Sphider-utf - utf-8 version of sphider here September 24, 2011 12:19PM |
Registered: 6 years ago Posts: 7 |
Shider now can index multy-lang sites, at least it index cp-1251 and utf-8 in russian pretty well. It may be some errors and may fail index some pages, but in general it work fine. 1. Sphider now use: - UTF-8 so u must change u database collation to utf8_general_ci - MySQLi class to interact with mysql - now use multi-bytes string functions so it work correct with UTF-8 2. FIXED MySQL server has gone away error 3. couse php limitation and current indexing algoritm shider ignore pages with size more than 1 megabyte 4. Some changes made to database so make sure use sql/upgrade_to_1.4.sql to update u db. 5. Sphider now auto detect site codepageand many other changes so many i cant remember all.
Re: Sphider-utf - utf-8 version of sphider here October 07, 2011 02:08PM |
Registered: 6 years ago Posts: 1 |
Re: Sphider-utf - utf-8 version of sphider here October 10, 2011 08:07AM |
Registered: 6 years ago Posts: 2 |
Re: Sphider-utf - utf-8 version of sphider here October 10, 2011 02:23PM |
Registered: 6 years ago Posts: 2 |
Re: Sphider-utf - utf-8 version of sphider here February 02, 2012 11:34AM |
Registered: 6 years ago Posts: 2 |
Re: Sphider-utf - utf-8 version of sphider here April 01, 2012 05:50PM |
Registered: 6 years ago Posts: 22 |
Re: Sphider-utf - utf-8 version of sphider here June 08, 2012 03:21PM |
Registered: 5 years ago Posts: 1 |
Re: Sphider-utf - utf-8 version of sphider here August 24, 2012 06:24PM |
Registered: 5 years ago Posts: 3 |
[mysqld] ......... skip-character-set-client-handshake collation-server=utf8_unicode_ci character-set-server=utf8 ..........
AddDefaultCharset UTF-8
default_charset="UTF-8"
$pdftotext_path='/usr/bin/pdftotext';
.htmlentities($word)
.htmlentities($word, ENT_NOQUOTES, "UTF-8"
Re: Sphider-utf - utf-8 version of sphider here August 31, 2014 01:59PM |
Registered: 3 years ago Posts: 6 |
Re: Sphider-utf - utf-8 version of sphider here February 02, 2015 07:58PM |
Registered: 3 years ago Posts: 6 |
Re: Sphider-utf - utf-8 version of sphider here April 10, 2015 05:12PM |
Registered: 3 years ago Posts: 1 |
function sanitizar_utf8($texto) { $saida = ''; $i = 0; $len = strlen($texto); while ($i < $len) { $char = $texto[$i++]; $ord = ord($char); if (($ord & 0x80) == 0x00) { if (($ord >= 0 && $ord <= 31) || $ord == 127) { if ($ord == 9 || $ord == 10 || $ord == 13) { $saida .= $char; } // Simbolo ASCII } else { $saida .= $char; } } else { $bytes = 0; for ($b = 7; $b >= 0; $b--) { $bit = $ord & (1 << $b); if ($bit) { $bytes += 1; } else { break; } } switch ($bytes) { case 2: // 110xxxxx 10xxxxxx case 3: // 1110xxxx 10xxxxxx 10xxxxxx case 4: // 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx $valido = true; $saida_padrao = $char; $i_inicial = $i; for ($b = 1; $b < $bytes; $b++) { if (!isset($texto[$i])) { $valido = false; break; } $char_extra = $texto[$i++]; $ord_extra = ord($char_extra); if (($ord_extra & 0xC0) == 0x80) { $saida_padrao .= $char_extra; } else { $valido = false; break; } } if ($valido) { $saida .= $saida_padrao; } else { $saida .= ($ord < 0x7F || $ord > 0x9F) ? utf8_encode($char) : ''; $i = $i_inicial; } break; case 1: // 10xxxxxx: ISO-8859-1 default: // 11111xxx: ISO-8859-1 $saida .= ($ord < 0x7F || $ord > 0x9F) ? utf8_encode($char) : ''; break; } } } return $saida; }