Welcome! Log In Create A New Profile

Advanced

Sphider-utf - utf-8 version of sphider here

Posted by SkyRanger 
Sphider-utf - utf-8 version of sphider here
September 24, 2011 12:19PM
Hi to all, i make some changes in sphider so it can index multi-lang and utf-8 sites.

http://code.google.com/p/sphider-utf/

So here changes:
Shider now can index multy-lang sites, at least it index cp-1251 and utf-8 in 
russian pretty well. It may be some errors and may fail index some pages, but in 
general it work fine.

1. Sphider now use:
- UTF-8 so u must change u database collation to utf8_general_ci
- MySQLi class to interact with mysql
- now use multi-bytes string functions so it work correct with UTF-8 
2. FIXED MySQL server has gone away error
3. couse php limitation and current indexing algoritm shider ignore pages with
size more than 1 megabyte
4. Some changes made to database so make sure use sql/upgrade_to_1.4.sql to 
update u db.
5. Sphider now auto detect site codepage
and many other changes so many i cant remember all.

Current version seems works fine, but need to be tested. It may contains bugs and errors, so i will be glad if u report bugs and errors.

U can get sources from here:

http://code.google.com/p/sphider-utf/source/checkout

p.s. If something dont work it maybe couse i forgot to put some files in rep smiling smiley

https://code.google.com/p/sphider-utf/ - free for all | support http://www.sphider.eu/forum/read.php?3,8793
Re: Sphider-utf - utf-8 version of sphider here
October 07, 2011 02:08PM
My search form does not work until I erase $mysqli_conn->query("SET NAMES 'utf8'"winking smiley; and when I do that, I can not search utf 8



Edited 1 time(s). Last edit at 10/07/2011 02:20PM by vanja.
Re: Sphider-utf - utf-8 version of sphider here
October 10, 2011 08:07AM
Hi, I tried with your sphider utf-8, its working fine for, English, but not for Arabic.
Re: Sphider-utf - utf-8 version of sphider here
October 10, 2011 02:23PM
Fixed it,

I put
header("Content-Type: text/html; charset=utf-8"winking smiley;
in the beginning of search.php

Thanks
Re: Sphider-utf - utf-8 version of sphider here
February 02, 2012 11:34AM
Hallo SkyRanger,

You said, that you've integrated a file-size limitation. What were exactly the Problems with bigger files?

Because the older Version initialised PDF-files with 3MB and I whink there were no problems.
Maybe, I've overlooked something?

Thanks for a response,

Schmidi
Re: Sphider-utf - utf-8 version of sphider here
April 01, 2012 05:50PM
Is this being hosted some where else now?
His google code has been removed.

Just wanted to edit and say. my mistake, the files are not under downloads but under "source>>browse"



Edited 1 time(s). Last edit at 04/02/2012 06:02AM by cladiron.
Re: Sphider-utf - utf-8 version of sphider here
June 08, 2012 03:21PM
Can anybody get this to work? Every time I try the search page is blank and does nothing. All the pages on my website are in UTF-8 and I would really like to be able to search for non-standard characters. Can anybody help out?

Thanks
Re: Sphider-utf - utf-8 version of sphider here
August 24, 2012 06:24PM
Hi all,

In order to index and search MS Word documents containing non-ascii characters with Sphider, I had to do the following in Fedora Linux (both FC11 & FC17):

1) Modify mysql server configuration in /etc/my.cnf

[mysqld]
.........
skip-character-set-client-handshake
collation-server=utf8_unicode_ci
character-set-server=utf8
..........

2) Modify httpd (Apache) server configuration in /etc/httpd/conf/httpd.conf

AddDefaultCharset UTF-8

3) Modify php configuration in /etc/php.ini

default_charset="UTF-8"

4) Install xpdf package (yum install xpdf) in order to have pdftotext program in /usr/bin/pdftotext and report it in settings/conf.php

$pdftotext_path='/usr/bin/pdftotext';

After all that, French letters é,è,ç,à,ù,ö î etc. show up OK in links, documents and indexes, except in my Sphider admin "Statistics/Search log/Query" report column where they keep garbled (e.g. é instead of é)

This is solved (for php versions <5.4) by updating calls to php "htmlentities" function as follows

replace single-argument call
.htmlentities($word)

by 3-arguments call
.htmlentities($word, ENT_NOQUOTES, "UTF-8"winking smiley

at lines 1095 and 1114 in admin.php.


I also noted that it is impossible to directly index .doc files without garbling such letters : one must export them from MS Word (and import them in Sphider using "Sites/Add site"winking smiley either as .pdf files using a PDF converter (such as CutePDF) , or export "Web page" format files (.htm) from MS Word.

Hope it might help other users of standard Sphider struggling with non-ascii documents...

Fedora 11 ; php 5.2.12 ; mysql 14.14 Distrib 5.1.42 ; Apache 2.2.13 ; pdftotext 0.10.7 ; Sphider fresh install



Edited 1 time(s). Last edit at 08/25/2012 11:01AM by grandebou.
Ant
Re: Sphider-utf - utf-8 version of sphider here
August 31, 2014 01:59PM
Hello!

I just installed sphider utf8. Unfortunately, in admin panel "settings" and "database" tabs display blank page.

Please, help.

Anton.
Re: Sphider-utf - utf-8 version of sphider here
February 02, 2015 07:58PM
Very cool project one question how can you index thousands of pages when php is limited to indexing this. do mostley Do to speed or return page results,, The big players Google msn ae using a CGI TO INDEX. Anouther idea to do a php script to join outher openserch data basses
Re: Sphider-utf - utf-8 version of sphider here
April 10, 2015 05:12PM
Is not it better to convert using this feature?
function sanitizar_utf8($texto) {
    $saida = '';

    $i = 0;
    $len = strlen($texto);
    while ($i < $len) {
        $char = $texto[$i++];
        $ord  = ord($char);


        if (($ord & 0x80) == 0x00) {

 
            if (($ord >= 0 && $ord <= 31) || $ord == 127) {

                
                if ($ord == 9 || $ord == 10 || $ord == 13) {
                    $saida .= $char;
                }

            // Simbolo ASCII
            } else {
                $saida .= $char;
            }

        
        } else {

            
            $bytes = 0;
            for ($b = 7; $b >= 0; $b--) {
                $bit = $ord & (1 << $b);
                if ($bit) {
                    $bytes += 1;
                } else {
                    break;
                }
            }

            switch ($bytes) {
            case 2: // 110xxxxx 10xxxxxx
            case 3: // 1110xxxx 10xxxxxx 10xxxxxx
            case 4: // 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
                $valido = true;
                $saida_padrao = $char;
                $i_inicial = $i;
                for ($b = 1; $b < $bytes; $b++) {
                    if (!isset($texto[$i])) {
                        $valido = false;
                        break;
                    }
                    $char_extra = $texto[$i++];
                    $ord_extra  = ord($char_extra);

                    if (($ord_extra & 0xC0) == 0x80) {
                        $saida_padrao .= $char_extra;
                    } else {
                        $valido = false;
                        break;
                    }
                }
                if ($valido) {
                    $saida .= $saida_padrao;
                } else {
                    $saida .= ($ord < 0x7F || $ord > 0x9F) ? utf8_encode($char) : '';
                    $i = $i_inicial;
                }
                break;
            case 1:  // 10xxxxxx: ISO-8859-1
            default: // 11111xxx: ISO-8859-1
                $saida .= ($ord < 0x7F || $ord > 0x9F) ? utf8_encode($char) : '';
                break;
            }
        }
    }
    return $saida;
}
Sorry, only registered users may post in this forum.

Click here to login