Welcome! Log In Create A New Profile

Advanced

Mod for Russian Users

Posted by Z@Zmaster 
Mod for Russian Users
October 07, 2007 01:33AM
Add this code after #210 line

/* Mod by for cp1251 & koi8-r support by Z@Zmaster */
preg_match("@<head[^>]*>(.*?)<\/head>@si",$file, $regs);
$headdata = $regs[1];

$res = Array ();
preg_match("/<meta +http-equiv *=[\"']?Content-Type[\"']? *content=[\"']?([^<>'\"]+)[\"']?/i", $headdata, $res);
if (isset ($res)) {
$content = $res[1];
}
$content = strtolower($content);

if ( $content == "" ) {
$sc = get_headers($url, 1);
$content = strtolower($sc[Content-Type]);
}

$codepages = array("windows-1251", "utf-8", "koi8-r"winking smiley;

for ($i=0;$i<count($codepages);$i++){
if ( preg_match("/$codepages[$i]/", $content) ) {
$cur_codepage = $codepages[$i];
break;
}
}

if ( $cur_codepage != "utf-8" || $cur_codepage != "" ) {
$file = iconv($cur_codepage, "UTF-8", $file);
}
/* Mod by for cp1251 & koi8-r support by Z@Zmaster */

And now you don't have problem with sites in cp1251 and koi8-r codepages
Hello from Russian programmers, Tomsk, TUSUR University winking smiley
Thx for help 21h aka Vladimir Smagin

Dlja teh kto znaet russkij smiling smiley Avtor prosto ne podumal chto sajty mogut byt' ne tol'ko v UTF-8, i jeto sil'no naprjagalo, pojetomu na kolenke bylo napisano nebol'shoe dopolnenie chtoby mozhno bylo ne parit'sja s kodirovkami smiling smiley Spasibo tovariwu 21h aka Vladimir Smagin za to chto pinal menja na sie bogohul'stvo smiling smiley Tomsk rulit TUSUR foreva!



Edited 3 time(s). Last edit at 10/07/2007 01:40AM by Z@Zmaster.
Tec
Re: Mod for Russian Users
October 10, 2007 01:21AM
Sphider, as developed by Ando, consists of 74 files. Most of them with more than 210 rows. For the Russian members of this forum it might be helpful to know which is the target script for your modification.

Tec
Re: Mod for Russian Users
October 11, 2007 03:50PM
hi Z@Zmaster thanks for this mod
but you said
"Add this code after #210 line " in what file?
Re: Mod for Russian Users
October 21, 2007 11:29PM
Sorry, my bad... Target file is admin/spider.php But you may be able to find where to insert the code and independently winking smiley
Re: Mod for Russian Users
October 21, 2007 11:47PM
And by the way Ando has read documentation on the PHP section http://www.php.net/manual/en/ref.mbstring.php winking smiley
Re: Mod for Russian Users
November 02, 2007 06:52AM
Z@Zmaster! You're brilliant!

I've been using Sphider for several years now to maintain a small search engine for 80 Hungarian language sites. 75 are in ISO-8859-2 (Latin2) and 5 in UTF-8. Until now, search results from the UTF-8 sites have been gobbledy-gook, as my search site and database use ISO-8859-2. I made a few changes to your code (as below), added it at line 210 in admin/spider.php, searched my UTF-8 sites and got real words in the database and readable search results.

Many thanks,

Michael

if ($file_read_error) {
$contents = getFileContents($url);
$file = $contents['file'];
}


/* Mod by for cp1251 & koi8-r support by Z@Zmaster */
preg_match("@<head[^>]*>(.*?)<\/head>@si",$file, $regs);
$headdata = $regs[1];

$res = Array ();
preg_match("/<meta +http-equiv *=[\"']?Content-Type[\"']? *content=[\"']?([^<>'\"]+)[\"']?/i", $headdata, $res);
if (isset ($res)) {
$content = $res[1];
}
$content = strtolower($content);

if ( $content == "" ) {
$sc = get_headers($url, 1);
$content = strtolower($sc[Content-Type]);
}

$codepages = array("windows-1250", "utf-8", "iso-8859-2"winking smiley;

for ($i=0;$i<count($codepages);$i++){
if ( preg_match("/$codepages[$i]/", $content) ) {
$cur_codepage = $codepages[$i];
break;
}
}

if ( $cur_codepage != "iso-8859-2" || $cur_codepage != "" ) {
$file = iconv($cur_codepage, "iso-8859-2", $file);
}
/* Mod by for cp1251 & koi8-r support by Z@Zmaster */

$pageSize = number_format(strlen($file)/1024, 2, ".", ""winking smiley;
printPageSizeReport($pageSize);
Re: Mod for Russian Users
February 11, 2008 01:17PM
On what version it is established mod?
At me an error "Fatal error: Call to undefined function: get_headers() in ...\admin\spider.php on line 223"
( line 223: $sc = get_headers($url, 1); )
Please, upload somewhere worker Spidering with it mod, it is very necessary.
Thanks.
Re: Mod for Russian Users
February 11, 2008 03:34PM
Everything, has solved a problem, works smiling smiley
Only does not index page where there is no line <meta ... charset =...>
Thanks Z@Zmaster!
Re: Mod for Russian Users
February 21, 2008 12:20PM
Hi there

I am trying to do something similar, and index a Chinese page.
I made the changes above, using gb2312_chinese_ci as my collation font, but everytime i try it i get the message - 'Page contains less than 10 words'

Any idea where it goes wrong really?
Re: Mod for Russian Users
June 26, 2008 03:53PM
solaris Wrote:
-------------------------------------------------------
> Hi there
>
> I am trying to do something similar, and index a
> Chinese page.
> I made the changes above, using gb2312_chinese_ci
> as my collation font, but everytime i try it i get
> the message - 'Page contains less than 10 words'
>
> Any idea where it goes wrong really?

Sorry, but here at the forum, perhaps you can only ask questions and seek answers himself sad smiley):
Re: Mod for Russian Users
March 30, 2010 05:29PM
mburp Wrote:
-------------------------------------------------------
> Z@Zmaster! You're brilliant!
>
> I've been using Sphider for several years now to
> maintain a small search engine for 80 Hungarian
> language sites. 75 are in ISO-8859-2 (Latin2) and
> 5 in UTF-8. Until now, search results from the
> UTF-8 sites have been gobbledy-gook, as my search
> site and database use ISO-8859-2. I made a few
> changes to your code (as below), added it at line
> 210 in admin/spider.php, searched my UTF-8 sites
> and got real words in the database and readable
> search results.
>
> Many thanks,
>
> Michael
>
> if ($file_read_error) {
> $contents = getFileContents($url);
> $file = $contents['file'];
> }
>
>
> /* Mod by for cp1251 & koi8-r support by Z@Zmaster
> */
> preg_match("@]*>(.*?)<\/head>@si",$file, $regs);
> $headdata = $regs[1];
>
> $res = Array ();
> preg_match("/'\"]+)[\"']?/i", $headdata, $res);
> if (isset ($res)) {
> $content = $res[1];
> }
> $content = strtolower($content);
>
> if ( $content == "" ) {
> $sc = get_headers($url, 1);
> $content = strtolower($sc);
> }
>
> $codepages = array("windows-1250", "utf-8",
> "iso-8859-2"winking smiley;
>
> for ($i=0;$i<count($codepages);$i++){
> if ( preg_match("/$codepages[$i]/", $content) ) {
> $cur_codepage = $codepages[$i];
> break;
> }
> }
>
> if ( $cur_codepage != "iso-8859-2" ||
> $cur_codepage != "" ) {
> $file = iconv($cur_codepage, "iso-8859-2",
> $file);
> }
> /* Mod by for cp1251 & koi8-r support by Z@Zmaster
> */
>
> $pageSize = number_format(strlen($file)/1024,
> 2, ".", ""winking smiley;
> printPageSizeReport($pageSize);


Thank you zazmaster. You are really great. ! well done! Cheers! grinning smiley

"Always turn a negative situation into a positive situation."
Editor @ [url=http://www.daily-reviews.com]Daily Reviews[/url]
Sorry, only registered users may post in this forum.

Click here to login