Welcome! Log In Create A New Profile

Advanced

Page is a duplicate - no, not really => Forcing a reindex

Posted by Hitman 
Page is a duplicate - no, not really => Forcing a reindex
May 17, 2007 02:32PM
Hi,

I have a couple of hundred pdf files that people can search through, and it has come to my attention that some of these pdf's can't be found. The reason was that these pdf's where scanned as images (and not OCR'ed), so the output of pdftotext was empty except some binary characters. I fixed this by using a pdf batch stamping program that adds the filename to the header of the first page (this is the keyword what people search for most). Fixed...I thought.

I've uploaded one of the new 'stamped' pdf's (renamed it to test.pdf), but when reindexing keep getting the message that the 'Page is a duplicate'. This isn't possible, because the content had changed (I added text to the pdf) and the filename is different. I tried reindexing using the admin pages and also tried the command line 'php spider.php -all' option.

Can somebody tell me how I can force Sphider to reindex everything without checking if it already exists in the database? If that isn't possible; how can I clear the links in the database and reindex everything? Thanks!



Edited 1 time(s). Last edit at 05/17/2007 02:36PM by Hitman.
Re: Page is a duplicate - no, not really => Forcing a reindex
May 18, 2007 10:36AM
> Can somebody tell me how I can force Sphider to
> reindex everything without checking if it already
> exists in the database?
In your case it wouldnt solve the problem: Sphider reports the duplcate message only when the md5 sum of the text matches an already indexed page. You only get a matching md5 sum if the page content is exactly equal. So i'm guessing the pdf converter isnt working correctly, it might just "extract" some error message or something similar. Check out what text exactly the converter is producing, this might give some clues.
Re: Page is a duplicate - no, not really => Forcing a reindex
May 18, 2007 02:45PM
Ando, thanks for the help. It seemed that there indeed was a problem with the rights of pdftotext. Unfortunately, I'm still not there.

I have a pdf file named: '1-SCA1987-01.pdf' it includes only graphics, but I managed to insert the filename by batch-stamping my documents, so the only 'real' text in the file is 1-SCA1987-01.pdf

but,
- I cannot find it when searching for: 1-SCA1987-01.pdf
- I cannot find it when searching for: SCA1987
- I *can* find it when searching for: 1-SCA1987-01

Can somebody please give me a clue how I can fix this? It's really no option to rename all documents! Thanks for your help.
Re: Page is a duplicate - no, not really => Forcing a reindex
May 18, 2007 07:39PM
check the extract_text function in spiderfuncs.php and try to print out the result before it is returned. This will show if the text needed is extracted
Re: Page is a duplicate - no, not really => Forcing a reindex
May 20, 2007 03:10PM
I extracted the first array item from $result and it contained:

1-SCA1987-01.pdf

So this seems to work, the only problem is that I can't search for it...

Edit: doesn't Sphider allow you to use the keywords in the url as searchable keywords? This way I already should be able to find these files, but it appears that this also doesn't work?!? (and yes, the option 'Index words in domain name and url path' is checked)



Edited 1 time(s). Last edit at 05/20/2007 05:18PM by Hitman.
Re: Page is a duplicate - no, not really => Forcing a reindex
May 21, 2007 02:05PM
> 1-SCA1987-01.pdf
Did yo uset "Required number of words in a page in order to be indexed" to 1 in settings?

> Edit: doesn't Sphider allow you to use the
> keywords in the url as searchable keywords? This
> way I already should be able to find these files,
> but it appears that this also doesn't work?!? (and
> yes, the option 'Index words in domain name and
> url path' is checked)
It is possible and does work. The option needs to be checked before indexing of course.
Re: Page is a duplicate - no, not really => Forcing a reindex
May 21, 2007 04:06PM
Actually this setting was 0.
Sorry, only registered users may post in this forum.

Click here to login