Welcome! Log In Create A New Profile


PDF Indexing

Posted by cmarcera 
PDF Indexing
April 23, 2007 05:56PM
New Sphider user here smiling smiley

I've successfully installed Sphider on my GoDaddy account. I also downloaded xpdf and uploaded that to the account. Using $_SERVER['ROOT'], I plugged in the xpdf path into Sphider's config (/home/content/p/d/f/pdfhosting/html/sphider/xpdf/pdftotxt) and set off to index a directory of PDFs.

Unfortunately I'm getting this:

Retrieving: http://[mywebsite.com]/pdf/2007/04/13/1.pdf at 10:54:18.
Size of page: 339.43kb. Starting indexing at 10:54:18. Page contains less than 10 words
Links found: 0. New links: 0

I checked to make sure the xpdf binaries all had 755 permissions and I even tried reuploading the xpdf files in forced-binary mode just to make sure. Anyone have any ideas/experience with this?

Thanks in advance!

-C. Marcera
Anonymous User
Re: PDF Indexing
April 23, 2007 06:30PM
If you have a directory of pdf's to index, you may follow 2 directions:

Configure the server to allow directory browsing or, if you can't or you do not want to,

Prepare an html list of you pdfs (you can automate it) to "help" sphider recognizing "them".

I tested sucesefuylly the firts solution and I suppose that the second may fit your particular situation.

Good lucky
Re: PDF Indexing
April 23, 2007 06:52PM
I do have directory browsing on. I first got a 403 Forbidden error, but an .htaccess file with "Options +Indexes" fixed that. It is seeing all the PDFs, it just thinks they're all less than 10 words and I have checked them in Acrobat to have more than enough text to be found.
Anonymous User
Re: PDF Indexing
April 23, 2007 06:57PM
Make sure you use an absolute address to pdtfotext (not a relative one).
Re: PDF Indexing
April 24, 2007 08:13AM

Double checked that, it's absolute!
Anonymous User
Re: PDF Indexing
April 25, 2007 10:13AM
Your path to pdttotext seems too long.
Re: PDF Indexing
April 25, 2007 09:51PM

Log into Sphider's admin page and change value of "Required number of words in a page in order to be indexed" from 10 to something like 5 and then re-index your pdf files and see if that makes any difference!
Sorry, only registered users may post in this forum.

Click here to login