I used to enjoy sanity. Now I barely remember it.

I've spent three frustrating days working with Sphider, downloading Xpdf binaries, reading every scrap of related advice Google conjures up, and experimenting until my eyes bleed! Sphider absolutely refuses to index my PDF files.

I have tested a hundred different "pdftotext" variations, extentions, pathways, locations, code strings, settings and permissions.
I have deleted the entire system - every file and process - and started from scratch twice!
I am ready to kill small animals.

a) My computer runs Windows 7 on a 64-bit platform
b) My websites are hosted on a commercial Linux server accessed via Cpanel
c) My PDF files contain clean copy-able text and are NOT password protected
d) I have carefully read and tested every related tip, trick, technique and tutorial in this forum and many others.

Here's the problem:

Sphider reads my HTML pages perfectly, but returns the ridiculous PDF error message "Page contains less than 10 words" every time.

I am going to tie concrete blocks to my head and jump off the ferry.

I am not a programmer, but a reasonably well-rounded website builder.

Any help would be hugely appreciated !!
January 28, 2011 11:10PM
Unfortunately I don´t know what might be the problem so I can´t be of any help but I'd like to let you know that I admire your determination.

I hope someone else will show up and give you the tip that will keep you from insanity.
Thanks Willy, appreciate the sentiment. I hope someone helps me out, too.

In the meantime, I've joined a lawn bowling club ... sanity is optional.

drinking smiley
i wonder where the support teams are? it's been a week or so, since you posted this, and no support is coming? hmmm?
February 04, 2011 05:31PM
I am revisiting this forum again.

There is no support team in this sphider forum. We just kind of help each other, if you have a good idea.

I do not have a problem indexing PDF files.

Make sure you have installed little software suggested by the author (Ando) and configure it right in the server. Review the documentation again.

My company's sphider has indexed more than 90.0 GB of websites already.

I hope you've found a solution in the meantime, but here's some feedback, just in case smiling smiley

Have you checked your web server's mime types? From a quick search of Sphider, it expects pdf files to be reported as “application/pdf”. If the web server's mime types are not set up correctly, then Sphider will not know that a particular file is pdf.

HTH - Pete
I have the same problem.
I put the pdftotext.exe in the sphider directory and set the code in conf.php to read c:\sphider\pdftotext.exe

Is that the correct location and pointer and why is it coded like that. It is NOT the C drive but a directory on an ISP's server and the sphider directory is not in the root.

Given that the above has received no reply for 12 months is there any point in even bothering to get sphider to work? It seems that there is no support for it and it is badly broken. The PDF indexing does not work and the link on their web site to the doc file indexer is broken.
yes, pmolsen right.

it's not working indexing through pdf files.

sad smiley thumbs down
August 11, 2012 06:15PM
How did you get to 90GB?! Mine crashes silently after less than a gig!
I am integrating Sphider into my current project and I have the same issue as mentioned previously: PDF files will be found, however they all get skipped due to having a length that is less than my minimum.

It has been quite a while since this post was active, but, has anyone had luck indexing PDFs?

I have been looking through the code that makes Sphider able to index a PDF and have found a solution to the error you are receiving.

Inside the Sphider project: '/sphider/admin/sphiderfuncs.php' contains a function called extract_text(). This function re-writes the content of the PDF you are trying to index into a temporary file and then running the pdftotxt.exe program on that file. The issue is that this temporary file does not have a file type (i.e. '.pdf'); XPDF's pdftotext.exe seems to require this file extension. Below is an example of the code you could replace just below the $global declarations and before the fopen() to add the file extension on the temporary file:

        $ext = "";
	if($source_type == 'pdf')
		$ext = ".pdf";
	$temp_file = "tmp_file".$ext;
	$filename = $tmp_dir."\\".$temp_file ;

Additionally, if you have spaces in the names of your PDF files the name of the file must be in double-quotes. I decided to make this the case on all my PDF's and not just those with spaces. When working on a Windows environment the following line of code will add the double-quotes to all PDF's:

        $command = $pdftotext_path." \"$filename\" -";

Hopefully this post was able to help someone out, let me know if there are questions.
