Welcome! Log In Create A New Profile

Advanced

Indexing PDFs on Linux with Sphider-plus (solved for me)

Posted by rasc 
Indexing PDFs on Linux with Sphider-plus (solved for me)
June 30, 2008 10:55AM
Hours later I found a working solution indexing PDFs on a shared linux host with Sphider-plus. Here's my way:

Sphider-plus includes a pdf converter but this doesn't work on linux systems because of impossibility running exe files on linux.

Instead of using the built-in converter try this:
1. Download the linux related pre-compiled binary of pdftotext included in the xpdf bundle from: www.foolabs.com/xpdf/download.html
2. Unzip/untar the package and save only the pdftotext file (it has no extension, that's ok)
3. Rename "pdftotext" to "pdftotext.script"
4. Upload via FTP this file to the "converter" directory of Sphider-plus
5. Identify the physical path of your web site (your hoster should provide this information anywhere)
6. Create an empty text file and into this write two lines:
#!/bin/sh
/PATH/TO/YOUR/WEB/DOWN/TO/converter/pdftotext.script $1 -
7. Adapt the full path above to your needs and use simple slashes (not double backslashes)
Second line begins with a slash and ends WITH the minus sign!
(Thanks to the user posted this hint sometimes ago)
8. Save this file as "pdftotext" (without the quotes)
9. Upload it to the converter dir
10. Set permissions of both pdftotext and pdftotext.script to 755 or 777 (whatever needed to run correctly)
11. Set permissions of the converter dir to 777! Otherwise indexing fails because of pdftotext is unable to write a temp file needed!
12. Last: change the pdftotext path in conf.php to:
$pdftotext_path = '/PATH/TO/YOUR/WEB/DOWN/TO/converter/pdftotext';

Now it should work fine. For me it does.
Re: Indexing PDFs on Linux with Sphider-plus (solved for me)
July 01, 2008 10:58PM
I'm new to Sphider, but adapting your instructions, I got the converter working with regular Sphider (not Sphider-Plus). Thanks.

My only problem is that it doesn't seem to work if I index via command-line, but it does work via the web interface. Unfortunately this client's web host seems to have a ridiculously low limit on memory size, so any PDF over 3mb generates a fatal error. No access to PHP.INI, unfortunately.



Edited 1 time(s). Last edit at 07/01/2008 11:00PM by Dekortage.
Re: Indexing PDFs on Linux with Sphider-plus (solved for me)
July 10, 2008 06:59AM
Thank for sharing your solution.
I am totally new to both Sphider and Sphider-Plus, and apply the PDF indexing using your methods(in Suse Linux with Sphider-Plus). It works well.

But how about indexing the Word document.
I downloaded the catdoc-0.94.2, but I don't know how to make it run in Linux env.

If you have any ideas, please help. I searched about this in the forum but still have not found a proper solution.
Re: Indexing PDFs on Linux with Sphider-plus (solved for me)
May 30, 2009 06:03AM
Hi to all!
I also have the same problem of many others: Sphider 1.3.4 doesn't index PDF files, and I tried the solutions posting on this forum without results. So, I consider to use a PHP function to convert them. I found this:
http://community.livejournal.com/php/295413.html
and it works!
So now, how can we include that script in spiderfuncs.php? If pdftotext fails (maybe for server settings, on my PC it works fine...), probably it's more easy to use something else! winking smiley



Edited 1 time(s). Last edit at 05/30/2009 06:05AM by RedWolf.
Re: Indexing PDFs on Linux with Sphider-plus (solved for me)
August 28, 2009 12:57AM
your a genius
Re: Indexing PDFs on Linux with Sphider-plus (solved for me)
August 28, 2009 08:27PM
Hello,
Newbiest to all that stuff.
Goal : create accessible repository of pdf files (exclusively) that will be accessed by visitors. Fulltext search.

Work done up to now :
- Created mysql database.
- Filled database.php file
- Installed Sphider on shared server (dreamhost). Database installed properly. No problem so far.

- Downloaded pdftotext file. Renamed it pdftotext.script. Created pdftotext file and changed name of pdftotext.script. Placed those files in the converter directory.
- Chmoded Converter directory to 777 and files to 775.
- Changed user and pw in auth.php
- Created datapdf directory. Pointed to it using /cmxxx.com/searchpub/datapdf
- Modified conf.php
$pdftotext_path = '/mydomain.com/converter/pdftotext'; Got this from FileZilla (replaced mydomain with cmxxxxxx.com
- Modified spider.php file to check if file exist. Launched IndexAll and got the file does not exist message.

Now, one problem. Obviously, I do not know how to set the path properly. I tried http://www.cmxxxx.com/converter/ and failed.

Can anyone of you nice people can help me here?
Re: Indexing PDFs on Linux with Sphider-plus (solved for me)
September 10, 2009 04:29PM
If you are using debian / ubuntu as your server its very easy,

# apt-get install xpdf

Then set "Full executable path to PDF converter" to /usr/bin/pdftotext

Thats it!
Re: Indexing PDFs on Linux with Sphider-plus (solved for me)
October 07, 2009 02:10AM
How did you integrate the function into Sphider???

It might be good info for others running into issues with the PDFtoText converters.
Re: Indexing PDFs on Linux with Sphider-plus (solved for me)
July 26, 2011 01:02AM
Hallo Out there,

I've installed it and it works great. But my question is, would it be possible to install on the same way the language support packages from pdftotext and when yes, how exactly?

I need to sphider cyrillic and greec pdf's!

Thanks a lot for your help,

Schmidi
Re: Indexing PDFs on Linux with Sphider-plus (solved for me)
January 14, 2014 02:00PM
Not sure if this thread is still active, but i followed both the options - having pdftotext installed via apt-get and placing the binary in my folder, but both the methods seem to return this result

Retrieving: http://example.com/download_file/A_Fair_Share.pdf at 15:51:08.
Not text or html

help!?
Sorry, only registered users may post in this forum.

Click here to login