Welcome! Log In Create A New Profile



Posted by akreider 
June 02, 2009 08:40AM
Has anyone thought of adding multi-threading?

The XENU Sleuth program is great example of this. It multithreads when it checks to see if links are valid. I've run 30 threads on my computer and it goes superfast.

If you were running the threads on different websites it would avoid causing the sites trouble. So you'd be spidering 30 websites at once.

30 might be more than necessary, but 5-10 threads would increase the speed dramatically without being a burden on a modern cpu.

Re: Multi-threading
November 29, 2009 09:58AM
I've thought of this. But i havent really looked into it.

ON the surface, it probably either needs a good refactoring of the code or need to find a way which does not utilise the httpd/web server in order to be highly efficient.

I'm taking a course of parallel programming now, and I'll see if i have time to hack something up after these few weeks. :-)
Re: Multi-threading
December 02, 2009 11:40PM
that must means that you have to start the code from scratch, I tried dividing the workload of the sphider in 10, by sending out 10 socket at the same time, but the problem is that the first socket will scan the 2,3,4,5,6,7,8,9,10 th over again, making multi-threading useless as you're just doing the spider 10x over again, I'm trying to tell it to stop executing when it reach socket two, but if I put halt or similar stop commands in php, it quits the search completely. sad smiley
Re: Multi-threading
December 22, 2009 02:44PM
This will need some additional logic, without question.

Ideally, the process should be a batch of URLs divided into groups. Each thread should be assigned a group of URLs to process. Once a thread completes or dies, it should be restarted to handle another group of URLs.

This is similar to what we have within our ad server. In our case, a single server deployment can index a couple hundred thousand URLs per day quite. We've taken this a bit further to allow for multiple dedicated indexing nodes to be used for high volume users, an excellent compliment to using cloud server services.

This sounds like a good idea for a Sphider extension: On Demand Indexing

Feedback, anyone?
Re: Multi-threading
June 05, 2012 08:29PM
It makes it difficult to have thread with several topics/ tasks.

(1) several jobs
If you use she spider.php as commandline you can send each php process to the background. Use a shellscript like that:

cd /home/axel/(...)/php/sphider/admin/ || exit 1

urls="$urls http://www.example.com"
urls="$urls http://blog.example.com"
# (...)

for myurl in $urls
        /bin/php spider.php -u $myurl -d 0 -r &

(2) multithreaded spider
I could imagine it in the following way:
all pages in the index and that still follow include/ exclude rules will be pushed to curl

These pages you could parse to find new pages in the website, i.e.

preg_match_all( '/<a(?:[^>]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is', $pagecontent, $output, PREG_SET_ORDER );
foreach( $output as $item ) {
	// check link: is it a new page or not - if so, then spider it, index it and parse it for links

(3) On demand indexing
It would be easy, if the spider.php could be used as http request (with an auth-key for the admin authentication as url parameter).
Then you just need to send the http request. You don't need to wait for response - so you can use curl or make exec() with sending a wget into background like that:

$sCmd='/bin/wget -O /dev/null '.$url.' >/dev/null 2>&1  &';
$exec = exec($sCmd);


Edited 1 time(s). Last edit at 06/05/2012 08:32PM by Axel.
Sorry, only registered users may post in this forum.

Click here to login