Welcome! Log In Create A New Profile

Advanced

Spidering sites with token id

Posted by pedalpete 
Spidering sites with token id
May 01, 2007 03:53AM
I am attempting to crawl music.myspace.com but they use Tokens in the url.
A token does not stick to the user via a session, but changes on certain page views.

Any idea how to set-up Sphider to ignore the tokens? so that it does not crawl the same page again and again?

Myspace works fine if you request the page without the token.

I have attempted to limit the crawl to only include tokens which start with E, etc. but because the tokens change, the spider only crawls part of the site.

Basically, what I am trying to do is get an index of friendid's for musicians only on Myspace (trying not to get all profile pages).

Any suggestions would be great.
Pete
Re: Spidering sites with token id
May 03, 2007 08:39AM
To ignore the ID, add the token to file admin/spiderfuncs.php

line 828

return preg_replace("/(\?|&winking smiley(PHPSESSID|JSESSIONID|ASPSESSIONID|sid)=[0-9a-zA-Z]+$/", "", $url);

Diego Medina
[url=http://www.fmpwizard.com]Web Developer[/url]
Sorry, only registered users may post in this forum.

Click here to login