Welcome! Log In Create A New Profile

Advanced

Temporary ignore robots.txt

Posted by Tec 
Tec
Temporary ignore robots.txt
July 23, 2007 03:30PM
Normally it is a pleasure to use a well-educated Sphider, following the instructions of robots.txt. But some webmasters are strange people. They try to prevent me to index their sites with restrictive robots.txt
For those it is helpful to have also a naughty Sphider.

To let Sphider forget temporary its respect:


Open ... admin/admin.php and search for:

<input type="radio" name="soption" value="level" <?php print $levelchecked;?>>To depth: <input type="text" name="maxlevel" size="2" value="<?php print $spider_depth;?>"><br/>

Delete the next 2 rows and replace them with the following 3 rows:

<?php if ($reindex==1) $check="checked"; $userobot = "checked";?>
<input type="checkbox" name="reindex" value="1" <?php print $check;?>> Reindex<br/>
<input type="checkbox" name="use_robot" value="1" <?php print $userobot;?>> Use 'robots.txt'<br/>


Open .../admin/spider.php and search for

if(!isset($reindex)) {
$reindex=0;
}

Beyond this include:

if(!isset($use_robot)) {
$use_robot=0;
}


In the same file search for:

index_site($url, $reindex, $maxlevel, $soption, $in, $out, $domaincb);

Delete this row and replace it with:

index_site($url, $reindex, $maxlevel, $soption, $in, $out, $domaincb, $use_robot);


In the same file search for:

function index_site($url, $reindex, $maxlevel, $soption, $url_inc, $url_not_inc, $can_leave_domain) {

Delete this row and replace it with:

function index_site($url, $reindex, $maxlevel, $soption, $url_inc, $url_not_inc, $can_leave_domain, $use_robot) {


In the same file search for

$omit = check_robot_txt($url);

Delete this row and replace it with:

$robots = ("robots.txt"winking smiley; // standardname of file
if ($use_robot != '1') {
$robots = ("no_robots.txt"winking smiley; // Sphider never will find this file and ignore the contents of robots.txt
}

$omit = check_robot_txt($url, $robots);


Now open .../admin/spiderfuncs.php and search for:

function check_robot_txt($url) {
global $user_agent;
$urlparts = parse_url($url);
$url = 'http://'.$urlparts['host']."/robots.txt";

Delete all this rows and replace them with:

function check_robot_txt($url,$robots) {
global $user_agent;
$urlparts = parse_url($url);
$url = 'http://'.$urlparts['host']."/$robots";


That's it.
In submenus Index and Re-Index now you will find an additional checkbox called Use 'robots.txt' which normally is checked and might be disabled if required for a temporary naughty Sphider.

Happy coding

Tec
Re: Temporary ignore robots.txt
July 24, 2007 09:19AM
thaks a lot ! tec
Re: Temporary ignore robots.txt
August 22, 2007 07:59PM
Hilarious!
Re: Temporary ignore robots.txt
August 21, 2008 02:31PM
hi everyone.
I use mp3 mod for sphrider.In file .../admin/spider.php i`m nod find line $omit = check_robot_txt($url, $robots); and this mod dont working for Sphrider wich mp3 mod.Plese help.Sorry for my bad english.
Re: Temporary ignore robots.txt
September 02, 2008 06:11PM
Hey winking smiley

Is it working for original sphider only? I'm asking cause I have some problems getting this to work in Sphider-Plus. I'm not very good in php but I try my best to do some modification with the help of this forum.

The problem is the code-modification in admin.php where the code is a little bit different. I've changed it as much as I can to get it to work and there is a additional checkbox "robots.txt" finally. But now, the spider stops indexing after the first URL. What's wrong? Please can you help? How to get it to work for Sphider-Plus?

Here are the modified rows of admin.php:

Index depth:
<input type='text' name='maxlevel' size='2' title='Enter indexing depth level' value='$spider_depth'
/>
";
if ($reindex==1) {$check='checked="checked"'; $userobot = "checked";}
echo "<label class='em' for='reindex'>Re-index</label>
<input type='checkbox' name='reindex' id='reindex' title='Check box to Re-index' value='1' $check
/> Check to Re-index</fieldset>
<input type='checkbox' name='use_robot' value='1' $check/> Use 'robots.txt'</fieldset>
";


Thanks! winking smiley

Upshapes Interactive
http://www.upshapes.de/
Re: Temporary ignore robots.txt
October 01, 2008 08:03PM
Never try this on my website or the website of many webmasters I know.
Your crawler (name and IP) would quickly end in the ban list and its datas transmitted to the list maintener.

When a webmaster forbid crawlers he/she has a reason.

Two of mine are configured to do so as they are still in test mode and I don't want anybody to come except a few persons who help me debug them.



Edited 1 time(s). Last edit at 10/01/2008 08:03PM by GeekMan.
Re: Temporary ignore robots.txt
March 16, 2012 11:56AM
GeekMan Wrote:
-------------------------------------------------------
>Your crawler (name and IP) would quickly end in the ban list and its datas transmitted to the list maintener.

I'm having the same problem than Wesley, and makes sense, so now I don't know whether this mod didn't work or the server just blocked my host's IP.
The error message doesn't longer appear, but the crawling stops anyway.
I don't know whether I'm banned or nor not because the visible pages are still crawled, so it's either their server blocking robots ONLY in my IP or this mod not working.

How do their server know I'm not a browser?
How can I crawl as a browser? I changed user-agent and didn't work either.
I tried sphider-plus 1.6, no success.

…or Probably adding a big delay between pages?

The pages are public, and user-content (forum) (not copyrighted) but the company blocks the access so we don't fine each other. The company abuses their users and removes any contant link, moves contact forms around so links posted online stop working, and keep forums as a archive-pages you can't search for. All that to make utterly impossible for us to damage their marketed image.

I'd really want to make an index of all those problems, and allow others to find each other.

Do you have any solution?
Re: Temporary ignore robots.txt
March 27, 2012 08:10PM
Wow, bro. THANK YOU. You are the shiz!
smileys with beerspinning smiley sticking its tongue outthe finger smileyhot smiley
Re: Temporary ignore robots.txt
March 27, 2012 08:20PM
Hey Mr. Big Wig Webmaster,

Tell me exactly how you would assign a name to the ip address;
drinking smiley

You cant, thats how.
the finger smiley

That is all.
Re: Temporary ignore robots.txt
March 27, 2012 08:36PM
Tec Wrote:
-------------------------------------------------------
> Normally it is a pleasure to use a well-educated
> Sphider, following the instructions of robots.txt.
> But some webmasters are strange people. They try
> to prevent me to index their sites with
> restrictive robots.txt
> For those it is helpful to have also a naughty
> Sphider.
>
> To let Sphider forget temporary its respect:
>
>
> Open ... admin/admin.php and search for:
>
> >To depth: ">
>
> Delete the next 2 rows and replace them with the
> following 3 rows:
>
> <?php if ($reindex==1) $check="checked"; $userobot
> = "checked";?>
> > Reindex
> > Use 'robots.txt'
>
>
> Open .../admin/spider.php and search for
>
> if(!isset($reindex)) {
> $reindex=0;
> }
>
> Beyond this include:
>
> if(!isset($use_robot)) {
> $use_robot=0;
> }
>
>
> In the same file search for:
>
> index_site($url, $reindex, $maxlevel, $soption,
> $in, $out, $domaincb);
>
> Delete this row and replace it with:
>
> index_site($url, $reindex, $maxlevel, $soption,
> $in, $out, $domaincb, $use_robot);
>
>
> In the same file search for:
>
> function index_site($url, $reindex, $maxlevel,
> $soption, $url_inc, $url_not_inc,
> $can_leave_domain) {
>
> Delete this row and replace it with:
>
> function index_site($url, $reindex, $maxlevel,
> $soption, $url_inc, $url_not_inc,
> $can_leave_domain, $use_robot) {
>
>
> In the same file search for
>
> $omit = check_robot_txt($url);
>
> Delete this row and replace it with:
>
> $robots = ("robots.txt"winking smiley; //
> standardname of file
> if ($use_robot != '1') {
> $robots = ("no_robots.txt"winking smiley; // Sphider never
> will find this file and ignore the contents of
> robots.txt
> }
>
> $omit = check_robot_txt($url, $robots);
>
>
> Now open .../admin/spiderfuncs.php and search
> for:
>
> function check_robot_txt($url) {
> global $user_agent;
> $urlparts = parse_url($url);
> $url =
> 'http://'.$urlparts['host']."/robots.txt";
>
> Delete all this rows and replace them with:
>
> function check_robot_txt($url,$robots) {
> global $user_agent;
> $urlparts = parse_url($url);
> $url = 'http://'.$urlparts['host']."/$robots";
>
>
> That's it.
> In submenus Index and Re-Index now you will find
> an additional checkbox called Use 'robots.txt'
> which normally is checked and might be disabled
> if required for a temporary naughty Sphider.
>
> Happy coding
>
> Tec
=================
So why is there a friggin error at the end of the scriptmoody smiley

Parse error: syntax error, unexpected $end in somecrap/sphider-1.3.5/admin/spiderfuncs.php on line 839
=====

AH NM, I am an idiot sometimes..I forgot to comment out the original function start for check_robot_txt($url).
It Works!



Edited 1 time(s). Last edit at 03/27/2012 08:40PM by 712011m4n.
Re: Temporary ignore robots.txt
May 31, 2012 05:16PM
Tec, what would the syntax look like when running this in a commandline ?
Tec
Re: Temporary ignore robots.txt
June 01, 2012 08:01AM
Not available, as the command line options are limited to
-all Reindex everything in the database
-u <url> Set the url to index
-f Set indexing depth to full (unlimited depth)
-d <num> Set indexing depth to <num>
-l Allow spider to leave the initial domain
-r Set spider to reindex a site
-m <string> Set the string(s) that an url must include (use \n as a delimiter between multiple strings)
-n <string> Set the string(s) that an url must not include (use \n as a delimiter between multiple strings)

You would need to improve the scripts
.../admin/spider.php
and
.../admin/spiderfuncs.php
to accept the 'temporary ignore robots.txt' as a new command line option.
The same scripts need to be modified to accept additional 'category' selection.
Please do not ask me to do so for you. I am busy with the scripts of Sphider-plus, which, up to now, also do not meet your desirers.


Tec
Re: Temporary ignore robots.txt
June 04, 2012 04:11AM
Ok.
Thank you for your tips.
Not sure i'll be able to do it either, but i'll look into it.
Sorry, only registered users may post in this forum.

Click here to login