Welcome! Log In Create A New Profile

Advanced

1.3.5 does not obey robots.txt

Posted by t-p 
t-p
1.3.5 does not obey robots.txt
January 07, 2010 12:11AM
Hi,

I updated my 1.3.4 by replacing th changed files.

I noticed that the 1.3.5 does not completely follow (it follows some and indexed some!) my robots.txt, while the 1.3.4 did follow fully.

Can anybody guide me please to solve this problem? Do I have to have php5+ to use 1.3.5? Thanks.
Re: 1.3.5 does not obey robots.txt
February 11, 2010 03:13AM
t-p Wrote:
-------------------------------------------------------
> Hi,
>
> I updated my 1.3.4 by replacing th changed files.
>
>
> I noticed that the 1.3.5 does not completely
> follow (it follows some and indexed some!) my
> robots.txt, while the 1.3.4 did follow fully.
>
> Can anybody guide me please to solve this problem?
> Do I have to have php5+ to use 1.3.5? Thanks.


t-p,

I just, today, installed 1.3.5 and also noticed it does not obey robots.txt. - I would be interested if anyone has a fix for this, as well. I have two existing installs using 1.3.4 (but they show 1.3.3 in admin...)

BTW, I'm using php5 on my servers and that is not the problem...



Edited 1 time(s). Last edit at 02/11/2010 03:14AM by Convergence.
Re: 1.3.5 does not obey robots.txt
February 22, 2010 07:34PM
Hey guys,

I discovered the same thing, looks like when the author updated the eregi functions to preg_match, he forgot to use the case modifier.

Basically, if you have User-agent instead of user-agent, it won't count it.

Anyways, I have created a patch for it here; http://mdj.us/media/spiderfuncs.patch

If you don't know what the hell that is, then just open spiderfuncs.php and look for preg_match and make sure to add the "i" modifier in the regular expression if it's missing. There are three instances to change.
t-p
Re: 1.3.5 does not obey robots.txt
February 22, 2010 11:39PM
Thanks matt for taking time to help, really appreciate it.
Re: 1.3.5 does not obey robots.txt
February 25, 2010 07:40PM
Thanks Matt,

I tiny but significant change...thumbs up
Re: 1.3.5 does not obey robots.txt
March 04, 2010 08:34PM
Hi again all, there's yet another problem in the check_robot_txt function within spiderfuncs.php.

around line 217, it has
return null;

this tells it to return null and exit the function when disallow rule returns nothing. Well, now this is a huge problem because most robots.txt start with a general "disallow none" rule and then further restrict directories.

i.e,
User-agent: *
Disallow:
Disallow:  /cgi-bin/ 
Disallow: /private/

and on and on, the way this is written it exits the function when it see the first Disallow rule. So change the return null line (line 217) to this;
continue;

this tells the script to simply continue to check the next line within the loop.

Now it finally parses and applies rules from robots.txt correctly.

Now when you index you should see the disallowed rules at the top of the output/logs, i.e.
Disallowed files and directories in robots.txt:
http: //example.com/cgi-bin/
http: //example.com/private/
etc, etc. I added a space because the forum software keeps wanting to linkify the text.

I have updated the patch at http://mdj.us/media/spiderfuncs.patch



Edited 1 time(s). Last edit at 03/04/2010 08:35PM by matt.
Re: 1.3.5 does not obey robots.txt
August 06, 2010 05:14AM
Thankd man, perfect!



Edited 1 time(s). Last edit at 08/06/2010 12:34PM by Malcolm.
Re: 1.3.5 does not obey robots.txt
August 29, 2010 07:40AM
Some sites have different rules set in the META TAGS for robots regardless of robot.txt file, some are not set for sphider unless they say "index, follow".
Re: 1.3.5 does not obey robots.txt
January 13, 2011 10:04AM
Thanks heaps for your patch, that solved a big hassle I was having..!

Ross..
Re: 1.3.5 does not obey robots.txt
January 21, 2011 01:20AM
Hello Everyone Thank you smiling smileysmiling smiley
Re: 1.3.5 does not obey robots.txt
December 02, 2011 02:41AM
Kudos on that patch, Matt. I've been pounding my head against this for days thinking I was doing something wrong. You're my hero.
Re: 1.3.5 does not obey robots.txt
January 27, 2013 09:08AM
Thank you very much! Works very well now.

I owe you a pint.

thumbs up
Sorry, only registered users may post in this forum.

Click here to login