Welcome! Log In Create A New Profile

Advanced

Follow sitemap.xml (version II)

Posted by Tec 
Tec
Follow sitemap.xml (version II)
November 05, 2007 11:33AM
Advantages over prior versions:

In admin settings there is a new checkbox to select: "If available follow sitemap.xml".

If selected, this mod will force Sphider to index all links found in a sitemap.xml file.

This mod will also force Sphider to reindex only links that are
- new and not jet known in Sphiders link table
and
- links whose 'last modified' date is newer than Sphiders 'last indexed' date.

The dummy sitemap.xml link is no longer required. Version II will not course any longer a "Not text or html" message when indexing.


Mod disadvantage: requires PHP5


Notes:
This mod is not compatible to older mod versions. If you use a prior version, please delete it before installing the following.
During reindex this mod will not overwrite the existing link table. The sitemap infos are only used for the current reindex. Sphiders link table remains unchanged for further tasks.
If used together with this mod, my mod "Create sitemap.xml" will store a full set of links belonging to the site. Not only the new reindexed.




In .../admin/configset.php search for:

include "auth.php";

Beyond this include:

if ($_follow_sitemap==""winking smiley {
$_follow_sitemap=0;
}


In the same file search for:

fwrite($fhandle,"$"."word_upper_bound = ".$_word_upper_bound. ";"winking smiley;

Beyond this include:

fwrite($fhandle, "\n\n// If available follow 'sitemap.xml'\n"winking smiley;
fwrite($fhandle,"$"."follow_sitemap = ".$_follow_sitemap. ";"winking smiley;


In the same file search for:

<td> Keyword weight depending on the number of times it appears in a
page is capped at this value</td>
</tr>

Beyond this include:

<tr>
<td class="left1"><input
name="_follow_sitemap" type="checkbox" value="1" id="follow_sitemap" <?php if
($follow_sitemap==1) echo "checked";?>></td>
<td> If available follow 'sitemap.xml'</td>
</tr>


In .../admin/spider.php search for:

global $supdomain;

Delete that row and replace it with the following:

global $supdomain, $smp, $follow_sitemap;


In the same file search for:
$thislevel = $level - 1;

Beyond this row additionally include:

if ($data['nofollow'] != 1) {
if ($smp != 1 && $follow_sitemap == 1) { // enter here if we don't already know a valid sitemap and if admin settings allowed us to do so
$tmp_urls = get_temp_urls($sessid); // reload previous temp
$url2 = remove_sessid(convert_url($url));
// get folder where sitemap should be and if exists, cut existing filename, suffix and subfolder
$local = "http://localhost/publizieren/"; // your base adress for your local server
$sitemap_name = "sitemap.xml"; // could be individualized
$host = parse_url($url2);
$hostname = $host[host];

if ($hostname == 'localhost') $host1 = str_replace($local,'',$url2);
$pos = stripos($host1, "/"winking smiley; // on local server delete all behind the /

if ($pos) $host1 = substr($host1,0,$pos); // build full adress again, now only until host
if ($hostname == 'localhost') {
$url2 = ("$local$host1"winking smiley;
}else {
$url2 = ("$host[scheme]://$hostname"winking smiley;
}
$input_file = "$url2/$sitemap_name"; // create path to sitemap

if ($handle = fopen($input_file, "r"winking smiley) { // happy times, we found a new sitemap
$links = get_sitemap($input_file,$mysql_table_prefix); // now extract links from sitemap.xml
if ($links !='') {
reset ($links);
while ($thislink = each($links)) {
mysql_query ("insert into ".$mysql_table_prefix."temp (link, level, id) values ('$thislink[1]', '$level', '$sessid')"winking smiley;
echo mysql_error();
}
$smp = '1'; // there was a valid sitemap and we stored the new links
}
}
}
} else {
printStandardReport('noFollow',$command_line);
}


In the same file search for:

if ($data['nofollow'] != 1) {
$links = get_links($file, $url, $can_leave_domain, $data['base']);
$links = distinct_array($links);
$all_links = count($links);
$numoflinks = 0;
//if there are any, add to the temp table, but only if there isnt such url already
if (is_array($links)) {
reset ($links);

while ($thislink = each($links)) {
if ($tmp_urls[$thislink[1]] != 1) {
$tmp_urls[$thislink[1]] = 1;
$numoflinks++;
mysql_query ("insert into ".$mysql_table_prefix."temp (link, level, id) values ('$thislink[1]', '$level', '$sessid')"winking smiley;
echo mysql_error();
}
}
}
} else {
printStandardReport('noFollow',$command_line);
}


Delete all that and replace it with the following:

if ($smp != 1) {
if ($data['nofollow'] != 1) {
$links = get_links($file, $url, $can_leave_domain, $data['base']);
$links = distinct_array($links);
$all_links = count($links);
$numoflinks = 0;
//if there are any, add to the temp table, but only if there isnt such url already
if (is_array($links)) {
reset ($links);

while ($thislink = each($links)) {
if ($tmp_urls[$thislink[1]] != 1) {
$tmp_urls[$thislink[1]] = 1;
$numoflinks++;
mysql_query ("insert into ".$mysql_table_prefix."temp (link, level, id) values ('$thislink[1]', '$level', '$sessid')"winking smiley;
echo mysql_error();
}
}
}
} else {
printStandardReport('noFollow',$command_line);
}
}


In the same file search for:

global $mysql_table_prefix, $command_line, $mainurl, $tmp_urls, $domain_arr, $all_keywords;

Delete that row and replace it with it with the following:

global $mysql_table_prefix, $command_line, $mainurl, $tmp_urls, $domain_arr, $all_keywords, $smp, $follow_sitemap;
$smp = '0';


At the end of .../admin/spiderfuncs.php additionaly include this function:

function get_sitemap ($input_file,$mysql_table_prefix) {
$s_map = simplexml_load_file ($input_file);
if ($s_map != '') { // if sitemap.xml was conform to XML version 1.0
$links = array ();
$indexdate = '';
$lastmod = '';
foreach($s_map as $url) {
$lastmod = strtotime(substr(($url->lastmod),0,10)); // get lastmod date only for this page from sitemap
$del=mysql_query("delete from ".$mysql_table_prefix."temp"winking smiley; // function get_sitemap will build a new temp table
echo mysql_error();

$res=mysql_query("select indexdate from ".$mysql_table_prefix."links where url like '%$url->loc%'"winking smiley;
echo mysql_error();
$num_rows = mysql_num_rows($res); // do we already know this link?
if ($num_rows==0) $indexdate = strtotime("2000-01-01"winking smiley; // if we don't know this link set the indexdate to very old
if ($num_rows > 0) $indexdate = strtotime(mysql_result($res,"indexdate"winking smiley);
$new = $lastmod - $indexdate;
if ($new > '0') $links[] =($url->loc); // add new link only if date from sitemap.xml is newer than date of last index
}
echo "<br><font color=\"green\"><b>>>> Valid sitemap.xml found here <<< <br></b></font>";
$links = explode(",",(implode(",",$links))); // destroy SimpleXMLElement Object and get link array
}
return $links;
}


Happy coding

Tec
Re: Follow sitemap.xml (version II)
November 05, 2007 12:43PM
$del=mysql_query("delete from ".$mysql_table_prefix."temp"winking smiley; // function get_sitemap will build a new temp table

That isn't what you want to do if theres more than one sphider running.

$del=mysql_query("delete from ".$mysql_table_prefix."temp WHERE id=".$sessid.""winking smiley; // function get_sitemap will build a new temp table

Overall Pretty Nice
I don't understand what the localhost and base site part is for, can you go into more detail?
Re: Follow sitemap.xml (version II)
November 05, 2007 05:26PM
Can this be run, using the sitemap, from the command line?
Tec
Re: Follow sitemap.xml (version II)
November 05, 2007 09:55PM
Thanks for the $sessid note. I didn't check it, but to prevent further questions: the additional variable will have to be passed to and received from the function get_sitemap (). Something like
($input_file,$mysql_table_prefix,$sessid) will be required.


<<< I don't understand what the localhost and base site part is for, can you go into more detail? >>>
sitemap.xml is always expected at the root folder of a domain. If, for example, the url to be indexed is something like: [http://www.abc.de/subfolder/index.html] there is no sitemap at that folder. So in order to find the sitemap I have to extract the "basic folder".
I prepared this mod on my local system. So I needed a localhost url. But under the localhost condition the function parse_url() delivers not only the host. So I was obliged to do some additional efforts.
I left it as part of the code for those who intend to use it further on at a localhost environment.
If you are running only "in the wild", forget it.


<<< Can this be run, using the sitemap, from the command line? >>>

Yes. On my Windows system I had no problem.

Tec
Re: Follow sitemap.xml (version II)
November 07, 2007 02:59PM
It looks like it still requests every page it finds in the sitemap, is it supposed to?
Tec
Re: Follow sitemap.xml (version II)
November 07, 2007 05:26PM
It will index / reindex a page only if date of last Sphider index is older than date of last modified for every link separately detected. Additionally all those (new) pages not jet stored in Sphider link table will be indexed / reindexed and stored in link table.

As debug assistance:
When you do a reindex do you see the row

>>> Valid sitemap.xml found here <<<

at the beginning of result output? This row is formatted in green.

Something like:

1. Retrieving: http://www.abc.de/index.php at 18:03:35.
>>> Valid sitemap.xml found here <<<
Size of page: 12.21kb. Starting indexing at 18:03:35. MD5 sum checked. Page content not changed
Links found: 0. New links: 0
2. Retrieving: http://www.abc.de/html//html/galerie.html at 18:03:35.
. . .

If you don't see that green row, the mod didn't find a valid sitemap.xml. If this is already the problem, we will have to debug why this happens in your application.


If you see the row >>> Valid sitemap.xml found here <<< you should check which links were detected in the sitemap.xml file. For this check do the following:

In .../admin/spider.php search for:
$smp = '1'; // there was a valid sitemap and we stored the new links
}
}

Beyond this rows, temporary include the following row:

echo "<br>Links found in sitemap.xml:<br><pre>";print_r($links);echo "</pre>";

Together with your next index / reindex you will get a list of links displayed when a valid sitemap was found. Compare that list with the rest of the index list. Should be the same.

Up to here. In order to be furthermore helpful, please let me know the results of your test

Tec
Re: Follow sitemap.xml (version II)
November 07, 2007 07:59PM
I do get the line regarding the sitemap being found, and it is clearly pulling the links from there.

It then goes through page by page requesting it from the server and says "Indexed" in green.

I'll do a bit of debugging when I get home.
Re: Follow sitemap.xml (version II)
November 08, 2007 05:46AM
I had to tweak a couple things. 1) it was reindexing everything because you were substringing the lastmod date out of the xml file after 10 characters, I'm not really sure why. smiling smiley 2) URL's in xml files need to have the amperands encoded, so it wasn't finding those urls in the db because in there they are just an &.

It seems the actual index time is not stored, only the date, so if you run this more than once a day it will still reindex anything that changed that day, even if you have reindexed since it changed last, but this is better than the alternative. Thanks, I'll let you know if I find anything else.

The following should so in admin/sphiderfuncs.php:


function get_sitemap ($input_file,$mysql_table_prefix)
{
$s_map = simplexml_load_file ($input_file);
if ($s_map != '') // if sitemap.xml was conform to XML version 1.0
{
$links = array ();
foreach($s_map as $url)
{
$the_url = str_replace("&amp;","&",$url->loc);
$lastmod = strtotime($url->lastmod); // get lastmod date only for this page from sitemap
$del=mysql_query("delete from ".$mysql_table_prefix."temp"winking smiley; // function get_sitemap will build a new temp table
$res=mysql_query("select indexdate from ".$mysql_table_prefix."links where url like '%$the_url%'"winking smiley;
$num_rows = mysql_num_rows($res); // do we already know this link?
$indexdate = 0;
if ($num_rows > 0)
{
$indexdate = strtotime(mysql_result($res,"indexdate"winking smiley);
}
$new = $lastmod - $indexdate;
if ($new > '0')
{
$links[] =($url->loc); // add new link only if date from sitemap.xml is newer than date of last index
}
}
echo "<br><font color=\"green\"><b>>>> Valid sitemap.xml found here <<< <br></b></font>";
$links = explode(",",(implode(",",$links))); // destroy SimpleXMLElement Object and get link array
}
return($links);
}
Re: Follow sitemap.xml (version II)
November 08, 2007 06:10AM
Also, since I forgot to add it, I hadn't run it since I originally requested this update. It took all night and timed out 3 times. Updating a weeks or so worth of updated limiting to the pages we know changed took 10 minutes to do, so updating the few changes once a night will be a huge improvement. smiling smiley

(Also, there are more benefits than just this to have a functioning sitemap, if you don't have one already. Even "static" pages can have their file mod time checked and dynamically updated in the sitemap. You really ought to get one if you don't have one.)
Tec
Re: Follow sitemap.xml (version II)
November 08, 2007 03:52PM
Thank you for the url supplements. Concerning the index date: I also noticed that it is not stored as date+time but only as date. In order to remain compatible with standard Sphider database I intended not to touch that.

Tec



Edited 1 time(s). Last edit at 11/08/2007 04:20PM by Tec.
Re: Follow sitemap.xml (version II)
November 21, 2007 04:40PM
Does it follow, "url must have" and "url must not have"?
it if does not, can you provide a little code to do that and where to put it
Tec
Re: Follow sitemap.xml (version II)
November 21, 2007 08:01PM
If a sitemap.xml is available this mod delivers alternate links. The rest of index / reindex procedure remains unaffected.

Tec
Re: Follow sitemap.xml (version II)
March 20, 2008 08:49PM
Hello,

I´m trying to implement this MOD on localhost but I can´t get Sphider to read my sitemap.xml, well, it does read it but says
>>> Valid sitemap.xml found here <<<
Size of page: 0.59kb. Starting indexing at 21:43:47. Page contains less than 10 words
Links found: 0. New links: 0
2. Retrieving: at 21:43:47.
Failed to parse address "" NOHOST
Links found: 0. New links: 0
Strange thing I notice is that the real file size is 119kb, not 0.59kb.

I have all the urls in the sitemap.xml pointiong to localhost, like
<url>
	<loc>localhost/phpBB2/viewtopic.php?t=1891</loc>
</url>

Do you have any suggestions?

Thanks, greetings.
Tec
Re: Follow sitemap.xml (version II)
March 21, 2008 08:09AM
Hello Willi,
your sitemap.xml should contain adressable links like:
<url>
<loc>http://localhost/. . . . </loc>
<lastmod>2007-10-06T00:06:27+01:00</lastmod>
<changefreq>monthly</changefreq>
<priority>0.5</priority>
</url>
The missing http:// prevents indexing and causes the error message "NOHOST"
Also you need the <lastmod> information, because only links with a 'lastmod' date newer than that one stored in Sphiders database (stored during last index/reindex) will force Sphider to reindex the link now. Otherwise it would not be neccessary at this moment. So we save some time not to index 'old' links again that are already in database with last up to date data.

Tec



Edited 1 time(s). Last edit at 03/21/2008 08:12AM by Tec.
Re: Follow sitemap.xml (version II)
March 22, 2008 01:30PM
You´re right, I got it working now, after adding the http://

I guess it´s correct behaviour that the message for the sitemap.xml is 0 links found and after that simply all links in the sitemap.xml are followed?
Tec
Re: Follow sitemap.xml (version II)
March 23, 2008 02:22PM
Hello Willy,
Yes, if a sitemap.xml file is available, Sphider-plus does not search for other links. Only those links presented in the sitemap file will be used for index/reindex.

Tec
Re: Follow sitemap.xml (version II)
March 23, 2008 06:30PM
Good,

That´s what I´d like to have.

The pity is that this mod works brillantly on my localhost running with Xampp and PHP5 but on my live host it won´t run, they´re running PHP4.

Isn´t there a way to make your code work with PHP4?
Re: Follow sitemap.xml (version II)
September 02, 2009 04:15AM
This is an excellent mod...

As an alternative way to use this, if you had a very specific set of files and/or folders you wanted indexed by Sphider, but you do NOT want to change your sitemap.xml file, then you could create a sitemap file with those files/folders in it, and then call it something like sitemap_sphider.xml.

Then in the admin/spider.php file you can change this line:
$sitemap_name = "sitemap.xml"; // could be individualized

To this:
$sitemap_name = "sitemap_sphider.xml"; // could be individualized

Now place the sitemap_sphider.xml file along side your sitemap.xml and Sphider will only use the links in the sitemap_sphider.xml file.

And just as a side note for clarification, this line:
$local = "[localhost];;

Should look something like this:
$local = "/home/mydomain/public_html";
Re: Follow sitemap.xml (version II)
February 03, 2010 11:13AM
The spider script says that it found the sitemap, but it doesn't seem to be using the sitemap to index my site. I tried setting the link depth to both 0 and 1 so that it would only index the pages in the sitemap. It seems that it's using the url's on my website rather than the sitemap because it only indexed the pages with links on the first page, and therefore did not index all of the pages that are listed in the sitemap.

To clarify the first paragraph, I want to index ONLY the pages I have in my sitemap. My website is a news site, and the index page is a list of news stories with a blurb of the story and a link that opens in another window/tab with the complete story. I want to index ONLY the pages with the full news story, which is what I have in my sitemap.

I copied the code the best I could considering that some of the code appears with smileys. I was going to ask for the code to be reposted with the code being written inside the "code" box, but even that puts smileys in the final result (WTF webmaster?).

Could someone link to a text file with the code in case I made a mistake in fixing the smileys?

Thanks in advance.
Re: Follow sitemap.xml (version II)
February 03, 2010 01:41PM
Does your sitemap include lastmoddates?

Urls should have this format:
<url>
<loc>your_site.url/page_title.html</loc>
<lastmod>2008-05-15T08:38:50+02:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.5</priority>
</url>
Re: Follow sitemap.xml (version II)
February 03, 2010 08:05PM
Willy Wrote:
-------------------------------------------------------
> Does your sitemap include lastmoddates?
>
> Urls should have this format:
>
>
> your_site.url/page_title.html
> 2008-05-15T08:38:50+02:00
> weekly
> 0.5
>
>
Yes, except for the priority tag, which I don't think is important in this instance.



Edited 1 time(s). Last edit at 02/03/2010 08:08PM by keith1764.
Re: Follow sitemap.xml (version II)
March 28, 2017 03:57PM
sphider doesn't index my sitemap.xml which contains 6000 links :-(
Sorry, only registered users may post in this forum.

Click here to login