Welcome! Log In Create A New Profile

Advanced

How to limit search to certain folders?

Posted by GerhardS 
How to limit search to certain folders?
February 07, 2015 01:35AM
Hi,
can anybody show me how to spider only certain web pages?
My website consists of folders named projects_d and projects_e. Projects_d contains only files in german language, while projects_e contains only files in english language. Sphider spiders both pages and shows both the english and the german version of a file containing the query. I would like to see only results in one language. Is this possible?
Re: How to limit search to certain folders?
February 07, 2015 11:57AM
Moin,

you must set "URL must include:" for projects_e or "URL must not include:" -> projects_d at "index" -> "advanced options" for indexing or not indexing this directories.

regards
Re: How to limit search to certain folders?
February 07, 2015 06:34PM
I have entered:
Adress: http://www.mysite.de
Options: full, Reindex, URL must include: _de, URL must not include _en

This gives for some folders the answer:
1. Link http://www.mysite.de/: file checking forbidden by required/disallowed string rule
or
4. Retrieving: http://www.mysite.de/service_d/service_de.php at 17:27:59.
Size of page: 2.69kb. Starting indexing at 17:27:59. MD5 sum checked. Page content not changed
What does that mean?
Re: How to limit search to certain folders?
June 18, 2015 06:45PM
@GerhardS

I am no expert on using the URL must include: box, however it appears to support regex, so you may want to look into a regex expression that will look for the presence of _de in your URL. An example of using regex to find a particular String can be seen here: http://stackoverflow.com/questions/9348326/regex-find-word-in-the-string

As for the following question,
Quote
Gerhards
4. Retrieving: http://www.mysite.de/service_d/service_de.php at 17:27:59.
Size of page: 2.69kb. Starting indexing at 17:27:59. MD5 sum checked. Page content not changed
What does that mean?
Sphider uses a MD5 hashing function to get a hash value that represents the data being indexed, when you re-index your site Sphider crawls as it would normally but when it hits a page and hashes it Sphider will check that hash against previously indexed values to see if the page exists. Since hashing is based on the content of the page and will be different if anything has changed this allows for a quick comparison to tell if the page is the same and it can skip further indexing OR if Sphider must Re-index the changed page. The message: MD5 sum checked. Page content not changed is simply alerting you that given it's indexing algorithm there were no changes found on that page.

Hopefully this will help.
Sorry, only registered users may post in this forum.

Click here to login