Welcome! Log In Create A New Profile

Advanced

indexing very large sites possible? (and many other questions)

Posted by soaringeagle 
indexing very large sites possible? (and many other questions)
January 25, 2015 07:42PM
i have a site www.dreadlockssite.com
thats at least 1.5 million pages (just moved to a new site structure just starting my 1st sitemap crawl)
everytimne i start the indexer it seems so slow and causes my sitemap crawler to timeout and the site itself sliows way down (i think)
no pages in the admin for Sphider will load till i kill the spider process on the server
wich has only crawled 800 pages

i realize the 1st index mighttake days
shouldi set a slow crawl rate say 1 page per sec
let the 1st index take a week

afterwich will it only index new stuff?

also im confused about the categories as i see no info
but can i use it to search only forums only photos only blogs or music
do i just set a diferent search root and add that search to that category?

any way that it could use the existing sitemap to "seed' the index
Re: indexing very large sites possible? (and many other questions)
February 02, 2015 07:46PM
Hi soaringeagle I Can help you if you like, I went to your site yes it is very slow and using a lot of memory to index all the pages you have, i think i can index it for you in a day. Please contact me if you like wifiboot@msn.com cheers
Tec
Re: indexing very large sites possible? (and many other questions)
February 04, 2015 08:31AM
<<< i think i can index it for you in a day >>>
Well, let us do some simple calculation.
1 day = 24 hours = 1.440 minutes = 86.400 seconds.
Assuming that one page is completely indexed in one second. In order to index 1.500.000 pages, it will take seventeen days.

Now let us make a big step forward. Let us assume you already index 500.000 pages. Each may contain 500 words, which is little. Now your database will contain 250.000.000 keyword/link relationships. For each word, one link relationship is required, because otherwise the search algorithm could not find all pages containing the search query of your search engine user. Independent whether a keyword was found the first time on a page, or it is a duplicate. If you like to get all results, your db needs to know all pages containing the keyword.

Time now to start indexing page 500.001
For each word of full text presented on this page, the index procedure needs to grab the complete database to find out whether it is a new word, or an already known keyword. Afterwards placing a new keyword/link relationship into db. Let us forget the time consumption to store new URL, title, description and new keywords (for each new page) into the db. You really believe page no. 500.001 will be completely indexed in one second? And the indexer becomes slower and slower. For each additional page.
<<< i think i can index it for you in a day >>>
Sorry, but you will have to wait for weeks.

I know he did not ask for, better to suggest 'soaringeagle' to split the site into different topics/categories and to index only one topic into one set of tables of the database. This also will speed up the search algorithm, because only a specific topic needs to be grabbed in database. Not possible to be realized with the original Sphider, because during index procedure the category function does not divide the content, but stores all into one db.

Tec



Edited 1 time(s). Last edit at 02/04/2015 10:22AM by Tec.
Re: indexing very large sites possible? (and many other questions)
February 05, 2015 06:20PM
The spider 1 use can index 50 pages in 18sec.
No it is not using Sphider I really like Sphider but for now just watching it this is a very good crawler for smaller indexes

I am using a cgi and from out side of the websites so not to take up all there recourse

smileys with beer
Sorry, only registered users may post in this forum.

Click here to login