Welcome! Log In Create A New Profile

Advanced

Sphider Basic Software works with way over 6.000.000 pages

Posted by Andriko 
Sphider Basic Software works with way over 6.000.000 pages
April 30, 2007 05:38PM
Dear All,

Sphider (when the bugs are fixed) works with way over 6 million indexed pages!
If you have more than a 100.000 you must increase MysQl Buffer Settings (Keybuffer, SortBuffer) and even with more then 9.000.000 Keywords (in Keywords Table) your searches can be as fast as 0,3 secs.
I was really sureprised!
So you may ask for Hardware configuration: AMD 64x2, 4GB DDR2 Ram, 150 GB Raid Mirror!
I may post my updates soon! They should also improve indexing (not speed but quality)
Just wanted to drop some words and the updates will follow sooon!

Andreas

http://www.made-easy.cc
Re: Sphider Basic Software works with way over 6.000.000 pages
May 06, 2007 06:10AM
Andriko Wrote:
-------------------------------------------------------
> Dear All,
>
> Sphider (when the bugs are fixed) works with way
> over 6 million indexed pages!

Hello,

Does the latest download version have these bugs fixed?


> If you have more than a 100.000 you must increase
> MysQl Buffer Settings (Keybuffer, SortBuffer) and
> even with more then 9.000.000 Keywords (in
> Keywords Table) your searches can be as fast as
> 0,3 secs.

That's fantastic! Can you please be a bit more detailed in how to accomplish this?

> I was really sureprised!
> So you may ask for Hardware configuration: AMD
> 64x2, 4GB DDR2 Ram, 150 GB Raid Mirror!
> I may post my updates soon!

Waiting! smiling smiley

They should also
> improve indexing (not speed but quality)
> Just wanted to drop some words and the updates
> will follow sooon!
>
> Andreas
Re: Sphider Basic Software works with way over 6.000.000 pages
May 07, 2007 07:20AM
Old Expat,

what in specific are you interested in?
Hardware Specs?
Software?

OK... as for the hardware!
As described! AMD Dualcore Processor 64bit! SUSE Linux 10 basic installation no graphical interfaces and such bulls..t. Plain old command prompt!
MysQL, PHP 5, Apache 2 ... i wont go into subversion depth... basic of this versions should do!
RAM 4GB... the base system running in text mode should not consume more than 40 MB... you don't need and update engines, crons aso... just a naked system!
Imstall if you wish the basic software .... as is... that does already a great job!
Configure your Mysql tables!
Now we need to tweek the php as well as the mysql setup.
In php.ini we need to increase the memory limit to lets say 512MB... that does fine for the scripting engine, and even if you decide to index pdfs, docs aso. (which can be memory hungry process) -> you don't need the graphical interface to index pdfs... 512 is fine... there is more to it, but we are talking about the basic server setup here, not about the spiders, which are running on seperate smaller PII machines (old ones i had left).
For PHP there is not much more to do!

Now lets have a look into MysQL Setup! (etc/my.cnf)

First of all you might want a HUGE KeyBuffer (if you have the hardware and physical memory) for ease I will just say that my one is set to 2G
so the key should read

key_buffer = 2G
table_cache = 100M
sort_buffer_size = 100M
read_buffer_size = 100M
read_rnd_buffer_size = 50MB
myisam_sort_buffer_size = 100MB

these are the basic settings, pay attention ... if you have more than 4GB physical RAM you may increase the keybuffer to 50% of your physical RAM!

Now, 2 GB is actually not enough to cache the whole key tables, but it is a good starting point!

The first (and initial search for a keyword) will load the keytables into the cache, which will increase the speed on all subsequent pages you have to display... the first search can be a bit slow with 2GB.... the more you have... the DB needs it!

So basically this is it, you also may tweak a bit in the [isamchk] and [myisamchk] section setting keybuffers there to 1 or 2 GB... but this is dependent on your compilation not a requirement!

Ready to go for the DB and Webserver!

Now we need a spider....

P2, 20GB HDD, 256MB Ram, old machine... I have RedHat Core 6 running on those again in text mode.
DB Connect over IPtongue sticking out smileyort socket to the DB Server you use! (the one described before for my installation)
These machines do not index pdfs.. this is too memory intensive!
Just start the spider, starting with a domain (the best is a directory) and let it run.. the -l flag to leave the domain... and you are set!
Now if you have a second and a third domain you can have a "a la google" search engine (much smaller though) pretty fast...

You are missing out AdWords! In the version I use I have implemented one though!

So ... my time is up! More Questions? Need help?

A.

http://www.eu-yellow.com
Re: Sphider Basic Software works with way over 6.000.000 pages
May 08, 2007 07:51AM
Hi

Sure Sphider can perform faster when you increase the buffer size. But this is not a bug in Sphider. Or did you mean something else?
Re: Sphider Basic Software works with way over 6.000.000 pages
May 08, 2007 11:14AM
Hi Ando,

no, I did not talk about a bug! Though some things can and could be optimized.
I was talking about HOW to make it possible to handle several million keywords, links, sites aso.
Keeping the DB settings as default will result in searches of minutes.
My database (we started from 6Mil. Keywords in this discussion) has grown by now to over 7.5 Mil. so without tweaking the DB (which keeps the search time faster) this would by way not be possible!
So what is the next step? Depending on the Keyword I get search results of over 10000 pages, just to make an example... So what will be the next step of getting the initial search time for the first page display from sometimes 3-4-5 secs down to 0,xx secs?
Well, I am working on a cluster to achieve this!

Everything is based on Sphider... modified, tweaked expanded...

What I'd really would like... is to make searches as fast as google... or one of the bigger search engines.... the script in base is well... I am also working on a real algorithm for the link rating .... (how many sites link to one specific site) bla bla bla... You may know how the mayor search engines do this...

Well, what I do definately goes beyond the scope of Sphider... and beyond the scope of this project! Some parts may be reused, added, refined, I already sad I will post you some updates so you may compile a new version... though I do not (this is beyond my scope) write code for reusabilty, I am coding specific code for a specific project!

A.
Re: Sphider Basic Software works with way over 6.000.000 pages
May 28, 2007 11:02AM
Andriko Wrote:
> Sphider (when the bugs are fixed) works with way
> over 6 million indexed pages!

Any specific indexing/search times? smiling smiley

Also I suppose that your pages are generated from DB entries as well?
Ando is honest when he qualifies Sphider as a light weight. Anyway, I think he will improve what we have
Re: Sphider Basic Software works with way over 6.000.000 pages
May 29, 2007 09:47PM
Guys!

The things is not about the software used to spider, not even, or at least secondly the data retrieval process! Its all about algorythms! Ando, when he says 100000 pages! What does he mean? On a average server you can run it have super fast results! And... hopefully you are happy!

What I sad is simple and clear! Using the basic software for spidering! No changes! Let 3-4 low end pcs do that! Collect data! No big deal, no database needed! The format is ok, and even .doc or whatsoever else gets indexed. So now you have this data in your DB! WHAT NOW?

Someone needs to search through it! AN here we have the performance bottlenecks! Thats all there is about search engine technology! The data retrieval process!

Here the hardware comes in! A single machine runs fast out of resources! Cluster is the solution I thing will work!

What the actual front end will be? Don't know yet! May be JAVA... but this is also not the problem!

Guys there are so many issues I could write for the rest of the night! May be some of you are interested in such a discussion... HOW TOs.... maybe....

And right .... Ando will improve.... the softw. I mean! Nevertheless trying out what, and importantly HOW things can be done also helps Ando as the core developer!

Greetz,

Andreas

http://www.made-easy.cc
Re: Sphider Basic Software works with way over 6.000.000 pages
May 30, 2007 12:41AM
Has anyone tried or explored using Sphider with SQLLite? I haven't used it, and have only briefly been researching it, but everything is pointing to quite a speed increase over MySql (in certain applications). Maybe Sphider is one of those? SQLLite seems to be being adapted rapidly and will probably be part of the standard php hosting packages fairly soon if everything keeps going like they are now.

Maker of BungeeBones.com
An Income generating Link Exchange for webmasters
Has anyone used a postgresql database as a Sphider backend?
Sorry, only registered users may post in this forum.

Click here to login