Posted by mindplay 
excellent search engine, but what about stemming?
May 16, 2007 10:48AM
this search engine is easily the best pure PHP search engine I have *ever* seen. Like wow! smiling smiley

but what about stemming?

currently, if you search for "download", you get one set of results, and a completely different set of results for "downloads".

this is very impractical for most users - which is why most search engines use stemming on the keyword index; stemming takes the "stem" of a word, so that "downloads", for example, becomes "download".

without stemming, when doing combined searches, the problem becomes worse - for example, the following four searches would yield four entirely different sets of results:

- "download files"
- "file downloads"
- "download file"
- "files downloads"

if you stem words before adding them to the index, and you stem words before performing a keyword lookup, your articles are indexed on the words "file" and "download", regardless of which form the word exists in, e.g. "download", "downloads", "downloading", "downloaders", etc., all becomes simply "download".

this way, search results become broader, and combination searches function the way most users expect it to - like it does on google, msn, yahoo, and so on.

the worst thing about not using stemming is, a lot of users are going to search for "download files", and then give up - while the page they were looking for would actually have come up if they had instead searched for "download file" ... this effectively means that at least half the time, there's a 50% chance your users aren't finding what they're searching for.

another advantage of stemming is that it also makes the keyword index considerably smaller, since much fewer words are indexed - which in turn makes searches faster, take up less storage, etc.

I am really, really impressed with Sphider - I didn't actually think it was possible to implement a search engine this fast and powerful with PHP; I always assumed you'd need the performance of a real programming language to do that.

But stemming is a crucial missing feature.

Your thoughts on this please? smiling smiley
May 16, 2007 11:40AM
Word stemming is already included.

Log in as Admin and select Settings section.
Now select checkbox for 'Use word stemming (e.g. find sites containing "runs" and "running" when searching for "run"winking smiley. Should be enabled before indexing'. Don't try with Reindex. You need a fresh index.
Also select checkbox 'Enable spelling suggestions (Did you mean...)'

No problem to stem for download / downloads

For file download / download file select OR-search in search frame.


Edited 1 time(s). Last edit at 05/16/2007 11:43AM by Tec.
May 16, 2007 01:57PM

I guess I should have examined it more closely before posting.

one question though - when this is enabled, does it break quoted searches? like for example, does a search for "the cat drinks water" (in quotes) yield the same result as "the cats drink water"?
May 16, 2007 02:45PM
Try out and get experience
Anonymous User
May 16, 2007 11:00PM

It was a priviledge to read your analysis, as well as the secure comments and answers.

Maybe you also want to read a previous debabe about that matter in:


In my experience, I use stemming in general research and abandon it when I want more specific results.

This is a kind of choice that only humans can do, for the moment.

Sphider's neural network is already good. In theory, it can be improved without leaving php.

Keep folowing your mind.

Thanks to Tec for all with identical profile.
