Welcome! Log In Create A New Profile

Advanced

Not every occurrence of words gets indexed **SOLVED**

Posted by Kim 
Kim
Not every occurrence of words gets indexed **SOLVED**
August 16, 2011 08:54AM
Hi.
I have implemented this search engine on my site (with danish words), but find that not all occurrences of words gets indexed. On a regular search directly in the database, i find - lets say fifty occurrences of a specific word, but only three is shown (indexed). Could there be some kind of limit to number of words? (I've shortly looked into the code, but see no sign of this). I saw some answer in the forum talking about max 200kb of index data, but that didn't give any meaning, so I ignored that. Is there any solution to this?
Thanks.



Edited 1 time(s). Last edit at 08/17/2011 07:45PM by Kim.
Tec
Re: Not every occurrence of words gets indexed
August 16, 2011 10:55AM
Did you really index the complete site? Did you set the Spidering depth in 'Spidering options' to FULL
Additionally.
Original Sphdier only shows the first occurence of the keyword per result. If a page contains 10 occurences of that word, only the first will be presented in result listing. If you like to get all results presented in result listing, please have a look at [www.sphider-plus.eu] , which also indexes Danish letters (færdige, brød, skål).

Tec
Kim
Re: Not every occurrence of words gets indexed
August 16, 2011 11:21AM
Thanks for your quick reply.

1) When I indexed the site, I did not use full site indexing. The site has a max depth of three, and I set the depth to ten, but it gave the same result. Does full site indexing do anything different?

2) It's not the missing occurrence of a single word several times on the same page, but the word on several pages.

3) I don't think there is a problem with Danish letters, because it indexes all kind of words with æøå and so forth.

I'll look into sphider-plus - maybe that will to the trick.
Thanks!
Tec
Re: Not every occurrence of words gets indexed
August 16, 2011 11:56AM
<<< Does full site indexing do anything different? >>>
No it doesn't.

Are you sure the complete site was indexd? By means of a tool like 'phpMyAdmin' you may check whether all links (pages) are indexed.
Are you running Sphider on a 'Shared Hosting' server?

Tec
Kim
Re: Not every occurrence of words gets indexed
August 16, 2011 01:06PM
I am positively sure that the site has been indexed, re-indexed - I even wiped the database clean and started from scratch - and what else have we? I think I've tried about everything.

Regarding shared hosting, I have several sites (virtual hosts) on my server. I doubt that it should be the problem... Or?

Thanks again
/Kim.
Tec
Re: Not every occurrence of words gets indexed
August 16, 2011 01:38PM
Just to be sure:
By means of a tool like 'phpMyAdmin' you may check whether all links (pages) are indexed.

Do you se all your pages in table $table-prefix-links ?
Kim
Re: Not every occurrence of words gets indexed
August 16, 2011 03:28PM
I know how many sites I have on the site, and there are more entries in the links table due to some few URL differences. I'm sure they are all there.

I'm not able to do it right now, but I have some long articles on the site - and I don't know if the field-type for fulltxt (mediumtext) should be longtext. But I'll check the actual size of the pages later.

Thanks again for your help!
/Kim.
Kim
Re: Not every occurrence of words gets indexed
August 16, 2011 07:04PM
Now I have looked a little bit more into links table, and found that the index procedure cuts off randomly - not because of the mysql field type, as mentioned. Both random pages, and randomly down the text of the pages gets cut of. The only thing that differs on the pages, are the actual content, which for the most part is plain text. A few div, link and image tags are present too. But that goes for the pages that gets indexed correct as well.

I have inserted some text at the end of the document, by which I can determine (via an sql query), which pages has or has not been fully indexed.

Does that give you any clue to what could go wrong?

/Kim.
Tec
Re: Not every occurrence of words gets indexed
August 17, 2011 12:13AM
<<< Does that give you any clue to what could go wrong? >>>
Not really. On a 'Shared Hosting' server your index procedure might get interrupted. But this will finish completely indexing and you should find an 'Unfinished' message in 'Sites' view. Not cutting off the rest of full text at some pages.
Eventually the index procedure gets aborted by some page content.
In order to be helpful, please let me have the URL as per pm, so I could try to assist you.

Tec
Kim
Re: Not every occurrence of words gets indexed
August 17, 2011 08:29AM
I've looked into some specific pages that gets cut of, and in a couple of cases the page had a syntax errors in a tag, just after it got cut of. Eg. <div id="something> (missing end-quote). One page had a "< <" pattern and that killed the index engine. Another page had a <br /> tag between table tags (<table><tr><br /><td>winking smiley

Well, it's good to get those errors corrected, but not on behalf of search quality... Is there a reason for Sphider to parse the HTML?

/Kim.



Edited 1 time(s). Last edit at 08/17/2011 08:49AM by Kim.
Tec
Re: Not every occurrence of words gets indexed
August 17, 2011 09:43AM
<<< Is there a reason for Sphider to parse the HTML? >>>

Yes it is, because words inside of tags are not part of the full text. And only the text of a page should be indexed.
Sphider is using the PHP function strip_tags() to delete the tags from the page content.

Cit from the PHP manual:
Because strip_tags() does not actually validate the HTML, partial or broken tags can result in the removal of more text/data than expected.

Tec
Kim
Re: Not every occurrence of words gets indexed
August 17, 2011 11:02AM
I see - I've missed that little notion on strip_tags. I just made some closed experiment with strip_tag, and reproduced (of cause) the errors. Maybe this is a lesson to put into install.txt, that incorrect defined HTML is a risk factor...

Thanks for your help!
/Kim.
Sorry, only registered users may post in this forum.

Click here to login