Welcome! Log In Create A New Profile


Filtering Out Duplicate Information

Posted by mtaylor 
Filtering Out Duplicate Information
May 24, 2007 03:55PM
Is there a way to filter out duplicate information (like header and footer) from a site you have no control over?

I know you can add Sphider tags to eliminate duplicate info on your own site...but if I want to use Sphider to grab pages from a half million websites, I don't want to fill up my database with megabytes of duplicate header/footer/sidebar info.

I realize Sphider off the shelf isn't built for that, but does anyone know of an algorithm or programming method to accomplish this?

Or even a separate piece of code that will clean up the database after Sphider has done its work...

Michael Taylor
May 24, 2007 11:10PM
I may have found the way...more later...

Sorry, only registered users may post in this forum.

Click here to login