Welcome! Log In Create A New Profile

Advanced

Howto: Index using a script (Theme MOD and Mass-Indexing)

Posted by cladiron 
Howto: Index using a script (Theme MOD and Mass-Indexing)
November 23, 2013 02:50PM
This is a repost of a deleted thread. I'm unsure if the admin here will delete this again, so it may not stay here long.

I have posted my version edits here at [url=http://sourceforge.net/projects/sphidercomunity/]http://sourceforge.net/projects/sphidercomunity/[/url] for downloading.
Any issues with the Sphider-CV should be posted in this thread.
In this version you will have to manually add your links to the scripts.
I am working on a way to automate this process. This has to be made so the links are viewable before pressing the button to index them. Just in case you need to remove sites that you do not want in your search.


Excellent way to populate your search engine with little effort.
Using scripts, you shouldn't have to worry about your site timing out when running the indexer.
You will not be able to view each line yet, but that is a work in progress among a few other ideas.
You will be able to view the status area in the admin section and watch the links and keywords increase.

I'm not sure how many people may know about this or have give it much thought, but i found it to be a life saver.
Due to how long sites can take to index, i found i lost alot of time when the indexer would finish while i was asleep.
So i would miss hours that it could have been running.

This little script can keep it indexing for days, depending how many you add to the script.

In this little TUT i will explain how to create a script to place all your website links in for Indexing.

Create an .sh file called what ever you like. I will call mine "run_indexer_depth.sh"
Now place the code below inside it, replacing the URL's and indexing depth.
You can place as many web-URLs as you want in the file.
(The # sign is a comment, and means that line will be skipped.)
Even tho the line is not processed, it does still show in the console if viewing it. If you rather not have the messages show in the console, you CAN remove the lines that start with a # sign.

#Usage: php spider.php <options>
#
#Options:
# -all            Reindex everything in the database
# -u <url>        Set url to index
# -f              Set indexing depth to full (unlimited depth)
# -d <num>        Set indexing depth to <num>
# -l              Allow spider to leave the initial domain
# -r              Set spider to reindex a site
# -m <string>     Set the string(s) that an url must include (use \n as a delimiter between multiple strings)
# -n <string>     Set the string(s) that an url must not include (use \n as a delimiter between multiple strings)
# ----------------------------------------------------------------------------------------------
php spider.php -u http://blahhhh.org/forums -d 5;
php spider.php -u http://blahhhhhh.net/forums -d 5;
php spider.php -u http://forums.blahhhhhhee.com -d 5;
php spider.php -u http://blahhhrrhhh.com/forums -d 5;
#php spider.php -u http://blahhhhhqqh.com/forums -d 5;
#php spider.php -u http://blahhhhhddh.com/forums -d 5;
php spider.php -u http://blahvvhhhhh.com/forums -d 5;
php spider.php -u http://blahhnnhhhh.com/forums -d 5;
exit;

Place your newly created run_indexer_depth.sh file in the same folder as spider.php

Chown the .sh file to 755 (i use FTP to change the permissions)
Now in SSH navigate to the .sh file and exec it.
Use screen so you can close out SSH when you want.


With screen Example:

cd /home/sites/public_html/admin/
screen ./run_indexer_depth.sh


To close out the screen without closing out the indexer.
While viewing the scan of the indexer press these:

Ctrl+A+D

Without using screen Example:
(if you start it this way, you must keep the SSH window open that is running the indexer. If you close it, the indexer WILL stop)

cd /home/sites/public_html/admin/
./run_indexer_depth.sh

This can also be used to Reindex your sites.
Example below.

#Usage: php spider.php <options>
#
#Options:
# -all            Reindex everything in the database
# -u <url>        Set url to index
# -f              Set indexing depth to full (unlimited depth)
# -d <num>        Set indexing depth to <num>
# -l              Allow spider to leave the initial domain
# -r              Set spider to reindex a site
# -m <string>     Set the string(s) that an url must include (use \n as a delimiter between multiple strings)
# -n <string>     Set the string(s) that an url must not include (use \n as a delimiter between multiple strings)
# ----------------------------------------------------------------------------------------------
php spider.php -u http://blahhhh.org/forums -r;
php spider.php -u http://blahhhhhh.net/forums -r;
php spider.php -u http://forums.blahhhhhhee.com -r;
php spider.php -u http://blahhhrrhhh.com/forums -r;
#php spider.php -u http://blahhhhhqqh.com/forums -r;
#php spider.php -u http://blahhhhhddh.com/forums -r;
php spider.php -u http://blahvvhhhhh.com/forums -r;
php spider.php -u http://blahhnnhhhh.com/forums -r;
exit;

(Must be ROOT or a SUDO-user)
To install screen:
CENTOS:

yum install screen

UBUNTU:
apt-get install screen

CRON
Now this file can be setup to run as a Cron.
Things to consider when setting up the Cron

How large are the sites ?
Depth your going to index.
How many links you add to the .sh script.

Servers stats with 8 groups running.
This is the average or less, but i seen 2 spikes on the CPU that got over 5%.
1 was at 25%, the other was at 10%
Uptime: 1 days, 20 hours, 10 minutes
Tasks: 211 total,   1 running, 210 sleeping,   0 stopped,   0 zombie
Cpu(s):  3.1%us,  0.3%sy,  0.0%ni, 95.5%id,  0.9%wa,  0.0%hi,  0.2%si,  0.0%st
Mem:   3960652k total,  2700808k used,  1259844k free,   185548k buffers
Swap:  5996536k total,        0k used,  5996536k free,  1447388k cached
-sh-3.2$ screen -r 
There are several suitable screens on:
        31448.script    (Detached)
        31452.script    (Detached)
        31469.script    (Detached)
        31427.script    (Detached)
        31436.script    (Detached)
        31461.script    (Detached)
        31441.script    (Detached)
        31422.script    (Detached)



Here is an Archived link of the orginal thread:
http://web.archive.org/web/20130127024833/http://www.sphider.eu/forum/read.php?3,9239



Edited 1 time(s). Last edit at 11/23/2013 02:52PM by cladiron.
Sorry, only registered users may post in this forum.

Click here to login