Welcome! Log In Create A New Profile


Handle links with filenames containing spaces

Posted by redscourge 
Handle links with filenames containing spaces
February 24, 2009 04:46PM
in admin/spiderfuncs.php, in the get_links() function, where it does all the long preg_match_all()'s, you can fix this to allow the handling of links with spaces in them (later on in the code, it converts the spaces to %20).

All you have to do is copy the whole section of preg_match_all()'s, paste a second set of them, search for #, and put a space character at the end of the last 0-9- specification, like this:

preg_match_all("/href\s*=\s*[\'\"]([+:%\/\?~=&;\\\(\),._a-zA-Z0-9-]*)(#[.a-zA-Z0-9-]*)?[\'\" ](\s*rel\s*=\s*[\'\"]?(nofollow)[\'\"]?)?/i", $file, $regs, PREG_SET_ORDER);


preg_match_all("/href\s*=\s*[\'\"]([+:%\/\?~=&;\\\(\),._a-zA-Z0-9- ]*)(#[.a-zA-Z0-9-]*)?[\'\" ](\s*rel\s*=\s*[\'\"]?(nofollow)[\'\"]?)?/i", $file, $regs, PREG_SET_ORDER);

Now, if you have a bunch of files that have spaces in the filename, Sphider can properly index them, and doesnt truncate the URL at the first space character. This is more commonplace on Windows, or when Windows users are putting files on the server, but it still makes sense to handle this case!

Re: Handle links with filenames containing spaces
February 24, 2009 11:44PM

Your suggestion works quite well for all links. If you need something also for the main URL, you additionally will have to add the following:

In .../admin/spider.php search for:
$compurl = parse_url($url);
Above that row additionally include:
$url = str_replace(" ", "%20", $url);

Sorry, only registered users may post in this forum.

Click here to login