Monday, October 20, 2003

The xch perl script (posted on Oct 11) Xtracts URLs from a file and CHallenges the proxy server to determine which URLs are blocked.

xch source_file out_file

will extract URLs from source_file (which is an html file of links to other sites) and use curl to retrieve the http header of the URL. If the header corresponds to Error 403 Forbidden, that (usually) means that URL is on the SBA blacklist. The URL will then be written to out_file.

As I mentioned earlier, this algorithm stopped working sometime between 3Q last year and this year. SCV's proxy server now returns a header that says everything is OK, but the server transparently gives you back a HTML document that says the URL has been blocked by the caching server.

Maybe I can continue to just retrieve headers and rely on the "Content-Length: 1090" string in the header to act as a marker for blacklisted sites.

If you just want to extract URLs from a file but not challenge the proxy server, use

xch -x source_file challenge_list

This puts all the extracted URLs into challenge_list

If you already have a list of URLs you want to test, use

xch -c challenge_list out_file

where challenge_list is the list of URLs in a text file, one URL per line.

No comments: