Crawler

You are probably reading this page because you found the link in one of your web server’s access logs. This page explains what is going on.

Your page has been (or is still) visited by a crawler that is run by me to gather regular snapshots of German universities’ and research institutes’ websites as part of the L3S Web Observatory. The employed software is Heritrix, a very sophisticated crawler developed and also used by the Internet Archive. In particular, the crawler strictly adheres to the robots exclusion standard which means you can restrict its access by yourself by simple means.

Although there are some heuristics to circumvent so called spider traps, the crawler sometimes is trapped in a web page that creates infinite many links (a frequent example are calendar web pages). In case you detect such an access pattern, I would be very glad if you could tell me.

If you have any questions, comments or complaints please contact me via e-mail () or phone (+44 (0)114 222 2682). It would be helpful if you can provide information on the host or URLs that were crawled and the time of the access.

crawl progress