This page demonstates C# and .Net code.
SiteMap v1.1 - Web Page scraper
This application is an example of a
Web spider, aka page scrapper. It uses multiple threads and HttpWebRequest to
down load pages from a web site. The pages are individually parsed in worker
threads. In this application, the HTML anchor tags are extracted and each HREF
is enqueued into the thread pool for continued processing. Thus, the whole site
will be traversed eventually if the start page is connected by some chain of
HREF. The output is sent to xml. All the links are sent to the outputText pane.
Broken links are marked ERROR. So if you happen to haven any bad HREF on your
site, this will tell you. That is a cool application right there if your are
managing a large active site. Version 1.1 adds support for the
Robots exclusion protocol. This is a standard way for a web site to
request a page scrapper not index certain pages. Useful if the site has a lot of
dynamic content.
IsServerUp - Continuously polls a list of web sites and indicates whether the server is up.
This demo is a simple utility to monitor multiple web
servers. The WinForm lists if your server is up or down and for how long. It
reads a list of urls from a file Config.xml. Then it creates a thread pool, one
per URL, and gets web pages from each URL. The state of the web servers being
up or down are stored in a DataSet and displayed in a WinForm with DataGrid.
This is also a good demo of Threads. The actual HTML is loaded and read so it
would be possible to scrape dynamic pages for data if they have a simple
format.