The web09-bst Dataset

http://boston.lti.cs.cmu.edu/Data/web08-bst/planning.html

The web09-bst dataset is a 25 terabyte dataset of about 1 billion web pages crawled in Janary and February, 2009. The crawl order was best-first search, using the OPIC metric. The crawl was started from about 28 million URLs that either i) had high OPIC values in a web graph produced from an earlier 200 million page crawl, or ii) were ranked highly by a commercial search engine for one of 4,000 sample queries in one of 10 languages. This dataset covers web content in English, Chinese, Spanish, Japanese, French, German, Arabic, Portuguese, Korean, and Italian. More information about the dataset is available in our project planning document. This document is slightly outdated - for example, our plans for the seed URLs changed - but it is still the best description of our plans. Dataset construction is in progress now. Current progress on the crawler and statistics for pages fetched can be found here. It is expected to be available to other researchers by April, 2009, under a TREC-style data license, for a small fee.