Data Sets Resources
Learning to rank
Learning to rank[1] or machine-learned ranking (MLR) is a type of supervised or semi-supervised machine learning problem in which the goal is to automatically construct a ranking model from training data. Training data consists of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment (e.g. “relevant” or “not relevant”) for each item. Ranking model’s purpose is to rank, i.e. produce a permutation of items in new, unseen lists in a way, which is “similar” to rankings in the training data in some sense.
Ranking Refinement and its Application to Information Retrieval
http://www.cs.pitt.edu/~valizadegan/Publications/WWW_presentation.pdf
http://www2008.org/papers/pdf/p397-jinA.pdf
We consider the problem of ranking re¯nement, i.e., to
improve the accuracy of an existing ranking function with
a small set of labeled instances. We are, particularly, inter-
ested in learning a better ranking function using two comple-
mentary sources of information, ranking information given
by the existing ranking function (i.e., the base ranker) and
that obtained from users’ feedbacks. This problem is very
important in information retrieval where feedbacks are grad-
ually collected. The key challenge in combining the two
sources of information arises from the fact that the rank-
ing information presented by the base ranker tends to be
imperfect and the ranking information obtained from users’
feedbacks tends to be noisy. We present a novel boosting
algorithm for ranking re¯nement that can e®ectively lever-
age the uses of the two sources of information. Our empiri-
cal study shows that the proposed algorithm is e®ective for
ranking re¯nement, and furthermore it signi¯cantly outper-
forms the baseline algorithms that incorporate the outputs
from the base ranker as an additional feature.
LETOR4.0 Datasets
http://research.microsoft.com/en-us/um/beijing/projects/letor/letor4dataset.aspx
LETOR is a package of benchmark data sets for research on LEarning TO Rank, which contains standard features, relevance judgments, data partitioning, evaluation tools, and several baselines. Version 1.0 was released in April 2007. Version 2.0 was released in Dec. 2007. Version 3.0 was released in Dec. 2008. This version, 4.0, was released in July 2009. Very different from previous versions (V3.0 is an update based on V2.0 and V2.0 is an update based on V1.0), LETOR4.0 is a totally new release. It uses the Gov2 web page collection (~25M pages) and two query sets from Million Query track of TREC 2007 and TREC 2008. We call the two query sets MQ2007 and MQ2008 for short. There are about 1700 queries in MQ2007 with labeled documents and about 800 queries in MQ2008 with labeled documents.
Mechanical turk
The web09-bst Dataset
http://boston.lti.cs.cmu.edu/Data/web08-bst/planning.html
The web09-bst dataset is a 25 terabyte dataset of about 1 billion web pages crawled in Janary and February, 2009. The crawl order was best-first search, using the OPIC metric. The crawl was started from about 28 million URLs that either i) had high OPIC values in a web graph produced from an earlier 200 million page crawl, or ii) were ranked highly by a commercial search engine for one of 4,000 sample queries in one of 10 languages. This dataset covers web content in English, Chinese, Spanish, Japanese, French, German, Arabic, Portuguese, Korean, and Italian. More information about the dataset is available in our project planning document. This document is slightly outdated - for example, our plans for the seed URLs changed - but it is still the best description of our plans. Dataset construction is in progress now. Current progress on the crawler and statistics for pages fetched can be found here. It is expected to be available to other researchers by April, 2009, under a TREC-style data license, for a small fee.
dmoz dataset
http://ce.sharif.ir/courses/84-85/2/ce324/assignments/files/assignDir4/dmozReadme.html
(year 2005)
This dataset consists of the “Science” subtree of the dmoz.org web directory. It contains 11625 topics and 104853 documents. The topics are numbered by integers from 0 to 11624; the documents are numbered by integers from 0 to 104852. The topics are organized into a tree (topic 0 is the root). Each document belongs to one or more topics; however, the vast majority of documents (102801 out of 104853) belong to exactly one topic.
Using the Wikipedia page-to-page link database
Dataset information
Microsoft Research Projects
http://research.microsoft.com/en-us/projects/mslr/download.aspx
We release two large scale datasets for research on learning to rank: MSLR-WEB30k with more than 30,000 queries and a random sampling of it MSLR-WEB10K with 10,000 queries.
Below are two rows from MSLR-WEB10K dataset:
=============================================================
0 qid:1 1:3 2:0 3:2 4:2 … 135:0 136:0
2 qid:1 1:3 2:3 3:0 4:0 … 135:0 136:0
=============================================================
In the data files, each row corresponds to a query-url pair.
The first column is relevance label of the pair, the second column is query id, and the following columns are features. The larger value the relevance label has, the more relevant the query-url pair is. A query-url pair is represented by a 136-dimensional feature vector.
The details of features can be found here.