Data Sets Resources

Learning to rank

Learning to rank[1] or machine-learned ranking (MLR) is a type of supervised or semi-supervised machine learning problem in which the goal is to automatically construct a ranking model from training data. Training data consists of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment (e.g. “relevant” or “not relevant”) for each item. Ranking model’s purpose is to rank, i.e. produce a permutation of items in new, unseen lists in a way, which is “similar” to rankings in the training data in some sense.

http://en.wikipedia.org/wiki/Learning_to_rank

Ranking Refinement and its Application to Information Retrieval

http://www.cs.pitt.edu/~valizadegan/Publications/WWW_presentation.pdf

http://www2008.org/papers/pdf/p397-jinA.pdf

We consider the problem of ranking re¯nement, i.e., to

improve the accuracy of an existing ranking function with

a small set of labeled instances. We are, particularly, inter-

ested in learning a better ranking function using two comple-

mentary sources of information, ranking information given

by the existing ranking function (i.e., the base ranker) and

that obtained from users’ feedbacks. This problem is very

important in information retrieval where feedbacks are grad-

ually collected. The key challenge in combining the two

sources of information arises from the fact that the rank-

ing information presented by the base ranker tends to be

imperfect and the ranking information obtained from users’

feedbacks tends to be noisy. We present a novel boosting

algorithm for ranking re¯nement that can e®ectively lever-

age the uses of the two sources of information. Our empiri-

cal study shows that the proposed algorithm is e®ective for

ranking re¯nement, and furthermore it signi¯cantly outper-

forms the baseline algorithms that incorporate the outputs

from the base ranker as an additional feature.

LETOR4.0 Datasets

http://research.microsoft.com/en-us/um/beijing/projects/letor/letor4dataset.aspx

LETOR is a package of benchmark data sets for research on LEarning TO Rank, which contains standard features, relevance judgments, data partitioning, evaluation tools, and several baselines. Version 1.0 was released in April 2007. Version 2.0 was released in Dec. 2007. Version 3.0 was released in Dec. 2008. This version, 4.0, was released in July 2009. Very different from previous versions (V3.0 is an update based on V2.0 and V2.0 is an update based on V1.0), LETOR4.0 is a totally new release. It uses the Gov2 web page collection (~25M pages) and two query sets from Million Query track of TREC 2007 and TREC 2008. We call the two query sets MQ2007 and MQ2008 for short. There are about 1700 queries in MQ2007 with labeled documents and about 800 queries in MQ2008 with labeled documents.

Mechanical turk

The web09-bst Dataset

http://boston.lti.cs.cmu.edu/Data/web08-bst/planning.html

The web09-bst dataset is a 25 terabyte dataset of about 1 billion web pages crawled in Janary and February, 2009. The crawl order was best-first search, using the OPIC metric. The crawl was started from about 28 million URLs that either i) had high OPIC values in a web graph produced from an earlier 200 million page crawl, or ii) were ranked highly by a commercial search engine for one of 4,000 sample queries in one of 10 languages. This dataset covers web content in English, Chinese, Spanish, Japanese, French, German, Arabic, Portuguese, Korean, and Italian. More information about the dataset is available in our project planning document. This document is slightly outdated - for example, our plans for the seed URLs changed - but it is still the best description of our plans. Dataset construction is in progress now. Current progress on the crawler and statistics for pages fetched can be found here. It is expected to be available to other researchers by April, 2009, under a TREC-style data license, for a small fee.

dmoz dataset

http://ce.sharif.ir/courses/84-85/2/ce324/assignments/files/assignDir4/dmozReadme.html

(year 2005)

This dataset consists of the “Science” subtree of the dmoz.org web directory. It contains 11625 topics and 104853 documents. The topics are numbered by integers from 0 to 11624; the documents are numbered by integers from 0 to 104852. The topics are organized into a tree (topic 0 is the root). Each document belongs to one or more topics; however, the vast majority of documents (102801 out of 104853) belong to exactly one topic.

Using the Wikipedia page-to-page link database

Dataset information

Microsoft Research Projects

http://research.microsoft.com/en-us/projects/mslr/download.aspx

We release two large scale datasets for research on learning to rank: MSLR-WEB30k with more than 30,000 queries and a random sampling of it MSLR-WEB10K with 10,000 queries.

Below are two rows from MSLR-WEB10K dataset:
=============================================================
0 qid:1 1:3 2:0 3:2 4:2 … 135:0 136:0 
2 qid:1 1:3 2:3 3:0 4:0 … 135:0 136:0 
=============================================================

In the data files, each row corresponds to a query-url pair.

The first column is relevance label of the pair, the second column is query id, and the following columns are features. The larger value the relevance label has, the more relevant the query-url pair is. A query-url pair is represented by a 136-dimensional feature vector. 

The details of features can be found here.