December 2011
1 post
Data Sets Resources
https://bitly.com/bundles/hmason/1
http://oreilly.com/catalog/0636920018254
http://www.delicious.com/pskomoroch/dataset
Reference from :
Hilary Mason Wants To Get You Started With Big Data
November 2011
6 posts
Learning to rank
Learning to rank[1] or machine-learned ranking (MLR) is a type of supervised or semi-supervised machine learning problem in which the goal is to automatically construct a ranking model from training data. Training data consists of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment...
Ranking Refinement and its Application to...
http://www.cs.pitt.edu/~valizadegan/Publications/WWW_presentation.pdf
http://www2008.org/papers/pdf/p397-jinA.pdf
We consider the problem of ranking re¯nement, i.e., to
improve the accuracy of an existing ranking function with
a small set of labeled instances. We are, particularly, inter-
ested in learning a better ranking function using two comple-
mentary sources of information, ranking...
LETOR4.0 Datasets
http://research.microsoft.com/en-us/um/beijing/projects/letor/letor4dataset.aspx
LETOR is a package of benchmark data sets for research on LEarning TO Rank, which contains standard features, relevance judgments, data partitioning, evaluation tools, and several baselines. Version 1.0 was released in April 2007. Version 2.0 was released in Dec. 2007. Version 3.0 was released in Dec. 2008. This...
Mechanical turk
https://www.mturk.com/mturk/welcome
The web09-bst Dataset
http://boston.lti.cs.cmu.edu/Data/web08-bst/planning.html
The web09-bst dataset is a 25 terabyte dataset of about 1 billion web pages crawled in Janary and February, 2009. The crawl order was best-first search, using the OPIC metric. The crawl was started from about 28 million URLs that either i) had high OPIC values in a web graph produced from an earlier 200 million page crawl, or ii) were...
dmoz dataset
http://ce.sharif.ir/courses/84-85/2/ce324/assignments/files/assignDir4/dmozReadme.html
(year 2005)
This dataset consists of the “Science” subtree of the dmoz.org web directory. It contains 11625 topics and 104853 documents. The topics are numbered by integers from 0 to 11624; the documents are numbered by integers from 0 to 104852. The topics are organized into a tree (topic 0 is the...
October 2011
2 posts
Using the Wikipedia page-to-page link database
http://haselgrove.id.au/wikipedia.htm
Dataset information
http://kevinchai.net/datasets
August 2011
1 post
Microsoft Research Projects
http://research.microsoft.com/en-us/projects/mslr/download.aspx
We release two large scale datasets for research on learning to rank: MSLR-WEB30k with more than 30,000 queries and a random sampling of it MSLR-WEB10K with 10,000 queries.
Below are two rows from MSLR-WEB10K dataset: ============================================================= 0 qid:1 1:3 2:0 3:2 4:2 …...
April 2011
2 posts
AOL query data
Here is the AOL data : http://www.gregsadetsky.com/aol-data/ (439MB)
Site1: http://www.atrus.org/hosted/AOL-data.tgz
Site2: http://aolsearchlogs.cloudsites.com/AOL-data.tgz
Site3: http://sexygeeks.be/AOL-data.tgz
Site4: http://www.upodcast.be/mirror/AOL-data.tgz
Site5: http://www.leafyhost.com/AOL-data.tgz
Phantomjs
http://www.phantomjs.org/
PhantomJS can be fully scripted using JavaScript. It is an optimal solution for headless testing of web-based applications, site scraping, pages capture, SVG renderer, PDF converter and many other usages.
March 2011
1 post
Free Python learning
(1)http://www.swaroopch.com/notes/Python
(2)http://learnpythonthehardway.org/
(3)http://en.wikibooks.org/wiki/Non-Programmer’s_Tutorial_for_Python_2.6
(4)http://en.wikibooks.org/wiki/Python_Programming
(5)http://docs.python.org/tutorial/index.html
(6)http://www.greenteapress.com/thinkpython/thinkpython.html