December 2011
1 post
Data Sets Resources
https://bitly.com/bundles/hmason/1 http://oreilly.com/catalog/0636920018254 http://www.delicious.com/pskomoroch/dataset Reference from :  Hilary Mason Wants To Get You Started With Big Data
Dec 27th
November 2011
6 posts
Learning to rank
Learning to rank[1] or machine-learned ranking (MLR) is a type of supervised or semi-supervised machine learning problem in which the goal is to automatically construct a ranking model from training data. Training data consists of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment...
Nov 28th
Ranking Refinement and its Application to...
http://www.cs.pitt.edu/~valizadegan/Publications/WWW_presentation.pdf http://www2008.org/papers/pdf/p397-jinA.pdf We consider the problem of ranking re¯nement, i.e., to improve the accuracy of an existing ranking function with a small set of labeled instances. We are, particularly, inter- ested in learning a better ranking function using two comple- mentary sources of information, ranking...
Nov 28th
LETOR4.0 Datasets
http://research.microsoft.com/en-us/um/beijing/projects/letor/letor4dataset.aspx LETOR is a package of benchmark data sets for research on LEarning TO Rank, which contains standard features, relevance judgments, data partitioning, evaluation tools, and several baselines. Version 1.0 was released in April 2007. Version 2.0 was released in Dec. 2007. Version 3.0 was released in Dec. 2008. This...
Nov 28th
Mechanical turk
https://www.mturk.com/mturk/welcome
Nov 28th
The web09-bst Dataset
http://boston.lti.cs.cmu.edu/Data/web08-bst/planning.html The web09-bst dataset is a 25 terabyte dataset of about 1 billion web pages crawled in Janary and February, 2009. The crawl order was best-first search, using the OPIC metric. The crawl was started from about 28 million URLs that either i) had high OPIC values in a web graph produced from an earlier 200 million page crawl, or ii) were...
Nov 28th
dmoz dataset
http://ce.sharif.ir/courses/84-85/2/ce324/assignments/files/assignDir4/dmozReadme.html (year 2005) This dataset consists of the “Science” subtree of the dmoz.org web directory. It contains 11625 topics and 104853 documents. The topics are numbered by integers from 0 to 11624; the documents are numbered by integers from 0 to 104852. The topics are organized into a tree (topic 0 is the...
Nov 28th
October 2011
2 posts
Using the Wikipedia page-to-page link database
http://haselgrove.id.au/wikipedia.htm
Oct 29th
Dataset information
http://kevinchai.net/datasets
Oct 29th
August 2011
1 post
Microsoft Research Projects
http://research.microsoft.com/en-us/projects/mslr/download.aspx We release two large scale datasets for research on learning to rank: MSLR-WEB30k with more than 30,000 queries and a random sampling of it MSLR-WEB10K with 10,000 queries. Below are two rows from MSLR-WEB10K dataset: ============================================================= 0 qid:1 1:3 2:0 3:2 4:2 …...
Aug 1st
April 2011
2 posts
AOL query data
Here is the AOL data : http://www.gregsadetsky.com/aol-data/ (439MB) Site1: http://www.atrus.org/hosted/AOL-data.tgz Site2: http://aolsearchlogs.cloudsites.com/AOL-data.tgz Site3: http://sexygeeks.be/AOL-data.tgz Site4: http://www.upodcast.be/mirror/AOL-data.tgz Site5: http://www.leafyhost.com/AOL-data.tgz
Apr 25th
Phantomjs
http://www.phantomjs.org/ PhantomJS can be fully scripted using JavaScript. It is an optimal solution for headless testing of web-based applications, site scraping, pages capture, SVG renderer, PDF converter and many other usages.
Apr 1st
March 2011
1 post
Free Python learning
(1)http://www.swaroopch.com/notes/Python (2)http://learnpythonthehardway.org/ (3)http://en.wikibooks.org/wiki/Non-Programmer’s_Tutorial_for_Python_2.6 (4)http://en.wikibooks.org/wiki/Python_Programming (5)http://docs.python.org/tutorial/index.html (6)http://www.greenteapress.com/thinkpython/thinkpython.html
Mar 26th