<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0"><channel><atom:link rel="hub" href="http://tumblr.superfeedr.com/" xmlns:atom="http://www.w3.org/2005/Atom"/><description></description><title>Memory is limited ...</title><generator>Tumblr (3.0; @yentw)</generator><link>http://note.yen.tw/</link><item><title>Data Sets Resources</title><description>&lt;p&gt;&lt;a href="https://bitly.com/bundles/hmason/1" target="_blank"&gt;&lt;a href="https://bitly.com/bundles/hmason/1"&gt;https://bitly.com/bundles/hmason/1&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://oreilly.com/catalog/0636920018254" target="_blank"&gt;&lt;a href="http://oreilly.com/catalog/0636920018254"&gt;http://oreilly.com/catalog/0636920018254&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://www.delicious.com/pskomoroch/dataset" target="_blank"&gt;&lt;a href="http://www.delicious.com/pskomoroch/dataset"&gt;http://www.delicious.com/pskomoroch/dataset&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Reference from : &lt;/p&gt;
&lt;p&gt;&lt;a href="http://www.readwriteweb.com/archives/redux_hilary_mason_wants_to_get_you_started_with_big_dat.php" title="Permanent link to Hilary Mason Wants To Get You Started With Big Data" target="_blank"&gt;Hilary Mason Wants To Get You Started With Big Data&lt;/a&gt;&lt;/p&gt;</description><link>http://note.yen.tw/post/14850323731</link><guid>http://note.yen.tw/post/14850323731</guid><pubDate>Tue, 27 Dec 2011 01:10:00 -0500</pubDate></item><item><title>Learning to rank</title><description>&lt;p&gt;&lt;strong&gt;Learning to rank&lt;/strong&gt;&lt;sup class="reference" id="cite_ref-liu_0-0"&gt;&lt;a href="http://en.wikipedia.org/wiki/Learning_to_rank#cite_note-liu-0"&gt;[1]&lt;/a&gt;&lt;/sup&gt;&lt;span&gt; or &lt;/span&gt;&lt;strong&gt;machine-learned ranking&lt;/strong&gt;&lt;span&gt; (MLR) is a type of &lt;/span&gt;&lt;a href="http://en.wikipedia.org/wiki/Supervised_learning" title="Supervised learning"&gt;supervised&lt;/a&gt;&lt;span&gt; or &lt;/span&gt;&lt;a href="http://en.wikipedia.org/wiki/Semi-supervised_learning" title="Semi-supervised learning"&gt;semi-supervised&lt;/a&gt; &lt;a href="http://en.wikipedia.org/wiki/Machine_learning" title="Machine learning"&gt;machine learning&lt;/a&gt;&lt;span&gt; problem in which the goal is to automatically construct a &lt;/span&gt;&lt;a href="http://en.wikipedia.org/wiki/Ranking_function" title="Ranking function"&gt;ranking model&lt;/a&gt;&lt;span&gt; from training data. Training data consists of lists of items with some &lt;/span&gt;&lt;a class="mw-redirect" href="http://en.wikipedia.org/wiki/Partial_order" title="Partial order"&gt;partial order&lt;/a&gt;&lt;span&gt; specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment (e.g. &amp;#8220;relevant&amp;#8221; or &amp;#8220;not relevant&amp;#8221;) for each item. Ranking model&amp;#8217;s purpose is to rank, i.e. produce a &lt;/span&gt;&lt;a href="http://en.wikipedia.org/wiki/Permutation" title="Permutation"&gt;permutation&lt;/a&gt;&lt;span&gt; of items in new, unseen lists in a way, which is &amp;#8220;similar&amp;#8221; to rankings in the training data in some sense.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://en.wikipedia.org/wiki/Learning_to_rank"&gt;&lt;a href="http://en.wikipedia.org/wiki/Learning_to_rank"&gt;http://en.wikipedia.org/wiki/Learning_to_rank&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;</description><link>http://note.yen.tw/post/13454892517</link><guid>http://note.yen.tw/post/13454892517</guid><pubDate>Mon, 28 Nov 2011 10:25:00 -0500</pubDate></item><item><title>Ranking Refinement and  its Application to Information  Retrieval</title><description>&lt;p&gt;&lt;a href="http://www.cs.pitt.edu/~valizadegan/Publications/WWW_presentation.pdf"&gt;&lt;a href="http://www.cs.pitt.edu/~valizadegan/Publications/WWW_presentation.pdf"&gt;http://www.cs.pitt.edu/~valizadegan/Publications/WWW_presentation.pdf&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://www2008.org/papers/pdf/p397-jinA.pdf"&gt;&lt;a href="http://www2008.org/papers/pdf/p397-jinA.pdf"&gt;http://www2008.org/papers/pdf/p397-jinA.pdf&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We consider the problem of ranking re¯nement, i.e., to&lt;/p&gt;
&lt;p&gt;improve the accuracy of an existing ranking function with&lt;/p&gt;
&lt;p&gt;a small set of labeled instances. We are, particularly, inter-&lt;/p&gt;
&lt;p&gt;ested in learning a better ranking function using two comple-&lt;/p&gt;
&lt;p&gt;mentary sources of information, ranking information given&lt;/p&gt;
&lt;p&gt;by the existing ranking function (i.e., the base ranker) and&lt;/p&gt;
&lt;p&gt;that obtained from users&amp;#8217; feedbacks. This problem is very&lt;/p&gt;
&lt;p&gt;important in information retrieval where feedbacks are grad-&lt;/p&gt;
&lt;p&gt;ually collected. The key challenge in combining the two&lt;/p&gt;
&lt;p&gt;sources of information arises from the fact that the rank-&lt;/p&gt;
&lt;p&gt;ing information presented by the base ranker tends to be&lt;/p&gt;
&lt;p&gt;imperfect and the ranking information obtained from users&amp;#8217;&lt;/p&gt;
&lt;p&gt;feedbacks tends to be noisy. We present a novel boosting&lt;/p&gt;
&lt;p&gt;algorithm for ranking re¯nement that can e®ectively lever-&lt;/p&gt;
&lt;p&gt;age the uses of the two sources of information. Our empiri-&lt;/p&gt;
&lt;p&gt;cal study shows that the proposed algorithm is e®ective for&lt;/p&gt;
&lt;p&gt;ranking re¯nement, and furthermore it signi¯cantly outper-&lt;/p&gt;
&lt;p&gt;forms the baseline algorithms that incorporate the outputs&lt;/p&gt;
&lt;p&gt;from the base ranker as an additional feature.&lt;/p&gt;</description><link>http://note.yen.tw/post/13454842124</link><guid>http://note.yen.tw/post/13454842124</guid><pubDate>Mon, 28 Nov 2011 10:23:22 -0500</pubDate></item><item><title>LETOR4.0 Datasets</title><description>&lt;p&gt;&lt;a href="http://research.microsoft.com/en-us/um/beijing/projects/letor/letor4dataset.aspx"&gt;&lt;a href="http://research.microsoft.com/en-us/um/beijing/projects/letor/letor4dataset.aspx"&gt;http://research.microsoft.com/en-us/um/beijing/projects/letor/letor4dataset.aspx&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;span&gt;LETOR is a package of benchmark data sets for research on LEarning TO Rank, which contains standard features, relevance judgments, data partitioning, evaluation tools, and several baselines. Version 1.0 was released in April 2007. Version 2.0 was released in Dec. 2007. Version 3.0 was released in Dec. 2008. This version, 4.0, was released in July 2009. Very different from previous versions (V3.0 is an update based on V2.0 and V2.0 is an update based on V1.0), LETOR4.0 is a totally new release. It uses the Gov2 web page collection (~25M pages) and two query sets from &lt;/span&gt;&lt;a href="http://ir.cis.udel.edu/million/index.html"&gt;Million Query track&lt;/a&gt;&lt;span&gt; of TREC 2007 and TREC 2008. We call the two query sets MQ2007 and MQ2008 for short. There are about 1700 queries in MQ2007 with labeled documents and about 800 queries in MQ2008 with labeled documents.&lt;/span&gt;&lt;/p&gt;</description><link>http://note.yen.tw/post/13454626070</link><guid>http://note.yen.tw/post/13454626070</guid><pubDate>Mon, 28 Nov 2011 10:15:02 -0500</pubDate></item><item><title>Mechanical turk</title><description>&lt;p&gt;&lt;a href="https://www.mturk.com/mturk/welcome"&gt;&lt;a href="https://www.mturk.com/mturk/welcome"&gt;https://www.mturk.com/mturk/welcome&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;</description><link>http://note.yen.tw/post/13454459620</link><guid>http://note.yen.tw/post/13454459620</guid><pubDate>Mon, 28 Nov 2011 10:08:27 -0500</pubDate></item><item><title>The web09-bst Dataset</title><description>&lt;p&gt;&lt;a href="http://boston.lti.cs.cmu.edu/Data/web08-bst/planning.html" target="_blank"&gt;&lt;a href="http://boston.lti.cs.cmu.edu/Data/web08-bst/planning.html"&gt;http://boston.lti.cs.cmu.edu/Data/web08-bst/planning.html&lt;/a&gt; &lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The web09-bst dataset is a 25 terabyte dataset of about 1 billion web pages crawled in Janary and February, 2009. The crawl order was best-first search, using the OPIC metric. The crawl was started from about 28 million URLs that either i) had high OPIC values in a web graph produced from an earlier 200 million page crawl, or ii) were ranked highly by a commercial search engine for one of 4,000 sample queries in one of 10 languages. This dataset covers web content in English, Chinese, Spanish, Japanese, French, German, Arabic, Portuguese, Korean, and Italian.  More information about the dataset is available in our project planning document. This document is slightly outdated - for example, our plans for the seed URLs changed - but it is still the best description of our plans.  Dataset construction is in progress now. Current progress on the crawler and statistics for pages fetched can be found here. It is expected to be available to other researchers by April, 2009, under a TREC-style data license, for a small fee.&lt;/p&gt;</description><link>http://note.yen.tw/post/13454255049</link><guid>http://note.yen.tw/post/13454255049</guid><pubDate>Mon, 28 Nov 2011 10:00:00 -0500</pubDate></item><item><title>dmoz dataset</title><description>&lt;p&gt;&lt;a href="http://ce.sharif.ir/courses/84-85/2/ce324/assignments/files/assignDir4/dmozReadme.html" target="_blank"&gt;&lt;a href="http://ce.sharif.ir/courses/84-85/2/ce324/assignments/files/assignDir4/dmozReadme.html"&gt;http://ce.sharif.ir/courses/84-85/2/ce324/assignments/files/assignDir4/dmozReadme.html&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;(year 2005)&lt;/p&gt;
&lt;p&gt;&lt;span&gt;This dataset consists of the &amp;#8220;&lt;/span&gt;&lt;a href="http://www.dmoz.org/Science/"&gt;Science&lt;/a&gt;&lt;span&gt;&amp;#8221; subtree of the &lt;/span&gt;&lt;a href="http://www.dmoz.org/"&gt;dmoz.org&lt;/a&gt;&lt;span&gt; web directory. It contains 11625 topics and 104853 documents. The topics are numbered by integers from 0 to 11624; the documents are numbered by integers from 0 to 104852. The topics are organized into a tree (topic 0 is the root). Each document belongs to one or more topics; however, the vast majority of documents &lt;/span&gt;&lt;span&gt;(102801 out of 104853) belong to exactly one topic.&lt;/span&gt;&lt;/p&gt;</description><link>http://note.yen.tw/post/13454190254</link><guid>http://note.yen.tw/post/13454190254</guid><pubDate>Mon, 28 Nov 2011 09:57:35 -0500</pubDate></item><item><title>Using the Wikipedia page-to-page link database</title><description>&lt;p&gt;&lt;a href="http://haselgrove.id.au/wikipedia.htm"&gt;&lt;a href="http://haselgrove.id.au/wikipedia.htm"&gt;http://haselgrove.id.au/wikipedia.htm&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;</description><link>http://note.yen.tw/post/12067596037</link><guid>http://note.yen.tw/post/12067596037</guid><pubDate>Sat, 29 Oct 2011 06:25:20 -0400</pubDate></item><item><title>Dataset information</title><description>&lt;p&gt;&lt;a href="http://kevinchai.net/datasets"&gt;&lt;a href="http://kevinchai.net/datasets"&gt;http://kevinchai.net/datasets&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;</description><link>http://note.yen.tw/post/12067576922</link><guid>http://note.yen.tw/post/12067576922</guid><pubDate>Sat, 29 Oct 2011 06:23:57 -0400</pubDate></item><item><title>Microsoft Research Projects</title><description>&lt;p&gt;&lt;a href="http://research.microsoft.com/en-us/projects/mslr/download.aspx"&gt;http://research.microsoft.com/en-us/projects/mslr/download.aspx&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;span&gt;We release two large scale datasets for research on learning to rank: MSLR-WEB30k with more than 30,000 queries and a random sampling of it MSLR-WEB10K with 10,000 queries.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span&gt;Below are two rows from MSLR-WEB10K dataset:&lt;br/&gt;=============================================================&lt;br/&gt;0 qid:1&amp;#160;1:3&amp;#160;2:0&amp;#160;3:2&amp;#160;4:2 &amp;#8230; 135:0&amp;#160;136:0 &lt;br/&gt;2 qid:1&amp;#160;1:3&amp;#160;2:3&amp;#160;3:0&amp;#160;4:0 &amp;#8230; 135:0&amp;#160;136:0 &lt;br/&gt;=============================================================&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span&gt;In the data files, each row corresponds to a query-url pair. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span&gt;The first column is relevance label of the pair, the second column is query id, and the following columns are features. The larger value the relevance label has, the more relevant the query-url pair is. A query-url pair is represented by a 136-dimensional feature vector. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span&gt;The details of features can be found &lt;a target="_self" href="http://research.microsoft.com/en-us/projects/mslr/feature.aspx"&gt;here&lt;/a&gt;.&lt;/span&gt;&lt;/p&gt;</description><link>http://note.yen.tw/post/8333625286</link><guid>http://note.yen.tw/post/8333625286</guid><pubDate>Mon, 01 Aug 2011 04:00:58 -0400</pubDate></item><item><title>AOL query data</title><description>&lt;p&gt;Here is the AOL data&amp;#160;: &lt;a href="http://www.gregsadetsky.com/aol-data/"&gt;&lt;a href="http://www.gregsadetsky.com/aol-data/"&gt;http://www.gregsadetsky.com/aol-data/&lt;/a&gt;&lt;/a&gt; (439MB)&lt;/p&gt;
&lt;p&gt;Site1: &lt;a href="http://www.atrus.org/hosted/AOL-data.tgz"&gt;&lt;a href="http://www.atrus.org/hosted/AOL-data.tgz"&gt;http://www.atrus.org/hosted/AOL-data.tgz&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Site2: &lt;a href="http://aolsearchlogs.cloudsites.com/AOL-data.tgz"&gt;&lt;a href="http://aolsearchlogs.cloudsites.com/AOL-data.tgz"&gt;http://aolsearchlogs.cloudsites.com/AOL-data.tgz&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Site3: &lt;a href="http://sexygeeks.be/AOL-data.tgz"&gt;&lt;a href="http://sexygeeks.be/AOL-data.tgz"&gt;http://sexygeeks.be/AOL-data.tgz&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Site4: &lt;a href="http://www.upodcast.be/mirror/AOL-data.tgz"&gt;&lt;a href="http://www.upodcast.be/mirror/AOL-data.tgz"&gt;http://www.upodcast.be/mirror/AOL-data.tgz&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Site5: &lt;a href="http://www.leafyhost.com/AOL-data.tgz"&gt;&lt;a href="http://www.leafyhost.com/AOL-data.tgz"&gt;http://www.leafyhost.com/AOL-data.tgz&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;</description><link>http://note.yen.tw/post/4925670821</link><guid>http://note.yen.tw/post/4925670821</guid><pubDate>Mon, 25 Apr 2011 09:01:45 -0400</pubDate></item><item><title>Phantomjs </title><description>&lt;p&gt;&lt;a href="http://www.phantomjs.org/"&gt;&lt;a href="http://www.phantomjs.org/"&gt;http://www.phantomjs.org/&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;span&gt;PhantomJS can be fully scripted using JavaScript. It is an optimal solution for headless testing of web-based applications, site scraping, pages capture, SVG renderer, PDF converter and many other usages.&lt;/span&gt;&lt;/p&gt;</description><link>http://note.yen.tw/post/4254195919</link><guid>http://note.yen.tw/post/4254195919</guid><pubDate>Fri, 01 Apr 2011 03:35:17 -0400</pubDate></item><item><title>Free Python learning</title><description>&lt;p&gt;(1)&lt;a target="_blank" href="http://www.swaroopch.com/notes/Python"&gt;&lt;a href="http://www.swaroopch.com/notes/Python"&gt;http://www.swaroopch.com/notes/Python&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;(2)&lt;a target="_blank" href="http://learnpythonthehardway.org/"&gt;&lt;a href="http://learnpythonthehardway.org/"&gt;http://learnpythonthehardway.org/&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;(3)&lt;a href="http://en.wikibooks.org/wiki/Non-Programmer's_Tutorial_for_Python_2.6"&gt;&lt;a href="http://en.wikibooks.org/wiki/Non-Programmer's_Tutorial_for_Python_2.6"&gt;http://en.wikibooks.org/wiki/Non-Programmer&amp;#8217;s_Tutorial_for_Python_2.6&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;(4)&lt;a target="_blank" href="http://en.wikibooks.org/wiki/Python_Programming"&gt;&lt;a href="http://en.wikibooks.org/wiki/Python_Programming"&gt;http://en.wikibooks.org/wiki/Python_Programming&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;(5)&lt;a target="_blank" href="http://docs.python.org/tutorial/index.html"&gt;&lt;a href="http://docs.python.org/tutorial/index.html"&gt;http://docs.python.org/tutorial/index.html&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;(6)&lt;a target="_blank" href="http://www.greenteapress.com/thinkpython/thinkpython.html"&gt;&lt;a href="http://www.greenteapress.com/thinkpython/thinkpython.html"&gt;http://www.greenteapress.com/thinkpython/thinkpython.html&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;</description><link>http://note.yen.tw/post/4104723528</link><guid>http://note.yen.tw/post/4104723528</guid><pubDate>Sat, 26 Mar 2011 06:27:53 -0400</pubDate></item></channel></rss>

