dmoz dataset

http://ce.sharif.ir/courses/84-85/2/ce324/assignments/files/assignDir4/dmozReadme.html

(year 2005)

This dataset consists of the “Science” subtree of the dmoz.org web directory. It contains 11625 topics and 104853 documents. The topics are numbered by integers from 0 to 11624; the documents are numbered by integers from 0 to 104852. The topics are organized into a tree (topic 0 is the root). Each document belongs to one or more topics; however, the vast majority of documents (102801 out of 104853) belong to exactly one topic.