Datasets

The data sets used in Doug Downey's PhD thesis are below, arranged in thesis order. The datasets are documented in readme files for each chapter; for more details, please see the thesis document.

Chapter 2: Monotonic Feature Algorithm

Training/test examples
Unlabeled examples
Corpus file (about 0.5G, compressed)
Examples in text format
Dictionary file for contexts (see readme)
Readme file explaining the parameter values in the labeled and unlabeled data sets

Chapter 3: Urns

Training/test examples
Precision measurements
Precision measurements information, including classes considered & tagging conventions
Readme file explaining the parameter values in the data sets

Chapter 4: REALM

Seed/test examples
Corpus file (about 100M, compressed)
Readme file explaining the parameter values in the data sets

Code & other data

Urns Code

The Java classes needed to execute Urns are provided here without separate documentation. The "probabilitiesForCounts" method learns urn parameters and computes probabilities for a given a set of extraction parameters; an example of its use is given in main(). Note: The code requires both the weka and colt libraries.

HMM-T model

An HMM-T model file is provided for a model trained on the REALM corpus above. Each line gives a term, along with its 20-dimensional latent state distribution.