Datasets

The data sets used in Doug Downey's PhD thesis are below, arranged in thesis order. The datasets are documented in readme files for each chapter; for more details, please see the thesis document.

Chapter 2: Monotonic Feature Algorithm

Chapter 3: Urns

Chapter 4: REALM

Code & other data

Urns Code

The Java classes needed to execute Urns are provided here without separate documentation. The "probabilitiesForCounts" method learns urn parameters and computes probabilities for a given a set of extraction parameters; an example of its use is given in main(). Note: The code requires both the weka and colt libraries.

HMM-T model

An HMM-T model file is provided for a model trained on the REALM corpus above. Each line gives a term, along with its 20-dimensional latent state distribution.