Datasets
The data sets used in Doug Downey's PhD thesis are below, arranged in thesis order. The datasets are documented in readme files for each chapter;
for more details, please see the thesis document.
Chapter 2: Monotonic Feature Algorithm
Chapter 3: Urns
Chapter 4: REALM
Code & other data
Urns Code
The Java classes needed to execute Urns are provided here without separate documentation.
The "probabilitiesForCounts" method learns urn parameters and computes probabilities for a given a set of extraction parameters; an example of its use is given in main().
Note: The code requires both the weka and colt libraries.
HMM-T model
An HMM-T model file is provided for a model trained on the REALM corpus above. Each line gives a term, along with its 20-dimensional latent state distribution.