teLEX

An implementation of the LEX algorithm for locating named entities
by Colin Bayer and Doug Downey

teLEX is a scalable system that executes the "LEX" algorithm for automatically locating named entities in text. The Java source files, required libraries, and a script demonstrating the use of the code are included in this zip file. Instructions on how to use the codebase are given below (see ``Executing the teLEX code'').

Note that teLEX assumes an "n-gram database" formatted in the manner of the Google n-grams dataset (in compressed form). The teLEX code was designed for scalability, and operates effectively even using the massive Google n-grams database, while allowing the database to be stored in compressed form (see ``Performance'' below). If you do not have access to the Google n-grams, there is a script for creating an n-gram database from a corpus; see ``Creating an n-gram database'' below.

Executing the teLEX code

The code executes in five stages, as detailed below. Detailed information about usage for any stage can be obtained by executing the stage without command-line arguments.

java QueryExtractor [-nonlp] <input directory>
java QuerySorter <extracted queries iqf directory>
java QueryRunner <sorted queries iqf file> <nGram database directory>
java QueryRouter [-fh n] <resolved queries file>
java LEXEvaluator [-nonlp] [-counts <count file path>] <tau> <delta> <input text directory> <resolved and routed queries directory>
Note:Tau and delta are parameters of the LEX algorithm, detailed in the paper below (see ``References''). We have found that a value of tau of 1E-8 and delta of 0.4 seems to work relatively well when using the Google n-grams database.

For concreteness, the example script demonstrates running the above java programs with values for the parameters, along with classpath variables and JVM directives that were effective in an example run. Of course, the particular directory locations used in the script may not correspond to those you employ.

Performance

In an experiment on a relatively standard desktop computer, teLEX system was able to process a total of 50,000 documents of 30KB on average, for a total of 1.5G of data, in about 8.5 hours, using the Google N-grams database. Because the runtime scales sub-linearly with corpus size (in particular, it tends to scale with the number of unique n-grams in the corpus, which is typically sub-linear in the size of the corpus), faster execution times per KB may be possible for larger corpora. teLEX uncompresses the files it needs from the Google n-grams at runtime, and thus does not require that the Google n-grams database be stored in uncompressed form (this capability saves roughly 100G of disk space).

Creating an n-gram database

The following java program will create an n-gram database from a corpus of text:
java NGramDBFromText <corpus directory> <n-gram database output> <max n-gram size>

Note that this program requires that the n-gram database be small enough to fit in memory, and has not been tested as thoroughly as the main teLEX system described above.

References

If you use the teLEX code, please cite the following paper:
Locating Complex Named Entities in Web Text.   Doug Downey, Matthew Broadhead, Oren Etzioni. (IJCAI 2007).