teLEX
An implementation of the LEX algorithm for locating named entities
by Colin Bayer and Doug Downey
teLEX is a scalable system that executes the "LEX" algorithm for automatically locating named entities in text.
The Java source files, required libraries, and a script demonstrating the use of the code are
included in this zip file. Instructions on how to use the
codebase are given below (see ``Executing the teLEX code''). Note that teLEX assumes an "n-gram database" formatted in
the manner of the Google n-grams dataset (in compressed form). The teLEX code was designed for scalability, and operates
effectively even using the massive Google n-grams database, while allowing the database
to be stored in compressed form (see ``Performance'' below).
If you do not have access to the Google n-grams, there is a script for creating an n-gram database from a corpus;
see ``Creating an n-gram database'' below.
Executing the teLEX code
The code executes in five stages, as detailed below. Detailed information about usage
for any stage can be obtained by executing the stage without command-line arguments.
java QueryExtractor [-nonlp] <input directory>
java QuerySorter <extracted queries iqf directory>
java QueryRunner <sorted queries iqf file> <nGram database directory>
java QueryRouter [-fh n] <resolved queries file>
java LEXEvaluator [-nonlp] [-counts <count file path>] <tau> <delta> <input text directory> <resolved and routed queries directory>
Note:Tau and delta are parameters of the LEX algorithm, detailed in the paper below (see ``References'').
We have found that a value of tau of 1E-8 and delta
of 0.4 seems to work relatively well when using the Google n-grams database.
For concreteness, the example script demonstrates running the above java programs with values for the parameters, along with classpath
variables and JVM directives that were effective in an example run. Of course, the particular directory locations used in
the script may not correspond to those you employ.
Performance
In an experiment on a relatively standard desktop computer, teLEX system was able to process a total of 50,000 documents of 30KB on average, for a total of 1.5G of data, in about 8.5 hours,
using the Google N-grams database. Because the runtime scales sub-linearly with corpus size (in particular, it tends to scale with the number of
unique n-grams in the corpus, which is typically sub-linear in the size of the corpus), faster execution times per KB may be possible
for larger corpora. teLEX uncompresses the files it needs from the Google n-grams at runtime, and thus does not require that the
Google n-grams database be stored in uncompressed form (this capability saves roughly 100G of disk space).
Creating an n-gram database
The following java program will create an n-gram database from a corpus of text:
java NGramDBFromText <corpus directory> <n-gram database output> <max n-gram size>
Note that this program requires that the n-gram database be small enough to fit in memory, and has not
been tested as thoroughly as the main teLEX system described above.
References
If you use the teLEX code, please cite the following paper:
Locating Complex Named Entities in Web Text.
Doug Downey, Matthew Broadhead, Oren Etzioni. (IJCAI 2007).