| langdetect
A statistical language and encoding detector
Langdetect is a simple statistical language and encoding detector.
It is written in perl and C. The perl engine works offline (and only once) by
gathering sample data for each language, pre-processing it and producing
a C++ source code for a fast language detector. The C++ source code can then be
compiled to produce the detector itself or modified and embedded in other projects.
The basic detection algorithm enchances the one found in the Gertjan van Noord's textcat program.
Besides the fact that langdetect produces a fast c++ detector,
it uses floating point arithmetics and is quite independent
on the length of the sample data files.
Langdetect also attempts to guess the encoding of the data file (altough you
may need to experiment a bit with the sample data files to get good results).
Don't be scared by the contents of the produced C++ file: the tables
generated by the perl engine are optimized for size since with many
sample files the source can grow to several megabytes. When editing
you can simply take a look at the topmost structure declarations
and then jump straight to the ending lines which contain the real
detection algorithm.
The perl engine generates a simple main() routine that opens the
file specified as commandline argument, loads its first 4KB and
applies the detection algorithm to it. If you're going to
embed the source in some other program (remember the GPL!) then
you will typically wipe out the main routine and paste in some
other implementation that will be called from external files.
Besides perl (for generating the source file) there are no other special
requirements. In particular the only runtime requirement is the standard C library.
This program is covered by the GPL license: it is free software :)
Happy langdetecting! :)
download
|