/ home / software / free software / langdetect
langdetect

A statistical language and encoding detector

Langdetect is a simple statistical language and encoding detector. It is written in perl and C. The perl engine works offline (and only once) by gathering sample data for each language, pre-processing it and producing a C++ source code for a fast language detector. The C++ source code can then be compiled to produce the detector itself or modified and embedded in other projects.

The basic detection algorithm enchances the one found in the Gertjan van Noord's textcat program. Besides the fact that langdetect produces a fast c++ detector, it uses floating point arithmetics and is quite independent on the length of the sample data files. Langdetect also attempts to guess the encoding of the data file (altough you may need to experiment a bit with the sample data files to get good results).

Don't be scared by the contents of the produced C++ file: the tables generated by the perl engine are optimized for size since with many sample files the source can grow to several megabytes. When editing you can simply take a look at the topmost structure declarations and then jump straight to the ending lines which contain the real detection algorithm.

The perl engine generates a simple main() routine that opens the file specified as commandline argument, loads its first 4KB and applies the detection algorithm to it. If you're going to embed the source in some other program (remember the GPL!) then you will typically wipe out the main routine and paste in some other implementation that will be called from external files.

Besides perl (for generating the source file) there are no other special requirements. In particular the only runtime requirement is the standard C library.

This program is covered by the GPL license: it is free software :)

Happy langdetecting! :)

download

langdetect.tar.bz2 [256713 bytes]



here
we detect languages and encodings
there
blog
articles
software
pragmaware
pragma
photos
contact
login
past
h 0x685125
u 0x5fca55
p 0x00057d
future
working on it...