During Christmas, I tried to do some small fun hack with language modeling. That obviously requires reading and evaluating an LM. There are many ways to do so. But here is new method which I really like: use the python interface of KenLM.
So here is a note for myself (largely adapted from Victor Chahuneau):
To install KenLM, you need to first install boost. If you want to install boost, then you need to install libbz2.
- First, install libbz2:
sudo apt-get install libbz2-dev
- Then install 1.6.0: download here, then type
, and finally
./bj2 -j 4
- Install boost:
Now we install kenlm, I am using the copy of Victor's here.
git clone https://github.com/vchahun/kenlm.git pushd kenlm ./bjam python setup.py install popd
Training an LM
Download some books from Gutenberg, I am using Chambers's Journal of Popular Literature, Science, and Art, No. 723. And I got this file,
So all you need to do to train a model is,
cat 50780-0.txt| /home/archan/src/github/kenlm/bin/lmplz -o 3 > yourLM.arpa
Then you can binarize the LM, which is the part I like about KenLM, it feels snappy and fast than other toolkits I used.
/home/archan/src/github/kenlm/bin/build_binary yourLM.arpa yourLM.klm
Evaluate a Sentence with an LM
Write a python script like this:
import kenlm model = kenlm.LanguageModel('yourLM.klm') score = model.score('i like science fiction') print score