Quick-and-dirty language model with OpenGRM
« Back to notesAssuming that the corpus has been processed appropriately (tokenized, whitespace replaced with entity token, etc.). Also assuming the default smoothing method (Kneser-Ney).
echo "usage: buildlm.sh ngram_count corpus_filename"
NGRAM_SIZE=$1
: ${NGRAM_SIZE:=3}
echo "ng: $NGRAM_SIZE"
FNAME=$2
set -x
set -e
ngramsymbols < $FNAME.split > $FNAME.syms
farcompilestrings -symbols=$FNAME.syms -keep_symbols=1 $FNAME.split > $FNAME.far
ngramcount --order=$NGRAM_SIZE < $FNAME.far > $FNAME.$NGRAM_SIZE.counts
ngrammake $FNAME.$NGRAM_SIZE.counts > $FNAME.$NGRAM_SIZE.smoothed.mod
« Back to notes