baseline LM訓練指令

requirements

  • Chinese Gigaword Dataset (the CNA part)

  • Kaldi

  • MITLM (安裝步驟看這篇)

流程

(1) 算count

gunzip -c cna.split.txt.gz  | cut -d' ' -f2- | ngram-count -text /dev/stdin -write count.split.txt.gz

  (2) Smoothing

~/MITLM/bin/estimate-ngram -c count.split.txt.gz -o 3 -s ModKN -wl lm0.arpa.gz

  (3) Pruning+砍掉OOV

(tmp_vocab是字典的第一個column)

ngram -debug 1 -order 3 -lm lm0.arpa.gz -vocab tmp_vocab -limit-vocab -prune 1e-7 -prune-lowprobs -unk -renorm -write-lm lang_test/lm.gz

  (4) 把LM變成G.fst

(在lang_test底下)

gunzip -c pruned.arpa.gz | arpa2fst - - | fstprint | eps2disambig.pl | s2eps.pl | fstcompile --isymbols="words.txt" --osymbols="words.txt" --keep_isymbols=false --keep_osymbols=false | fstrmepsilon | fstdeterminizestar --use-log=true | fstarcsort --sort_type=ilabel > "G.fst"