The CMU Statistical Language Modeling Toolkit

Authors: Ronald Rosenfeld & Philip Clarkson
Updated: Mon 07 June 1999
Source: http://www.speech.cs.cmu.edu/SLM_info.html
Type: software toolkit
Languages: N/A
Keywords: languagedataexperimentword-frequency
Open Access: yes
License:
Documentation: http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.html
Publications: Clarkson, P.R. & Rosenfeld, R. (1997). Statistical Language Modeling Using the CMU-Cambridge Toolkit From Proceedings ESCA Eurospeech 1997.
Citation: Rosenfeld R. & Clarkson P. (1999). The CMU Statistical Language Modeling Toolkit. Carnegie Mellon University. http://www.speech.cs.cmu.edu/SLM_info.html
Summary:

The Carnegie Mellon Statistical Language Modeling (CMU SLM) Toolkit is a set of unix software tools designed to facilitate language modeling work in the research community. Version two is no longer limited to the use of bigram and trigram models, and provides support for n-grams of arbitrary size. It also provides support for several discounting schemes, rather than limiting the user to the Good-Turing discounting strategy used in version one. In addition, the tools used to count word n-grams, vocabulary n-grams and id n-grams have been re-written to increase greatly their speed of operation. Other changes include a more flexible way of handling context cues, the ability to calculate probabilities from ARPA format language models, the ability to force the model to back-off under certain circumstance (for example, if there is an unknown word in the context), and support for gnuzip compressed files as well as files compressed with the compress utility.