| Literature DB >> 21868845 |
C Y Suen1.
Abstract
n-gram (n = 1 to 5) statistics and other properties of the English language were derived for applications in natural language understanding and text processing. They were computed from a well-known corpus composed of 1 million word samples. Similar properties were also derived from the most frequent 1000 words of three other corpuses. The positional distributions of n-grams obtained in the present study are discussed. Statistical studies on word length and trends of n-gram frequencies versus vocabulary are presented. In addition to a survey of n-gram statistics found in the literature, a collection of n-gram statistics obtained by other researchers is reviewed and compared.Entities:
Year: 1979 PMID: 21868845 DOI: 10.1109/tpami.1979.4766902
Source DB: PubMed Journal: IEEE Trans Pattern Anal Mach Intell ISSN: 0098-5589 Impact factor: 6.226