본문 바로가기

NLP ( 자연어 처리 )

n-gram

An n-gram model is a type of probabilistic language modelMore concisely, an n-gram model predicts x_i based on x_{i-(n-1)}, \dots, x_{i-1}. In probability terms, this is P(x_i \mid x_{i-(n-1)}, \dots, x_{i-1}). When used for language modeling, independence assumptions are made so that each word depends only on the last n − 1 words.




1-gram, 2gram, 3-gram..







Applications and considerations[edit]

n-gram models are widely used in statistical natural language processing. In speech recognitionphonemes and sequences of phonemes are modeled using a n-gram distribution. For parsing, words are modeled such that each n-gram is composed of n words. For language identification, sequences of characters/graphemes (e.g.letters of the alphabet) are modeled for different languages.[4] For sequences of characters, the 3-grams (sometimes referred to as "trigrams") that can be generated from "good morning" are "goo", "ood", "od ", "d m", " mo", "mor" and so forth (sometimes the beginning and end of a text are modeled explicitly, adding "__g", "_go", "ng_", and "g__"). For sequences of words, the trigrams that can be generated from "the dog smelled like a skunk" are "# the dog", "the dog smelled", "dog smelled like", "smelled like a", "like a skunk" and "a skunk #".


 

"n-gram 은 통계적인 자연어 처리에서 널리 사용되는 모델"이다. 이 한마디로 어느정도 n-gram 에 대해 감을 잡을 수 있는 것 같았다..!!




Out-of-vocabulary words

An issue when using n-gram language models are out-of-vocabulary (OOV) words. They are encountered in computational linguistics and natural language processing when the input includes words which were not present in a system's dictionary or database during its preparation. By default, when a language model is estimated, the entire observed vocabulary is used. In some cases, it may be necessary to estimate the language model with a specific fixed vocabulary. In such a scenario, the n-grams in the corpus that contain an out-of-vocabulary word are ignored. The n-gram probabilities are smoothed over all the words in the vocabulary even if they were not observed.[7]

Nonetheless, it is essential in some cases to explicitly model the probability of out-of-vocabulary words by introducing a special token (e.g. <unk>) into the vocabulary. Out-of-vocabulary words in the corpus are effectively replaced with this special <unk> token before n-grams counts are cumulated. With this option, it is possible to estimate the transition probabilities of n-grams involving out-of-vocabulary words.[8]


n-gram 은 OOV ( 사전에 등재되지 않은 단어가 있는 경우 ) 에서는 이슈가 있다. 그냥 무시하고 n-gram probabilities 를 추정한다. 그래서 번역등재되지 않은 단어를 <unk> token 으로 처리하는 방식등이 이용되기도 한다. 





n
-grams for approximate matching

n-grams can also be used for efficient approximate matching. By converting a sequence of items to a set of n-grams, it can be embedded in a vector space, thus allowing the sequence to be compared to other sequences in an efficient manner. For example, if we convert strings with only letters in the English alphabet into single character 3-grams, we get a 26^3-dimensional space (the first dimension measures the number of occurrences of "aaa", the second "aab", and so forth for all possible combinations of three letters). Using this representation, we lose information about the string. For example, both the strings "abc" and "bca" give rise to exactly the same 2-gram "bc" (although {"ab", "bc"} is clearly not the same as {"bc", "ca"}). However, we know empirically that if two strings of real text have a similar vector representation (as measured by cosine distance) then they are likely to be similar. Other metrics have also been applied to vectors of n-grams with varying, sometimes better, results. For example, z-scores have been used to compare documents by examining how many standard deviations each n-gram differs from its mean occurrence in a large collection, or text corpus, of documents (which form the "background" vector). In the event of small counts, the g-score (also known as g-test) may give better results for comparing alternative models.

Another method for approximate matching is signature files. The study reported in [9] shows that a bit-sliced signature file can be compressed to a smaller size than an inverted file which is the standard way of implementing a vector space approach. With a signature width less than half the number of unique n-grams, the signature file method is about as fast as the inverted file method, and significantly smaller.

It is also possible to take a more principled approach to the statistics of n-grams, modeling similarity as the likelihood that two strings came from the same source directly in terms of a problem in Bayesian inference.

n-gram-based searching can also be used for plagiarism detection.


n-gram 은 approximate matching에도 많이 이용된다. approximate matching이 뭥미... ㅠㅠ 여튼 Bayesian 추론 같은 통계적인 접근에서도 많이 쓰이고, 표절 탐지에도 많이 쓰인다고 하네요~~




Other applications

n-grams find use in several areas of computer science, computational linguistics, and applied mathematics.

They have been used to:


n-gram 은 더 많은 영역에서 사용된다.

 커널 디자인, 오타 보정, 관심어 추출 기능, 검색 시 유저가 원하는 문서 검색, 검색 퍼포먼스 향상, 문자나 단어 예측, 암호 해독학 과 같은...!!

'NLP ( 자연어 처리 )' 카테고리의 다른 글

katz-backoff  (0) 2016.04.17
ARPA Language models  (0) 2016.04.16
Zipf's law  (0) 2016.04.08
Edit distance  (0) 2016.04.04
ngram-count  (0) 2016.04.04