A Beginner’s Guide to Natural Language Processing — Part 2

Ch Ravi Raj
Analytics Vidhya
Published in
4 min readDec 3, 2020

--

Preprocessing and text cleanup was covered in Part 1 in this series of articles to introduce people to NLP. In this article, you will find:

  1. Tokens.
  2. N-grams.
  3. Co-occurence matrices.
  4. PMI — Pointwise Mutual Information.

Tokens

Tokens are the units that constitute a sentence. It could be a unit of a sequence of characters, digits, punctuations or a combination, separated by spaces usually.

Depending on the packages used, the identification of tokens works differently. For some packages, like NLTK, punctuations are also treated as separate tokens. So the word can’t will split into 2 tokens ca and n’t. But with a library like Spacy, it will remain a single word. However, it creates another issue by appending the punctuation to the word with which it came. As seen in the example below, sigh, is made 2 tokens by NLTK, but one single token by Spacy.

The process in NLTK seems easier since it gives you a list in return. However, on scaling to larger datasets and further downstream processes, despite the overhead, Spacy does a faster job.

Tokens usually carry very little meaning as individual entities. While they do convey the dictionary meaning to the machine, they do not give any context, nor direction to where the sentence is headed. To resolve this problem, we use N-Grams.

N-Grams

N-grams are a combination of N tokens that usually co-occur. For example, the word new occurs in a lot of contexts, but the word york frequently occurs with new. So we combine the two and get new york to give better information.

Combining 2 tokens(unigrams) gives us a bigram. Higher order n-grams are formed using 2 (n-1)-grams. 2 bigrams give a trigram, 2 trigrams form a quadgram and so on.

Few of the common bigrams found in the Twitter dataset, with their frequencies are as shown below.

Top 10 bigrams from a collection of tweets.

As you can see, words like gonna, gotta and wanna are frequently used with an apostrophe and have been split into 2 words. The others on the list, seem intuitively easy to predict for us. But when the dataset is unfamiliar, or when it is much larger than what we can comprehend, we might not be able to make a lot of sense out of the data to validate the bigrams ourselves.

Hence, statistical measures are useful in identifying bigrams more accurately and defining their precedence when there are many bigrams with the same first token or the same last token.

Co-occurrence

One of the simplest techniques we can think of, to find a close association between tokens is to count the number of times they have occurred together. This, in technical terms, is called Co-occurrence. It simply counts the number of times one token is followed by another token and forms a scoring metric. Usually, this is represented as a matrix of the order n*n, where n is the total number of unique tokens in a set of strings/documents.

For the set of strings :
‘the cat sat’, ‘the cat sat in the hat’, ‘the cat with the hat’, the co-occurrence matrix looks like this.

Co-occurrence matrix to find bigrams, where the row name is the first word, and the column name is the second word.

However, this is a rather naive approach since it coveys only the counts. We need a more robust approach that also considers other parameters like how many times a token has occurred in the entire corpus vs how many times it has occurred as a part of the bigram.

Pointwise Mutual Information (PMI)

If a token play occurs 100 times in a corpus in total and occurs immediately before the token cricket 5 times, it is illogical to replace the token play with play cricket. However, if it occurs immediately before the token football 95 times, it seems intuitively correct to replace the token play with play football.

Given that the purpose of forming N-grams is to reduce the number of unique tokens to form a robust corpus, this process seems apt. PMI is one of the available metrics to form the bigrams in a relative manner.

PMI is calculated using conditional probabilities.

PMI calculation formula.

p(x|y) = probability of finding the token x after the token y.
p(y|x) = probability of finding the token y after the token x.
p(y) = probability of the token y in the entire corpus.
p(x) = probability of the token x in the entire corpus.

A sample code segment to find the PMI scores and bigrams using NLTK is given below.

In the next part of the series, you will find — POS Tagging, Stemming and Lemmatization.

I really hope you liked the article. Do clap for the article if it was helpful. Any advice or criticism is welcome. Thanks for reading!

--

--

Ch Ravi Raj
Analytics Vidhya

Data Scientist | ML | NLP | Data Analyst | Problem Solving | Poetry | Books| Coffee