A Beginner’s Guide to Natural Language Processing — Part 5

Ch Ravi Raj
Analytics Vidhya
Published in
4 min readJan 28, 2021

--

In Part 4, we saw what vectors are and how they are generated using TFIDF or simple counts. We have a vector of the size of the number of unique words in the entire corpus. Now when I gave an example, there were only 4–10 unique words. Hence, my vector size was also 4–10 integers or float values. Now let’s just take a corpus with only 100 unique words. Seems acceptable, right? But when we start moving to bigger corpora, the vector size will start becoming increasingly cumbersome. For instance, if there was a 10 worded sentence in a 10,000-word corpus. Our vector will have 9,990 0s and only 10 non-zero values — either integer or float. Such a sparse matrix becomes difficult to navigate through. It would be easier for us to have small-sized vectors. But we cannot accurately decide which are the words we’ll have to represent as a part of our vector and which are the ones we’ll have to leave.

Another problem that we face with using TF-IDF/Count Vectoriser values directly, is that they do not capture the relationship between words. There might not be any resemblance between the vectors for the words — “Male”, “Man”, “Policeman” and “King”.

To avoid this problem of selection, as well as generate vectors that show the similarities between closely related words, we use the concept of Word Embeddings — vectors of a fixed length that also consider the semantic relationship that exists between words in the sentence. Word2Vec is one of the most commonly used concepts to generate word embeddings. Before we implement Word2Vec, we will also have to discuss 2 other concepts — The Skipgram Model and the Continuous Bag-of-words(CBOW) Model.

What is Word2Vec?

A deep learning model introduced in 2013, by Google, to generate continuous dense vectors(vectors without too many zeros) for words. The created vectors capture the contextual and semantic similarity between the words. It is an unsupervised model that can take a huge corpus, having a large number of unique words, and return a dense vector space representing the entire vocabulary. Commonly, the user decides the length of the vector, called Dimensions. The dimension is usually much smaller than the number of unique words in the vocabulary, in contrast to our TF-IDF or Count vectors.

What is the CBOW Model?

In this architecture of Word2Vec, we try to predict the target word (t), based on the context words around it (t-n,.,…t-1,t+1,…,.,t+n). The value of n is the window size that we choose, based on the problem we are solving. We can model the CBOW architecture as a classification model where we use the context words as our input, x, and predict the target word, y.

What is the Skipgram Model?

In this model, we try and predict the context words (t-n,.,…t-1,t+1,…,.,t+n) around the target word (t). We give our skip-gram model our input x and our label y as pairs (x,y). We train it by using [(target, context), 1] values as positive samples(target is our word of interest and context is a context word occurring near the target word). Label 1 indicates this is a positive pair with context and target relevant to each other. We also give [(target, random), 0] pairs as negative samples where the target is our word of interest but random is just a randomly selected word from our vocabulary which has no association with our result. Through this, we teach the model which pairs of words are relevant and which are not. That allows the model to generate similar embeddings for similar words.

Below is the demo of a Word2Vec implementation with a very small dataset. The accuracies are low, but they give you an idea of how to work with large datasets.

With this, we complete the series — ‘A Beginner’s Guide to Natural Language Processing’. Once the vectors and embeddings are made, you can use them to solve almost all the problems that NLP has to offer!

Thank you for reading! Let me know what you think about the series! Do share it with a few beginners if you like it, and help them as well! Cheers! Clap, share, and correct!

--

--

Ch Ravi Raj
Analytics Vidhya

Data Scientist | ML | NLP | Data Analyst | Problem Solving | Poetry | Books| Coffee