A Beginner’s Guide to Natural Language Processing — Part 4

Ch Ravi Raj
Analytics Vidhya
Published in
4 min readJan 7, 2021

--

After the 3 previous posts(Parts 1,2,3), you are now familiar with the cleaning and tagging of textual data. After reducing them to their root forms, we are now going to represent them as numbers.

The most common doubts that beginners have, with NLP, revolve around the representation of text as numbers. It could be either about the ideal approach or about the reduction metric. Although there is no perfect answer to these doubts, many of us tackle these problems based on the problem statement. When we are trying to solve a word prediction problem, the sequence of the words matters much more than a sentiment analysis problem, where the meaning conveyed by the words has a more profound impact.

But whatever be the problem, we need to build a model or a method that allows us to formulate a solution when it encounters new data. Hard coding a system to tag sentences with words like “good”, “nice”, “amazing” and so on as POSITIVE sentences is a naive approach that does not really require ML or NLP. But if we can find a way to represent all positive-sounding words similarly, then we need not worry about the system encountering new words.

Vectors and Vectorizers

Now a question arises — How do we represent related words similarly? One of the most common ways is with a VECTOR — an array of numbers. These numbers can be a positive int or positive and negative float values. Vectorizers are packages that form vectors for words/sentences/documents.

  • Count Vectorizer: The BOW is a representation that turns text into fixed-length vectors by counting how many times each word appears in a sentence or document. It is called a “bag” of words because any information about the order or structure of words in the document is discarded. We simply collect the data and make a corpus of all unique words in the data. We then make a matrix-like representation of those words. The fixed-length, equal to the number of unique words in the BOW, allows for a standard representation — making a vector.
An example of a bag of words vector concept.

We convert the sentence based on counts of the words in the vocabulary.

Considering the sentences— “the cat sat”, “the cat sat in the hat”, “the cat with the hat”, a standard Vectoriser would generate a fixed vector, having the size of the vocabulary. Below is a code sample to make vectors using CountVectorizer.

  • TFIDF Vectorizer: TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.

The term frequency (TF) of a word in a document is the raw count of instances a word appears in a document.

The inverse document frequency (IDF) of the word across a set of documents is how common or rare a word is in the entire document set. This metric can be calculated by taking the total number of documents, dividing it by the number of documents that contain a word, and calculating its log.

TFIDF Scoring

We get the final value by multiplying both: how proportionally a word appears in a document, and the inverse document frequency of the word across a set of documents.

Given 2 sentences: “The car is driven on the road” and “The truck is driven on the highway” below is a small demonstration of TFIDF scoring.

Example of TFIDF scoring

We convert the sentence based on the TFIDF scores generated. the first sentence vector would be [0,0.043,0,0,0,0,0,0.043,0] and the second one would be [0,0,0.043,0,0,0,0,0,0.043]. Below is a code sample of the CountVectorizer code, using TfidfVectorizer instead.

In the next article, we shall discuss Embeddings and why they are needed.

Happy new year! Any and all criticism to help me improve, is appreciated! Clap, follow and share!

--

--

Ch Ravi Raj
Analytics Vidhya

Data Scientist | ML | NLP | Data Analyst | Problem Solving | Poetry | Books| Coffee