In Part 4, we saw what vectors are and how they are generated using TFIDF or simple counts. We have a vector of the size of the number of unique words in the entire corpus. Now when I gave an example, there were only 4–10 unique words. Hence, my vector size was also 4–10 integers or float values. Now let’s just take a corpus with only 100 unique words. Seems acceptable, right? But when we start moving to bigger corpora, the vector size will start becoming increasingly cumbersome. For instance, if there was a 10 worded sentence in a 10,000-word…


After the 3 previous posts(Parts 1,2,3), you are now familiar with the cleaning and tagging of textual data. After reducing them to their root forms, we are now going to represent them as numbers.

The most common doubts that beginners have, with NLP, revolve around the representation of text as numbers. It could be either about the ideal approach or about the reduction metric. Although there is no perfect answer to these doubts, many of us tackle these problems based on the problem statement. …


Tokens and N-grams with scoring metrics were covered in Part 2 in this series of articles as an introduction to NLP.

Photo by Edho Pratama on Unsplash

In this article, after a brief discussion on Data Parsing, you will find:

  1. POS Tagging.
  2. Stemming.
  3. Lemmatization.

What is Data Parsing?

Traditional sentence parsing is done as a method of understanding the exact meaning of a sentence or word. It usually explains the importance of various divisions such as subject and predicate. …


Preprocessing and text cleanup was covered in Part 1 in this series of articles to introduce people to NLP. In this article, you will find:

  1. Tokens.
  2. N-grams.
  3. Co-occurence matrices.
  4. PMI — Pointwise Mutual Information.

Tokens

Tokens are the units that constitute a sentence. It could be a unit of a sequence of characters, digits, punctuations or a combination, separated by spaces usually.

Depending on the packages used, the identification of tokens works differently. For some packages, like NLTK, punctuations are also treated as separate tokens. So the word can’t will split into 2 tokens ca and n’t. But with a library…


A phone using a voice assistant.
A phone using a voice assistant.
Photo by Omid Armin on Unsplash

“Alexa, play me a song.”, “Ok Google, call Mummy.” or even “Hey Siri, what is the value of Pi?” have become common in today’s world. While we celebrate the advancements that technology has produced in our lives, we also have to ask ourselves, how far have we reached in 2020, as our ancestors envisioned this year many generations ago.

To live in this age of information, it is imperative to know how to handle the data we are generating, by the terabyte, every second. It’s beyond the capability of the human mind, to consume information at that pace. But our…

Ch Ravi Raj

Data Analyst | NLP | ML | Problem Solving | Poetry | Books| Coffee

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store