A Beginner’s Guide to Natural Language Processing — Part 3

Ch Ravi Raj
Analytics Vidhya
Published in
4 min readDec 17, 2020

--

Tokens and N-grams with scoring metrics were covered in Part 2 in this series of articles as an introduction to NLP.

Photo by Edho Pratama on Unsplash

In this article, after a brief discussion on Data Parsing, you will find:

  1. POS Tagging.
  2. Stemming.
  3. Lemmatization.

What is Data Parsing?

Traditional sentence parsing is done as a method of understanding the exact meaning of a sentence or word. It usually explains the importance of various divisions such as subject and predicate. To a computer, data parsing is a similar process, where it analyses a string by breaking it into its constituents, with each of the resulting components turning into meaningful pieces to the system.

Why is data parsing necessary?

After preprocessing and clean up, it seems intuitive to the human mind, that the text is now clean and that it’s easy to understand the sequence of meaningful words. We understand parts of speech, we understand the concept of subject and predicate. However, to the machine, they are still not in a useful form. The text is still redundant and conveys no rules of grammar to the machine.

To the human mind, the sequence of words — “ravi typed letter” makes as much sense as “Ravi typed a letter.” But to a machine, neither of the statements made any sense. It still does not understand that “ravi” is the subject, nor does it understand that “typed” is a past tense of a verb. Makes you feel grateful for that mind of yours, isn’t it? To guide the system and give it some insight into the words, we usually perform a collection of steps.

POS Tagging

This the first step in explaining the grammatical rules of the unfamiliar text to the system. We understand the fact that RUN is a VERB because it implies action. Though the machine has access to a dictionary to understand the definition of RUN, it needs to be told via the definition that it is a VERB. Like a 1st-grade student understands verbs through his teacher, in this step the tagging is the teacher to the 1st-grade machine.

POS stands for Part of Speech. POS Tagging is the process of marking a word in a corpus to a corresponding part of a speech tag, based on its context and definition.

There are many methods of achieving this:

  1. Lexical Based Methods — The assigned POS tag is the most frequently occurring one with a word in the training corpus.
  2. Rule-Based Methods — The assigned POS tag is based on rules. For example — A rule that says, words ending with “ed” or “ing” must be assigned to a verb.
  3. Probabilistic Methods — The assigned POS tag is based on the probability of a particular tag sequence occurring.
  4. Deep Learning Methods — There are Recurrent Neural Networks also that can be used for POS tagging.
from nltk import pos_tag
pos_tag(['the', 'cat', 'sat', 'in', 'the', 'hat'])
# Output
# [('the', 'DT'),('cat', 'NN'),('sat', 'VBD'),('in', 'IN'),
# ('the', DT'),('hat', 'NN')]

In the above result:

  • DT — Determiner
  • NN — Noun Singular or Mass
  • VBD — Verb Past Tense
  • IN — Preposition

Stemming and Lemmatization

In some cases, the state of the sentence matters, and in some cases it is inconsequential. In the example, we can see that SAT is classified accordingly as past tense. However, for example, when I am building an application that detects verbs and classifies the sentence as a sentence indicating some action, the tense does not matter. Similarly, there are plenty of applications, where we just need the word in its simplest state. To solve the problem, we have concepts like stemming and lemmatization.

Stemming is the process of reducing words to their word stem, base or root form. Here, we work with common suffixes that indicate tenses and we cut them from the word.

Example of Stemming.

If you look at the table on the left, an obvious problem is found, isn’t it? Studi does not seem to be the root of any English word. Removing suffixes like ed and es is a naive method to try and remove the unnecessary overhead. An alternative to solve this problem, to ensure the words like studied are traced back to the word study, we use Lemmatization.

Lemmatization is the process of grouping together the altered forms of a word so they can be analyzed as a single item, identified by the word’s lemma, or dictionary form. It requires a clear dictionary to map all the variants of a word to its own root word. Unlike Stemming, with the help of the dictionary, it

Example of Lemmatization.

ensures that the words are accurately mapped to the original root.

As seen, stemming converted studies to studi and studying to study. Despite the stemming, there are still 2 different words that are supposed to be tackled by the machine. Lemmatizing, on the other hand, maps both studying and studies to the word study. Thus, there is increased efficiency, but because of the dictionary lookup, the time taken to lemmatize data is much higher than to stem the data.

In the next part of the series, you will find Vectors and Vectorizers.

Happy learning everyone! I hope my articles are making a difference and helping you learn more. Do let me know if there is anything I need to improve on! Clap, follow and share!

--

--

Ch Ravi Raj
Analytics Vidhya

Data Scientist | ML | NLP | Data Analyst | Problem Solving | Poetry | Books| Coffee