A phone using a voice assistant. — Photo by Omid Armin on Unsplash

A Beginner’s Guide to Natural Language Processing — Part 1

Published in

Analytics Vidhya

5 min readNov 19, 2020

“Alexa, play me a song.”, “Ok Google, call Mummy.” or even “Hey Siri, what is the value of Pi?” have become common in today’s world. While we celebrate the advancements that technology has produced in our lives, we also have to ask ourselves, how far have we reached in 2020, as our ancestors envisioned this year many generations ago.

To live in this age of information, it is imperative to know how to handle the data we are generating, by the terabyte, every second. It’s beyond the capability of the human mind, to consume information at that pace. But our machines, that understand 0s and 1s, cannot match our ability to comprehend information in its raw form. There arose the problem, of how to make systems process natural language as humanly possible.

WHAT IS NLP?

NLP — Natural Language processing, forms the bedrock of all interactive systems, including handling text data in any format. A few examples of our daily encounters with it include :

Voice Assistants: Siri, Google Assistant, Cortana, Alexa. All of them use NLP to comprehend your commands, map them to a specific activity and subsequently initiate the task.
Translators: Google Translate, Microsoft Translator, Linguee. They use NLP to map words from one language to another, by considering context and similarities.
Smart keyboards: Autocorrect, spell check suggestions and grammar checks are a result of NLP, where the system tries to understand what the user wants to convey and predict accordingly.

While these applications and features now appear to be common and easily available, there is still active research going on. Making better predictions, replying more accurately and conveying the aptest translation are still works in progress. The in-depth knowledge that is required to build these systems completely is something that an individual acquires with the application of theoretical knowledge. There are many solutions to the problems of NLP, but the basic steps and procedures remain the same before we can feed the data into any model.

NLP Workflows

Most of the NLP problems follow a similar pattern as seen in the image below.

The entities in the image can be described as:

Text Documents — The input data that we will be using. Each document could be an entire book, a paragraph or even a single sentence.
Text Preprocessing — Manipulating the input document into a format where it has no noise elements and has a uniform pattern across all documents.
Text Parsing — The uniform data now needs to be minimalized. By this, we imply that we want to reduce the corpus (collection of all the words across all documents in the dataset). Replacing different versions of the same word, identifying the tense of the words and grouping similar words are all steps done in this part.
Text Representation — Converting the cleaned document into the numerical format to have a fixed size representation for every unique word/phrase in the complete set of documents. This can involve both increasing the number of features (the parameters used to represent the words) with respect to the corpus, using techniques like Feature Engineering or decrease them with methods like Embeddings.
Modeling — This is the part of the pipeline where concepts such as Machine Learning, Deep Learning and Artificial Intelligence are applied to the numerical representation of our Text Data. Problems ranging from Sentiment Prediction or Text Classification to the likes of Autocorrect and Predictive Typing are modeled here using various algorithms.
Evaluation — After the model is made, it needs to be tested against new unseen data, to validate its performance in a real-time situation. The model also has to be scaled to the required number of users before deploying it for consumption.

In this post, which is a part of a series, only the generic Preprocessing steps are discussed.

What is Preprocessing?

To the human mind, text clean-up involves spelling and grammar checks but for a machine, it is a lot more than just syntactical and semantically correct sequence of words. There are many steps, that are done based on the problem at hand, that have to be performed under the umbrella of preprocessing.

Conversion to lower case

A machine does not completely understand the formal grammatical rules, H and h are two different entities. Hence the words — happy, Happy and HAPPY are three different words.

To avoid unnecessary redundancy and neutralize any case differences, converting all the data into lowercase text is of utmost importance.

Removing punctuations and accents

Punctuations are treated as separate characters by the system. While they do convey some meaning with respect to the text data, they are irrelevant to the machine. in most cases, they do not play any role in the processing except that of delimiters to separate one word from another. With this in mind, most NLP workflows include dropping of punctuations as a necessary step in the preprocessing.

Other languages also have accents, which have minimal impact on English. Processing an English problem, they do not help the pipeline and can cause deviations in the model. To avoid erroneous suggestions, unless the model is being trained in a cross-linguistic manner, we drop off the accents and the words containing them.

Removing Stop Words

WordCloud of the poem IF by Rudyard Kipling

Stop words are words that are commonly found across the entire corpus but do not convey much information without context. Common words like prepositions, conjunctions or even pronouns, are usually very generic and occur in large numbers in a collection. They often have a very distinctively high number than the other words in the corpus.

The above WordCloud shows the most commonly occurring words in larger sizes than the rare ones. The words “if”, “you”, “and”, and “can” are the most frequent, but do not convey the information about what the poem talks about. Words like this can be safely dropped off in most cases to reduce the data volume and eliminate the noise.

Additional steps in the preprocessing, which vary based on the problem being solved or the methods used in previous steps, can include:

Dropping off numbers.
Replacing multiple blank spaces.
Adding unique delimiter sequences.
Dropping usernames/websites/links from the data.

This completes the cleaning up of the text data without parsing it. Sample code is available here:

What is next?

In the next post, you will find — Tokenization, N-grams and methods to identify them.

I hope you liked the article. This is my first technical blog post, any advice or criticism is highly appreciated. Thanks for reading!