top of page

Natural Language Processing Techniques: Text Preprocessing

Updated: Apr 16

Natural Language Processing (NLP) is rapidly becoming one of the most popular techniques today. It involves programming computers to comprehend, process, and analyze vast amounts of human natural language data, including text and speech.


For instance, consider reading reviews for a book. Humans can effortlessly gauge whether the sentiment is positive or negative, right? But how do machines interpret these sentiments? That's where NLP shines.


NLP libraries like SpaCy and NLTK empower developers to harness the power of natural language understanding. These tools enable text analysis, sentiment analysis, and language translation.


Examples of NLP in action abound: from analyzing WhatsApp text messages to filtering spam in Gmail and even conversing with virtual assistants like Siri. The applications of NLP are diverse and continually expanding, revolutionizing how we interact with technology.


Some wide NLP uses


STEP 1: Text Preprocessing

Text preprocessing is the initial crucial step in handling textual data, aimed at making the data more manageable and useful for analysis. This step involves various techniques, including tokenization, stemming, and lemmatization.


1. Tokenization

Tokenization breaks down a text into smaller units, such as sentences or words. These smaller units are called tokens and the building blocks for further analysis.


For example, a paragraph can be tokenized into sentences, and sentences can be further tokenized into words.


Let's illustrate this with a code snippet:



Let us look at a small piece of code to understand this better:

#import the necessary library

import nltk

nltk.download("punkt")

paragraph = "A paragraph is a series of related sentences developing a central idea, called the topic. Try to think about paragraphs in terms of thematic unity: a paragraph is a sentence or a group of sentences that supports one central, unified idea. Paragraphs add one idea at a time to your broader argument"

sentence= nltk.sent_tokenize(paragraph) #List of sentences

for sentences in sentence:

print(sentences)

Here, the paragraph is tokenized into sentences using nltk.sent_tokenize(), and each sentence is printed.


2. Stemming

Stemming involves reducing words to their root form by removing affixes, such as prefixes or suffixes. The resulting stem may not always be a valid word but it also simplifies the text for analysis.


For example, the word "running" is from "run".


There are different stemming algorithms such as Porter, Porter2 (Snowball), and Lancaster. These algorithms vary in aggressiveness and computational efficiency.



Let's demonstrate stemming with a code snippet:

#Using the same paragraph

#Importing libraries

from nltk.stem.snowball import SnowballStemmer

#Stop words are words that do not add any meaning to a sentence
from nltk.corpus import stopwords
nltk.download('stopwords')

# create an object of stemming
stemmer = SnowballStemmer("english")

for i in range(len(sentence)):
text= nltk.word_tokenize(sentence[i])
text = [stemmer.stem(word) for word in text if word not in set(stopwords.words('english'))]
sentence[i]= " ".join(text)
print(sentence)

Here, SnowballStemmer is used to stem words in each sentence, and stopwords (words that do not add meaning to a sentence) are removed.


3. Lemmatization

Lemmatization is similar to stemming but aims to convert words to their base or dictionary form, known as lemma. Unlike stemming, lemmatization ensures that the resulting word is valid and meaningful.


Let's apply lemmatization using NLTK:

#importing the library

from nltk.stem import WordNetLemmatizer

# create an object of Lemmatizing function
lemma = WordNetLemmatizer()
for i in range(len(sentence)):
text= nltk.word_tokenize(sentence[i])
text = [lemma.lemmatize(word) for word in text if word not in set(stopwords.words('english'))]
sentence[i]= " ".join(text)

print(sentence)

WordNetLemmatizer lemmatizes words in each sentence, and stopwords are removed.


In summary, text preprocessing techniques like tokenization, stemming, and lemmatization prepare textual data for further analysis and modeling. They help in standardizing text, reducing complexity, and improving the efficiency and accuracy of natural language processing tasks.


STEP 2: Text Preprocessing

In this step, we focus on converting words into vectors, which are numerical representations necessary for machine learning models to process text data effectively.


1. Bag of Words

BOW is a simple yet powerful technique for converting text data into numerical vectors. It involves creating a matrix where each row represents a document (or sentence), and each column represents a unique word in the entire corpus.


For example, let's consider four sentences:

  1. "This is a wonderful book."

  2. "I liked this book very much."

  3. "I didn’t find it interesting."


Using the CountVectorizer library from NLTK, we can create a matrix that counts the frequency of each word in each sentence. Stop words (common words like "the", "is", etc.) are often removed to focus on meaningful words.


Binary Bag of Words (Binary BOW) represents whether a word is present (1) or not present (0) in a sentence, irrespective of its frequency.


While BOW is effective in capturing the occurrence of words, it lacks semantic understanding and doesn't differentiate the importance of words. For example, in the sentence "This is a good book," the word "good" is crucial for sentiment analysis but receives the same importance as other words in BOW.


2. TF-IDF: (Term Frequency-Inverse Document Frequency)

TF-IDF is a more sophisticated technique that overcomes the limitations of BOW by considering the importance of words in a document relative to the entire corpus.


TF (Term Frequency) measures how often a word appears in a document relative to the total number of words in that document. IDF (Inverse Document Frequency) measures how unique or rare a word is across all documents in the corpus.


The TF-IDF weight is the product of TF and IDF. It assigns higher weights to words that are frequent in a document but rare in the corpus, emphasizing their importance.


For instance, in our example sentences, "wonderful" and "liked" might have higher TF-IDF scores compared to common words like "this" or "book," as they are more indicative of the document's content.


The TF table would look something like:


IDF(word) = loge(Total number of sentences/Number of sentences with the word in it)


The IDF table would look like:


By considering both the frequency and uniqueness of words, TF-IDF provides more meaningful representations of text data, making it suitable for tasks like information retrieval, document classification, and sentiment analysis.


Conclusion

Text preprocessing is crucial in NLP, simplifying text for machine understanding. Techniques like tokenization, stemming, and lemmatization standardize text, while methods like Bag of Words (BOW) and TF-IDF convert words into numerical vectors. These techniques enable sentiment analysis and text classification, laying the groundwork for successful NLP applications. Choosing the right preprocessing methods is key to effectively analyzing textual data.

0 comments
bottom of page