Text Preprocessing

Text preprocessing is the essential first step in any NLP pipeline. Raw text data is noisy, inconsistent, and unstructured. Before a machine learning model can work with text, it must be cleaned, normalised, and transformed into a consistent format.

Why Preprocess Text?

Raw text contains many sources of noise:

Issue	Example
Mixed case	"Hello" vs "hello" vs "HELLO"
Punctuation	"great!" vs "great"
Stop words	"the", "is", "and" — frequent but low information
Inflections	"running", "runs", "ran" — same root meaning
Special characters	URLs, HTML tags, emojis
Whitespace	Extra spaces, tabs, newlines

Tip: The preprocessing steps you apply depend on your task. For sentiment analysis, you might keep exclamation marks (they indicate emphasis). For topic modelling, you would remove them.

The Preprocessing Pipeline

A typical text preprocessing pipeline follows these steps:

Lowercasing — Convert all text to lowercase
Removing noise — Strip HTML tags, URLs, special characters
Tokenisation — Split text into individual tokens (words or subwords)
Stop word removal — Remove common words with little meaning
Stemming or Lemmatisation — Reduce words to their root form
Optional: Removing numbers, handling negations, spelling correction

Lowercasing

Converting text to lowercase ensures that "Cat", "cat", and "CAT" are treated as the same word.

text = "Natural Language Processing is AMAZING!"
text_lower = text.lower()
print(text_lower)
# Output: "natural language processing is amazing!"

Caution: Lowercasing can lose information. For example, "US" (United States) becomes "us" (pronoun). Named Entity Recognition may require preserving case.

Removing Noise

import re

def clean_text(text):
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    # Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

raw = "<p>Check out https://example.com! NLP is #awesome 123</p>"
print(clean_text(raw))
# Output: "Check out NLP is awesome"

Tokenisation

Tokenisation is the process of splitting text into individual units called tokens. Tokens can be words, subwords, or characters.

Word Tokenisation

from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt_tab')

text = "The cat sat on the mat."
tokens = word_tokenize(text)
print(tokens)
# Output: ['The', 'cat', 'sat', 'on', 'the', 'mat', '.']

Sentence Tokenisation

from nltk.tokenize import sent_tokenize

text = "NLP is fascinating. It powers many applications. Let's learn more!"
sentences = sent_tokenize(text)
print(sentences)
# Output: ['NLP is fascinating.', 'It powers many applications.', "Let's learn more!"]

Tokenisation with spaCy

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("The cat sat on the mat.")

tokens = [token.text for token in doc]
print(tokens)
# Output: ['The', 'cat', 'sat', 'on', 'the', 'mat', '.']

Subword Tokenisation

Modern transformer models use subword tokenisation (BPE, WordPiece, SentencePiece), which breaks rare words into smaller units.

Tokeniser	Used By
Byte-Pair Encoding (BPE)	GPT-2, GPT-3, RoBERTa
WordPiece	BERT
SentencePiece	T5, ALBERT, XLNet
Unigram	Used within SentencePiece

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("unhappiness is overwhelming")
print(tokens)
# Output: ['un', '##happiness', 'is', 'over', '##whelm', '##ing']

Tip: Subword tokenisation handles out-of-vocabulary words by breaking them into known subword units. This is why transformer models rarely encounter unknown tokens.

Stop Word Removal

Stop words are extremely common words (e.g., "the", "is", "and", "to") that carry little semantic meaning on their own.

from nltk.corpus import stopwords
nltk.download('stopwords')

Text Preprocessing

Text Preprocessing

Why Preprocess Text?

The Preprocessing Pipeline

Lowercasing

Removing Noise

Tokenisation

Word Tokenisation

Sentence Tokenisation

Tokenisation with spaCy

Subword Tokenisation

Stop Word Removal

More in Data Science