You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Text preprocessing is the essential first step in any NLP pipeline. Raw text data is noisy, inconsistent, and unstructured. Before a machine learning model can work with text, it must be cleaned, normalised, and transformed into a consistent format.
Raw text contains many sources of noise:
| Issue | Example |
|---|---|
| Mixed case | "Hello" vs "hello" vs "HELLO" |
| Punctuation | "great!" vs "great" |
| Stop words | "the", "is", "and" — frequent but low information |
| Inflections | "running", "runs", "ran" — same root meaning |
| Special characters | URLs, HTML tags, emojis |
| Whitespace | Extra spaces, tabs, newlines |
Tip: The preprocessing steps you apply depend on your task. For sentiment analysis, you might keep exclamation marks (they indicate emphasis). For topic modelling, you would remove them.
A typical text preprocessing pipeline follows these steps:
Converting text to lowercase ensures that "Cat", "cat", and "CAT" are treated as the same word.
text = "Natural Language Processing is AMAZING!"
text_lower = text.lower()
print(text_lower)
# Output: "natural language processing is amazing!"
Caution: Lowercasing can lose information. For example, "US" (United States) becomes "us" (pronoun). Named Entity Recognition may require preserving case.
import re
def clean_text(text):
# Remove HTML tags
text = re.sub(r'<[^>]+>', '', text)
# Remove URLs
text = re.sub(r'https?://\S+|www\.\S+', '', text)
# Remove special characters and digits
text = re.sub(r'[^a-zA-Z\s]', '', text)
# Remove extra whitespace
text = re.sub(r'\s+', ' ', text).strip()
return text
raw = "<p>Check out https://example.com! NLP is #awesome 123</p>"
print(clean_text(raw))
# Output: "Check out NLP is awesome"
Tokenisation is the process of splitting text into individual units called tokens. Tokens can be words, subwords, or characters.
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt_tab')
text = "The cat sat on the mat."
tokens = word_tokenize(text)
print(tokens)
# Output: ['The', 'cat', 'sat', 'on', 'the', 'mat', '.']
from nltk.tokenize import sent_tokenize
text = "NLP is fascinating. It powers many applications. Let's learn more!"
sentences = sent_tokenize(text)
print(sentences)
# Output: ['NLP is fascinating.', 'It powers many applications.', "Let's learn more!"]
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The cat sat on the mat.")
tokens = [token.text for token in doc]
print(tokens)
# Output: ['The', 'cat', 'sat', 'on', 'the', 'mat', '.']
Modern transformer models use subword tokenisation (BPE, WordPiece, SentencePiece), which breaks rare words into smaller units.
| Tokeniser | Used By |
|---|---|
| Byte-Pair Encoding (BPE) | GPT-2, GPT-3, RoBERTa |
| WordPiece | BERT |
| SentencePiece | T5, ALBERT, XLNet |
| Unigram | Used within SentencePiece |
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("unhappiness is overwhelming")
print(tokens)
# Output: ['un', '##happiness', 'is', 'over', '##whelm', '##ing']
Tip: Subword tokenisation handles out-of-vocabulary words by breaking them into known subword units. This is why transformer models rarely encounter unknown tokens.
Stop words are extremely common words (e.g., "the", "is", "and", "to") that carry little semantic meaning on their own.
from nltk.corpus import stopwords
nltk.download('stopwords')
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.