You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Machine learning models work with numbers, not words. To apply ML to text, we must convert words, sentences, and documents into numerical representations — a process called text representation or text vectorisation.
The quality of your text representation directly affects model performance. A good representation captures semantic meaning, while a poor one treats every word as an independent, unrelated symbol.
| Representation | Captures Meaning? | Dimensionality | Use Case |
|---|---|---|---|
| Bag of Words | No | High (sparse) | Simple classification |
| TF-IDF | Partially | High (sparse) | Information retrieval, classification |
| Word Embeddings | Yes | Low (dense) | Most modern NLP tasks |
| Contextual Embeddings | Yes (context-aware) | Low (dense) | State-of-the-art NLP |
The simplest text representation. Each document is represented as a vector of word counts. The order of words is ignored — only their frequency matters.
Given a corpus:
The vocabulary is: [cat, dog, log, mat, on, sat, the]
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.