Bag of Words and TF-IDF
Once text has been tokenised, the next challenge is representing it as numbers that a machine learning model can use. The two approaches covered in this lesson. Bag of Words and TF-IDF: were the standard methods for text representation for decades and they remain useful and widely deployed today.
The Bag of Words Model
The Bag of Words (BoW) model represents a document as a set of word counts, ignoring word order and grammar entirely. The name comes from the metaphor of shaking all the words in a document into a bag: you know which words are present and how often, but you lose all information about their arrangement.
Building a Vocabulary
The first step is to build a vocabulary: the complete set of unique tokens across all documents in your corpus. Suppose you have two short documents:
- Document 1: "the cat sat on the mat"
- Document 2: "the dog sat on the floor"
The vocabulary is: {the, cat, sat, on, mat, dog, floor}: seven unique words.
Representing Documents as Vectors
Each document is represented as a vector with one dimension per vocabulary word. The value in each position is the count of that word in the document.
| the | cat | sat | on | mat | dog | floor | |
|---|---|---|---|---|---|---|---|
| Doc 1 | 2 | 1 | 1 | 1 | 1 | 0 | 0 |
| Doc 2 | 2 | 0 | 1 | 1 | 0 | 1 | 1 |
Now both documents are numerical vectors that any standard ML algorithm can work with.
The Problem of Sparsity
In practice, vocabularies contain tens or hundreds of thousands of words. Any single document uses only a tiny fraction of them. The resulting vectors are sparse: mostly zeros. A document containing 200 unique words in a vocabulary of 50,000 has 99.6% zeros. This is computationally wasteful and can make learning harder.
The Limits of Word Counts
Raw counts have a significant problem: common words dominate. In any large corpus, words like "the", "a" and "of" appear in almost every document and with high frequency. They overwhelm the signal from the words that actually distinguish documents from one another.
Removing stopwords helps, but it is a blunt instrument. What you really want is a way to up-weight words that are distinctive and down-weight words that are ubiquitous.
That is exactly what TF-IDF does.
TF-IDF: Term Frequency–Inverse Document Frequency
TF-IDF is a statistical measure that reflects how important a word is to a particular document in a collection. It combines two components.
Term Frequency (TF)
How often does the word appear in this document? A word that appears ten times in a document is more relevant to that document than one that appears once.
$$ ext{TF}(t, d) = rac{ ext{count of term } t ext{ in document } d}{ ext{total terms in document } d}$$
Dividing by the total number of terms normalises for document length: a word appearing 10 times in a 100-word document is more significant than one appearing 10 times in a 10,000-word document.
Inverse Document Frequency (IDF)
How rare is the word across all documents? A word that appears in almost every document (like "the") is not useful for distinguishing between documents. A word that appears in only a few documents is more distinctive.
$$ ext{IDF}(t) = logleft(rac{ ext{total documents}}{ ext{documents containing term } t} ight)$$
Words that appear in every document get an IDF of 0 (log 1 = 0), effectively zeroing them out. Rare words get high IDF scores.
Combining TF and IDF
$$ ext{TF-IDF}(t, d) = ext{TF}(t, d) imes ext{IDF}(t)$$
The result is a high score for words that are frequent in a specific document but rare across the collection: exactly the words most likely to characterise that document's content.
An Intuition Example
In a corpus of news articles, the word "the" appears in every article. Its IDF is 0: it is useless for distinguishing articles. The word "cryptocurrency" appears in only a few articles. It gets a high IDF. If it also appears frequently in a particular article, its TF-IDF score for that article will be high, correctly signalling that this article is about cryptocurrency.
Practical Applications of BoW and TF-IDF
Despite their simplicity, these methods have powered production systems for years:
- Search engines: TF-IDF is still a component of ranking algorithms, though modern engines combine it with neural approaches
- Document classification: spam filters, news categorisation, sentiment analysis on short text
- Document similarity: compare TF-IDF vectors using cosine similarity to find related documents
- Information retrieval: retrieve the most relevant documents for a query
What These Methods Cannot Do
The bag of words model has a fundamental, irrecoverable limitation: it loses word order completely.
"The dog bit the man" and "The man bit the dog" have identical BoW representations. They mean completely different things.
This is not a minor flaw. Language is deeply sequential. Meaning emerges from the order of words, not just their presence. Negation ("not good" vs "good"), modification ("almost finished" vs "finished") and context all depend on order.
TF-IDF improves the weighting of words but does not solve the order problem. Both methods are also unable to capture semantic similarity between words "happy" and "joyful" are treated as completely unrelated tokens, even though they mean nearly the same thing.
These limitations motivated the development of word embeddings, which we will cover next.
Quiz: Why does a word that appears in every document receive a TF-IDF score of zero? What fundamental limitation do both Bag of Words and TF-IDF share and why does it matter?