Tokenisation and Text Preprocessing
Before a machine learning model can learn anything from text, that text needs to be transformed into a form the model can work with. Computers do not understand words, sentences or paragraphs: they understand numbers. The pipeline that takes raw human-written text and prepares it for a model is called preprocessing and it begins with one fundamental operation: tokenisation.
The Challenge of Raw Text
Raw text is messy. Consider a single tweet: it might contain hashtags, URLs, emoji, abbreviations, deliberate misspellings and mixed casing. A news article will have punctuation, numbers, proper nouns and technical terms. A product review might include sarcasm, regional slang and rating stars embedded in prose.
Before any of this is useful to a model, you need to answer a deceptively simple question: what is a unit of meaning?
Common sources of noise in raw text include:
- Punctuation: does the comma in "well, that worked" carry meaning or is it just a breath mark?
- Casing: "Python" (the language) and "python" (the snake) are different, but "The" and "the" are functionally the same in most tasks
- Contractions and abbreviations: "don't" vs "do not", "Dr." vs "Doctor"
- Special characters: URLs, email addresses, HTML tags and emoji
- Whitespace: multiple spaces, line breaks and tabs
Each of these requires a deliberate decision. There is no universally correct answer: the right preprocessing strategy depends on the task.
What is Tokenisation?
Tokenisation is the process of splitting a piece of text into smaller units called tokens. These tokens become the basic units of input for a model.
The simplest approach is to split on whitespace, turning "natural language processing" into ["natural", "language", "processing"]. But this immediately runs into problems. How do you handle "isn't"? Is it one token or two? How do you handle "New York" if you want to treat it as a single entity?
There are three main tokenisation strategies, each with different trade-offs.
Word-level Tokenisation
Split text into individual words. This is intuitive and easy to implement, but it creates problems:
- Vocabulary explosion: a realistic corpus might have hundreds of thousands of unique words, including typos, inflections and rare words. The model needs an entry for each one.
- Out-of-vocabulary (OOV) words: any word not seen during training cannot be represented. A word like "COVID-19" would have been OOV for any model trained before 2020.
- Language variation: "run", "running", "ran" and "runner" are all different tokens, even though they share a root meaning.
Character-level Tokenisation
Split text into individual characters. The vocabulary is tiny (26 letters plus punctuation and digits) and there are never OOV problems. But the sequences become very long, making it harder for models to learn long-range dependencies. The character "a" carries very little information on its own.
Subword Tokenisation
The approach used by almost all modern NLP systems. The idea is to break rare or unknown words into smaller meaningful pieces while keeping common words intact.
For example, the word "tokenisation" might be split into ["token", "isation"] and "unhappiness" into ["un", "happiness"]. The word "the" stays as ["the"].
Popular subword algorithms include:
- Byte Pair Encoding (BPE): used by GPT models. Starts with individual characters and iteratively merges the most frequent pairs until a target vocabulary size is reached.
- WordPiece: used by BERT. Similar to BPE but uses a likelihood criterion rather than raw frequency.
- SentencePiece: language-agnostic and works directly on raw text without requiring pre-tokenisation.
Subword tokenisation balances vocabulary size, coverage and meaningful units.
Normalisation
After tokenising, the next step is normalisation: bringing tokens into a consistent form.
Lowercasing
Converting all text to lowercase is the simplest normalisation step. "Apple", "apple" and "APPLE" become the same token. This reduces vocabulary size and ensures consistent treatment. The downside is losing information "Apple" (the company) and "apple" (the fruit) are no longer distinguishable.
Stemming
Stemming strips suffixes from words to reduce them to a common root. "running", "runner" and "runs" all become "run". The stem does not have to be a real word "studies" becomes "studi" under the Porter stemmer. Stemming is fast but crude and can produce odd results.
Lemmatisation
Lemmatisation reduces words to their dictionary root form (the lemma). "running" becomes "run", "better" becomes "good", "was" becomes "be". Unlike stemming, the output is always a real word. Lemmatisation requires a knowledge of the language's grammar (a lexicon) and is slower, but produces cleaner results.
| Input | Stemmed | Lemmatised |
|---|---|---|
| running | run | run |
| studies | studi | study |
| better | better | good |
| caring | car | care |
Stopword Removal
Stopwords are common words that carry little semantic meaning: "the", "a", "is", "in", "of". Removing them reduces the size of the vocabulary and focuses the model on content-bearing words. This is valuable for tasks like document retrieval or topic modelling, but can be harmful for tasks that depend on grammatical structure "not good" loses its negation if "not" is removed as a stopword.
Putting It Together
A typical basic preprocessing pipeline looks like this:
- Lowercase the text
- Remove or replace special characters (URLs, HTML tags, etc.)
- Tokenise into words or subwords
- Optionally remove stopwords
- Optionally apply stemming or lemmatisation
The output is a sequence of cleaned tokens, ready to be converted into numbers. The next step, representation, is where tokens become vectors a model can actually learn from.
Key Takeaway
Text preprocessing is not a one-size-fits-all process. The choices you make (which tokenisation strategy to use, whether to lowercase, whether to stem or lemmatise, whether to remove stopwords) depend on your task, your language and your model. A good NLP practitioner understands each tool and when to reach for it.
Quiz: What is the main advantage of subword tokenisation over word-level tokenisation? Under what circumstances might removing stopwords be harmful rather than helpful?