| Rule-based | Probabilistic / ML | |
|---|---|---|
| Method | Hand-written rules | Learn from annotated corpus |
| Data | Not needed | Treebank required |
| Pros | Interpretable, works small data | Scales, adapts |
| Cons | Doesn't scale, language-specific | Needs annotation |
| Use when | No data, specific domain | Standard case for English |
What is POS tagging?
POS = the role a word plays in a sentence (noun, verb, adjective, …).
Goal: determine the POS tag for a particular instance of a word — same word can be different POS in different sentences (book = noun or verb).
Uses:
- Enhanced regex:
(Det) Adj* N+catches multiword expressions - Text-to-speech (how to pronounce lead)
- Input for full syntax parsing
- Backoff for downstream tasks (NER → NN, sentiment → JJ)
Two approaches
Modern default: ML with CRF / BiLSTM-CRF / fine-tuned transformer.
The HMM math
For words w₁…wₙ and tags t₁…tₙ:
Emission: P(wᵢ | tᵢ) — how often is this word tagged this way?
P(flies | VBZ)= times flies was a verb / total verb tags
Transition: P(tᵢ | tᵢ₋₁) — how often does this tag follow that tag?
P(VBZ | NN)= how often a verb follows a noun
Pick the sequence that maximises:
argmax ∏ᵢ P(wᵢ | tᵢ) · P(tᵢ | tᵢ₋₁)
Solved efficiently by Viterbi (dynamic programming).
Punchline: deep learning is, at heart, learning to count.
The accuracy trap
State-of-the-art: 94–97% on clean English. Looks great, isn't.
5% error = one wrong tag every 20 words = one error per sentence.
| Corpus | Accuracy |
|---|---|
| English WSJ | ~97% |
| English Brown | ~95% |
| Spanish AnCora | ~93% |
| English Twitter | ~89% |
| Chinese | ~75% |
Easy tokens (the, a, punctuation) inflate the number — real accuracy on hard cases is lower.
Sources of error
| Source | Example |
|---|---|
| Lexicon gap | new words, slang, domain jargon |
| Unknown word | misspellings, rare/unseen |
| Multi-word expressions | take off, by the way, out of control |
| Structural ambiguity | "Visiting relatives can be boring." |
| Annotation standard | guidelines disagree at edges |
| Punctuation / tokenisation | upstream tokenizer fights you |
Fix: richer features (suffix, capitalisation, prev/next word, embeddings) + sequential models (CRF, BiLSTM, BERT).
Error compounding
Each layer multiplies the previous one's error:
TOK (95%) × POS (93%) × DP (90%) × CLS (85%) ≈ 68%
Two takeaways:
- Don't trust headline accuracies on individual NLP tasks — the pipeline eats them.
- End-to-end transformers skip the explicit pipeline → no layer-wise compounding → real structural advantage.
Two types of parsing
| Constituency | Dependency | |
|---|---|---|
| What | Phrase structure tree | Word-to-word arrows |
| Levels | Terminal (words), pre-terminal (POS), non-terminal (S/NP/VP) | Each word has one head, each arc labelled |
| Used by | Linguistic theory, older NLP | Modern parsers (spaCy, Stanza), translation |
import spacy
nlp = spacy.load("en_core_web_sm")
for tok in nlp("I saw the big dog"):
print(tok.text, tok.dep_, "→", tok.head.text)
# I nsubj → saw
# saw ROOT → saw
# the det → dog
# big amod → dog
# dog dobj → sawCFG → PCFG
CFG (Context-Free Grammar) = rules like S → NP VP, VP → V NP.
Problem: "Fed raises interest 0.5 percent" has 36 valid parses by typical English CFG rules. All grammatical, most absurd.
PCFG (Probabilistic CFG) = same rules + probabilities learned from a treebank. Pick the parse with the highest product of rule probabilities.
P(rule) = count(rule applied) / count(LHS expanded)
Same machinery as the HMM tagger, one level up. Replaced today by neural parsers (transition-based, graph-based, transformer-based).
