Maria Aguilera

Definition. Automatic analysis and understanding of natural language for the communication between computers and humans.

Natural language = how humans actually speak. Artificial language = Java, Python, SQL.
NLP ≠ Neuro Linguistic Programming. Same letters, very different thing.
NLP = AI + Computational Linguistics. Modern stack = ML / Deep Learning + CL.
NLP = systems that understand language. NLP × ML = systems that learn to understand language.

Task	What it does
Question Answering	Natural-language queries over a dataset.
Machine Translation	Language A → language B (Google Translate, DeepL).
Text generation	Autocomplete, Smart Compose, full reports.
Fake news detection	Classify text as truthful or misleading.
Summarization	Long text → short text, meaning preserved.
Information extraction	Pull structured fields from free text.
Text classification	Spam vs not-spam, sentiment, intent.

The underlying question: if I have information in natural language, can I use it to solve a business problem?

#	Level	Question	Status
1	Morphology	what are the tokens?	🟢 mostly solved
2	Syntax	how do words relate?	🟢 mostly solved
3	Semantics	what do words together mean?	🟡 making progress
4	Pragmatics	what's the speaker's intent?	🔴 still really hard
5	Inference	what's true that wasn't said?	🔴 still really hard

Read it as a ladder. Every NLP task lives on one rung. You can't do rung n without rung n−1.

Sentence: "The dog is chasing the boy in the playground."

Level	Output
1 Morphology	`[The
2 Syntax	`chasing` → `dog` (nsubj), `chasing` → `boy` (obj)
3 Semantics	`{agent: dog, action: chasing, patient: boy, location: playground}`
4 Pragmatics	"narrating a scene"
5 Inference	"the boy is probably scared"

The bottom is automation. The top is open research.

Three reasons language is brutal for computers:

Common sense is omitted. "help me with the window" — open it? close it? clean it? The speaker assumes you know.
Productivity → combinatorial explosion. Limitless new sentences. You can't memorise; the model must generalise.
Ambiguity is the killer. "I saw the man with the telescope." Two valid readings — humans pick one without noticing.

The fry-an-egg test. Explain it to someone who has never seen an egg. Now imagine explaining it without a shared language. That's NLP's grounding problem.

Deep learning works shockingly well — and has real cracks:

Bias — if the corpus isn't curated, the model inherits and amplifies it.
Black box — induction at scale, not logical reasoning. Answers, not explanations.
No causal grounding — correlation patterns, not "because."
No real-world semantics or pragmatics — the model has read about coffee, never tasted it.
Frozen world model — BERT (trained 2018) doesn't know what COVID is. Can't infer after its cutoff.

The honest take: reasoning over text is exactly where current models still struggle.

Knowledge-based

Big explicit DB of facts/rules · IBM Watson · logical, slow to scale

Statistical / DL

Transformers · LLMs · "give me data, I'll learn anything" · black-box

Historically knowledge-based was on top; now statistics is.
The pendulum has swung before. It will swing again.
Don't fire the linguist. The cognitive chasm — between pattern-matching text and actually understanding language — is still wide open.

Problem	Level it lives on
Tokenizer for Japanese / Chinese	1 · Morphology
Parse a sentence to find subject/object	2 · Syntax
Sentiment / classification / search ranking	3 · Semantics
Translation between high-resource languages	2 + 3
Chatbot intent / dialogue	4 · Pragmatics
Summarise medical reports + answer reasoning Q	5 · Inference
Fake news / contradiction detection	3 + 5

Rule of thumb. If the answer requires information that isn't in the text, you're at level 5. Lower your expectations or add a knowledge source.

Part 1 · Introduction to NLP — Cheat Sheet

What NLP is

What NLP solves

The 5 levels (the spine)

The example climbs the ladder

Why language is hard

Why ML alone isn't enough

The two traditions (and the pendulum)

Decision: where does your problem sit?