| Measure | Captures |
|---|---|
| Positive / negative tone | Optimism vs pessimism vocabulary. |
| Uncertainty | "Maybe", "approximately", "uncertain"... |
| Litigious | Legal-defensive vocabulary. |
| Modal-weak | "Could", "might", "perhaps". |
| Modal-strong | "Must", "will", "definitely". |
| Document length | Total words. Longer = more obfuscation per some lit. |
The question
After a public company is hacked, how does the language of its annual report change — and does that change carry information about future performance?
- Companies signal calm to markets.
- Disclosure rules force them to mention the breach.
- They have an incentive to mention it in the softest possible language.
The hypothesis: language reveals what the numbers hide.
The data
- 645 confirmed U.S. corporate data breaches between 2005 and 2019.
- 29,000+ 10-K annual reports filed with the SEC over the same window (matched on CIK).
- Breach data: Privacy Rights Clearinghouse + manual cross-references.
- 10-K texts: SEC EDGAR full-text downloads.
Sampling design: each breached firm gets a matched control firm — same sector, similar size, no recorded breach in the window.
Linguistic features
Standard finance-NLP measures from the Loughran–McDonald financial-tone dictionaries:
Each year-firm filing → a vector of these scores.
The methodology
A difference-in-differences design:
- Pre-period: filings before the breach.
- Post-period: filings after the breach.
- Treated group: breached firms.
- Control group: matched non-breached firms.
Difference in language change (post − pre) between treated and control = the causal effect of the breach on tone, controlling for time trends and industry shifts.
Then a second regression: does abnormal positive tone predict future ROA / stock returns?
The headline findings
After a breach, treated firms' 10-Ks:
- More positive tone (+ significantly).
- Less uncertainty language.
- Longer documents overall.
- More litigious language (this is forced — they have to disclose lawsuits).
Consistent with strategic obfuscation — drowning the bad news in softer, longer prose.
The kicker: abnormally positive tone predicts worse future earnings. The language contains information the numbers don't yet show.
What I learned
- Match-and-difference designs are how you get causal claims from observational financial data.
- Loughran–McDonald is the right dictionary for finance — general-purpose sentiment (VADER, etc.) underperforms.
- Document length is a feature, not a control. Long 10-Ks correlate with bad outcomes.
- Awards opened doors — INNCYBER Innovation Award 2019 + 2020. The methodology mattered more than the headline.