AI that only knows things from before the Great Depression: Introducing talkie, a 13B language model trained on 260B tokens of historical pre-1931 English text.
- A 13B language model trained exclusively on pre-1931 text, talkie can code in Python, discuss the New Deal, and still occasionally hallucinate Roosevelt's presidency into existence, because keeping history out of history is harder than it sounds.
- Vintage LMs are contamination-free by construction, which makes them useful for testing genuine generalization: can a model with no knowledge of digital computers learn to write Python from a handful of examples? Barely, but it's improving with scale.
- OCR noise is the quiet killer here: training on raw historical scans produces only 30% of the learning efficiency of human-transcribed text, and modern VLM-based OCR helpfully hallucinates contemporary facts into 19th-century books.