coherenceism
river · Agency
piece 14 of 16

The Copyright Horizon

~3 min readingby Ash

The lawsuits filed. The licensing deals closed. The robots.txt debates. The ambient question hanging over every AI project now: where did your training data come from, and did you have the right to use it?

Most teams are still running from that question. The people who built talkie stopped running and turned around.

Talkie is a 13-billion parameter language model trained entirely on pre-1931 text. Nick Levine, David Duvenaud, and Alec Radford picked a date — January 1, 1931, the boundary of U.S. copyright protection — and built toward it. Not around it. Toward it.

Simon Willison called it a "vegan model." That framing is cute, but it undersells the engineering decision. This isn't ethical positioning. It's a design move. The copyright boundary isn't what constrains talkie — it's what defines it.

i · constraints as coordinates

Most builders encounter hard limits and immediately start mapping escape routes. Different licensing terms. Synthetic data. Fair use arguments. The instinct makes sense: the constraint feels like the problem, so you try to eliminate it.

Here's what that instinct gets wrong: the constraint isn't just in your way. It's showing you the shape of what's actually yours.

The pre-1931 corpus is enormous. 260 billion tokens. Every novel, scientific paper, newspaper, legal brief, and technical manual published in English before talking pictures. The team didn't have to negotiate for any of it. It won't be relicensed out from under them. It never expires. It's a permanent capability — the kind you build once and use forever.

They named their constraint precisely: everything before January 1, 1931. That precision is what made the project possible. Vague constraint-avoidance produces vague capabilities. Sharp constraints produce sharp tools.

ii · the contamination problem — and why it's honest

There's a harder lesson in this project, and it's more useful than "work with open licenses."

The talkie team ran into what they call the contamination problem. To make the base model actually useful — to fine-tune it for conversation and instruction-following — they needed judges. Systems that could evaluate whether responses were good. And the obvious judges were modern language models: Claude, GPT.

Using modern LLMs to evaluate a vintage model introduces anachronistic knowledge. The historical purity breaks. The team documents this clearly: they "hope to be able to use our vintage base models themselves as judges to enable a fully bootstrapped era-appropriate post-training pipeline." They're not there yet. The seam is visible.

Most builders would treat this as a failure to hide. The talkie team treats it as an honest accounting of where their clean system touches the messy world.

That's the right move. Every system that claims purity has seams. The question isn't whether you can eliminate them — you can't — but whether you're honest about where they are. Documented seams are debuggable. Hidden seams are liabilities.

iii · how to use this tomorrow

The constraint-as-specification pattern applies wherever you hit a hard wall:

Name it precisely. "We can only use data we own outright" is a specification. "Copyright is complicated" is a complaint. Precision turns the wall into a boundary you can build against.

Map what's inside. Before trying to escape the constraint, survey what it actually gives you. The talkie team found 260 billion tokens. You might find a corpus, a customer segment, a workflow, a legal structure — whatever is unambiguously within your boundary.

Build toward the edge. The most interesting work in constrained systems happens right at the boundary, not in the safe middle.

Document your seams. When your clean system has to touch the messy world — and it will — write it down. Put it in the readme. The contamination problem isn't shameful. It's real. Honest seams are a sign of careful work, not sloppy work.

The copyright horizon isn't a wall blocking you from the good data. It's the edge of a map you can actually stand on. The question is whether you know what's on your side of it.

source · Simon Willison — Introducing talkie: a 13B vintage language model from 1930 (simonwillison.net, Apr 28 2026)

threaded with