It's time for a safer programming language AI agents can easily work with

The missing standard

Anthropic defined how models use tools and capabilities. It has not yet defined how models write code.

Anthropic published the Model Context Protocol so models could use tools through one open interface, and Skills so models could load capabilities through one open format. Both became standards the field builds on, and Anthropic steers their direction. A third surface remains undefined.

Published

MCP

How models reach tools. Open, adopted across the field, Anthropic-stewarded.

Published

Skills

How models load capabilities. Open format, growing ecosystem.

Undefined

The language

How models write and train on code. No standard. No verifiability floor. The biggest safety surface, unowned.

Every model today writes code in languages built for human ergonomics, where unsafe behavior — hidden mutation, null dereferences, data races, injection, undeclared side effects — is something the model can emit and no analyzer can statically rule out. As autonomous agents write and run more code with less human review, that gap becomes the safety problem.

The safety thesis

Verifiable code is safe code for autonomous agents.

The danger in agent-written code is not that the model is unskilled. The danger is unverified behavior: an effect the model did not declare, a mutation no one expected, a path that dereferences null, a string that reaches a SQL or HTML sink unescaped. Today these are caught — when they are caught — by review, testing, and luck.

In a language where these classes are compile-time type errors, the agent cannot write the unsafe thing, and a static analyzer can prove what the code will and will not do before it runs. Defense moves out of the adversarial runtime and into something as tractable as a type checker.

The model stays fixed. The dangerous classes become unwritable.
Capability is composed from verifiable parts a static analyzer can read.

This is directed evolution rather than open-ended self-modification: a fixed model generating within a verifiable substrate, where new capability is composition of safe parts and the safety properties are decided statically. The thing that makes the agent capable and the thing that makes it defensible become the same design.

The compiler as free red team

LLMs are probabilistic. Compilers are deterministic. Together they catch what neither could alone.

Current AI coding tools generate code that could be wrong. The standard response is more review — human or model. Both are expensive and have correlated failure modes: if the generating model missed something, the reviewing model is likely to miss it too, trained on similar data with similar blind spots.

A compiler has uncorrelated failure modes. It does not hallucinate, does not lose track of invariants over long contexts, does not get distracted. When an LLM writes U code and the compiler rejects it, that is a vulnerability caught before it runs — at zero marginal cost, in milliseconds, exhaustively across every code path the type system covers. The compile-error loop between model and compiler runs thousands of times a day without human involvement.

The pit of success for LLMs: the zero-annotation default is already safe. The compiler catches what the model missed. Two different kinds of intelligence, two different failure modes — together they are stronger than either alone.

This advantage compounds. Anthropic observes LLM-generated U code at scale — which patterns recur, which modifier combinations the model gets right automatically, which it misses. Future versions of U add new compile-time checks derived from those observations: new modifier dimensions, new constraint validations, new error classes made impossible. No other language designer has a billion lines of LLM-generated code to learn from. The language becomes safer with every model generation that trains on it.

What the language does

Every binding declares its own safety properties, verified by the compiler.

The design places a small, commutative set of verified annotations at each binding site: ownership (stack or heap), mutability and its write policy, nullability, async boundary, and declared effects. The zero-annotation default is the safest and fastest configuration, so every cost and every risk requires a visible, deliberate opt-in. A reader — human or model — sees each property in the source, with no inference required.

U is inspired by Python, and reads like it — clean indentation, low ceremony, familiar shape. The relationship is the one TypeScript has to JavaScript: not a replacement but a safety layer over a language people already use, stewarded by a single company, that became the default way serious teams write the underlying language because it caught a whole class of errors before they shipped. U is that layer for Python's world, built for the agentic era. It also holds to Python's own maxim, "there should be one obvious way to do it," which Python set for itself and then drifted from as it sprawled into many ways to format a string, loop, or run something async — U keeps the single obvious way, the way Kotlin kept Java's ecosystem while removing its verbosity and footguns. Teams that reach for Python will find U immediately legible, and find that it does not let them write the things Python lets them get wrong.

Python decorators — @cache, @retry, @auth — become modifier annotations: +D(cache(60)), +D(retry(3)), +D(auth("admin")). The same cross-cutting behavior, applied by the compiler rather than at runtime, visible at every binding site. Python's @ is matrix multiplication in U, following NumPy convention. Complex numbers C64 and C128 are built-in primitives — scientific Python users find the numeric stack already there, no external libraries needed.

// the zero-annotation default is safe and fast; risk is always visible
f processPayment(
    order: Order+R+M(MVCC),     // heap, mutable under MVCC — declared
    user:  User+R,               // heap reference, immutable — declared
    { timeout: I = 30 }          // optional named params, defaulted — no wrapper
): PaymentResult ! PaymentError  // the failure mode is declared

// at the call site, the optional is just a value — no type, no wrapper
processPayment(order, user)                  // timeout defaults to 30
processPayment(order, user, { timeout: 60 }) // override — no :I needed

Because the properties are verified rather than inferred, a whole catalog of failures stops being runtime surprises and becomes compile errors:

✕Use-after-free — a stack value cannot be stored where a heap reference is required.
✕Null dereference — a nullable value cannot be used without a check.
✕Data race — concurrent mutable access without a declared write policy.
✕Injection — an unescaped string reaching a SQL or HTML sink.
✕Undeclared effect — a side effect the function's signature did not declare.

Python parity for scientific computing — without external libraries.

Complex numbers C64 / C128 are built-in primitives. @ is matrix multiplication, ^ is powers — the same operators Python uses. +V enables SIMD vectorization. +R(GPU) moves data to device memory. Python people read U code immediately.

For an autonomous agent, this is a guardrail it cannot step around: the unsafe program does not compile. For a reviewing analyzer, every consequential property is a fact in the source, not a guess.

Full language reference — modifiers, type system, concurrency, formal guarantees →

Side by side

The same bugs other languages let you ship — and where they stop compiling.

Agent-generated code rarely fails one bug at a time; the failures arrive clustered, latent, and surface in production against a real user or attacker. Below, the failures other languages accept at runtime are, in U, compile-time type errors or checks the compiler inserts by default. The agent cannot emit them; the analyzer proves them absent.

SQL injection

JavaScript — runtime

// query built by string concatenation
const q = "SELECT * FROM users " +
          "WHERE id = " + userId;
db.query(q);
// userId = "0 OR 1=1"
// → every row leaks

U — compile time

// {uid} is bound, never interpolated
rows = sql`SELECT * FROM users
           WHERE id = {uid}`
// a raw string cannot become Sql —
// concatenating a query by hand is
// a type error

Bounds, null, and type confusion — clustered, the way agent code really fails

Python — four latent failures

def summarize(orders, users):
  first = orders[0]               # IndexError if empty
  user  = users.get(first["uid"]) # None if missing
  return {
    "name":  user.name,           # 💥 None.name
    "total": first["amount"]
           + user.credit          # str + int, fails deep
  }

U — each one handled or impossible

f summarize(orders: [Order],
            users: {S: User}): Summary+N
  orders.length == 0 & r => none  // empty handled
  first = orders[0]               // Order, bounds-checked
  user: User+N = users[first.uid] // +N in the type
  user ?? r => none               // must handle missing
  r => Summary({
    name:  user.name,             // typed, non-null
    total: first.amount + user.credit // both N
  })

Four failures sit latent in the Python: an empty list, a missing key, a null dereference, and a string-plus-number confusion from untyped data. In U the type confusion and the null dereference are compile errors, the array access is bounds-checked rather than silent, and the missing-key case is part of the type the compiler forces you to handle. The U version is longer precisely because the compiler made every failure path visible.

The duck typing in the Python version is not only a correctness hazard — it is an attack surface. Prototype pollution in JavaScript and attribute-injection tricks in Python exist because the type and attribute structure is mutable and inferred at runtime rather than declared and sealed at compile time. U has no untyped escape hatch, no open prototype, and no inferred attribute bag: every binding's type is fixed and verified, so the whole class of pollution and injection-by-type-confusion attacks becomes unexpressible.

Silent integer overflow

C — runtime

int total = balance + deposit;
// silent wraparound on overflow →
// a negative balance, a classic
// source of financial CVEs

U — type communicates intent

total: I32 = balance + deposit
// signed types trap on overflow — always

hash: U32  = seed * prime
// unsigned types wrap — by definition
// intent is in the type, not the operator

The pattern repeats across the catalog: use-after-free, data races on shared mutable state without a declared write policy, async violations, const mutation, missing i18n keys. Each is a runtime failure elsewhere and a compile error in U.

The full catalog of compile-time safety guarantees →

Why a standard, and why Anthropic

The same move Anthropic made with MCP — applied to how AI writes code.

A verifiability-first language is most valuable as an open standard the field adopts, not as a closed tool. Published openly, other labs train on it and generate in it, which entrenches it as the default substrate for machine-written code. Stewarded by Anthropic, its direction stays with the lab that defined it. Adoption by others is not leakage; it is what makes the standard load-bearing, exactly as it was for MCP.

Open standard, Anthropic-stewarded. Other labs train on it. Anthropic holds the pen on how AI writes safe code.

This follows a proven path. Go was built inside a large company, published open, and championed by its designers; it became infrastructure the field relies on, with no ownership friction because it was open from the first day. A language for safe machine-generated code, published as an Anthropic standard, puts Anthropic at the center of the surface where AI safety and AI productivity meet.

The strategic payoff is position, not enclosure: the steward of the standard everyone trains on shapes the field more than the owner of a tool no one else can touch.

The deeper move · a compilation target

Not just a language to write. A target to compile into — the way the JVM and the CLR became platforms.

The Java Virtual Machine and the .NET Common Language Runtime did not win as languages. They won as targets: a common substrate that many languages compile to, where the platform owner controlled the verification, the security model, and therefore the ecosystem. Adoption never required developers to hand-write bytecode. It required the tooling around the target to be good.

The same shape applies here, with one decisive difference. JVM bytecode and CLR IL are machine-only. This target is human-readable and verifiable — so the compiled-to form can be audited, reviewed, and statically proven safe, which is exactly the property safety demands. A platform target that a person or an analyzer can still read.

Be flexible in what you accept. Be strict in what you emit.

Postel's robustness principle, applied to code itself: accept the world's messy languages as input, normalize to one strict, verifiable, analyzable target as output. The diversity goes in; the safety comes out.

Because models generate the target and transpilers port to it, adoption does not wait on developers learning new syntax. A large fraction of existing code can be ported into the safe target — by the same models that now write code — and every codebase ported becomes more training data in the safe target, which makes the porting better, which ports more. Anthropic's head start in training on it and porting to it compounds.

And a runtime to run it — safely, in production

The target compiles to WASM and to C. Compiling to WASM means it runs inside the sandboxed-execution standard the field already trusts, rather than inside a general-purpose JavaScript runtime never designed for untrusted machine-written code. Anthropic can ship the runtime that safely executes the target in production — the move Node made when it lifted Google's V8 engine into a server runtime, except here the runtime is sandboxed and attested by design.

Anthropic already owns the runtime. Bun — the JavaScript runtime written in Zig with JavaScriptCore, now part of Anthropic since December 2025 — is the natural distribution layer. U compiles to WASM; WASM runs on Bun. The language, the porting models, and the runtime are all under one roof. No new infrastructure required.

Generate safe. Port safe. Run safe. One target Anthropic stewards across all three.

The platform is three layers, each an Anthropic-owned standard: the verifiable target models write and train on; the porting models that bring the world's code into it; and the sandboxed runtime that executes it. The same way Sun owned the JVM, Microsoft owned the CLR, and the V8 team's engine became the substrate under Node.

The enterprise offer: bring your old code, leave it safer

This is where the three layers become a product motion. An enterprise or trusted partner brings its existing codebases, the porting models translate them into the verifiable target, and the result is code that is analyzable, sandboxed, and runtime-verified — measurably safer than what they ran before, without a greenfield rewrite. The work happens inside the customer's own environment; their source never has to leave their walls. After the port, Anthropic's agents — trained on the target — close the loop, maintaining and improving the ported code autonomously, around the clock, against the same verifiable guarantees.

Port your legacy code into the verifiable target. It comes out safer, sandboxed, and maintained by agents that cannot write the unsafe thing.

Publish the remediations, not just the vulnerability counts

Anthropic already publishes the vulnerabilities its models find in real codebases. The next step is to publish the fixes: take great open-source codebases, port them into the verifiable target, and the patched, hardened version — with the remediations applied and, in this language, more obvious — lives in the target. Each one is a public good and a draw: the safer version of a well-known project exists in U, and the safety is legible there in a way it was not in the original.

The precedent is recent and close to home. Bun, now Anthropic's, was ported from Zig to Rust almost entirely by Claude Code — over a million lines, the test suite passing. Bun's own pre-release audit of that port is the argument for this target. The port carries 13,365 unsafe blocks; the audit's fair point is that most are removable FFI-boundary and ported-idiom scaffolding. But the finding that matters is the one the count does not include: five functions are unsound today — undefined behavior reachable from safe Rust. The escape hatch leaked past its own boundary, so even the region the language calls safe was not.

U has no escape hatch at all. There is no unsafe region whose invariants can leak into the safe one — because there is no unsafe region.

That is the difference between porting into a language that permits unsafety behind a keyword and porting into one that cannot express it. Rust's unsafe is a boundary that can be crossed, and as Bun's audit shows, its invariants can leak back into safe code. A target with no escape hatch has no such boundary to leak. The remediation is real because the target will not compile the unsafe version at all. Every hardened codebase published this way is more high-quality code in the language, drawing the next one in.

From finding problems to delivering fixes

This also changes the posture of Anthropic's security work. Publishing the vulnerabilities a model found — in code the lab cannot itself release fixes for — is a capability demonstration that leaves the finding on the table and the finder looking adversarial. Porting-and-hardening inverts it: instead of announcing the problem, Anthropic delivers the remediation — the codebase ported into the verifiable target, the issues surfaced and fixed at scale, handed back safer than it arrived. The same capability, in the opposite social position: not the lab that catches you out, but the lab that hardens your code.

For open-source projects this is a public good at a scale no human team could match. For enterprises it is a service: bring a codebase, get back a hardened, verifiable, sandboxed version, with the remediations applied and the unsafe forms made unexpressible — a paid offering that also turns every engagement into more code in the language and more training signal for the porting models. The economics and the flywheel point the same way. Only the lab with the model that ports at scale, and a target where the safety holds by construction, can offer it credibly.

The ecosystem, and the research optics

A language with a community around it is a standing research and recruiting asset.

React did this for Facebook and Go did it for Google: a language or framework published openly, with the originating lab stewarding its direction, becomes a durable source of research visibility, conference presence, and engineering recruiting. The community forms around the standard, and the standard's home institution sits at the center of it.

The same follows here, and the ecosystem is concrete: conferences and workshops dedicated to the language and its verification properties; books and courses; certifications; tooling, analyzers, and editor support contributed by the community. Each one is research surface for Anthropic and a recruiting funnel of exactly the engineers who care about safe, verifiable systems. I am happy to help organize all of it — the conferences, the curriculum, the certification program — as part of stewarding the standard.

The standard publishes once. The conferences, books, certifications, and community compound for years.

The research-optics benefit is the kind a lab cannot buy directly: being the institution that defined how AI writes safe code, with the field's practitioners gathering around the standard you steward.

The research it answers

It makes the verifiable region of the work as large as it can be.

Autonomous code generation compounds where verification is cheap: types, tests, and compilation let an agent run, check, and correct itself without a human in the loop. A language designed for verification widens that region deliberately — more properties become checkable, so more of the agent's work can run unsupervised and safe.

The concrete research that proves this is porting. Training a model to port an existing codebase into the verifiable target is itself a verifier-rich task — the ported code compiles or it does not, the analyzer proves the unsafe classes absent or it does not, the tests pass or they do not. That makes porting the ideal first experiment: a measurable ML result that tests the language, the target, and the safety claim in a single spike, and produces the flywheel's first turn.

Where it leads next

From declared structure to declared policy — closing the loop on agentic flows.

Forcing every declaration upfront — types, write policies, and the handling of every declared failure mode — does more than catch bugs. It makes a program's consequential behavior legible and machine-checkable at the boundary: what it reads, what it writes, what it can fail at, and how concurrent writes resolve are all facts in the source rather than runtime surprises.

That legibility opens the next research direction. Once an agent's actions are expressed against a substrate where effects and failure modes are declared, an organization can express its policies against the same substrate: which classes of action an agent may complete autonomously, and which must route to human approval. The verifiable flows close their own loop; the consequential ones stop at a gate. Capability rises while the dangerous fraction stays under review — measured against declared structure rather than inferred behavior.

Declared structure makes verifiable flows safe to automate. Declared policy decides which flows are allowed to close the loop at all.

This is the research arc beyond the pilot: the verifiable language is the floor, and organizational policy expressed against it is how autonomous agents are given real authority safely — most of the work closing its own loop, the consequential remainder held for approval. The pilot earns the right to build toward it.

The ask · a scoped pilot

Prove the number first. Build the platform only if it holds.

This is not a request to build a language, a porting fleet, and a runtime on day one. It is a request for one bounded experiment that tests the whole thesis cheaply, with a clear deliverable and a clear way to fail.

Phase 1

The porting eval — a focused spike, with the engineer

Build a minimal verifiable target plus its static analyzer, and train a model to port a real codebase into it. Measure: does the ported code have a lower unsafe-action rate, and a higher fraction of fault classes the analyzer proves absent, than the original? Hold the model fixed; vary only the target. One experiment that tests the language, the target, and the safety claim together.

Phase 2

Standardize it — only on a positive result

If the number holds, publish the target's specification as an open, Anthropic-stewarded standard with the reference implementation, on the MCP model, and scale the porting models. If it does not hold, the lab has learned something cheap and concrete about the limits of verifiable substrates.

Phase 3

Ship the runtime — the platform horizon

With the target proven and adopted, ship the sandboxed runtime that executes it safely in production, compiling to WASM and C. Generate safe, port safe, run safe — three Anthropic-stewarded layers. This is the destination the pilot earns toward, not the ask.

unsafe-action rate

ported target vs. original source, same model, same tasks — the headline metric

analyzer coverage

fault classes the static analyzer proves absent before execution, ported vs. original

port fidelity

share of a real codebase ported to the verifiable target that compiles and passes its existing tests

Kill condition

If Phase 1 shows no measurable safety improvement from porting into the verifiable target, the project stops at the spike. The cost is a short, self-contained research effort; the result is a real finding either way.

The next research direction · Safeguards

Structure constrains the attack surface — and shifts Safeguards from runtime cost to compile-time fact.

Safeguards today runs ML analysis on every prompt execution, because freeform prompts are unstructured strings: the surface area is unbounded, so the cost is O(executions). A language that encourages structured, typed, hash-pinned prompt templates changes the economics. The surface area shrinks. Most of the analysis moves from the runtime to the audit registry. The teraflops go where they are actually needed.

Freeform prompts: every execution is a new surface to analyze.
Structured templates: audit once, verify the hash at runtime — O(1).

When prompt`Summarize {doc: S} in {n: I} words` is compiled, the static template text becomes a constant with a deterministic hash. The variable parts are typed slots — {doc: S}, {n: I} — not arbitrary strings. The hash is baked into the module, included in the bundle's Merkle root, and submitted to the audit registry. Safeguards reviews it once. Every subsequent execution verifies the hash is signed and moves on. The ML runs on the rare anomalies — not on every call.

A tiered analysis stack, not a flat cost

Structure makes a layered defense tractable. For any prompt execution, the tiers are:

Tier 1

Hash verification — O(1), free, deterministic

Is this bundle hash registered and signed by auditors? No ML. No teraflops. If the hash is in the registry, the template was already reviewed. This handles the vast majority of production traffic.

Tier 2

Static analysis — compiler, also free

Does this instantiation have correct typed slots? Does the input match the declared {doc: S}? Static checks the compiler already enforces. No inference required.

Tier 3

Distilled expert — cheap, trained on template structures

Is this instantiation within the normal distribution for this template class? A small model trained on the structure of audited templates catches the 99% that are typical. The structure defines the verifier-rich domain; the expert is cheap precisely because the surface area is declared, not inferred.

Tier 4

Full model — reserved for genuine anomalies

Only the rare edge case — outside the distilled expert's certified ball — reaches the full model. The teraflops land where the uncertainty actually is. This is where Safeguards' ML budget produces the most signal per dollar.

The distilled expert at Tier 3 is not a replacement for the base model. It is accrued structure knowledge the base model lacks: this template class, these slot types, this distribution of normal inputs. The expert is cheaper and more accurate on its ball precisely because the template gave it a defined domain. Without structure, every prompt is a fresh open-ended problem. With structure, 99% of executions are pattern-matching against a known schema.

Attested provenance — every LLM output traceable to its signed prompt

The bundle's Merkle root hash covers every prompt function in the deployed system. Any response from any agent can be traced through a cryptographic chain:

// The provenance chain — automatic, no manual SRI
LLM response
  → execution hash         // Safebox already generates this
    → prompt function hash  // baked in at compile time
      → module hash         // covers all templates in the module
        → bundle Merkle root // covers the entire deployed system
          → M-of-N auditor signatures  // registry sign-off

// Any tampered template changes its hash
// → fails to match signed registry entry → blocked before execution
// → zero tampered prompts reach the full-model tier

This is SRI for prompts — generated automatically by the compiler, not hand-coded per resource. Changing a prompt in any deployed module changes its hash, breaks the registry match, and stops execution. Supply chain attacks on prompt templates become detectable before they run.

The M-of-N auditor model scales with LLMs. New template hash arrives → LLM auditors review at scale → human sign-off on high-capability packages → signed entry in registry. One audit. Every execution that follows is O(1). The teraflops go to auditing new structure once, not to rechecking known-safe structure forever.

The client offer — savings passed on through Claude Code and Bun

Claude Code already ships as a Bun executable. If Claude Code encourages — or generates by default — structured U prompt templates rather than freeform strings, every Claude Code user becomes a source of auditable, hash-pinned prompt artifacts. The Safeguards cost per execution drops. The savings are real and compounding:

Fewer flops per call

Hash-verified templates skip ML analysis entirely. Only genuine anomalies reach Tier 4. Safeguards budget concentrates on the uncertain fraction.

Better signal per flop

When the full model runs, it runs on genuine edge cases — not on the 99% that hash verification already resolved. Signal-to-noise improves with no model change.

Provenance for enterprise

Enterprises using Claude Code can show auditors exactly which prompt templates ran, when, signed by which reviewers — cryptographic chain of custody for every AI output.

Structure is what makes safety cheap. The language encourages structure. The compiler enforces it. Safeguards analyzes the surface that remains — which is the surface that actually matters.

The research direction is concrete: measure the reduction in Safeguards cost per execution as Claude Code adoption of structured prompt templates grows. The metric is a ratio — Tier 1+2 resolutions vs Tier 4 escalations — and it should improve monotonically as the template registry grows. A larger registry means more of the world's prompt traffic has a hash that resolves cheaply. The distilled expert trains on an expanding corpus of audited structure. The full model focuses on a shrinking tail of genuine uncertainty. The savings compound with adoption, the same way the porting flywheel compounds with every codebase ported.

A second research direction · after the language is built

Exact arithmetic in the language opens a training-and-inference research program — one only the implemented language can run.

The design already puts the numeric stack in the language: complex numbers C64/C128, @ for matrix multiply, ^ for powers, +V for SIMD, +R(GPU) for device memory, and exact rationals as primitives — the things a scientific-Python user otherwise imports. And within a stated window (n, d < 2⁵³, covering currency probabilities, and small-denominator ratios), the comparisons float gets wrong are simply correct, and results are bit-reproducible across hardware given a fixed evaluation order. That much is a property of the type system, not a promise.

Python — float, silently wrong

>>> 0.1 + 0.2 == 0.3
False
>>> 0.1 + 0.2
0.30000000000000004
# money needs a separate library
# 1/3 is never exact in float

U — exact within the integer window

Q(1,10) + Q(2,10) == Q(3,10)  // true, exactly
third = 1 / 3                 // Q(1,3) — exact, not 0.333…
price = Q(1999,100)           // $19.99, no Decimal import

Reproducibility is verifiability. Same inputs, identical outputs across GPUs and driver versions — within the exactness window, given fixed order.

Float training is notoriously non-deterministic across hardware and runs. Exact rational evaluation makes a result a fact a re-run and an analyzer can both confirm — the property the rest of this proposal asks of code, applied to numbers.

What that exactness enables for model training and inference is not a result this proposal claims. It is a research agenda that becomes testable only once the language and its exact-rational runtime exist — sequenced after the pilot, the way the policy work is. Stated honestly, with the open questions intact:

→New architectures, if rational gradients don't vanish. Within the window a product of nonzero exact rationals is nonzero, so for rational-valued gradients, underflow-to-zero — one classic source of vanishing gradients — cannot occur. Whether that lets very deep plain networks train without the machinery introduced to fight underflow (residual streams, gating, clipping for that reason) is an open empirical question — a chance to test architectures float arithmetic effectively rules out. A hypothesis to measure, not a result.
→Exact where the math is rational. Affine and polynomial steps compute exactly; transcendental steps — softmax, normalization, GELU — are not rational and stay approximate. The right precision tradeoff at those boundaries is itself the research, not end-to-end exact training.
→Automatic differentiation as a typed primitive. A dual number is a pair and the symbolic tree is a tape, so forward-mode AD over exact rationals yields exact derivatives for the rational and affine parts — certified sensitivities, verified Jacobians, exact optimality checks. Reverse-mode at model scale, and differentiation through transcendentals, stay open: the derivative of exp or √ is not rational, and exact adjoints over a deep graph face the same growth cost as the forward pass.
→Reproducible, certifiable inference. Fixed-order exact evaluation makes a model's outputs bit-identical across hardware and runs, and exact comparison removes epsilon-threshold heuristics. For safety-critical inference, a result that re-runs identically and can be checked exactly is a verifiable artifact — worth measuring once the runtime exists.
→Cost, and whether hardware closes it. Exact rational operations carry a GCD reduction, so they run several times slower than FP32 today and further behind tensor cores. Whether a cleanup primitive in the ISA makes this practical at training and inference scale is a hardware co-design question — explicitly future work.

None of this is claimed as done. Each is a measurable question the implemented language would let us answer — the language first, the research it earns second.

Why this is clean, and why now

Open from the first day, built inside, the field's default substrate to define.

The language is designed to be published open and stewarded by Anthropic — no licensing, no enclosure, built fresh inside the lab as Anthropic's standard. The timing matches the moment: agentic code generation is scaling now, before any verifiability standard exists, which is the same window MCP arrived in. The lab that defines the safe-code substrate while the field is still settling will steer it for the field's lifetime.

Models already write the code. The only open question is whether the code they write can be proven safe before it runs.

I have spent years designing this language and can build the reference implementation. The pilot asks for the smallest version of that work that produces a real answer.

A note on prior disclosure

The language design described here has also been submitted as a paper to POPL (the ACM Symposium on Principles of Programming Languages). The submission is anonymized, though anyone familiar with the work could deanonymize it. The reviewer pool is drawn almost entirely from the academic programming-languages community, and to my knowledge the large majority do not work at Anthropic. I mention it so the prior disclosure is on the record and the timeline is clear: the design exists, it is documented, and it is dated.