RFD 0051 — oxfmt: a canonical, idempotent source formatter
- State: discussion
- Depends on: the shared lossless rowan CST (
oxc-parser,oxc-syntax); the[[notation]]table and the dual-notation surface (§2.4.1); the[fmt]notation-policy cascade (RFD 0013 toolchain, #762); the example corpus (RFD 0044 / RFD 0050), which the formatter is exercised against. - Prior art: the pretty-printing lineage — Oppen, Prettyprinting (1980); Hughes, The Design of a Pretty-Printing Library (1995); Wadler, A Prettier Printer (2003); Lindig, Strictly Pretty (2000); Bernardy, A Pretty But Not Greedy Printer (2017). Production formatters built on that algebra — gofmt, Black, Prettier, dprint, Biome, nixfmt, Dhall, and Lean’s printer. The normal-form framing — Eberhart, On the Relationship Between Parsing and Pretty-Printing (2012); Clarke, Liepelt & Orchard, Scrap Your Reprinter (2017), on why layout-preservation and canonicalization conflict; Black’s
assert_equivalentAST-safety check.
Question
What should Argon’s ox fmt / oxfmt be? The shipped v1 normalized whitespace only — it stripped trailing space and capped blank lines but could not re-indent, reflow, or impose a canonical layout, so the promise in the toolchain chapter that “spacing and layout are the Argon convention” had no implementation behind it.
Context
A formatter is the normal-form function for layout-equivalence on programs. Define s₁ ∼ s₂ iff parse(s₁) = parse(s₂); then fmt = print ∘ parse sends each text to the canonical representative of its class. Two properties follow from that framing rather than being engineered:
- Idempotence (
fmt(fmt(s)) = fmt(s)) is forced — the normal form of a normal form is itself. - Canonical form is defined by the printer, not discovered. There is no platonic “right” layout; the break and spacing rules are the definition.
Three facts about the codebase shape the design. The parser already produces a lossless rowan CST in which trivia (whitespace, the four comment kinds) are ordinary tokens — so losslessness is structural, with no parse-time attachment to get wrong. The grammar is uniform: every composite node is a delimited comma-list, a brace-delimited item/statement body, a bare comma-list, or an inline token sequence — so the lowering is a generic element walk plus a few family handlers, not ~100 bespoke rules. And the surface is newline-insensitive, so layout carries no meaning to preserve beyond the author’s blank-line paragraphing.
Decision
Build the formatter as a normalizer over a Wadler/Oppen combinator document IR, fed from the shared CST, emitting a single canonical layout with near-zero configuration, behind a layered correctness guard. Concretely:
-
Document IR (the one irreversible choice). Each node lowers to a
Docoftext/line/softline/hardline/nest/group/ concatenation; a renderer resolves eachgroupto flat-or-broken by whether the flat form fits the target width. A naive transcription of Wadler’s lazy algorithm is exponential in a strict language, so the renderer is the strict (Lindig) reformulation: a single work-stack with bounded look-ahead and pre-propagated break flags (agroupcontaining a forced break never probes flat), keeping it linear. -
Greedy selector. Each
groupis resolved locally (flat if it fits, else broken). Greedy is space-suboptimal on a minority of constructs; an optimal cost-based evaluator can be swapped in behind the same IR per-construct later, against the real corpus — that is not an architectural fork. -
Near-zero configuration. The only layout knob is the target line width. Everything else is fiat (the gofmt/Black/Dhall posture, matching Argon’s single-canonical-form stance). Notation direction (
preserve/unicode/ascii) is policy, resolved through the existingox.tomlcascade, not layout. -
Canonicalize, do not preserve. A single tool cannot both preserve the author’s layout and produce an idempotent canonical form — the two obey conflicting lens laws (Clarke et al. 2017).
oxfmtchooses canonicalization: it discards input layout and imposes the canonical one. The one author signal it keeps is blank lines between items, collapsed to at most one (gofmt semantics). A layout-preserving reprinter — for refactoring tools and LSP code actions — is a separate tool with distinct laws and is out of scope here. -
Trivia stays in the CST. Comments are read from the tree and placed by line ownership (own-line comments lead, same-line comments trail); the
//////!/////* */distinctions are preserved and comment interiors are never reflowed. Post-parse attachment-to-AST-nodes — gofmt’s self-described “biggest mistake” — is avoided. -
Macro and
quote {}bodies are opaque. A macro body is parsed twice — by the host grammar (where layout is dead) and by the macro engine (where token adjacency, repetition spacing like$($x:tt),*, and trailing commas can be load-bearing). Host-grammar token-equivalence therefore does not imply the macro is unchanged, sooxfmtrenders macro definitions andquote {}subtrees verbatim and formats only the surrounding code (the rustfmt-conservative stance). This keeps the guard below sound — the formatter never alters bytes it cannot prove inert — with the matching consequence that a macro-bearing file is not layout-invariant (its body layout is preserved), exactly as blank-line paragraphing is. -
A meaning-preservation safety guard (the maximal-correctness contract). Every rewrite is verified before it is returned: the candidate must re-parse cleanly, its significant-token sequence (kind + text, modulo trailing commas, which are layout in Argon’s grammar) must equal the input’s, and its comment multiset must match. On any failure the formatter emits the input unchanged — it is, by construction, incapable of changing what a file means. This is stronger than an AST-shape comparison because the lossless tree lets the guard check comment preservation directly.
-
Location. In
oxc-fmt, over the sharedoxclexer/parser/CST — no second grammar (the Nix cautionary tale).ox fmt [--check] <paths>and theoxfmtbinary share one path;--checkis exactly the “already a normal form?” predicate.
The canonical-forms layering (why this ships now)
Settling a canonical surface form (clause ordering, dual-form spellings, the forall/exists surface, predicate-sublanguage unification) refines ∼ by merging classes, giving a chain ∼₀ ⊆ ∼₁ ⊆ ⋯. The formatter is well-defined at every stage, and each later decision lands as one additional confluent pass that only shrinks the already-canonical set — it never invalidates prior output. So the layout floor ships now on settled syntax, and the unsettled canonical-form decisions are neither blocked by nor forced by the formatter; each becomes a small pass with a built-in acceptance test (idempotence + token/comment equivalence on the corpus) when its decision lands.
Correctness
The layered net, each rung catching what the one below misses:
- Parse-preservation — the output re-parses cleanly (cheapest gate).
- Token-and-comment equivalence — the safety guard above, enforced on every format.
- Idempotence —
fmt(fmt(s)) = fmt(s)byte-exact, over the whole corpus and under fuzzing. - Layout-invariance (modulo blank lines) — re-spacing a file’s token stream (mangling indentation and intra-line spacing, preserving paragraph breaks) yields identical canonical bytes across the corpus: input layout does not leak into output.
- Totality / no-panic fuzzing — arbitrary-string and token-salad property tests confirm the lowering and renderer never crash and every produced output is a fixed point.
What is out of scope
The layout-preserving reprinter (refactoring / range edits); incremental and range formatting for the LSP; an optimal cost-based evaluator beyond the constructs that demonstrably need it; and forcing the unsettled canonical-surface decisions, which the layering above defers to their own rulings. A code style guide (spec/reference/code-styleguide.md) describes the canonical layout the implementation produces — derived from the formatter, not the other way round.