Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

RFD 0051 — oxfmt: a canonical, idempotent source formatter

  • State: discussion
  • Depends on: the shared lossless rowan CST (oxc-parser, oxc-syntax); the [[notation]] table and the dual-notation surface (§2.4.1); the [fmt] notation-policy cascade (RFD 0013 toolchain, #762); the example corpus (RFD 0044 / RFD 0050), which the formatter is exercised against.
  • Prior art: the pretty-printing lineage — Oppen, Prettyprinting (1980); Hughes, The Design of a Pretty-Printing Library (1995); Wadler, A Prettier Printer (2003); Lindig, Strictly Pretty (2000); Bernardy, A Pretty But Not Greedy Printer (2017). Production formatters built on that algebra — gofmt, Black, Prettier, dprint, Biome, nixfmt, Dhall, and Lean’s printer. The normal-form framing — Eberhart, On the Relationship Between Parsing and Pretty-Printing (2012); Clarke, Liepelt & Orchard, Scrap Your Reprinter (2017), on why layout-preservation and canonicalization conflict; Black’s assert_equivalent AST-safety check.

Question

What should Argon’s ox fmt / oxfmt be? The shipped v1 normalized whitespace only — it stripped trailing space and capped blank lines but could not re-indent, reflow, or impose a canonical layout, so the promise in the toolchain chapter that “spacing and layout are the Argon convention” had no implementation behind it.

Context

A formatter is the normal-form function for layout-equivalence on programs. Define s₁ ∼ s₂ iff parse(s₁) = parse(s₂); then fmt = print ∘ parse sends each text to the canonical representative of its class. Two properties follow from that framing rather than being engineered:

  • Idempotence (fmt(fmt(s)) = fmt(s)) is forced — the normal form of a normal form is itself.
  • Canonical form is defined by the printer, not discovered. There is no platonic “right” layout; the break and spacing rules are the definition.

Three facts about the codebase shape the design. The parser already produces a lossless rowan CST in which trivia (whitespace, the four comment kinds) are ordinary tokens — so losslessness is structural, with no parse-time attachment to get wrong. The grammar is uniform: every composite node is a delimited comma-list, a brace-delimited item/statement body, a bare comma-list, or an inline token sequence — so the lowering is a generic element walk plus a few family handlers, not ~100 bespoke rules. And the surface is newline-insensitive, so layout carries no meaning to preserve beyond the author’s blank-line paragraphing.

Decision

Build the formatter as a normalizer over a Wadler/Oppen combinator document IR, fed from the shared CST, emitting a single canonical layout with near-zero configuration, behind a layered correctness guard. Concretely:

  1. Document IR (the one irreversible choice). Each node lowers to a Doc of text / line / softline / hardline / nest / group / concatenation; a renderer resolves each group to flat-or-broken by whether the flat form fits the target width. A naive transcription of Wadler’s lazy algorithm is exponential in a strict language, so the renderer is the strict (Lindig) reformulation: a single work-stack with bounded look-ahead and pre-propagated break flags (a group containing a forced break never probes flat), keeping it linear.

  2. Greedy selector. Each group is resolved locally (flat if it fits, else broken). Greedy is space-suboptimal on a minority of constructs; an optimal cost-based evaluator can be swapped in behind the same IR per-construct later, against the real corpus — that is not an architectural fork.

  3. Near-zero configuration. The only layout knob is the target line width. Everything else is fiat (the gofmt/Black/Dhall posture, matching Argon’s single-canonical-form stance). Notation direction (preserve / unicode / ascii) is policy, resolved through the existing ox.toml cascade, not layout.

  4. Canonicalize, do not preserve. A single tool cannot both preserve the author’s layout and produce an idempotent canonical form — the two obey conflicting lens laws (Clarke et al. 2017). oxfmt chooses canonicalization: it discards input layout and imposes the canonical one. The one author signal it keeps is blank lines between items, collapsed to at most one (gofmt semantics). A layout-preserving reprinter — for refactoring tools and LSP code actions — is a separate tool with distinct laws and is out of scope here.

  5. Trivia stays in the CST. Comments are read from the tree and placed by line ownership (own-line comments lead, same-line comments trail); the /// / //! / // / /* */ distinctions are preserved and comment interiors are never reflowed. Post-parse attachment-to-AST-nodes — gofmt’s self-described “biggest mistake” — is avoided.

  6. Macro and quote {} bodies are opaque. A macro body is parsed twice — by the host grammar (where layout is dead) and by the macro engine (where token adjacency, repetition spacing like $($x:tt),*, and trailing commas can be load-bearing). Host-grammar token-equivalence therefore does not imply the macro is unchanged, so oxfmt renders macro definitions and quote {} subtrees verbatim and formats only the surrounding code (the rustfmt-conservative stance). This keeps the guard below sound — the formatter never alters bytes it cannot prove inert — with the matching consequence that a macro-bearing file is not layout-invariant (its body layout is preserved), exactly as blank-line paragraphing is.

  7. A meaning-preservation safety guard (the maximal-correctness contract). Every rewrite is verified before it is returned: the candidate must re-parse cleanly, its significant-token sequence (kind + text, modulo trailing commas, which are layout in Argon’s grammar) must equal the input’s, and its comment multiset must match. On any failure the formatter emits the input unchanged — it is, by construction, incapable of changing what a file means. This is stronger than an AST-shape comparison because the lossless tree lets the guard check comment preservation directly.

  8. Location. In oxc-fmt, over the shared oxc lexer/parser/CST — no second grammar (the Nix cautionary tale). ox fmt [--check] <paths> and the oxfmt binary share one path; --check is exactly the “already a normal form?” predicate.

The canonical-forms layering (why this ships now)

Settling a canonical surface form (clause ordering, dual-form spellings, the forall/exists surface, predicate-sublanguage unification) refines by merging classes, giving a chain ∼₀ ⊆ ∼₁ ⊆ ⋯. The formatter is well-defined at every stage, and each later decision lands as one additional confluent pass that only shrinks the already-canonical set — it never invalidates prior output. So the layout floor ships now on settled syntax, and the unsettled canonical-form decisions are neither blocked by nor forced by the formatter; each becomes a small pass with a built-in acceptance test (idempotence + token/comment equivalence on the corpus) when its decision lands.

Correctness

The layered net, each rung catching what the one below misses:

  1. Parse-preservation — the output re-parses cleanly (cheapest gate).
  2. Token-and-comment equivalence — the safety guard above, enforced on every format.
  3. Idempotencefmt(fmt(s)) = fmt(s) byte-exact, over the whole corpus and under fuzzing.
  4. Layout-invariance (modulo blank lines) — re-spacing a file’s token stream (mangling indentation and intra-line spacing, preserving paragraph breaks) yields identical canonical bytes across the corpus: input layout does not leak into output.
  5. Totality / no-panic fuzzing — arbitrary-string and token-salad property tests confirm the lowering and renderer never crash and every produced output is a fixed point.

What is out of scope

The layout-preserving reprinter (refactoring / range edits); incremental and range formatting for the LSP; an optimal cost-based evaluator beyond the constructs that demonstrably need it; and forcing the unsettled canonical-surface decisions, which the layering above defers to their own rulings. A code style guide (spec/reference/code-styleguide.md) describes the canonical layout the implementation produces — derived from the formatter, not the other way round.