RFD 0034 — Source text encoding and the Unicode lexical policy

State: committed
Opened: 2026-06-14
Decides: that Argon source is UTF-8 and that identifiers are Unicode per UAX #31, comments and string/char literals admit arbitrary UTF-8, and operators/punctuation/keywords stay ASCII (modulo the established ⊑/⊤/⊥ typeset aliases). Records two safety/correctness items as documented fast-follows: NFC normalization at the name-resolution layer (canonical equivalence) and a mixed-script confusable warning (UAX #39). Also records the module-file membership rule (Rust semantics: only mod/use-reachable files are part of a package) and its loud counterpart, OW0710.
Implements: §2.1/§2.3 of the reference; the lexer change in oxc-lexer; the reachable-closure workspace build in oxc-workspace; OW0710 (OrphanModuleFile) flipped from reserved to live.

This RFD records, as built, two adjacent lexical-layer decisions that surfaced together while diagnosing a real authoring incident: a tenant ontology package whose editor lit up with “unsupported non-ASCII character” diagnostics pointing at obviously-valid, pure-ASCII rule files.

Question

Encoding. §2.1 already declared source UTF-8, but §2.3 defined identifiers as [A-Za-z_][A-Za-z0-9_]* (ASCII only), and the lexer rejected any non-ASCII byte outside string literals — including in comments. So an em-dash in a // comment was a hard lexer error. What is the real policy?
Identifiers. Should identifiers be ASCII-only, or full Unicode? If Unicode, with what normalization, and how do we keep visually-confusable homoglyphs from silently denoting different names?
Module membership. A .ar file sitting in a package’s source tree but declared by no mod was being lexed, parsed, and checked — and (through Salsa accumulation during cross-module name resolution) its lex errors bubbled up and were misattributed to sibling files. Is a non-mod-reachable file part of the package?

Decision

1. Source is UTF-8; non-ASCII is admitted in identifiers, comments, and literals

Comments (//, ///, //!, /* */) and string/char literals admit arbitrary UTF-8. (The lexer already scanned these byte-by-byte; the policy is now explicit and tested.)
Operators, punctuation, and keywords are ASCII. The only non-ASCII operator forms are the recognized typeset aliases ⊑ (U+2291 → <:), ⊤ (U+22A4 → Top), ⊥ (U+22A5 → Bot). A non-ASCII codepoint in operator position is still a hard error (OE0001), now reported at the correct file and codepoint.

2. Identifiers are Unicode (UAX #31)

An identifier starts with a XID_Start character or _ and continues with XID_Continue characters (unicode-ident, the rustc-grade table). The ASCII subset is the common case and the recommended style. The token text is the raw source slice, byte-for-byte — see the lexer constraint below.

This follows the Rust/Cargo aesthetic (Rust accepts Unicode identifiers per UAX #31) and keeps the substrate ontology-neutral: a vocabulary authored in a non-Latin script is first-class.

Lexer constraint — token text must equal the source bytes. The parser rebuilds the rowan green tree from token text and derives every node’s text_range() by accumulating token byte-lengths. If the lexer rewrote an identifier’s text (e.g. folding a de-normalized spelling to NFC), the tree’s offset space would diverge from the raw-source offset space that the checker, the LSP LineIndex, and miette all index against — shifting every downstream span. So canonicalization does not happen in the lexer; the token carries the source bytes verbatim.

NFC normalization — fast-follow. Canonical equivalence (precomposed é U+00E9 vs e+combining-acute) should hold: two such spellings ought to denote the same name. Per the constraint above, that belongs at the name-resolution / interning layer (normalize the name key, not the token text) — the rust-analyzer model. Name comparison is currently spread across the resolver, checker, and elaborator on raw .text(), so doing this correctly is its own focused change. Until it lands, identifiers are matched by their exact source bytes (an NFD and an NFC spelling of the same glyphs are distinct names).

Confusable safety (UAX #39) — fast-follow. Permitting arbitrary scripts reopens the homoglyph surface (Latin A U+0041 vs Cyrillic А U+0410 read identically). The decided mitigation is a warning, not a refusal: a mixed-script confusable identifier is reported so the confusion is loud, never silent. It needs the unicode-security / unicode-script tables and a deny.toml license allowance, so it lands as a focused fast-follow. Until then, cross-script confusables are not yet flagged.

3. Module membership is the `mod`/`use`-reachable closure (Rust semantics)

A .ar file is part of a package iff it is reachable from the package entry through a chain of mod/use declarations — exactly as a .rs file is part of a Rust crate only when a mod brings it in. A sibling file no chain reaches is not compiled, not checked, not linted, and cannot contribute diagnostics.

The compiled workspace is therefore built from the reachable closure alone. Leaving unreachable files in the workspace was the root cause of the misattribution in the Question: checking a reachable file transitively parsed every workspace file during name resolution, and an unreachable file’s lex/parse diagnostics bubbled through Salsa accumulation onto whichever reachable file triggered the parse.

Loud counterpart — OW0710 (OrphanModuleFile), now live. Rust silently ignores an unreferenced source file (the IDE hints at it); Argon’s loud-not-silent doctrine and the already-reserved §3.1 code argue for surfacing it. We emit OW0710 as a warning (the build stays green, matching Rust’s non-fatal treatment) at ox check/ox build, naming each on-disk .ar under the schema root that no mod/use chain reaches. This is the diagnostic that would have immediately explained the incident (“rel_example.ar is not part of this package”).

Consequences

Vocabulary and model packages may use Unicode identifiers (matched by exact source bytes today; NFC canonical equivalence is the fast-follow above).
A scratch/tutorial .ar left in a package’s source tree no longer breaks the build with misattributed errors; it is ignored and surfaced as OW0710.
The confusable warning is owed; until it lands, a mixed-script identifier is accepted silently.
No change to operators/keywords; ⊑/⊤/⊥ aliases preserved.

Keyboard shortcuts

Argon Design Records (RFDs)