RFD 0034 — Source text encoding and the Unicode lexical policy
- State: committed
- Opened: 2026-06-14
- Decides: that Argon source is UTF-8 and that identifiers are Unicode per
UAX #31, comments and string/char
literals admit arbitrary UTF-8, and operators/punctuation/keywords stay ASCII (modulo the
established
⊑/⊤/⊥typeset aliases). Records two safety/correctness items as documented fast-follows: NFC normalization at the name-resolution layer (canonical equivalence) and a mixed-script confusable warning (UAX #39). Also records the module-file membership rule (Rust semantics: onlymod/use-reachable files are part of a package) and its loud counterpart,OW0710. - Implements: §2.1/§2.3 of the reference; the lexer change in
oxc-lexer; the reachable-closure workspace build inoxc-workspace;OW0710(OrphanModuleFile) flipped from reserved to live.
This RFD records, as built, two adjacent lexical-layer decisions that surfaced together while diagnosing a real authoring incident: a tenant ontology package whose editor lit up with “unsupported non-ASCII character” diagnostics pointing at obviously-valid, pure-ASCII rule files.
Question
- Encoding. §2.1 already declared source UTF-8, but §2.3 defined identifiers as
[A-Za-z_][A-Za-z0-9_]*(ASCII only), and the lexer rejected any non-ASCII byte outside string literals — including in comments. So an em-dash in a// commentwas a hard lexer error. What is the real policy? - Identifiers. Should identifiers be ASCII-only, or full Unicode? If Unicode, with what normalization, and how do we keep visually-confusable homoglyphs from silently denoting different names?
- Module membership. A
.arfile sitting in a package’s source tree but declared by nomodwas being lexed, parsed, and checked — and (through Salsa accumulation during cross-module name resolution) its lex errors bubbled up and were misattributed to sibling files. Is a non-mod-reachable file part of the package?
Decision
1. Source is UTF-8; non-ASCII is admitted in identifiers, comments, and literals
- Comments (
//,///,//!,/* */) and string/char literals admit arbitrary UTF-8. (The lexer already scanned these byte-by-byte; the policy is now explicit and tested.) - Operators, punctuation, and keywords are ASCII. The only non-ASCII operator forms are the
recognized typeset aliases
⊑(U+2291 →<:),⊤(U+22A4 →Top),⊥(U+22A5 →Bot). A non-ASCII codepoint in operator position is still a hard error (OE0001), now reported at the correct file and codepoint.
2. Identifiers are Unicode (UAX #31)
An identifier starts with a XID_Start character or _ and continues with XID_Continue
characters (unicode-ident, the rustc-grade table). The ASCII subset is the common case and the
recommended style. The token text is the raw source slice, byte-for-byte — see the lexer
constraint below.
This follows the Rust/Cargo aesthetic (Rust accepts Unicode identifiers per UAX #31) and keeps the substrate ontology-neutral: a vocabulary authored in a non-Latin script is first-class.
Lexer constraint — token text must equal the source bytes. The parser rebuilds the rowan green
tree from token text and derives every node’s text_range() by accumulating token byte-lengths.
If the lexer rewrote an identifier’s text (e.g. folding a de-normalized spelling to NFC), the tree’s
offset space would diverge from the raw-source offset space that the checker, the LSP LineIndex,
and miette all index against — shifting every downstream span. So canonicalization does not happen
in the lexer; the token carries the source bytes verbatim.
NFC normalization — fast-follow. Canonical equivalence (precomposed é U+00E9 vs
e+combining-acute) should hold: two such spellings ought to denote the same name. Per the constraint
above, that belongs at the name-resolution / interning layer (normalize the name key, not the
token text) — the rust-analyzer model. Name comparison is currently spread across the resolver,
checker, and elaborator on raw .text(), so doing this correctly is its own focused change. Until it
lands, identifiers are matched by their exact source bytes (an NFD and an NFC spelling of the same
glyphs are distinct names).
Confusable safety (UAX #39) — fast-follow. Permitting arbitrary scripts reopens the homoglyph
surface (Latin A U+0041 vs Cyrillic А U+0410 read identically). The decided mitigation is a
warning, not a refusal: a mixed-script confusable identifier is reported so the confusion is
loud, never silent. It needs the unicode-security / unicode-script tables and a deny.toml
license allowance, so it lands as a focused fast-follow. Until then, cross-script confusables are
not yet flagged.
3. Module membership is the mod/use-reachable closure (Rust semantics)
A .ar file is part of a package iff it is reachable from the package entry through a chain of
mod/use declarations — exactly as a .rs file is part of a Rust crate only when a mod brings
it in. A sibling file no chain reaches is not compiled, not checked, not linted, and cannot
contribute diagnostics.
The compiled workspace is therefore built from the reachable closure alone. Leaving unreachable files in the workspace was the root cause of the misattribution in the Question: checking a reachable file transitively parsed every workspace file during name resolution, and an unreachable file’s lex/parse diagnostics bubbled through Salsa accumulation onto whichever reachable file triggered the parse.
Loud counterpart — OW0710 (OrphanModuleFile), now live. Rust silently ignores an
unreferenced source file (the IDE hints at it); Argon’s loud-not-silent doctrine and the
already-reserved §3.1 code argue for surfacing it. We emit OW0710 as a warning (the build
stays green, matching Rust’s non-fatal treatment) at ox check/ox build, naming each on-disk
.ar under the schema root that no mod/use chain reaches. This is the diagnostic that would
have immediately explained the incident (“rel_example.ar is not part of this package”).
Consequences
- Vocabulary and model packages may use Unicode identifiers (matched by exact source bytes today; NFC canonical equivalence is the fast-follow above).
- A scratch/tutorial
.arleft in a package’s source tree no longer breaks the build with misattributed errors; it is ignored and surfaced asOW0710. - The confusable warning is owed; until it lands, a mixed-script identifier is accepted silently.
- No change to operators/keywords;
⊑/⊤/⊥aliases preserved.