Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

RFD 0036 — Heterogeneous and specialized data stores

  • State: discussion
  • Opened: 2026-06-15
  • Decides: how specialized / heterogeneous stores — a columnar analytical store (a >TB “market oracle” securities master), a blob store, a DynamoDB/KV store — become part of the Argon knowledge graph and are queried through it, alongside the default store (in-memory / Postgres). Establishes three distinct patterns (foreign federation, external-valued attributes, persistence-backend swap), the connector SPI, the mapping & placement surface, the recursion×federation discipline, the world-assumption-as-tier-input rule, and the provenance/freshness contract. Built on RFD 0035 (the operator-tree pipeline — its forcing consumer and substrate), RFD 0033 (the store-agnostic Schema query-provider and the ad-hoc surface), and RFD 0020/RFD 0021 (the engine).
  • Grounded in: a six-track literature campaign (polystore/federation, OBDA/VKG, pluggable persistence, recursion×federation, compute pushdown, consistency/provenance) whose decision-ready briefing this RFD selects from. External systems and PL theory are cited as evidence, never authority; current-repo code is cited at verified anchors.
  • Depends on: RFD 0033 (PR #401) — the store-agnostic Schema query-provider this RFD’s C1 rests on. Merge #401 first; on this branch the 0033-*.md links are forward-referential by design, not dangling.

This RFD is co-designed with RFD 0035: 0035 is the engine layer the foreign data flows through, 0036 is the store layer. Per the repo workflow this is language-surface + engine architecture (RFD + reference → Lean where a new semantic notion appears → code); the only net-new semantic obligation is the frozen-foreign-EDB preservation theorem (RFD 0035 D3), expressible against spec/lean/Argon/Reasoning/.

Reconciled with the performance / distribution / consensus research campaign (2026-06-15). D2/D3/D4/D7/D9/D10 are updated to record the campaign’s findings: the read-model is the primary path as a columnar content-addressed immutable segment maintained by IVM (D9), the >TB analytical bar is met by pushdown + push-compute-to-data + Argon’s own vectorized streaming rather than a BTreeMap freeze (D4), the segment carries provenance inline via a two-semiring maintainer (D7), the write spine is a scalar root pointing at an immutable content-addressed manifest (D9), and the financial path is three workloads with Argon as the read/OLAP DB beside a federated ledger (D10). The governing principle below frames all of it. The campaign is research and decides nothing; these edits are the cut.


Question

Argon’s value is the ontology + reasoning layer over data. Some of that data is too large, or too workload-specialized, to live as Argon axiom events: a >TB columnar securities master built for analytics; a blob store for documents; a KV store for point lookups. The intent — stated since the ad-hoc work began — is that such stores are part of the knowledge graph and queryable through Argon naturally, joined against native facts and reasoned over, without copying their data into Argon’s log.

The forcing example: “flag every security in the same k-means cluster as a known-distressed security, where clusters are computed over the market oracle’s return vectors.” distressed/1 is a small native relation; returns(...) lives in the columnar store; cluster_of(...) is computed by the store’s own engine. This single query exercises every hard axis: foreign data in a native shape, a relation→relation compute operator, a federated join, recursion-adjacency, world assumptions, and cross-store provenance.

What is the design by which heterogeneous stores join the knowledge graph — the connector contract, the placement/mapping surface, the recursion discipline, the world-assumption handling, and the provenance/freshness model — built correctly and completely, with no hollow path?


Context

What RFD 0035 provides

The operator-tree pipeline: LogicalPlan as the shared lowering target (D1), the frozen-EDB materialization discipline (D3 — a non-Datalog sub-plan’s result is materialized into the catalog as a frozen EDB the evaluator joins natively), the relation-valued table operator IR (D4), the tree optimizer with pushdown + magic-sets/demand + federation-split (D5), and the generalized operator-call executor (D6). 0036 attaches its connector/compute/store layers to these seams. Without 0035 there is no place to attach (verified: the live path is AtomIR → compile_rule → Engine::evaluate; LogicalPlan is orphaned).

What the substrate already provides (verified, not assumed)

  • AxiomEvent already is the native provenance/freshness token (oxc-protocol/src/storage.rs:1508): content_id (BLAKE3 — C10) + the four scope axes (tenant/fork/standpoint/module — C8) + a bitemporal extent + a proof_tag (defeasibility) + a derivation (PosBool(M) DNF why-provenance). Federation adds one leaf, not a new token.
  • Per-concept world assumption is already mechanizedWorldAssumptionMap, Locality/Cwa.lean, with cwa_owa_transfer proven (CWA-true ⇒ OWA-true; the reverse proved unsound), and diagnostics OE0901/OW0902 reserved-but-unbuilt. Argon is ahead of deployed OBDA here (uniformly OWA; Ontop never implemented closed predicates).
  • The freeze discipline is in the engine (eval.rs:170-174 clock pin; 232-235 stable arrangements) — RFD 0035 D3 generalizes it.
  • No store config exists. ox.toml (oxc-workspace/src/lib.rs:160-168) parses package/project/schema/dependencies/lattice only. The .oxbin/pg projection cache exists but is inert (None everywhere; oxc-storage-pg get/put_projection_cache tested, never called by the evaluator).
  • A RuntimeStorageBackend trait exists (oxc-runtime/src/lib.rs:2890) — a sync, in-process replay seam; PgStorage is a separate, async, CQRS-shaped durable layer reached by hydrate-then-replay (it does not implement that trait).

Decision

Governing principle — Argon is a first-class engine and a permanent orchestrator (both, forever). Two things are true at once and neither subsumes the other. (1) Argon itself is a high-performance database engine — never “dumb,” never a thin forwarder, never pawning off its own/core performance to another DB; the end-state includes Argon’s own custom engine, and for data Argon owns and reasons over, Argon’s own engine does the work. (2) Argon is also a permanent orchestrator/federator over heterogeneous external stores — now and forever, even after that engine exists — because different workloads genuinely require different stores (sub-ms operational KV, >TB columnar analytics, blob stores, time-series). This is first-class permanent architecture, not scaffolding to outgrow. The router between them is [placement] (D6): per-workload, which store serves it — Argon’s own engine the first-class default, a specialized external store chosen when genuinely optimal. The only thing that lessens over time is the current degree of reliance on externals (and outsourcing the durability backend for Argon’s own log, à la Datomic); the orchestration capability is permanent. “Don’t pawn off” is therefore narrow: it forbids delegating Argon’s own core performance and forbids Argon being a dumb passthrough — it does not mean retreating from external stores.

D1 — Three patterns, named and kept separate

PatternWhat it isDriving exampleSeam
P2 — Foreign federation (centerpiece)Foreign data in its native shape, mapped into the ontology, queried through Argon, never copied into the logthe market oracleconnector SPI (D3) + frozen-EDB (RFD 0035 D3)
P3 — External-valued attributesA property’s value is a content-addressed handle; bytes live in a blob storedocument on an individualRef<Blob> (D8)
P1 — Persistence swapArgon’s own event-log/read-model lives in a chosen durable backendlog in DynamoDBthin durable seam (D9)

Conflating them is rejected: P2 data has no axiom-event semantics (no bitemporality, polarity, standpoints, defeat tags); forcing those onto it is the trap. P1 is an Argon-own-model persistence concern; P3 is a typed value with a remote byte-store.

D2 — Foreign data is a virtual extensional predicate, outside the log, frozen per query; read-only in v1

A foreign relation is a virtual EDB produced on demand by a connector and materialized once into the catalog as a frozen EDB for the query’s duration (RFD 0035 D3). Argon stores only the mapping + a connection contract, never the foreign data. The axiom-event log stays source-of-truth for native facts; the foreign store for its own; they meet at query time. Foreign stores are read-only in v1 — and this is the correct steady-state architecture, not a limitation: per-entity ACID + idempotent cross-entity messaging is the right shape (Helland); 2PC is an anti-availability protocol that only works inside one store (Gray/Lamport). Cross-store write transactions stay out of scope, with the sole admissible exception being P3’s non-transactional content-put (D8). This is the realization of RFD 0020 D11 (“ad-hoc queries/mutations first-class … gating is engine policy”) extended across the federation boundary.

The BTreeMap freeze is for small slices and is not the >TB performance path (updated per the performance campaign, 2026-06-15). Materializing a slice into an in-memory BTreeMap CatalogEntry is correct for a small federated result (or as the fallback), but it is the O(database) cost CP2 forbids for a >TB store — and you cannot freeze a >TB slice into a BTreeMap. The >TB analytical bar is met by never moving the data: push filters/projections/aggregates down (D3), run analytical compute in-store where the store can (the k-means train/predict split — only the small result returns), and for what must run in Argon, stream it through Argon’s own vectorized engine (RFD 0035 D4) over columnar segments/sources — never a BTreeMap ingest. Only the small result re-enters as a frozen EDB to join native facts. Argon provides the performance; a “dumb” store provides only bytes — Argon is never the dumb layer (Decision lead).

D3 — The connector SPI: per-operation, three-valued pushdown verdict + binding patterns, object-safe

A foreign store implements an object-safe ForeignRelation trait (held as Arc<dyn> in a CatalogEntry, so one catalog holds heterogeneous connectors — DataFusion’s Arc<dyn TableProvider> precedent):

  • schema() — answers type questions into Argon’s vocabulary; never owns type semantics. Schema stays store-agnostic (C1, RFD 0033) — the connector sees the IR, never Schema.
  • binding_patterns() — capability as {b,f}^n adornments (C4; Rajaraman–Sagiv–Ullman): ff… free-scannable, bf… requires a bound key (a blob/KV store cannot free-scan). Consumed by the optimizer’s bound-set propagation (the same bound: BTreeSet<VariableIdx> reorder.rs already tracks, lifted to LogicalPlan — RFD 0035 D5).
  • apply_filter(&LogicalPlan) -> (handle, Absorption) with three-valued Absorption ∈ {Exact, Inexact, Unsupported} (DataFusion TableProviderFilterPushDown): Exact = no re-check; Inexact = the source prunes but the engine re-applies the whole predicate (so the residual is not the set-complement of the absorbed work); Unsupported = engine does it. Under Inexact the residual is the original typed predicate, so the type-identical residual demand (C2/C3) is met automatically.
  • scan(handle, demand) -> stream — lazy; a bounded binding-pushed slice via the b positions in demand (never a free count on the hot path — C5).

Rejected: the delegated-subplan-rewrite shape (datafusion-federation: the connector ships its own optimizer rule and self-determines the federated fragment). More expressive and a cleaner IR fit, but the engine cannot independently cost or type-check what the connector absorbed opaquely — irreconcilable with the type-identical-residual demand (C2/C3), which is non-negotiable for Argon. The async connector is reconciled with the sync fixpoint by the freeze rule: await the slice once, materialize it as a frozen EDB, iterate sync (RFD 0035 D3). Relation-provider and compute-provider are distinct seams — a relation provider negotiates binding patterns; a compute provider is a table operator (RFD 0035 D4) routed to an analytical-tier executor (RFD 0035 D6). The market-oracle k-means is the latter: train (non-deterministic) runs outside the fixpoint and emits a frozen content-addressed model artifact; predict/assign (deterministic given the model) re-enters as a frozen EDB. Compute determinism at the foreign boundary is contract-asserted (the connector declares it; undeclared compute operators are refused in a fixpoint position — RFD 0035 D4), since Argon cannot verify foreign code by inspection.

Connector verdicts are a trust boundary — two distinct dimensions, with different defenses; do not fuse them. A connector that wrongly returns Exact suppresses the in-engine re-check and yields silently wrong answers — the one outcome a correctness-first engine cannot tolerate (collation, NULL handling, numeric coercion are the classic mismatches). The differential oracle (RFD 0035 D8) is load-bearing for in-engine passes but structurally cannot test a connector — it has no foreign data. The two dimensions:

  • Data completeness (false negatives — a source omits rows it should return). Irreducibly trusted in both Exact and Inexact, because re-applying the predicate only re-filters the rows that came back — it can never recover omitted ones. “Always re-check” buys exactly nothing here. This is the Postgres-is-trusted dimension: a connector is, irreducibly, a trusted data source for the relations it serves.
  • Verdict honesty (false positives — a source claims Exact but returns rows that fail the predicate). Not irreducible — this is the connector’s code, not the source’s data, and it is closable.

Three measures follow from the split:

  • Inexact-by-default. An Exact claim is honored — i.e. allowed to suppress the residual re-check — only from a connector that has passed a conformance harness for the operations it claims; otherwise pushdown is treated as Inexact and the engine re-applies the whole predicate. This removes the engine-introduced footgun (honoring an unverified guarantee).
  • Audit mode is the “always-re-check” configuration. It re-runs every Exact verdict against the residual and flags divergence — run in CI / dev / canary, where the re-check is free, rather than taxing production. This captures everything a blanket “no Exact ever” would buy on the verdict-honesty dimension, at zero steady-state cost — the federation analogue of the WCOJ-soundness bug the Lean↔Rust conformance framework caught (1bdfa164f).
  • Honest scope. A perpetual production re-check would pay a hot-path tax to defend only verdict-honesty — the dimension audit already closes for free — while leaving data-completeness, the irreducible hole, exactly as open. So Inexact-default + earned-Exact + audit-in-CI is the calibrated answer, not “always re-check.” Conformance raises confidence in the completeness trust; it does not abolish it.

The store taxonomy — one SPI, heterogeneous roles (updated per the performance campaign, 2026-06-15). The same ForeignRelation SPI accommodates stores with very different capabilities because binding_patterns() negotiates them: a point-lookup-only KV store answers bf (needs a bound key), a free-scannable columnar store answers ff. Concretely: DynamoDB plays up to three distinct roles — the outsourced CAS / durability backend for Argon’s own log (D9), an operational point-lookup tier, and a federated source; DuckDB / Parquet / Arrow is a >TB columnar source (a D3 connector); S3 is the immutable-segment object-store truth (D9) and the Ref<Blob> byte store (D8). DataFusion’s TableProvider is borrowed only as the connector interface shape — it is not Argon’s execution engine (RFD 0035 D4: Argon’s own vectorized engine is forced by the inline-provenance segment and justified by the small vectorized-vs-compiled constant). Federating to such a store is correct precisely for data Argon does not own or has deliberately placed there (D6) — the permanent-orchestrator half of the Decision lead — while Argon’s own engine remains the first-class default for what Argon owns.

D4 — Recursion × federation: the frozen-foreign-EDB rule + the refusal gates

A foreign relation in a fixpoint body is frozen-materialized once (RFD 0035 D3) — sound because a frozen slice is indistinguishable from a native EDB at the level of the mechanized semantics. For free-scannable sources this is the complete v1 mechanism — within a materialization cardinality budget. Freezing a slice into an in-memory BTreeMap CatalogEntry is bounded by demand for a binding-limited (bf) source, but bounded by nothing for a large or non-selective free-scan (ff) — an O(foreign-database) pull into memory, in direct tension with C5. So free-scannable freezing carries an explicit cardinality guard: a slice projected to exceed the budget is refused with a remediation diagnostic (“add a selective filter or declare a binding pattern”), never silently materialized to OOM. (The market-oracle headline is safe — k-means runs in-store and only the small cluster_of returns; a generic free-scan federated join is the case the guard protects.) Push-compute-to-data + streaming columnar execution is load-bearing for the >TB bar, not a follow-on (updated per the performance campaign, 2026-06-15): the BTreeMap freeze fundamentally cannot do >TB, so the trading-grade analytical path is pushdown (D3) + in-store compute where possible + Argon’s own vectorized streaming engine over columnar data otherwise (RFD 0035 D4) — the frozen-into-BTreeMap EDB is reserved for the small result re-entering the fixpoint. (Spilling joins and pushing the join itself down remain genuine follow-ons that further lift the budget.) For binding-limited (bf) sources the demand is itself recursive (Duschka–Genesereth: binding-limited access compiles to a recursive demand program), so “freeze once” is naive; the demand is computed by magic-sets/demand transformation (RFD 0035 D5) and the bounded slice pulled, with the connector required idempotent and monotone under growing demand. The two resolutions — demand-stratify (compute the complete magic_F extent in a lower stratum, then freeze) vs monotone bounded re-consultation (consult per demand-growth round; F* only grows) — are settled in implementation; free-scannable is the floor, binding-limited the careful extension.

Refusal gates (C9 teeth — static, checkable, never silent mis-evaluation):

  • NAF over an OWA foreign relation — absence ≠ false in an open world; refuse (or thread three-valued).
  • A relation both foreign-mapped and rule-derived — intensional/extensional conflict; compile-time refusal.
  • Pushing recursion into a source — Li–Chang: decidable for conjunctive fragments, undecidable with recursion + integrity constraints; only bounded binding-slices push, the fixpoint stays in-engine.
  • Result-bounded incomplete foreign slices (paginated/rate-limited APIs) under CWA-NAF — an incomplete F* silently breaks closed-world negation; refuse until the slice is warranted complete (D5).

D5 — World assumption is a decidability-tier input, not a soundness flag

A foreign relation declares its CWA/OWA (C6); the mark propagates into the tier classifier (the Tier metadata LogicalPlan nodes carry — RFD 0035 D5/D6). Closing a predicate is a complexity cliff — CQ answering jumps from AC0 (DL-Lite) to coNP-hard the instant any predicate is closed, unless the query is quantifier-free. So Argon admits closed foreign predicates only inside the Lutz–Seylan–Wolter Thm-5 FO-rewritable island (quantifier-free UCQs, no open→closed role inclusion — a static syntactic gate, firing the reserved OE0901/OW0902), and refuses the rest rather than silently moving a query past its tier ceiling. The world-assumption mark also gates which fixpoint flavour a foreign relation may enter (a CWA relation admits NAF / a WFS-SCC; an OWA one does not — the C6×D4 hinge). The cross-boundary completeness warrant — what a connector must supply for the CWA→OWA transfer to be sound — is the same composite leaf as D7’s freshness token, specialized with a closed? flag; the in-engine theorem (cwa_owa_transfer, CwaOwa.lean) exists, the connector contract is net-new.

D6 — Mapping & placement: three levels + a compiled content-addressed artifact; placement versioned separately from schema

  • Level 1 — source annotation (vocabulary-free — C7): a declaration marks a relation/concept foreign, with its world assumption and an abstract field-correspondence to logical names. No ontology vocabulary, no store identity.
  • Level 2 — ox.toml [store] / [placement] (versioned package contract, no secrets — C11): the named store, its kind (columnar/blob/relational/kv), binding-pattern hints, and an RML-style mapping shape; credentials referenced only by an indirect @deploy: handle.
  • Level 3 — deployment config (not versioned): endpoints, credentials, the concrete connector instance.

The mapping compiles to a content-addressed artifact hashed against the Schema composition signature (the .oxbin discipline — oxc-oxbin already carries per-section BLAKE3 + a composition signature — C10), so schema↔mapping drift is a load-time refusal, not a runtime surprise. Placement is versioned separately from schema: a relation can move stores without a schema bump; the mapping pins to a schema hash. Schema stays store-agnostic throughout (C1) — it answers type-checking questions (subsumption, refinement, world assumption) for a foreign relation without learning where bytes live; placement is the parallel catalog layer beside it.

D7 — Provenance & freshness: a foreign leaf in the existing PosBool(M) DNF

AxiomEvent already is the native token (Context). Federation adds one generator leaf (source_id, mapping_content_hash, as_of, closed?) into the same derivation DNF — the engine’s ⊗/⊕ provenance composition is unchanged. The mapping_content_hash triple-duties: the OBDA mapping-axiom provenance label (Calvanese 2019) + the freshness coordinate + the C10 content-address. Freshness is a three-rung ladder gated by source capability: (a) TTL/staleness-bound (weakest; a liveness property, source needs no cooperation); (b) CDC/change-feed (push invalidation — the delta path, RFD 0035 D7); (c) per-source as_of barrier (strongest, read-your-writes). Argon’s bitemporal tt is a barrier coordinate, so the engine-side mechanism for rung (c) already exists — but the rung is not “free”: the connector must expose a monotonic source position and a mapping that aligns it with tt. That alignment is a real per-connector obligation, not a given. The mandatory floor and the composite freshness of a multi-source join (v1: meet-of-leaves — the answer is as fresh as its weakest leaf) are the open residuals.

Provenance under incremental maintenance — the two-semiring split; the segment carries provenance inline (updated per the performance campaign, 2026-06-15). The persisted read-model segment (D9) carries the PosBool(M) derivation DNF + the proof_tag + the BitemporalExtent inline — “answer why at segment granularity” — which no surveyed columnar store does (they are all provenance-free), and which is one of the two reasons Argon’s read-model needs Argon’s own engine rather than an off-the-shelf one (RFD 0035 D4). The IVM maintainer keeps the two concerns on separate semiring components so they cannot fight: ℤ weights drive the cardinality IVM (DRedc / two-semiring DBSP; insertion is free under semi-naive, retraction is what forces the real algorithm), while PosBool(M) why-provenance rides as a value-field payload via the Green et al. ℕ[X] → PosBool(M) homomorphism. distinct collapses only the ℤ multiplicity component to set semantics for the oracle-identity obligation (RFD 0035 D7/D8); it does not touch the PosBool(M) payload — so the equivalence collapse and the provenance-carrying obligation are discharged by construction, not in tension.

D8 — P3 external-valued attributes: Ref<Blob>, the one tractable cross-store write

A blob-valued property is a first-class Ref<Blob> handle type (explicit indirection, composing with the reflective-Type/refinement machinery — RFD 0023 — over a magic blob-typed field). The write decomposes into three ops with different guarantees:

  1. idempotent content-addressed puthandle = BLAKE3(bytes), outside any transaction; re-putting identical bytes is a no-op by content hash;
  2. a transactional single-entity reference write — a native AxiomEvent holding the handle (the token already exists);
  3. a background GC sweep for the orphan window (put succeeded, reference never written).

It is admissible precisely because it is not a distributed transaction — it never crosses an entity boundary. The genuinely hard part is GC over a bitemporal, four-axis, fork-branched, content-addressed log: a blob referenced in fork A but retracted in fork B is not orphaned; one referenced only outside the current as_of window is live-but-invisible. Convex’s flat refCount is insufficient. v1: conservative mark-and-sweep with a grace period — and its cost is named, not implied cheap: to prove a blob unreferenced the sweep must scan reachability across all forks × as_of windows (≈ O(log) per sweep over a branched history). It is a background job, so C5 (hot-path) does not bind it, but it is not free, and that cost is precisely why a maintained per-fork refCount CQRS projection (incremental, O(1) per reference event) is the named follow-on rather than the v1 default. Schema sees only the Ref<Blob> type (C1).

D9 — P1 persistence swap: a thin async durable seam below the existing replay seam

The existing sync RuntimeStorageBackend (replay seam, oxc-runtime/src/lib.rs:2890) is preserved. A P1 backend (DynamoDB / FoundationDB-layer / RocksDB / Cassandra) plugs in at a separate, thin, async durable layer — where PgStorage already sits — reached by hydrate-then-replay (the async/sync split is by design, not a defect; RDFox + Datomic confirm it as the normal shape). The contract is Datomic-shaped and tiny: a consistent kv-read + one linearizable CAS on the root/watermark; the bulk store needs only eventual consistency, because the stored data is immutable (Datomic stood up DynamoDB in ~2 weeks on exactly this). First-party backends behind a compile-time enum + an async builder trait for third-party “external storage composers” (SurrealDB’s actual hybrid — not the enum-vs-trait dichotomy a naive reading assumes). The external durability backend (DynamoDB/S3/FoundationDB) is a swappable durability primitive, not Argon’s engine — the engine over it is always Argon’s (Decision lead); ox.toml holds at most a backend-kind selector; URIs/secrets are deployment config (C11).

Read-model persistence is the primary read path, not a coupled afterthought (updated per the performance campaign, 2026-06-15). The live read-model is a columnar, content-addressed, immutable segment (Datomic/Materialize/TerminusDB lineage, made columnar — RFD 0035 D7), maintained incrementally by the IVM maintainer so a mutate is a delta, not a rebuild. It serves C5 (reads hit segments, never the log), discharges C10 (the read model is a content-addressed cache), survives restart, and is the campaign’s #1 single-node win. This is content-addressed immutable segments, not the current scope-versioned mutable cache — the immutable shape is load-bearing (it is also the replication and cache-placement unit, and the fork mechanism). The fork axis (C8) plausibly rides the same content-addressed-segment mechanism as the read-model (fork = a pointer-set over shared immutable segments — the Neon/Snowflake zero-copy-clone shape; the Datomic/TerminusDB C10↔C8 convergence), while tenant/standpoint/module stay scoping coordinates — a two-mechanism split, flagged for investigation, not forced here.

  • The segment manifest — “which segments compose the read-model at watermark W per (tenant, fork, standpoint, module) at as_of” — is runtime state, not config (the Neon IndexPart / Iceberg metadata role), and is 4-axis + bitemporal. It lives with the write-spine (the natural home for the linearizable watermark). It is distinct from D6’s foreign-placement artifact.
  • The write spine stays a single scalar linearizable CAS (Datomic “db root” shape — the Track C verdict). The structured 4-axis + bitemporal manifest does not force a structured CAS: the manifest is itself an immutable, content-addressed object, and advancing the frontier is write the new immutable manifest, then one scalar CAS swings the root pointer to its content_id. Atomicity is automatic (the manifest is written and content-addressed before the CAS makes it live), and the outsourced-single-CAS simplicity is preserved — the linearizable cell’s value stays scalar even though what it points at is arbitrarily structured. This resolves the campaign’s scalar-CAS-vs-structured-manifest tension.

D10 — Sequencing and the scope line (relationship to RFD 0035)

The complete, correct mechanism is frozen-per-query federation on the RFD 0035 pipeline — it lowers end-to-end for every stage or refuses at a checkable gate; nothing half-checked executes (C9). The persisted-read-model + IVM (cross-query reuse, delta-maintained freshness — D7 rung (b)) is the named subsequent optimization whose correct fallback is the frozen-per-query path (RFD 0035 D7) — not a hollow deferral. P1 (D9) and P3 (D8) are independent of the pipeline and can proceed in parallel. P2 splits cleanly: the relation-provider half for free-scannable sources rides existing semantics (frozen EDB ≡ native EDB); binding-limited sources extend it via demand (D4, RFD 0035 D5); the compute-provider half (table operators / analytical tier) is the genuinely novel IR work (RFD 0035 D4/D6). Two hard prerequisites gate P2’s federation-split (both are sequencing facts, not open questions): (i) the logical-layer expression interpreter (RFD 0035 D5) — without an inspectable predicate, apply_filter (D3) cannot return a verdict; and (ii) the connector conformance harness (D3) — without it, Exact cannot be honored, so federation runs Inexact-only (correct, just slower). Until both land, federation-split is a stub, and the RFDs say so plainly rather than implying it works. The implementation owns the ordering and the cut, subject to: no hollow path, every optimization a genuine mechanism with a correct fallback, the differential oracle gating each (RFD 0021 D8 / RFD 0035 D8).

The cut, informed by the performance campaign (2026-06-15). The campaign’s evidence-grounded sequencing (research, not a directive; the cut stays the implementation’s): (1) the single-node IVM + columnar content-addressed segment read-model — the highest-leverage win, no S3/distribution/consensus, the benchmark suite lands here (none exists today, CP8); (2) the financial read paths over those segments; (3) the outsourced-CAS write spine + batching, independent and parallelizable; (4) distribution — later and greenfield (distribute storage, keep compute local: distributing the fixpoint imposes a per-iteration barrier and distributed incremental-recursive Datalog does not exist to adopt — reasoning stays single-node in v1). This reorders the earlier framing, which treated the persisted read-model + IVM as a someday-optimization; the architecture was already right (pure Z-set, RFD 0035 D7), only the priority was set without performance data.

The financial path is three workloads, not one. “Trading query” decomposes into a point lookup (one instrument’s current state — the operational tier, sub-ms–5ms), an analytical slice (aggregate/k-means over a >TB returns slice — the columnar segment + Argon’s vectorized engine, ~100ms–1s), and a ledger write (contended debit-credit — the TigerBeetle pattern). TigerBeetle is an accelerator, not a system of record (“Write Last, Read First”), so Argon’s reasoning + query layer is the general-purpose DB beside the ledger — the ledger federates out (a frozen-foreign EDB of its user_data-linked facts), and the trading-query path is a read/OLAP problem, not an OLTP-ledger one. v1 targets the analytical/columnar path (the market-oracle headline). A dedicated sub-ms operational point-lookup tier over the 5-coordinate (4-axis + bitemporal) key — and whether it forces a second segment kind (a point-lookup index layout vs the analytical scan layout, opposite physical shapes over the same log) — is a named-later residual (no operational store does sub-ms on a 5-coordinate key today).

D11 — The async execution boundary: await once at the EDB-loading edge; the fixpoint stays synchronous

A real foreign connector (Postgres, S3, DynamoDB, DuckDB) does async network I/O; the Argon evaluator (executor/eval.rs) and RuntimeStorageBackend are synchronous (semi-naive over BTreeMap). The two are reconciled by the frozen-foreign-EDB rule (D4) itself: a connector’s scan is async, but it is awaited exactly once, at the EDB-loading edge, before the fixpoint — its bounded demand-slice (D4, RFD 0035 D5) is drained into the frozen CatalogEntry, and the synchronous fixpoint then iterates over a constant, in-memory snapshot. The async↔sync seam sits outside and above the fixpoint, never inside an iteration. This is the same async-durable-seam-below-sync-replay discipline D9 establishes for persistence, now for reads — the read-side counterpart of “hydrate-then-replay.”

This is forced, not chosen — await-inside-iteration is unsound, not merely awkward. A synchronous fixpoint’s monotonicity / WFS guarantees assume a fixed input relation; re-consulting a foreign source mid-fixpoint lets the EDB change under the operator — the case the proofs do not cover (the operator’s parameter, not just its argument, would vary — D4’s freeze rationale). The boundary is the convergent answer across every mature engine, on primary sources: Soufflé .input / Nemo @import load once before evaluation as stratum-0 EDBs; DDlog / Materialize / Differential Dataflow ingest async sources from outside the synchronous dataflow (SyncActivator wake + capability-stamped batches) and keep operators non-blocking; DataFusion’s own recursive-CTE operator iterates over an in-memory WorkTable, not re-issued remote scans; and PostgreSQL recursive-CTE-over-FDW re-scanning per iteration is the bug its Dec-2025 Material-node patch fixes (the negative control — freeze is the soundness fix, not an optimization). The connector-SPI strawman reached the same shape independently: await a bounded binding-pushed slice once, freeze it, iterate the fixpoint over the frozen synchronous snapshot.

The SPI is async, dyn-dispatched, and the connector lives in Store state. ForeignRelation::scan is async; the connector is held Arc<dyn> in Store state, never in CatalogEntry (which must stay Serialize / Eq for the read-model segments and the differential oracle — D9 / RFD 0035 D8). Native async fn in a trait is not dyn-compatible, so the trait carries #[async_trait] (the per-call box is noise against a network round-trip; dynosaur is the static-dispatch-by-default alternative if the box ever matters). A blocking-only source (rare; the targeted stores are natively async) bridges via spawn_blocking + a channel (the DataFusion pattern), never a blocking call on a runtime worker.

Where the await lives, per entry point. Serve handlers are already async; the foreign fetch is hoisted above the existing flavour-aware sync core (run_reasoner’s block_in_place-vs-inline, oxc-serve/src/lib.rs). Concretely, materialize_predicates splits into [sync: seed base + plan demand] → [async: fetch + freeze] → [sync: fixpoint], with the one await in the middle and the CPU-bound fixpoint kept off the executor exactly as today. The CLI (no runtime today) builds one current-thread runtime and block_ons the whole command, doing all fetching inside that single block_on before the sync core. Rejected alternatives, on mechanism: making the query stack async end-to-end (function-colouring contagion — it colours the recursive core async for zero benefit, since the core does no I/O, and forces a runtime into the CLI while dragging the maintainer and the differential oracle along); and block_on inside the sync freeze (it runs the CPU-bound fixpoint on a runtime worker — executor starvation — and panics at the CLI, which has no runtime).

The single-fetch soundness boundary → a new refusal gate (extends D4). A single bounded fetch is complete iff the foreign relation’s extension is independent of the IDB computed in that fixpoint — i.e. there is no recursion through the foreign source (it sits strictly below the IDB it feeds; magic-sets / limited-access-patterns theory — Duschka–Levy 1997, Nash–Ludäscher 2004). This is the dual of D4’s Li–Chang gate (which forbids pushing recursion into a source): here the source is a leaf, but a recursive cycle that derives new foreign demand from already-consumed foreign tuples would make one fetch incomplete. v1 gates it statically: recursion-through-a-foreign-source is a refusal, reusing the analytical tier’s theorem-backed transitive no-cycle / dependency-cone check (AvoidsVocab, F2.4 / Reasoning/Datalog/AnalyticalFreeze.lean) — a foreign scan in an SCC with the IDB deriving its demand is refused with a remediation diagnostic, never silently under-derived. The relaxation — admit it via monotone bounded re-consultation (iterative demand rounds owned by the orchestrator at the boundary, the connector required idempotent + monotone under growing demand — D4’s second resolution) — is a named, gated follow-on, not a v1 shortcut; the full sans-IO yield-demand state machine is explicitly not adopted (overkill for batch compute — the loop, when needed, lives in the orchestrator, not the evaluator). F2.3’s existing base-binder boundary (bf binders must be already-materialized base relations) already enforces a conservative form of the gate.


Rationale

  • OBDA over a federated executor, not Convex-style absorption. You cannot absorb a pre-existing >TB store into one integrated backend (Convex’s model); Argon’s value is the ontology/reasoning layer over data where it lives. Foreign data = virtual EDB mapped into the ontology, queried through it (D2/D3/D6) — the OBDA/Trino/DataFusion shape.
  • The substrate was designed well, and it shows (D5/D7). AxiomEvent already being the provenance/freshness token, and per-concept CWA/OWA already being mechanized and ahead of deployed OBDA, mean federation adds a leaf and a tier input, not new machinery.
  • The freeze rule is the spine, and it carries zero new reasoning semantics (D2/D4). Because the mechanized operators never inspect provenance, a frozen foreign/computed EDB is a native EDB; federation’s hardness is engineering (the pipeline, the SPI), not semantics.
  • Type-identity is non-negotiable, so the SPI is per-operation (D3). The one Argon demand no federation system has — both-sides type-check must agree — forces the per-operation, independently-re-derivable residual over the opaque delegated rewrite.
  • Read-only-foreign is correct, not conservative (D2). The distributed-systems literature treats per-entity ACID + idempotent messaging as the right steady state; P3’s content-put is the one admissible cross-store write because it isn’t a distributed transaction.

Alternatives considered

  • Ingest/mirror foreign data into axiom events. Rejected (D2): O(database) copy of a >TB store, stale by construction; defeats the premise.
  • Delegated-subplan-rewrite connector SPI (datafusion-federation). Rejected (D3): opaque absorption is irreconcilable with the type-identical-residual demand (C2/C3).
  • Mapping as hand-authored interpreted config (R2RML/RML verbatim). Kept as influence, not adopted whole (D6): drift becomes a runtime surprise; the compiled-content-addressed artifact makes it a load-time refusal.
  • A magic blob-typed field (P3). Rejected in favor of explicit Ref<Blob> (D8): honest indirection, composes with reflective-Type/refinement.
  • Generalize the sync RuntimeStorageBackend to durable backends (P1). Rejected (D9): every backend would have to speak ABox kinds / retraction tombstoning / deep-clone synchronously — a heavy contract few stores fit; the thin async kv+CAS seam below it admits the widest backend set.
  • 2PC / cross-store distributed transactions. Rejected (D2): anti-availability; only works inside one store.
  • Treat IVM / the persisted read-model as a someday-optimization. Rejected after the performance campaign (D9/D10): the columnar content-addressed segment + IVM is the primary read-model and the highest-leverage work; frozen-per-query is its correct fallback. The architecture was already right; only the priority was wrong.
  • Delegate the analytical engine to DataFusion/DuckDB. Rejected (D3, RFD 0035 D4): Argon’s own vectorized engine, forced by the inline-provenance segment and unjustified-against by the small vectorized-vs-compiled constant. Off-the-shelf engines are a connector shape / temporary bridge, never Argon’s own engine. (Federating to such a store for data Argon does not own remains correct and permanent — the orchestrator half of the Decision lead.)
  • Make the write spine a structured (4-axis) CAS to carry the manifest. Rejected (D9): the manifest is an immutable content-addressed object the scalar root points at; one scalar CAS advances it. A structured CAS would needlessly reopen the harder-consensus question.
  • Make the query stack async end-to-end, or block_on inside the sync freeze. Rejected (D11): the first colours the recursive core async for zero benefit (it does no I/O) and breaks the sync CLI + maintainer + oracle; the second runs the CPU-bound fixpoint on a runtime worker (executor starvation) and panics at the CLI. The connector await belongs at the EDB-loading edge, above the sync fixpoint — which the frozen-foreign-EDB rule (D4) makes both sound and natural.

Consequences

  • New runtime dependencies + seams. The connector SPI (ForeignRelation, async / #[async_trait], dyn-dispatched, held in Store state — D11), the table-operator/analytical-tier executor (RFD 0035 D4/D6), the [store]/[placement] ox.toml sections + the compiled mapping artifact, the foreign-provenance leaf in derivation, the world-assumption tier-input threading, the Ref<Blob> type + blob-put SPI + GC sweep, and the thin async durable persistence seam. Each lands behind RFD 0035’s pipeline seams. The async↔sync seam is the EDB-loading edge (D11): materialize_predicates gains a plan-demand → await-fetch → freeze → sync-fixpoint split, with the await hoisted above the existing sync evaluator core (serve) or a single per-command block_on (CLI).
  • Spec / reference / Lean. Reference gains a heterogeneous-stores chapter; the only net-new semantic obligation is the frozen-foreign-EDB preservation theorem (RFD 0035 D3) against Fixpoint.lean/Compiled.lean. The CWA→OWA completeness warrant reads against CwaOwa.lean.
  • AGENTS nodes. oxc-reasoning, oxc-runtime, oxc-serve, and a new store-layer node updated once seams land; the “reasoner was not built here” / “rejected by design” framings retired (coordinated with RFD 0033).
  • Performance posture. The RFD 0020 scaling contract holds across the federation boundary: no O(database) on any hot path; the bounded binding-pushed slice + frozen EDB is the realization; the foreign store’s cost is the unknown the v1 cost model treats degenerately (push only when strictly better, RFD 0035 D5).

Open questions / tracked-future

  1. Engine-driven verdict (chosen) vs provider-driven callback — settled to per-operation (D3); the iterate-to-fixpoint negotiation (Trino) is deferred until a connector needs it. The genuinely-open piece is the connector conformance harness that gates the Exact claim (D3): its shape (a generated battery of predicates checked source-vs-oracle? a declared semantic profile — collation/NULL/coercion — the engine validates?) is net-new and unattested in the surveyed systems.
  2. Where mapping compilation lives — engine-side off-line (Ontop T-mappings) vs pushed into the source as views (Ultrawrap); and the exact [store]/[placement] schema (D6).
  3. Demand-stratify vs monotone bounded re-consultation for recursive foreign demand (D4) — settled with RFD 0035 D5’s magic-sets interface.
  4. Freeze vs delta as steady state (D7 rung (b)) — when a foreign CDC/change-feed earns its cost; the IVM follow-on (D10, RFD 0035 D7).
  5. Composite multi-source freshness beyond meet-of-leaves; read-your-writes across a mixed native-exact / foreign-barrier boundary (D7).
  6. The cross-boundary CWA→OWA completeness warrant — the connector contract that discharges cwa_owa_transfer for a closed foreign predicate, and how OE0901 fires when it is absent (D5).
  7. Blob GC over the bitemporal/fork-branched/content-addressed log — conservative mark-and-sweep vs a maintained per-fork refCount projection (D8).
  8. Whether content-addressed immutable layers unify C10 identity with the C8 fork axis (D9) — a deeper storage simplification to investigate.
  9. The per-symbol identity residual (RFD 0033): artifact identity is solid (composition signature + section hashes); threading it to per-event/per-symbol resolution (the module_id collision) is shared with the storage-identity fix and bears on the foreign-leaf source_id.
  10. Read-model segment GC over the 4-axis + bitemporal + fork space (D9): a single segment is referenced from many (tenant, fork) coordinates (cheap branching = pointer-sets over shared segments), so it is collectable only when no axis-coordinate’s manifest references it — a cross-axis reachability computation no surveyed system does (Neon GCs by single-axis LSN-horizon). The cheap-branching win and the GC-reachability cost are in direct tension; the algorithm is unsketched.
  11. The IVM maintainer’s checkpoint cadence (D9, RFD 0035 D7): per-mutation segment minting (churn + GC pressure) vs batched (the in-memory materialization must survive restart some other way). A genuine open the maintainer loop surfaces.
  12. Composed cross-tier freshness (extends D7’s meet-of-leaves): when one fixpoint joins facts from a point-lookup (operational tier), an analytical slice (segment watermark), and a federated ledger (as_of barrier), the derived conclusion is consistent only relative to the weakest of the three barriers — a three-way compose sharper than the two-source (native + foreign) case D7 states.