Skip to main content
The Axiom Portable Record Standard (APRS) defines the minimum contract every record in every Codex dataset must satisfy. It exists so you can join any two Codex datasets without wrangling code, pin a dataset to a schema version and trust forward compatibility, and feed records directly into an LLM, RAG index, or training pipeline.

Principles

Every Codex dataset conforms to seven principles:
  1. Schema-locked. Every dataset publishes a versioned schema. Breaking changes bump the major version. You pin to a version and trust forward compatibility within a major.
  2. Source-attributed. Every record carries provenance — what system emitted it, when, which pipeline version processed it, and a confidence score.
  3. AI-optimized labels. Categories, entity types, and sentiment labels are pre-computed at normalization time, not at query time.
  4. Spatially consistent. All geospatial data normalizes to H3 resolution 8 as the primary spatial key. Lat/lng is retained for display, not for joining.
  5. Bitemporally honest. Every record separates valid time (when the real-world event happened) from system time (when it was ingested or modified).
  6. Versioned snapshots. Monthly immutable snapshot releases with full changelogs. Pin to a month for reproducible research.
  7. Joinable by construction. A fixed set of shared keys lets any dataset join to any other without custom wrangling.

Record envelope

Every record carries four groups of mandatory fields, applied at normalization time.

Identity

FieldTypeDescription
record_idURNStable record URN: urn:aprs:record:{namespace}:{source_system}:{local_id}
chunk_idURNDeterministic chunk URN derived from record_id + optional section label (SHA-256-backed)
source_uriURL or URNPoints to the original source record for citation and re-fetch
source_systemstringOriginating system name (e.g. granicus, aisstream, equasis)
record_id and chunk_id are deterministic — two calls with the same arguments always produce the same URN. This enables incremental sync, deduplication, and stable vector index keys.

Schema and lineage

FieldTypeDescription
schema_versionsemverAPRS profile version, e.g. aprs.civic/1.0.0
normalization_versionsemverVersion of the pipeline that produced this row
acl_tierenumresearch, commercial, or internal — gates which exports may include this row

Bitemporal fields

FieldTypeDescription
ingested_atISO 8601When the system first ingested this row. Never mutates after insert.
modified_atISO 8601When the row was last updated. Refreshed on any write.
occurred_atISO 8601When the real-world event happened (e.g. council vote date, AIS position timestamp)
filed_atISO 8601When the record was filed or submitted to the authority. Required for permits and OSHA records.
published_atISO 8601When the source authority published the record
effective_fromISO 8601When a ruling or record became legally effective. Null if not applicable.
effective_toISO 8601When it expired or was superseded. Null means currently active.
See the bitemporal fields reference for per-dataset availability and domain-specific temporal extensions.
A permit can be voted on (occurred_at), published weeks later (published_at), become legally effective months after that (effective_from), and be ingested by Codex retroactively (ingested_at). Choose the clock that matches your analysis.

Confidence and provenance

FieldTypeDescription
confidence_scorefloat [0,1]Normalizer’s confidence in the record. Methodology is documented per-dataset.
provenanceJSON arrayOrdered list of transformations: [{stage, version, ts, notes?}] for full lineage reconstruction

Spatial consistency

All geospatial data includes h3_index at resolution 8 (avg edge length ~461 m, cell area ~0.74 km²). This is the universal spatial join key across all Codex datasets.
  • Point geometryh3_index is the H3 cell containing the point.
  • Polygon geometryh3_indexes (array) covers the polygon at resolution 8. geometry_wkt is retained for display.
  • Line geometry — H3 cells intersecting the line buffer.
Higher or lower resolutions may be published as supplementary fields (h3_index_9, etc.), but resolution 8 is authoritative.

LLM-ready surface

Every dataset publishes a llm_text view in Markdown-KV format alongside the structured Parquet/CSV view. This format is optimized for LLM reasoning — benchmarked at 60.7% accuracy versus 44.3% for CSV.
- chunk_id: urn:aprs:chunk:9f3a1b2c...
- record_id: urn:aprs:record:civic:us:granicus:phila-2024-03-15-item7b
- jurisdiction: Philadelphia, PA
- occurred_at: 2024-03-15
- event_type: zoning_vote
- summary: Council voted 11-6 to approve rezoning of the 2200 block...
- entities: {name: Kenyatta Johnson, role: councilmember, sentiment: supportive}
- litigation_risk_score: 0.42
- source_uri: https://phlcouncil.com/meetings/...

Claim vs. fact separation

For datasets where multiple sources may contradict each other (corporate ownership, permits, civic proceedings), records are published at two layers:
  • Claim layer — what a source asserted, preserved verbatim with source attribution. Multiple claims per subject are expected.
  • Fact layer — Codex’s resolution of contradictory claims into a canonical row, with a resolution_method field explaining the choice (e.g. latest_by_published_at, authority_priority, manual_review).
This separation ensures you can always trace how a fact was derived and which original assertions it is based on. See the claim/fact separation page for full schemas, resolution methods, and query examples.

Versioning

  • Schema semverMAJOR.MINOR.PATCH. Breaking changes (removing fields, changing types, narrowing enums) bump MAJOR. Additive changes bump MINOR. Doc-only clarifications bump PATCH.
  • Normalization semver — independent of schema version. A normalization version bump is always accompanied by a changelog entry.
  • Snapshot releases — first of each month, immutable once published. Labeled YYYY-MM.
  • Deprecation — fields marked @deprecated in a MINOR release may be removed in the next MAJOR with at least 6 months notice.