Normalization standard

The Axiom Portable Record Standard (APRS) defines the minimum contract every record in every Codex dataset must satisfy. It exists so you can join any two Codex datasets without wrangling code, pin a dataset to a schema version and trust forward compatibility, and feed records directly into an LLM, RAG index, or training pipeline.

Principles

Every Codex dataset conforms to seven principles:

Schema-locked. Every dataset publishes a versioned schema. Breaking changes bump the major version. You pin to a version and trust forward compatibility within a major.
Source-attributed. Every record carries provenance — what system emitted it, when, which pipeline version processed it, and a confidence score.
AI-optimized labels. Categories, entity types, and sentiment labels are pre-computed at normalization time, not at query time.
Spatially consistent. All geospatial data normalizes to H3 resolution 8 as the primary spatial key. Lat/lng is retained for display, not for joining.
Bitemporally honest. Every record separates valid time (when the real-world event happened) from system time (when it was ingested or modified).
Versioned snapshots. Monthly immutable snapshot releases with full changelogs. Pin to a month for reproducible research.
Joinable by construction. A fixed set of shared keys lets any dataset join to any other without custom wrangling.

Record envelope

Every record carries four groups of mandatory fields, applied at normalization time.

Identity

Field	Type	Description
`record_id`	URN	Stable record URN: `urn:aprs:record:{namespace}:{source_system}:{local_id}`
`chunk_id`	URN	Deterministic chunk URN derived from `record_id` + optional section label (SHA-256-backed)
`source_uri`	URL or URN	Points to the original source record for citation and re-fetch
`source_system`	string	Originating system name (e.g. `granicus`, `aisstream`, `equasis`)

record_id and chunk_id are deterministic — two calls with the same arguments always produce the same URN. This enables incremental sync, deduplication, and stable vector index keys.

Schema and lineage

Field	Type	Description
`schema_version`	semver	APRS profile version, e.g. `aprs.civic/1.0.0`
`normalization_version`	semver	Version of the pipeline that produced this row
`acl_tier`	enum	`research`, `commercial`, or `internal` — gates which exports may include this row

Bitemporal fields

Field	Type	Description
`ingested_at`	ISO 8601	When the system first ingested this row. Never mutates after insert.
`modified_at`	ISO 8601	When the row was last updated. Refreshed on any write.
`occurred_at`	ISO 8601	When the real-world event happened (e.g. council vote date, AIS position timestamp)
`filed_at`	ISO 8601	When the record was filed or submitted to the authority. Required for permits and OSHA records.
`published_at`	ISO 8601	When the source authority published the record
`effective_from`	ISO 8601	When a ruling or record became legally effective. Null if not applicable.
`effective_to`	ISO 8601	When it expired or was superseded. Null means currently active.

See the bitemporal fields reference for per-dataset availability and domain-specific temporal extensions.

A permit can be voted on (occurred_at), published weeks later (published_at), become legally effective months after that (effective_from), and be ingested by Codex retroactively (ingested_at). Choose the clock that matches your analysis.

Confidence and provenance

Field	Type	Description
`confidence_score`	float [0,1]	Normalizer’s confidence in the record. Methodology is documented per-dataset.
`provenance`	JSON array	Ordered list of transformations: `[{stage, version, ts, notes?}]` for full lineage reconstruction

Spatial consistency

All geospatial data includes h3_index at resolution 8 (avg edge length ~461 m, cell area ~0.74 km²). This is the universal spatial join key across all Codex datasets.

Point geometry — h3_index is the H3 cell containing the point.
Polygon geometry — h3_indexes (array) covers the polygon at resolution 8. geometry_wkt is retained for display.
Line geometry — H3 cells intersecting the line buffer.

Higher or lower resolutions may be published as supplementary fields (h3_index_9, etc.), but resolution 8 is authoritative.

LLM-ready surface

Every dataset publishes a llm_text view in Markdown-KV format alongside the structured Parquet/CSV view. This format is optimized for LLM reasoning — benchmarked at 60.7% accuracy versus 44.3% for CSV.

- chunk_id: urn:aprs:chunk:9f3a1b2c...
- record_id: urn:aprs:record:civic:us:granicus:phila-2024-03-15-item7b
- jurisdiction: Philadelphia, PA
- occurred_at: 2024-03-15
- event_type: zoning_vote
- summary: Council voted 11-6 to approve rezoning of the 2200 block...
- entities: {name: Kenyatta Johnson, role: councilmember, sentiment: supportive}
- litigation_risk_score: 0.42
- source_uri: https://phlcouncil.com/meetings/...

Claim vs. fact separation

For datasets where multiple sources may contradict each other (corporate ownership, permits, civic proceedings), records are published at two layers:

Claim layer — what a source asserted, preserved verbatim with source attribution. Multiple claims per subject are expected.
Fact layer — Codex’s resolution of contradictory claims into a canonical row, with a resolution_method field explaining the choice (e.g. latest_by_published_at, authority_priority, manual_review).

This separation ensures you can always trace how a fact was derived and which original assertions it is based on. See the claim/fact separation page for full schemas, resolution methods, and query examples.

Versioning

Schema semver — MAJOR.MINOR.PATCH. Breaking changes (removing fields, changing types, narrowing enums) bump MAJOR. Additive changes bump MINOR. Doc-only clarifications bump PATCH.
Normalization semver — independent of schema version. A normalization version bump is always accompanied by a changelog entry.
Snapshot releases — first of each month, immutable once published. Labeled YYYY-MM.
Deprecation — fields marked @deprecated in a MINOR release may be removed in the next MAJOR with at least 6 months notice.

​Principles

​Record envelope

​Identity

​Schema and lineage

​Bitemporal fields

​Confidence and provenance

​Spatial consistency

​LLM-ready surface

​Claim vs. fact separation

​Versioning

Principles

Record envelope

Identity

Schema and lineage

Bitemporal fields

Confidence and provenance

Spatial consistency

LLM-ready surface

Claim vs. fact separation

Versioning