Data catalog

Axiom Codex publishes nine datasets covering civic proceedings, port authority governance, maritime activity, permits, labor flows, points of interest, workplace safety, and a unified event timeline that ties them all together. Every dataset follows the normalization standard, ships with pre-computed AI labels, and joins to any other Codex dataset without custom wrangling.

Datasets

Civic Intelligence

517K records — Council votes, permits, and zoning decisions with entity extraction, sentiment scores, upzoning probability, and DAG-mapped approval sequences.Tags: NLP-ready, Labeled, Temporal Formats: Parquet, JSON Lines

Port Authority Governance

1,800+ records — Board meetings, agenda items, claims, facts, minutes, and tariffs from five major US port authorities with claim/fact separation and cross-port normalization.Tags: Governance, Structured, Multi-port Formats: Parquet, JSON Lines

AIS Maritime Positions

1.4M positions — Decoded vessel tracks, port calls, and anchor events enriched with Equasis vessel metadata, flag state, DWT, and kinematic fingerprints.Tags: Time-series, Enriched, Geospatial Formats: Parquet

Urban Signal Grid

454K H3 cells — Cell-level ESGI composite scores and 8 signal-group subscores across 22 US metros at H3 resolution 8.Tags: Geospatial, Scored, Multi-signal Formats: GeoParquet, CSV

Events Timeline

1.7M+ events — Unified temporal intelligence spanning permits, council decisions, AIS anomalies, OSHA violations, and business openings normalized to a single event schema.Tags: Temporal, Multi-source, Cross-domain Formats: JSON Lines, Parquet

LEHD Commuter Flows

454K OD pairs — Census LEHD worker origin-destination pairs normalized to H3 cells with income bands, job sector, and Huff gravity index pre-computed.Tags: Geospatial, Demographics, Transport Formats: CSV, Parquet

POI Intelligence

89K locations — Points of interest enriched with category taxonomy, NAICS codes, pioneer business flags, walk/transit scores, and a reviews sample.Tags: Enriched, Categorized, Labeled Formats: JSON

Permit Signals

2.1M permits — Building permit activity across 22 metros with LLM-extracted scope type, building type, unit count, and estimated cost tier.Tags: NLP-extracted, Temporal, Development Formats: Parquet, CSV

OSHA Safety Index

500K+ inspections — OSHA inspection records with NLP-classified hazard categories, violation severity tiers, and inflation-adjusted penalty normalization by H3 cell.Tags: NLP-classified, Safety, Industrial Formats: CSV, Parquet

What makes Codex datasets different

Every record in every dataset satisfies the Axiom Portable Record Standard (APRS). In practice, that means:

Zero-wrangling joins. A fixed set of shared keys (h3_index, event_id, jurisdiction_slug, mmsi, imo, and more) lets you join any two datasets with a single SQL JOIN.
Pre-computed AI labels. Entity types, categories, sentiment, and risk scores are computed at normalization time — not at query time.
LLM-ready formatting. Every dataset ships a Markdown-KV view (llm_text) optimized for RAG and LLM reasoning, benchmarked at 60.7% accuracy versus 44.3% for raw CSV.
Bitemporal timestamps. Every record separates event time (occurred_at) from system time (ingested_at), publication time (published_at), and legal effective dates (effective_from / effective_to).
Versioned monthly snapshots. Immutable monthly releases you can pin to for reproducible research.
Full provenance. Every record carries a provenance chain documenting each transformation stage, and a confidence_score with per-dataset methodology.

Pricing and delivery

Tier	Records	Formats	Price
Research	100K-record stratified sample	Parquet	Free on Hugging Face (CC-BY-4.0)
Commercial	Full dataset	Parquet, CSV, JSON Lines, GeoParquet	$299/dataset/month
Enterprise	Full dataset + entity graph	All formats + Markdown-KV	Contact sales

All paid tiers deliver monthly immutable Parquet snapshots via signed Cloudflare R2 download URLs. See getting started for setup instructions.

Standards compliance

Every dataset ships with a DCAT-US v3.0 catalog.jsonld sidecar for discoverability by data.gov-compatible crawlers. Per-dataset ontology crosswalks map Codex fields to the relevant domain standards:

Dataset	Target standards
Civic Intelligence	DCAT-US v3.0, PROV-DM, schema.org/GovernmentService
Port Authority Governance	DCAT-US v3.0, PROV-DM, schema.org/GovernmentService
AIS Maritime Positions	IHO S-100, IMO A.600(15), ITU-R M.1371
Urban Signal Grid	OSCRE IDM 2.0, OGC API — Features, DCAT-US v3.0
Events Timeline	W3C PROV-DM, schema.org/Event, DCAT-US v3.0
LEHD Commuter Flows	Census TIGER/FIPS, OGC GeoJSON, DDI-Lifecycle
POI Intelligence	schema.org/LocalBusiness, GS1 GLN, OGC GeoJSON
Permit Signals	DCAT-US v3.0, schema.org/ConstructionPermit, PROV-DM
OSHA Safety Index	schema.org/GovernmentPermit, NAICS, FRS, DCAT-US

Full crosswalks are available at axiomcodex.io/standards.

RAG and vector search

Every dataset ships with a _chunks Parquet file containing structure-aware chunks designed for retrieval-augmented generation. Chunks follow source boundaries (agenda items, speaker turns, permit scopes) rather than fixed token windows, and are compatible with LangChain and LlamaIndex ParentDocumentRetriever patterns. Each chunk carries:

A deterministic chunk_id (SHA-256-derived) for stable vector index keys
A parent_chunk_id for parent-child retrieval
An evidence_anchor for tracing back to the source document
Token counts and embedding metadata

AIS Maritime Positions are point-in-time records and are not chunked. The Urban Signal Grid, LEHD Flows, and POI Intelligence datasets use their compact Markdown-KV views as single-record chunks.

Next steps

Normalization standard

The contract every record satisfies — field definitions, versioning rules, and conformance requirements.

Join keys

The registry of shared keys that make cross-dataset joins work without wrangling.

Chunking policy

How records are chunked for RAG — structure-aware boundaries, deterministic IDs, and parent-child retrieval.

Entity graph export

A Parquet edge-list for multi-hop reasoning over entities, vessels, permits, and jurisdictions (Enterprise tier).

Bitemporal fields

Per-dataset reference for every temporal field, so you always know which clock a timestamp is on.

Claim/fact separation

How source assertions are separated from resolved facts, with full provenance.

Reference notebooks

Jupyter notebooks demonstrating upzoning classification and civic risk mapping with research-tier data.

Get started

Frameworks

Reference

Datasets

Civic Intelligence

Port Authority Governance

AIS Maritime Positions

Urban Signal Grid

Events Timeline

LEHD Commuter Flows

POI Intelligence

Permit Signals

OSHA Safety Index

What makes Codex datasets different

Pricing and delivery

Standards compliance

RAG and vector search

Next steps

Normalization standard

Join keys

Chunking policy

Entity graph export

Bitemporal fields

Claim/fact separation

Reference notebooks

​Datasets

Civic Intelligence

Port Authority Governance

AIS Maritime Positions

Urban Signal Grid

Events Timeline

LEHD Commuter Flows

POI Intelligence

Permit Signals

OSHA Safety Index

​What makes Codex datasets different

​Pricing and delivery

​Standards compliance

​RAG and vector search

​Next steps

Normalization standard

Join keys

Chunking policy

Entity graph export

Bitemporal fields

Claim/fact separation

Reference notebooks

Datasets

What makes Codex datasets different

Pricing and delivery

Standards compliance

RAG and vector search

Next steps