Skip to main content
Axiom Codex publishes eight datasets covering civic proceedings, maritime activity, permits, labor flows, points of interest, workplace safety, and a unified event timeline that ties them all together. Every dataset follows the normalization standard, ships with pre-computed AI labels, and joins to any other Codex dataset without custom wrangling.

Datasets

Civic Intelligence

517K records — Council votes, permits, and zoning decisions with entity extraction, sentiment scores, upzoning probability, and DAG-mapped approval sequences.Tags: NLP-ready, Labeled, Temporal Formats: Parquet, JSON Lines

AIS Maritime Positions

1.4M positions — Decoded vessel tracks, port calls, and anchor events enriched with Equasis vessel metadata, flag state, DWT, and kinematic fingerprints.Tags: Time-series, Enriched, Geospatial Formats: Parquet

Urban Signal Grid

454K H3 cells — Cell-level ESGI composite scores and 8 signal-group subscores across 22 US metros at H3 resolution 8.Tags: Geospatial, Scored, Multi-signal Formats: GeoParquet, CSV

Events Timeline

1.7M+ events — Unified temporal intelligence spanning permits, council decisions, AIS anomalies, OSHA violations, and business openings normalized to a single event schema.Tags: Temporal, Multi-source, Cross-domain Formats: JSON Lines, Parquet

LEHD Commuter Flows

454K OD pairs — Census LEHD worker origin-destination pairs normalized to H3 cells with income bands, job sector, and Huff gravity index pre-computed.Tags: Geospatial, Demographics, Transport Formats: CSV, Parquet

POI Intelligence

89K locations — Points of interest enriched with category taxonomy, NAICS codes, pioneer business flags, walk/transit scores, and a reviews sample.Tags: Enriched, Categorized, Labeled Formats: JSON

Permit Signals

2.1M permits — Building permit activity across 22 metros with LLM-extracted scope type, building type, unit count, and estimated cost tier.Tags: NLP-extracted, Temporal, Development Formats: Parquet, CSV

OSHA Safety Index

500K+ inspections — OSHA inspection records with NLP-classified hazard categories, violation severity tiers, and inflation-adjusted penalty normalization by H3 cell.Tags: NLP-classified, Safety, Industrial Formats: CSV, Parquet

What makes Codex datasets different

Every record in every dataset satisfies the Axiom Portable Record Standard (APRS). In practice, that means:
  • Zero-wrangling joins. A fixed set of shared keys (h3_index, event_id, jurisdiction_slug, mmsi, imo, and more) lets you join any two datasets with a single SQL JOIN.
  • Pre-computed AI labels. Entity types, categories, sentiment, and risk scores are computed at normalization time — not at query time.
  • LLM-ready formatting. Every dataset ships a Markdown-KV view (llm_text) optimized for RAG and LLM reasoning, benchmarked at 60.7% accuracy versus 44.3% for raw CSV.
  • Bitemporal timestamps. Every record separates event time (occurred_at) from system time (ingested_at), publication time (published_at), and legal effective dates (effective_from / effective_to).
  • Versioned monthly snapshots. Immutable monthly releases you can pin to for reproducible research.
  • Full provenance. Every record carries a provenance chain documenting each transformation stage, and a confidence_score with per-dataset methodology.

Pricing and delivery

TierRecordsFormatsPrice
Research100K-record stratified sampleParquetFree on Hugging Face (CC-BY-4.0)
CommercialFull datasetParquet, CSV, JSON Lines, GeoParquet$299/dataset/month
EnterpriseFull dataset + entity graphAll formats + Markdown-KVContact sales
All paid tiers deliver monthly immutable Parquet snapshots via signed Cloudflare R2 download URLs. See getting started for setup instructions.

Standards compliance

Every dataset ships with a DCAT-US v3.0 catalog.jsonld sidecar for discoverability by data.gov-compatible crawlers. Per-dataset ontology crosswalks map Codex fields to the relevant domain standards:
DatasetTarget standards
Civic IntelligenceDCAT-US v3.0, PROV-DM, schema.org/GovernmentService
AIS Maritime PositionsIHO S-100, IMO A.600(15), ITU-R M.1371
Urban Signal GridOSCRE IDM 2.0, OGC API — Features, DCAT-US v3.0
Events TimelineW3C PROV-DM, schema.org/Event, DCAT-US v3.0
LEHD Commuter FlowsCensus TIGER/FIPS, OGC GeoJSON, DDI-Lifecycle
POI Intelligenceschema.org/LocalBusiness, GS1 GLN, OGC GeoJSON
Permit SignalsDCAT-US v3.0, schema.org/ConstructionPermit, PROV-DM
OSHA Safety Indexschema.org/GovernmentPermit, NAICS, FRS, DCAT-US
Full crosswalks are available at axiomcodex.io/standards. Every dataset ships with a _chunks Parquet file containing structure-aware chunks designed for retrieval-augmented generation. Chunks follow source boundaries (agenda items, speaker turns, permit scopes) rather than fixed token windows, and are compatible with LangChain and LlamaIndex ParentDocumentRetriever patterns. Each chunk carries:
  • A deterministic chunk_id (SHA-256-derived) for stable vector index keys
  • A parent_chunk_id for parent-child retrieval
  • An evidence_anchor for tracing back to the source document
  • Token counts and embedding metadata
AIS Maritime Positions are point-in-time records and are not chunked. The Urban Signal Grid, LEHD Flows, and POI Intelligence datasets use their compact Markdown-KV views as single-record chunks.

Next steps

Normalization standard

The contract every record satisfies — field definitions, versioning rules, and conformance requirements.

Join keys

The registry of shared keys that make cross-dataset joins work without wrangling.

Chunking policy

How records are chunked for RAG — structure-aware boundaries, deterministic IDs, and parent-child retrieval.

Entity graph export

A Parquet edge-list for multi-hop reasoning over entities, vessels, permits, and jurisdictions (Enterprise tier).

Bitemporal fields

Per-dataset reference for every temporal field, so you always know which clock a timestamp is on.

Claim/fact separation

How source assertions are separated from resolved facts, with full provenance.

Reference notebooks

Jupyter notebooks demonstrating upzoning classification and civic risk mapping with research-tier data.