Datasets
Civic Intelligence
517K records — Council votes, permits, and zoning decisions with entity extraction, sentiment scores, upzoning probability, and DAG-mapped approval sequences.Tags: NLP-ready, Labeled, Temporal
Formats: Parquet, JSON Lines
AIS Maritime Positions
1.4M positions — Decoded vessel tracks, port calls, and anchor events enriched with Equasis vessel metadata, flag state, DWT, and kinematic fingerprints.Tags: Time-series, Enriched, Geospatial
Formats: Parquet
Urban Signal Grid
454K H3 cells — Cell-level ESGI composite scores and 8 signal-group subscores across 22 US metros at H3 resolution 8.Tags: Geospatial, Scored, Multi-signal
Formats: GeoParquet, CSV
Events Timeline
1.7M+ events — Unified temporal intelligence spanning permits, council decisions, AIS anomalies, OSHA violations, and business openings normalized to a single event schema.Tags: Temporal, Multi-source, Cross-domain
Formats: JSON Lines, Parquet
LEHD Commuter Flows
454K OD pairs — Census LEHD worker origin-destination pairs normalized to H3 cells with income bands, job sector, and Huff gravity index pre-computed.Tags: Geospatial, Demographics, Transport
Formats: CSV, Parquet
POI Intelligence
89K locations — Points of interest enriched with category taxonomy, NAICS codes, pioneer business flags, walk/transit scores, and a reviews sample.Tags: Enriched, Categorized, Labeled
Formats: JSON
Permit Signals
2.1M permits — Building permit activity across 22 metros with LLM-extracted scope type, building type, unit count, and estimated cost tier.Tags: NLP-extracted, Temporal, Development
Formats: Parquet, CSV
OSHA Safety Index
500K+ inspections — OSHA inspection records with NLP-classified hazard categories, violation severity tiers, and inflation-adjusted penalty normalization by H3 cell.Tags: NLP-classified, Safety, Industrial
Formats: CSV, Parquet
What makes Codex datasets different
Every record in every dataset satisfies the Axiom Portable Record Standard (APRS). In practice, that means:- Zero-wrangling joins. A fixed set of shared keys (
h3_index,event_id,jurisdiction_slug,mmsi,imo, and more) lets you join any two datasets with a single SQLJOIN. - Pre-computed AI labels. Entity types, categories, sentiment, and risk scores are computed at normalization time — not at query time.
- LLM-ready formatting. Every dataset ships a Markdown-KV view (
llm_text) optimized for RAG and LLM reasoning, benchmarked at 60.7% accuracy versus 44.3% for raw CSV. - Bitemporal timestamps. Every record separates event time (
occurred_at) from system time (ingested_at), publication time (published_at), and legal effective dates (effective_from/effective_to). - Versioned monthly snapshots. Immutable monthly releases you can pin to for reproducible research.
- Full provenance. Every record carries a
provenancechain documenting each transformation stage, and aconfidence_scorewith per-dataset methodology.
Pricing and delivery
| Tier | Records | Formats | Price |
|---|---|---|---|
| Research | 100K-record stratified sample | Parquet | Free on Hugging Face (CC-BY-4.0) |
| Commercial | Full dataset | Parquet, CSV, JSON Lines, GeoParquet | $299/dataset/month |
| Enterprise | Full dataset + entity graph | All formats + Markdown-KV | Contact sales |
Standards compliance
Every dataset ships with a DCAT-US v3.0catalog.jsonld sidecar for discoverability by data.gov-compatible crawlers. Per-dataset ontology crosswalks map Codex fields to the relevant domain standards:
| Dataset | Target standards |
|---|---|
| Civic Intelligence | DCAT-US v3.0, PROV-DM, schema.org/GovernmentService |
| AIS Maritime Positions | IHO S-100, IMO A.600(15), ITU-R M.1371 |
| Urban Signal Grid | OSCRE IDM 2.0, OGC API — Features, DCAT-US v3.0 |
| Events Timeline | W3C PROV-DM, schema.org/Event, DCAT-US v3.0 |
| LEHD Commuter Flows | Census TIGER/FIPS, OGC GeoJSON, DDI-Lifecycle |
| POI Intelligence | schema.org/LocalBusiness, GS1 GLN, OGC GeoJSON |
| Permit Signals | DCAT-US v3.0, schema.org/ConstructionPermit, PROV-DM |
| OSHA Safety Index | schema.org/GovernmentPermit, NAICS, FRS, DCAT-US |
RAG and vector search
Every dataset ships with a_chunks Parquet file containing structure-aware chunks designed for retrieval-augmented generation. Chunks follow source boundaries (agenda items, speaker turns, permit scopes) rather than fixed token windows, and are compatible with LangChain and LlamaIndex ParentDocumentRetriever patterns.
Each chunk carries:
- A deterministic
chunk_id(SHA-256-derived) for stable vector index keys - A
parent_chunk_idfor parent-child retrieval - An
evidence_anchorfor tracing back to the source document - Token counts and embedding metadata
AIS Maritime Positions are point-in-time records and are not chunked. The Urban Signal Grid, LEHD Flows, and POI Intelligence datasets use their compact Markdown-KV views as single-record chunks.
Next steps
Normalization standard
The contract every record satisfies — field definitions, versioning rules, and conformance requirements.
Join keys
The registry of shared keys that make cross-dataset joins work without wrangling.
Chunking policy
How records are chunked for RAG — structure-aware boundaries, deterministic IDs, and parent-child retrieval.
Entity graph export
A Parquet edge-list for multi-hop reasoning over entities, vessels, permits, and jurisdictions (Enterprise tier).
Bitemporal fields
Per-dataset reference for every temporal field, so you always know which clock a timestamp is on.
Claim/fact separation
How source assertions are separated from resolved facts, with full provenance.
Reference notebooks
Jupyter notebooks demonstrating upzoning classification and civic risk mapping with research-tier data.