Chunking policy

Codex chunks records along structural boundaries (agenda items, speaker turns, permit scopes) rather than fixed token windows. This preserves semantic context and improves retrieval accuracy by 20–40% compared to naive fixed-size chunking on government and legal text.

Principles

Respect source structure. Chunk at sections, speaker turns, or clause boundaries. Fall back to token-window chunking only when no structure exists.
One chunk = one contract obligation. Permit conditions, lease clauses, and blocker tags each become their own chunk.
Parent-child friendly. Every chunk carries doc_id + section_id so retrievers can fetch the parent document when a chunk matches.
Bounded size. Target 300–800 tokens for narrative sections; hard cap 1,500 tokens with intelligent split.
Overlap only when needed. Zero overlap for structured splits. Fixed 50-token overlap only for token-window fallback.
Deterministic IDs. chunk_id is SHA-256-derived via make_chunk_id(doc_id, section_id) — re-running the chunker on unchanged text produces identical IDs, enabling incremental vector-index updates.

Chunk schema

Every chunk record has this shape, written to a per-dataset _chunks Parquet file alongside the main dataset export.

Field	Type	Description
`chunk_id`	URN	`urn:aprs:chunk:{16-hex}` — SHA-256 of `doc_id` + section. Deterministic.
`doc_id`	URN	Parent record’s APRS URN
`section_id`	string	Stable section identifier within the document
`chunk_text`	text	The chunk content
`chunk_type`	enum	See chunk types below
`token_count`	int	Token count (tiktoken cl100k_base)
`char_count`	int	Character count
`seq_index`	int	0-based sequential index within the document
`parent_chunk_id`	URN	Null for top-level chunks; set for subsections
`evidence_anchor`	JSON	`{page, char_span, xpath?}` for source verification
`embedding_model`	string	Embedding model used (e.g. `text-embedding-3-large`)
`embedding_version`	string	Model version or fingerprint
`chunked_at`	timestamp	When chunking occurred
`chunking_version`	string	Version of the chunking policy applied

Per-dataset chunking rules

Civic Intelligence

Source material: council transcripts, agenda items, meeting minutes. Chunk at, in priority order:

Agenda item boundaries — each item is its own chunk tree. section_id = "item.{agenda_item_number}".
Speaker-turn boundaries — inside an agenda item, chunk per speaker turn. section_id = "item.{n}.turn.{m}". Turns under 50 tokens are merged with the next turn from the same speaker.
Blocker/condition enumeration — each blocker tag or condition becomes its own chunk with chunk_type=condition.
Contingency DAG nodes — each node becomes a chunk_type=contingency chunk.

The parent chunk for an agenda item contains the item summary; each speaker-turn, condition, and contingency is a child with parent_chunk_id set.

Events Timeline

Each event is a chunk. section_id is the event’s own ID. chunk_type=event. Long event descriptions (e.g. dark_event with extensive kinematic detail) split at 800 tokens with 50-token overlap. All splits share the same parent_chunk_id.

Permit Signals

Two-level tree:

Permit parent — summary plus APRS envelope. chunk_type=permit_summary.
Scope sections — each LLM-extracted scope section (e.g. “Floor 1 addition”, “Mechanical — HVAC”) becomes its own chunk with chunk_type=permit_scope_section.

Each condition in the permit (e.g. “Subject to ADA variance approval”) is an independent chunk with chunk_type=condition — they are frequently cited in isolation, so joining them with their scope section would hurt retrieval.

OSHA Safety

Two-level tree:

Case parent — chunk_type=osha_case_summary.
Citation sections — one per citation, since OSHA citations are independently enforceable. chunk_type=osha_citation.

AIS Maritime Positions

Not chunked. Positions are point-in-time records — one row equals one record. No _chunks file for this dataset.

Urban Signal Grid, LEHD Flows, POI Intelligence

Structured-only datasets. The llm_text Markdown-KV view is the text surface and is already compact (under 400 tokens per record). Each record maps to one chunk with chunk_type=record_view.

Chunk types

Type	Description
`event`	Single event row (Events Timeline)
`record_view`	Markdown-KV view of a structured record (USG, LEHD, POI)
`agenda_item`	Top-level civic agenda item
`speaker_turn`	Civic transcript speaker turn
`condition`	Enforceable condition, blocker, or covenant
`contingency`	DAG node in a civic contingency chain
`permit_summary`	Permit root chunk
`permit_scope_section`	Scope subsection of a permit
`osha_case_summary`	OSHA case root
`osha_citation`	Individual OSHA citation
`narrative_window`	Fallback token-window chunk

Token-window fallback

When a source has no identifiable structure (rare; mostly raw scraped HTML), Codex falls back to sliding-window chunking:

Window: 600 tokens
Stride: 550 tokens (50-token overlap)
Chunk type: narrative_window
Section ID: "window.{seq_index}"

Overlap is zero for all structured chunks — duplication across chunks degrades retrieval precision.

Parent-child retrieval

You can use the chunk tree directly with parent-child retrieval frameworks like LangChain or LlamaIndex ParentDocumentRetriever:

-- Retrieve a chunk's parent, then re-score at parent scope
WITH hit AS (
  SELECT chunk_id, doc_id, parent_chunk_id, chunk_text
  FROM codex.civic_intelligence_chunks
  WHERE chunk_id = :hit_chunk_id
)
SELECT parent.chunk_text
FROM hit
JOIN codex.civic_intelligence_chunks parent
  ON parent.chunk_id = hit.parent_chunk_id;

Cross-encoder re-ranking

Initial retrieval over chunk embeddings is coarse. The recommended post-retrieval step is a cross-encoder re-rank over the top 50 results. Score pairs: (query, chunk_text). Drop chunks below a score_threshold of 0.3.

Codex does not run the re-ranker — this is a consumer-side pattern. The chunk structure (bounded size, clean boundaries) is designed to be compatible with standard re-ranking pipelines.

Embedding policy

Default model: text-embedding-3-large (OpenAI), 3072 dimensions, reduced to 1536 for storage.
Re-embedding triggers: model version bump or chunking_version bump. Embeddings are tagged with embedding_model and embedding_version so you can decide whether to re-embed.
Deterministic re-embedding: because chunk_id is deterministic, re-embedding the same text produces the same chunk_id. Embedding rows update in place without orphans.

Get started

Frameworks

Reference

Principles

Chunk schema

Per-dataset chunking rules

Civic Intelligence

Events Timeline

Permit Signals

OSHA Safety

AIS Maritime Positions

Urban Signal Grid, LEHD Flows, POI Intelligence

Chunk types

Token-window fallback

Parent-child retrieval

Cross-encoder re-ranking

Embedding policy

​Principles

​Chunk schema

​Per-dataset chunking rules

​Civic Intelligence

​Events Timeline

​Permit Signals

​OSHA Safety

​AIS Maritime Positions

​Urban Signal Grid, LEHD Flows, POI Intelligence

​Chunk types

​Token-window fallback

​Parent-child retrieval

​Cross-encoder re-ranking

​Embedding policy

Principles

Chunk schema

Per-dataset chunking rules

Civic Intelligence

Events Timeline

Permit Signals

OSHA Safety

AIS Maritime Positions

Urban Signal Grid, LEHD Flows, POI Intelligence

Chunk types

Token-window fallback

Parent-child retrieval

Cross-encoder re-ranking

Embedding policy