Codex chunks records along structural boundaries (agenda items, speaker turns, permit scopes) rather than fixed token windows. This preserves semantic context and improves retrieval accuracy by 20–40% compared to naive fixed-size chunking on government and legal text.
Principles
- Respect source structure. Chunk at sections, speaker turns, or clause boundaries. Fall back to token-window chunking only when no structure exists.
- One chunk = one contract obligation. Permit conditions, lease clauses, and blocker tags each become their own chunk.
- Parent-child friendly. Every chunk carries
doc_id + section_id so retrievers can fetch the parent document when a chunk matches.
- Bounded size. Target 300–800 tokens for narrative sections; hard cap 1,500 tokens with intelligent split.
- Overlap only when needed. Zero overlap for structured splits. Fixed 50-token overlap only for token-window fallback.
- Deterministic IDs.
chunk_id is SHA-256-derived via make_chunk_id(doc_id, section_id) — re-running the chunker on unchanged text produces identical IDs, enabling incremental vector-index updates.
Chunk schema
Every chunk record has this shape, written to a per-dataset _chunks Parquet file alongside the main dataset export.
| Field | Type | Description |
|---|
chunk_id | URN | urn:aprs:chunk:{16-hex} — SHA-256 of doc_id + section. Deterministic. |
doc_id | URN | Parent record’s APRS URN |
section_id | string | Stable section identifier within the document |
chunk_text | text | The chunk content |
chunk_type | enum | See chunk types below |
token_count | int | Token count (tiktoken cl100k_base) |
char_count | int | Character count |
seq_index | int | 0-based sequential index within the document |
parent_chunk_id | URN | Null for top-level chunks; set for subsections |
evidence_anchor | JSON | {page, char_span, xpath?} for source verification |
embedding_model | string | Embedding model used (e.g. text-embedding-3-large) |
embedding_version | string | Model version or fingerprint |
chunked_at | timestamp | When chunking occurred |
chunking_version | string | Version of the chunking policy applied |
Per-dataset chunking rules
Civic Intelligence
Source material: council transcripts, agenda items, meeting minutes.
Chunk at, in priority order:
- Agenda item boundaries — each item is its own chunk tree.
section_id = "item.{agenda_item_number}".
- Speaker-turn boundaries — inside an agenda item, chunk per speaker turn.
section_id = "item.{n}.turn.{m}". Turns under 50 tokens are merged with the next turn from the same speaker.
- Blocker/condition enumeration — each blocker tag or condition becomes its own chunk with
chunk_type=condition.
- Contingency DAG nodes — each node becomes a
chunk_type=contingency chunk.
The parent chunk for an agenda item contains the item summary; each speaker-turn, condition, and contingency is a child with parent_chunk_id set.
Events Timeline
Each event is a chunk. section_id is the event’s own ID. chunk_type=event.
Long event descriptions (e.g. dark_event with extensive kinematic detail) split at 800 tokens with 50-token overlap. All splits share the same parent_chunk_id.
Permit Signals
Two-level tree:
- Permit parent — summary plus APRS envelope.
chunk_type=permit_summary.
- Scope sections — each LLM-extracted scope section (e.g. “Floor 1 addition”, “Mechanical — HVAC”) becomes its own chunk with
chunk_type=permit_scope_section.
Each condition in the permit (e.g. “Subject to ADA variance approval”) is an independent chunk with chunk_type=condition — they are frequently cited in isolation, so joining them with their scope section would hurt retrieval.
OSHA Safety
Two-level tree:
- Case parent —
chunk_type=osha_case_summary.
- Citation sections — one per citation, since OSHA citations are independently enforceable.
chunk_type=osha_citation.
AIS Maritime Positions
Not chunked. Positions are point-in-time records — one row equals one record. No _chunks file for this dataset.
Urban Signal Grid, LEHD Flows, POI Intelligence
Structured-only datasets. The llm_text Markdown-KV view is the text surface and is already compact (under 400 tokens per record). Each record maps to one chunk with chunk_type=record_view.
Chunk types
| Type | Description |
|---|
event | Single event row (Events Timeline) |
record_view | Markdown-KV view of a structured record (USG, LEHD, POI) |
agenda_item | Top-level civic agenda item |
speaker_turn | Civic transcript speaker turn |
condition | Enforceable condition, blocker, or covenant |
contingency | DAG node in a civic contingency chain |
permit_summary | Permit root chunk |
permit_scope_section | Scope subsection of a permit |
osha_case_summary | OSHA case root |
osha_citation | Individual OSHA citation |
narrative_window | Fallback token-window chunk |
Token-window fallback
When a source has no identifiable structure (rare; mostly raw scraped HTML), Codex falls back to sliding-window chunking:
- Window: 600 tokens
- Stride: 550 tokens (50-token overlap)
- Chunk type:
narrative_window
- Section ID:
"window.{seq_index}"
Overlap is zero for all structured chunks — duplication across chunks degrades retrieval precision.
Parent-child retrieval
You can use the chunk tree directly with parent-child retrieval frameworks like LangChain or LlamaIndex ParentDocumentRetriever:
-- Retrieve a chunk's parent, then re-score at parent scope
WITH hit AS (
SELECT chunk_id, doc_id, parent_chunk_id, chunk_text
FROM codex.civic_intelligence_chunks
WHERE chunk_id = :hit_chunk_id
)
SELECT parent.chunk_text
FROM hit
JOIN codex.civic_intelligence_chunks parent
ON parent.chunk_id = hit.parent_chunk_id;
Cross-encoder re-ranking
Initial retrieval over chunk embeddings is coarse. The recommended post-retrieval step is a cross-encoder re-rank over the top 50 results. Score pairs: (query, chunk_text). Drop chunks below a score_threshold of 0.3.
Codex does not run the re-ranker — this is a consumer-side pattern. The chunk structure (bounded size, clean boundaries) is designed to be compatible with standard re-ranking pipelines.
Embedding policy
- Default model:
text-embedding-3-large (OpenAI), 3072 dimensions, reduced to 1536 for storage.
- Re-embedding triggers: model version bump or
chunking_version bump. Embeddings are tagged with embedding_model and embedding_version so you can decide whether to re-embed.
- Deterministic re-embedding: because
chunk_id is deterministic, re-embedding the same text produces the same chunk_id. Embedding rows update in place without orphans.