Skip to main content
Codex chunks records along structural boundaries (agenda items, speaker turns, permit scopes) rather than fixed token windows. This preserves semantic context and improves retrieval accuracy by 20–40% compared to naive fixed-size chunking on government and legal text.

Principles

  1. Respect source structure. Chunk at sections, speaker turns, or clause boundaries. Fall back to token-window chunking only when no structure exists.
  2. One chunk = one contract obligation. Permit conditions, lease clauses, and blocker tags each become their own chunk.
  3. Parent-child friendly. Every chunk carries doc_id + section_id so retrievers can fetch the parent document when a chunk matches.
  4. Bounded size. Target 300–800 tokens for narrative sections; hard cap 1,500 tokens with intelligent split.
  5. Overlap only when needed. Zero overlap for structured splits. Fixed 50-token overlap only for token-window fallback.
  6. Deterministic IDs. chunk_id is SHA-256-derived via make_chunk_id(doc_id, section_id) — re-running the chunker on unchanged text produces identical IDs, enabling incremental vector-index updates.

Chunk schema

Every chunk record has this shape, written to a per-dataset _chunks Parquet file alongside the main dataset export.
FieldTypeDescription
chunk_idURNurn:aprs:chunk:{16-hex} — SHA-256 of doc_id + section. Deterministic.
doc_idURNParent record’s APRS URN
section_idstringStable section identifier within the document
chunk_texttextThe chunk content
chunk_typeenumSee chunk types below
token_countintToken count (tiktoken cl100k_base)
char_countintCharacter count
seq_indexint0-based sequential index within the document
parent_chunk_idURNNull for top-level chunks; set for subsections
evidence_anchorJSON{page, char_span, xpath?} for source verification
embedding_modelstringEmbedding model used (e.g. text-embedding-3-large)
embedding_versionstringModel version or fingerprint
chunked_attimestampWhen chunking occurred
chunking_versionstringVersion of the chunking policy applied

Per-dataset chunking rules

Civic Intelligence

Source material: council transcripts, agenda items, meeting minutes. Chunk at, in priority order:
  1. Agenda item boundaries — each item is its own chunk tree. section_id = "item.{agenda_item_number}".
  2. Speaker-turn boundaries — inside an agenda item, chunk per speaker turn. section_id = "item.{n}.turn.{m}". Turns under 50 tokens are merged with the next turn from the same speaker.
  3. Blocker/condition enumeration — each blocker tag or condition becomes its own chunk with chunk_type=condition.
  4. Contingency DAG nodes — each node becomes a chunk_type=contingency chunk.
The parent chunk for an agenda item contains the item summary; each speaker-turn, condition, and contingency is a child with parent_chunk_id set.

Events Timeline

Each event is a chunk. section_id is the event’s own ID. chunk_type=event. Long event descriptions (e.g. dark_event with extensive kinematic detail) split at 800 tokens with 50-token overlap. All splits share the same parent_chunk_id.

Permit Signals

Two-level tree:
  1. Permit parent — summary plus APRS envelope. chunk_type=permit_summary.
  2. Scope sections — each LLM-extracted scope section (e.g. “Floor 1 addition”, “Mechanical — HVAC”) becomes its own chunk with chunk_type=permit_scope_section.
Each condition in the permit (e.g. “Subject to ADA variance approval”) is an independent chunk with chunk_type=condition — they are frequently cited in isolation, so joining them with their scope section would hurt retrieval.

OSHA Safety

Two-level tree:
  1. Case parentchunk_type=osha_case_summary.
  2. Citation sections — one per citation, since OSHA citations are independently enforceable. chunk_type=osha_citation.

AIS Maritime Positions

Not chunked. Positions are point-in-time records — one row equals one record. No _chunks file for this dataset.

Urban Signal Grid, LEHD Flows, POI Intelligence

Structured-only datasets. The llm_text Markdown-KV view is the text surface and is already compact (under 400 tokens per record). Each record maps to one chunk with chunk_type=record_view.

Chunk types

TypeDescription
eventSingle event row (Events Timeline)
record_viewMarkdown-KV view of a structured record (USG, LEHD, POI)
agenda_itemTop-level civic agenda item
speaker_turnCivic transcript speaker turn
conditionEnforceable condition, blocker, or covenant
contingencyDAG node in a civic contingency chain
permit_summaryPermit root chunk
permit_scope_sectionScope subsection of a permit
osha_case_summaryOSHA case root
osha_citationIndividual OSHA citation
narrative_windowFallback token-window chunk

Token-window fallback

When a source has no identifiable structure (rare; mostly raw scraped HTML), Codex falls back to sliding-window chunking:
  • Window: 600 tokens
  • Stride: 550 tokens (50-token overlap)
  • Chunk type: narrative_window
  • Section ID: "window.{seq_index}"
Overlap is zero for all structured chunks — duplication across chunks degrades retrieval precision.

Parent-child retrieval

You can use the chunk tree directly with parent-child retrieval frameworks like LangChain or LlamaIndex ParentDocumentRetriever:
-- Retrieve a chunk's parent, then re-score at parent scope
WITH hit AS (
  SELECT chunk_id, doc_id, parent_chunk_id, chunk_text
  FROM codex.civic_intelligence_chunks
  WHERE chunk_id = :hit_chunk_id
)
SELECT parent.chunk_text
FROM hit
JOIN codex.civic_intelligence_chunks parent
  ON parent.chunk_id = hit.parent_chunk_id;

Cross-encoder re-ranking

Initial retrieval over chunk embeddings is coarse. The recommended post-retrieval step is a cross-encoder re-rank over the top 50 results. Score pairs: (query, chunk_text). Drop chunks below a score_threshold of 0.3.
Codex does not run the re-ranker — this is a consumer-side pattern. The chunk structure (bounded size, clean boundaries) is designed to be compatible with standard re-ranking pipelines.

Embedding policy

  • Default model: text-embedding-3-large (OpenAI), 3072 dimensions, reduced to 1536 for storage.
  • Re-embedding triggers: model version bump or chunking_version bump. Embeddings are tagged with embedding_model and embedding_version so you can decide whether to re-embed.
  • Deterministic re-embedding: because chunk_id is deterministic, re-embedding the same text produces the same chunk_id. Embedding rows update in place without orphans.