Civic Intelligence schema

The Civic Intelligence dataset contains ~517K records sourced from Granicus transcripts, Legistar council matters, and Chicago ELMS, plus historical CivicPlus records. Records are collected daily and published as monthly immutable snapshots.

As of April 2026, the council_decisions, zoning_variances, and environmental_reviews tables are now actively populated. Legistar-sourced records can use an HTML collection path for cities where the JSON API is no longer accessible, and California environmental reviews are sourced from the CEQANet registry. EPA EIS records are not yet available — see known limitations for details.

Every record inherits the full APRS envelope (record_id, chunk_id, bitemporal fields, confidence_score, provenance) and carries the join keys documented below.

Dataset-specific fields

Field	Type	Nullable	Description
`jurisdiction_slug`	string	no	Civic jurisdiction identifier (`city-slug-state`).
`h3_index`	string	yes	H3 resolution-8 cell derived from meeting or parcel location.
`document_type`	enum	no	Record classification. See document types.
`source`	enum	no	Originating system (`granicus`, `legistar`, `legistar_html`, `civicplus`, `chicago_elms`, `ceqanet`, `manual`).
`source_id`	string	yes	Source-native identifier (Granicus clip ID, Legistar matter ID).
`committee_name`	string	yes	Committee or body that owned the proceeding.
`duration_min`	integer	yes	Meeting duration in minutes (transcripts only).
`word_count`	integer	yes	Transcript word count.
`summary`	text	yes	LLM-generated summary. Max 500 chars in `llm_text` view, full length in Parquet.
`raw_text`	text	yes	Source transcript or matter body.
`entities_extracted`	JSON array	yes	Extracted entities with role, sentiment, and URN. See entities.
`blockers`	JSON array	yes	Approval blockers from a controlled vocabulary. See blockers.
`contingency_dag`	JSON object	yes	DAG of conditional approvals. See contingency DAG.
`language_signals`	JSON array	yes	Detected topical signals from a controlled vocabulary.
`sentiment_polarity`	enum	no	`positive`, `neutral`, or `negative`.
`topic_velocity`	numeric	yes	Rate-of-mentions signal for this topic in this jurisdiction.
`momentum_score`	numeric [0,1]	yes	Confidence-weighted aggregate score.
`upzoning_probability`	numeric [0,1]	yes	Classifier-estimated probability of density-increasing zoning change. See scores.
`hostility_index`	numeric [0,1]	yes	Aggregate of hostile-sentiment mentions and opposition-coded signals.
`litigation_risk_score`	numeric [0,1]	yes	Probability of formal legal challenge within 24 months.
`source_url`	URL	yes	Public-facing source link (council packet PDF, meeting recording).

Document types

Value	Description
`council_meeting`	Full council or board meeting (transcript or minutes)
`zoning_vote`	Council or Planning Commission vote on a zoning item
`rezoning_hearing`	Public hearing on a proposed rezoning
`variance_hearing`	Zoning board of adjustment or variance hearing
`environmental_review`	CEQA/NEPA or state-level environmental review. California records sourced from CEQANet; federal EPA EIS records are not yet available.
`capital_improvement`	Capital improvement plan item
`code_enforcement`	Code enforcement proceeding
`tax_assessment`	Tax assessment appeal or action
`building_inspection`	Inspection outcome
`foia_request`	Filed FOIA or public-records request
`planning_matter`	Other planning matter (Legistar catch-all)
`civic_document`	Uncategorized civic document (low classification confidence)

Entities

The entities_extracted field contains an array of entities mentioned in the record. Each entity includes:

{
  "name": "Kenyatta Johnson",
  "role": "councilmember",
  "sentiment": "supportive",
  "entity_urn": "urn:aprs:entity:person:k-johnson-phila",
  "quote_span": [12481, 12723],
  "mention_count": 7
}

Role values: councilmember, mayor, planning_commissioner, zoning_board_member, developer, resident, attorney, city_agency_staff, state_agency_staff, nonprofit_representative, business_owner, expert_witness, other. Sentiment values: supportive, favorable, neutral, concerned, opposed, hostile.

entity_urn is populated by the entity resolution pipeline. Records where resolution has not yet run will have null URNs.

Blockers

Ordered array of blocker tags identified by the LLM as standing between a proceeding and final approval:

Tag	Meaning
`awaiting_eis` / `awaiting_ceqa`	Environmental review incomplete
`community_opposition`	Organized opposition beyond expected public comment
`litigation_threat` / `active_litigation`	Legal challenge threatened or in progress
`design_revision_required`	Design changes needed before approval
`affordability_covenant_negotiation`	Affordability terms under negotiation
`traffic_study_pending`	Traffic impact study not complete
`historic_preservation_review`	Historic district review required
`infrastructure_funding_gap`	Insufficient infrastructure funding
`inter_agency_coordination`	Requires another jurisdiction’s sign-off
`political_holdover`	No movement across multiple meetings with no stated reason

Contingency DAG

The contingency_dag field represents conditional approval chains as a directed acyclic graph:

{
  "nodes": [
    { "id": "n1", "label": "Council final vote", "status": "pending" },
    { "id": "n2", "label": "Planning Commission recommendation", "status": "approved", "date": "2024-02-01" },
    { "id": "n3", "label": "Traffic study", "status": "pending" }
  ],
  "edges": [
    { "from": "n2", "to": "n1" },
    { "from": "n3", "to": "n1" }
  ]
}

Node status values: pending, approved, denied, withdrawn, deferred.

Scores

Upzoning probability

Classifier-estimated probability that the proceeding results in zoning changes allowing greater density or use intensity. Built with DistilBERT fine-tuned on council minutes with labeled upzoning outcomes. Features include language_signals, entity sentiment distribution, document_type, and historical base rate by jurisdiction.

>= 0.600 — flagged as “likely”
>= 0.850 — flagged as “highly likely”

The training corpus is weighted toward San Francisco, Philadelphia, and Chicago. Probability calibration for smaller jurisdictions may be less accurate.

Hostility index

Aggregate of hostile-sentiment entity mentions and opposition-coded language signals, normalized to document length. Predicts procedural delay, not outcome.

Litigation risk score

Probability of a formal legal challenge (lawsuit, appeal to state board) within 24 months. Built with a gradient boosted classifier trained on historic filings matched to upstream proceedings. Features include hostility_index, presence of attorney role in entities, litigation_threat blocker tag, and jurisdiction base rate.

>= 0.700 — flagged as “high risk”

Join keys

Key	Presence	Notes
`record_id`	always	APRS URN
`chunk_id`	always	Deterministic from `record_id`
`h3_index`	often	Null when no location is attributable
`event_id`	often	Present when mapped to Events Timeline; null for routine filings
`jurisdiction_slug`	always	Required on every record
`entity_urn`	via entities	Through `entities_extracted[].entity_urn`
`parcel_id`	sometimes	Present for zoning votes and variances

Example query

Find high litigation-risk zoning votes in Philadelphia:

SELECT
  record_id,
  occurred_at,
  summary,
  upzoning_probability,
  litigation_risk_score,
  entities_extracted
FROM read_parquet('civic-intelligence-2026-04.parquet')
WHERE jurisdiction_slug = 'philadelphia-pa'
  AND document_type = 'zoning_vote'
  AND litigation_risk_score >= 0.7
ORDER BY occurred_at DESC
LIMIT 10;

HTML collection mode

Legistar provides a JSON API for accessing council matters, but many cities have restricted API access behind authentication tokens. When the JSON API is unavailable for a jurisdiction, the collector can switch to an HTML collection mode that scrapes the same data from Legistar’s public web pages.

How it works

The HTML collector follows the same three-page navigation path a user would on a Legistar site:

Calendar page — the collector reads the meeting calendar to find upcoming and recent meetings for zoning-related bodies.
Meeting detail page — for each relevant meeting, the collector retrieves linked legislation items.
Legislation detail page — each legislation item is scraped for matter ID, file number, title, type, status, and key dates (introduced, on agenda, final action).

The resulting records are normalized into the same schema as JSON API records, so downstream consumers see no difference in field names or structure.

Zoning variance filtering

The zoning variance collector uses keyword-based filtering to identify relevant records from the full stream of council matters. Records are included if any of the following match: Body keywords — checked against the legislative body or committee name:

zoning
board of appeals
board of adjustment
planning commission
land use

Matter type keywords — checked against the matter type and title:

variance
conditional use
special permit
special use
cup

A record passes the filter if it matches on body name, matter type, or title. This is intentionally broad so that a variance filed under a generically named committee is still captured if the title mentions the relevant keywords.

These filters differ from the ones used by the council monitor, which targets broader development-related activity (rezoning, demolition, TIF districts, etc.). The zoning variance collector is narrower and focused specifically on variance and conditional-use proceedings.

Date handling

Records collected via the HTML path may include dates in m/d/YYYY format (for example, 3/15/2026) rather than the ISO format returned by the JSON API (2026-03-15T00:00:00). The collector normalizes all date formats before storage, so occurred_at and other date fields are always stored as ISO dates regardless of the collection path.

Identifying HTML-sourced records

Records collected via the HTML path have source set to legistar_html. You can use this to distinguish them from JSON API records (source = 'legistar') in queries:

SELECT source, count(*)
FROM read_parquet('civic-intelligence-2026-04.parquet')
WHERE document_type IN ('zoning_vote', 'variance_hearing')
GROUP BY source;

Signal extraction backend

LLM-based signal extraction (used to populate entities_extracted, language_signals, blockers, and item-level fields like outcome, units, and dollar_amount) runs on a configurable backend. Schema-constrained classification work defaults to a local model; the hosted Anthropic backend is opt-in.

Backend selection

The Granicus extractor selects a backend based on the CIVIC_EXTRACTOR environment variable:

Value	Backend	Notes
`ollama` (default)	Local Ollama runtime	Uses the `qwen3.6:35b-a3b` model with `format: json` and reasoning disabled (`think: false`) so JSON output is returned directly.
`anthropic`	Claude Haiku via the Anthropic API	Requires `ANTHROPIC_API_KEY`. Use for one-off audits or when the local runtime is unavailable.

When CIVIC_EXTRACTOR is unset or set to ollama, the extractor calls the local runtime only — there is no silent fallback to Anthropic. If the local runtime returns a non-200 response or invalid JSON, the record is skipped and extracted_signals remains null. Re-run the extractor after the local runtime is restored to populate skipped records.

Configuration

Variable	Default	Description
`CIVIC_EXTRACTOR`	`ollama`	Backend selector: `ollama` or `anthropic`.
`OLLAMA_BASE_URL`	`http://100.94.177.111:11434`	Ollama host. Override for local development or alternate inference hosts.
`OLLAMA_MODEL`	`qwen3.6:35b-a3b`	Ollama model tag. The model must accept the `think` option and return JSON in `response`.
`ANTHROPIC_API_KEY`	unset	Required only when `CIVIC_EXTRACTOR=anthropic`.

Output schema

Both backends are prompted with the same schema, so downstream consumers — including the civic_decision_signals materialized view — observe identical field shapes regardless of which backend produced a record. Each item carries item_type, outcome, units, sqft, dollar_amount, h3_signal, development_sentiment, and high_signal.

Example

Run a one-off backfill against the hosted Anthropic backend:

export CIVIC_EXTRACTOR=anthropic
export ANTHROPIC_API_KEY=sk-ant-...
python -m groundswell council.granicus --city sf --since 2026-01-01

Run the same command against a locally hosted Ollama instance:

export CIVIC_EXTRACTOR=ollama
export OLLAMA_BASE_URL=http://localhost:11434
export OLLAMA_MODEL=qwen3.6:35b-a3b
python -m groundswell council.granicus --city sf --since 2026-01-01

Known limitations

Jurisdictional coverage is uneven — dense for San Francisco, Philadelphia, and Boston; sparse for sunbelt growth markets.
CivicPlus ingestion has been retired. The upstream /AgendaCenter/ViewFile/... URL pattern the scraper depended on is no longer served by the tracked city domains, so no new records with source = 'civicplus' are being landed. Historical CivicPlus rows remain valid and queryable, and the civicplus value stays in the source enum for backward compatibility. Coverage will resume once a rebuilt scraper ships.
language_signals vocabulary is English-only. Bilingual meetings may underperform on signal extraction.
Transcripts lag real-time by 24–72 hours. Use occurred_at for event-time analysis, not ingested_at.
The legacy document_date field is retained for backward compatibility. Use occurred_at instead.
Legistar HTML-sourced records (source = 'legistar_html') may have null date fields when the source page uses non-standard labels. Most major cities — including New York City, Seattle, San Francisco, and Chicago — now resolve titles, sponsoring bodies, and dates through multiple fallback labels, but some jurisdictions may still return nulls. Filter on occurred_at IS NOT NULL if your query requires dates.
EPA EIS environmental reviews are not yet available. The environmental_review document type currently covers California CEQA filings (via CEQANet) only. Federal NEPA coverage is planned.
Records collected while the local extraction backend was unreachable have null extracted_signals and may also have empty raw_text if the upstream Granicus transcript fetch did not complete. These records require both the upstream fetch and the extractor to be re-run before signals appear. See signal extraction backend for backend configuration.

Get started

Frameworks

Reference

Dataset-specific fields

Document types

Entities

Blockers

Contingency DAG

Scores

Upzoning probability

Hostility index

Litigation risk score

Join keys

Example query

HTML collection mode

How it works

Zoning variance filtering

Date handling

Identifying HTML-sourced records

Signal extraction backend

Backend selection

Configuration

Output schema

Example

Known limitations

​Dataset-specific fields

​Document types

​Entities

​Blockers

​Contingency DAG

​Scores

​Upzoning probability

​Hostility index

​Litigation risk score

​Join keys

​Example query

​HTML collection mode

​How it works

​Zoning variance filtering

​Date handling

​Identifying HTML-sourced records

​Signal extraction backend

​Backend selection

​Configuration

​Output schema

​Example

​Known limitations

Dataset-specific fields

Document types

Entities

Blockers

Contingency DAG

Scores

Upzoning probability

Hostility index

Litigation risk score

Join keys

Example query

HTML collection mode

How it works

Zoning variance filtering

Date handling

Identifying HTML-sourced records

Signal extraction backend

Backend selection

Configuration

Output schema

Example

Known limitations