The Civic Intelligence dataset contains ~517K records sourced from Granicus transcripts, Legistar council matters, CivicPlus, and Chicago ELMS. Records are collected daily and published as monthly immutable snapshots.
As of April 2026, the council_decisions, zoning_variances, and environmental_reviews tables are now actively populated. Legistar-sourced records can use an HTML collection path for cities where the JSON API is no longer accessible, and California environmental reviews are sourced from the CEQANet registry. EPA EIS records are not yet available — see known limitations for details.
Every record inherits the full APRS envelope (record_id, chunk_id, bitemporal fields, confidence_score, provenance) and carries the join keys documented below.
Dataset-specific fields
| Field | Type | Nullable | Description |
|---|
jurisdiction_slug | string | no | Civic jurisdiction identifier (city-slug-state). |
h3_index | string | yes | H3 resolution-8 cell derived from meeting or parcel location. |
document_type | enum | no | Record classification. See document types. |
source | enum | no | Originating system (granicus, legistar, legistar_html, civicplus, chicago_elms, ceqanet, manual). |
source_id | string | yes | Source-native identifier (Granicus clip ID, Legistar matter ID). |
committee_name | string | yes | Committee or body that owned the proceeding. |
duration_min | integer | yes | Meeting duration in minutes (transcripts only). |
word_count | integer | yes | Transcript word count. |
summary | text | yes | LLM-generated summary. Max 500 chars in llm_text view, full length in Parquet. |
raw_text | text | yes | Source transcript or matter body. |
entities_extracted | JSON array | yes | Extracted entities with role, sentiment, and URN. See entities. |
blockers | JSON array | yes | Approval blockers from a controlled vocabulary. See blockers. |
contingency_dag | JSON object | yes | DAG of conditional approvals. See contingency DAG. |
language_signals | JSON array | yes | Detected topical signals from a controlled vocabulary. |
sentiment_polarity | enum | no | positive, neutral, or negative. |
topic_velocity | numeric | yes | Rate-of-mentions signal for this topic in this jurisdiction. |
momentum_score | numeric [0,1] | yes | Confidence-weighted aggregate score. |
upzoning_probability | numeric [0,1] | yes | Classifier-estimated probability of density-increasing zoning change. See scores. |
hostility_index | numeric [0,1] | yes | Aggregate of hostile-sentiment mentions and opposition-coded signals. |
litigation_risk_score | numeric [0,1] | yes | Probability of formal legal challenge within 24 months. |
source_url | URL | yes | Public-facing source link (council packet PDF, meeting recording). |
Document types
| Value | Description |
|---|
council_meeting | Full council or board meeting (transcript or minutes) |
zoning_vote | Council or Planning Commission vote on a zoning item |
rezoning_hearing | Public hearing on a proposed rezoning |
variance_hearing | Zoning board of adjustment or variance hearing |
environmental_review | CEQA/NEPA or state-level environmental review. California records sourced from CEQANet; federal EPA EIS records are not yet available. |
capital_improvement | Capital improvement plan item |
code_enforcement | Code enforcement proceeding |
tax_assessment | Tax assessment appeal or action |
building_inspection | Inspection outcome |
foia_request | Filed FOIA or public-records request |
planning_matter | Other planning matter (Legistar catch-all) |
civic_document | Uncategorized civic document (low classification confidence) |
Entities
The entities_extracted field contains an array of entities mentioned in the record. Each entity includes:
{
"name": "Kenyatta Johnson",
"role": "councilmember",
"sentiment": "supportive",
"entity_urn": "urn:aprs:entity:person:k-johnson-phila",
"quote_span": [12481, 12723],
"mention_count": 7
}
Role values: councilmember, mayor, planning_commissioner, zoning_board_member, developer, resident, attorney, city_agency_staff, state_agency_staff, nonprofit_representative, business_owner, expert_witness, other.
Sentiment values: supportive, favorable, neutral, concerned, opposed, hostile.
entity_urn is populated by the entity resolution pipeline. Records where resolution has not yet run will have null URNs.
Blockers
Ordered array of blocker tags identified by the LLM as standing between a proceeding and final approval:
| Tag | Meaning |
|---|
awaiting_eis / awaiting_ceqa | Environmental review incomplete |
community_opposition | Organized opposition beyond expected public comment |
litigation_threat / active_litigation | Legal challenge threatened or in progress |
design_revision_required | Design changes needed before approval |
affordability_covenant_negotiation | Affordability terms under negotiation |
traffic_study_pending | Traffic impact study not complete |
historic_preservation_review | Historic district review required |
infrastructure_funding_gap | Insufficient infrastructure funding |
inter_agency_coordination | Requires another jurisdiction’s sign-off |
political_holdover | No movement across multiple meetings with no stated reason |
Contingency DAG
The contingency_dag field represents conditional approval chains as a directed acyclic graph:
{
"nodes": [
{ "id": "n1", "label": "Council final vote", "status": "pending" },
{ "id": "n2", "label": "Planning Commission recommendation", "status": "approved", "date": "2024-02-01" },
{ "id": "n3", "label": "Traffic study", "status": "pending" }
],
"edges": [
{ "from": "n2", "to": "n1" },
{ "from": "n3", "to": "n1" }
]
}
Node status values: pending, approved, denied, withdrawn, deferred.
Scores
Upzoning probability
Classifier-estimated probability that the proceeding results in zoning changes allowing greater density or use intensity. Built with DistilBERT fine-tuned on council minutes with labeled upzoning outcomes. Features include language_signals, entity sentiment distribution, document_type, and historical base rate by jurisdiction.
>= 0.600 — flagged as “likely”
>= 0.850 — flagged as “highly likely”
The training corpus is weighted toward San Francisco, Philadelphia, and Chicago. Probability calibration for smaller jurisdictions may be less accurate.
Hostility index
Aggregate of hostile-sentiment entity mentions and opposition-coded language signals, normalized to document length. Predicts procedural delay, not outcome.
Litigation risk score
Probability of a formal legal challenge (lawsuit, appeal to state board) within 24 months. Built with a gradient boosted classifier trained on historic filings matched to upstream proceedings. Features include hostility_index, presence of attorney role in entities, litigation_threat blocker tag, and jurisdiction base rate.
>= 0.700 — flagged as “high risk”
Join keys
| Key | Presence | Notes |
|---|
record_id | always | APRS URN |
chunk_id | always | Deterministic from record_id |
h3_index | often | Null when no location is attributable |
event_id | often | Present when mapped to Events Timeline; null for routine filings |
jurisdiction_slug | always | Required on every record |
entity_urn | via entities | Through entities_extracted[].entity_urn |
parcel_id | sometimes | Present for zoning votes and variances |
Example query
Find high litigation-risk zoning votes in Philadelphia:
SELECT
record_id,
occurred_at,
summary,
upzoning_probability,
litigation_risk_score,
entities_extracted
FROM read_parquet('civic-intelligence-2026-04.parquet')
WHERE jurisdiction_slug = 'philadelphia-pa'
AND document_type = 'zoning_vote'
AND litigation_risk_score >= 0.7
ORDER BY occurred_at DESC
LIMIT 10;
HTML collection mode
Legistar provides a JSON API for accessing council matters, but many cities have restricted API access behind authentication tokens. When the JSON API is unavailable for a jurisdiction, the collector can switch to an HTML collection mode that scrapes the same data from Legistar’s public web pages.
How it works
The HTML collector follows the same three-page navigation path a user would on a Legistar site:
- Calendar page — the collector reads the meeting calendar to find upcoming and recent meetings for zoning-related bodies.
- Meeting detail page — for each relevant meeting, the collector retrieves linked legislation items.
- Legislation detail page — each legislation item is scraped for matter ID, file number, title, type, status, and key dates (introduced, on agenda, final action).
The resulting records are normalized into the same schema as JSON API records, so downstream consumers see no difference in field names or structure.
Zoning variance filtering
The zoning variance collector uses keyword-based filtering to identify relevant records from the full stream of council matters. Records are included if any of the following match:
Body keywords — checked against the legislative body or committee name:
zoning
board of appeals
board of adjustment
planning commission
land use
Matter type keywords — checked against the matter type and title:
variance
conditional use
special permit
special use
cup
A record passes the filter if it matches on body name, matter type, or title. This is intentionally broad so that a variance filed under a generically named committee is still captured if the title mentions the relevant keywords.
These filters differ from the ones used by the council monitor, which targets broader development-related activity (rezoning, demolition, TIF districts, etc.). The zoning variance collector is narrower and focused specifically on variance and conditional-use proceedings.
Date handling
Records collected via the HTML path may include dates in m/d/YYYY format (for example, 3/15/2026) rather than the ISO format returned by the JSON API (2026-03-15T00:00:00). The collector normalizes all date formats before storage, so occurred_at and other date fields are always stored as ISO dates regardless of the collection path.
Identifying HTML-sourced records
Records collected via the HTML path have source set to legistar_html. You can use this to distinguish them from JSON API records (source = 'legistar') in queries:
SELECT source, count(*)
FROM read_parquet('civic-intelligence-2026-04.parquet')
WHERE document_type IN ('zoning_vote', 'variance_hearing')
GROUP BY source;
LLM-based signal extraction (used to populate entities_extracted, language_signals, blockers, and item-level fields like outcome, units, and dollar_amount) runs on a configurable backend. Schema-constrained classification work defaults to a local model; the hosted Anthropic backend is opt-in.
Backend selection
The Granicus extractor selects a backend based on the CIVIC_EXTRACTOR environment variable:
| Value | Backend | Notes |
|---|
ollama (default) | Local Ollama runtime | Uses the qwen3.6:35b-a3b model with format: json and reasoning disabled (think: false) so JSON output is returned directly. |
anthropic | Claude Haiku via the Anthropic API | Requires ANTHROPIC_API_KEY. Use for one-off audits or when the local runtime is unavailable. |
When CIVIC_EXTRACTOR is unset or set to ollama, the extractor calls the local runtime only — there is no silent fallback to Anthropic. If the local runtime returns a non-200 response or invalid JSON, the record is skipped and extracted_signals remains null. Re-run the extractor after the local runtime is restored to populate skipped records.
Configuration
| Variable | Default | Description |
|---|
CIVIC_EXTRACTOR | ollama | Backend selector: ollama or anthropic. |
OLLAMA_BASE_URL | http://100.94.177.111:11434 | Ollama host. Override for local development or alternate inference hosts. |
OLLAMA_MODEL | qwen3.6:35b-a3b | Ollama model tag. The model must accept the think option and return JSON in response. |
ANTHROPIC_API_KEY | unset | Required only when CIVIC_EXTRACTOR=anthropic. |
Output schema
Both backends are prompted with the same schema, so downstream consumers — including the civic_decision_signals materialized view — observe identical field shapes regardless of which backend produced a record. Each item carries item_type, outcome, units, sqft, dollar_amount, h3_signal, development_sentiment, and high_signal.
Example
Run a one-off backfill against the hosted Anthropic backend:
export CIVIC_EXTRACTOR=anthropic
export ANTHROPIC_API_KEY=sk-ant-...
python -m groundswell council.granicus --city sf --since 2026-01-01
Run the same command against a locally hosted Ollama instance:
export CIVIC_EXTRACTOR=ollama
export OLLAMA_BASE_URL=http://localhost:11434
export OLLAMA_MODEL=qwen3.6:35b-a3b
python -m groundswell council.granicus --city sf --since 2026-01-01
Known limitations
- Jurisdictional coverage is uneven — dense for San Francisco, Philadelphia, and Boston; sparse for sunbelt growth markets.
language_signals vocabulary is English-only. Bilingual meetings may underperform on signal extraction.
- Transcripts lag real-time by 24–72 hours. Use
occurred_at for event-time analysis, not ingested_at.
- The legacy
document_date field is retained for backward compatibility. Use occurred_at instead.
- Legistar HTML-sourced records (
source = 'legistar_html') may have null date fields when the source page uses non-standard labels. Most major cities — including New York City, Seattle, San Francisco, and Chicago — now resolve titles, sponsoring bodies, and dates through multiple fallback labels, but some jurisdictions may still return nulls. Filter on occurred_at IS NOT NULL if your query requires dates.
- EPA EIS environmental reviews are not yet available. The
environmental_review document type currently covers California CEQA filings (via CEQANet) only. Federal NEPA coverage is planned.
- Records collected while the local extraction backend was unreachable have null
extracted_signals and may also have empty raw_text if the upstream Granicus transcript fetch did not complete. These records require both the upstream fetch and the extractor to be re-run before signals appear. See signal extraction backend for backend configuration.