Skip to main content
The Civic Intelligence dataset contains ~517K records sourced from Granicus transcripts, Legistar council matters, CivicPlus, and Chicago ELMS. Records are collected daily and published as monthly immutable snapshots.
As of April 2026, the council_decisions, zoning_variances, and environmental_reviews tables are now actively populated. Legistar-sourced records can use an HTML collection path for cities where the JSON API is no longer accessible, and California environmental reviews are sourced from the CEQANet registry. EPA EIS records are not yet available — see known limitations for details.
Every record inherits the full APRS envelope (record_id, chunk_id, bitemporal fields, confidence_score, provenance) and carries the join keys documented below.

Dataset-specific fields

FieldTypeNullableDescription
jurisdiction_slugstringnoCivic jurisdiction identifier (city-slug-state).
h3_indexstringyesH3 resolution-8 cell derived from meeting or parcel location.
document_typeenumnoRecord classification. See document types.
sourceenumnoOriginating system (granicus, legistar, legistar_html, civicplus, chicago_elms, ceqanet, manual).
source_idstringyesSource-native identifier (Granicus clip ID, Legistar matter ID).
committee_namestringyesCommittee or body that owned the proceeding.
duration_minintegeryesMeeting duration in minutes (transcripts only).
word_countintegeryesTranscript word count.
summarytextyesLLM-generated summary. Max 500 chars in llm_text view, full length in Parquet.
raw_texttextyesSource transcript or matter body.
entities_extractedJSON arrayyesExtracted entities with role, sentiment, and URN. See entities.
blockersJSON arrayyesApproval blockers from a controlled vocabulary. See blockers.
contingency_dagJSON objectyesDAG of conditional approvals. See contingency DAG.
language_signalsJSON arrayyesDetected topical signals from a controlled vocabulary.
sentiment_polarityenumnopositive, neutral, or negative.
topic_velocitynumericyesRate-of-mentions signal for this topic in this jurisdiction.
momentum_scorenumeric [0,1]yesConfidence-weighted aggregate score.
upzoning_probabilitynumeric [0,1]yesClassifier-estimated probability of density-increasing zoning change. See scores.
hostility_indexnumeric [0,1]yesAggregate of hostile-sentiment mentions and opposition-coded signals.
litigation_risk_scorenumeric [0,1]yesProbability of formal legal challenge within 24 months.
source_urlURLyesPublic-facing source link (council packet PDF, meeting recording).

Document types

ValueDescription
council_meetingFull council or board meeting (transcript or minutes)
zoning_voteCouncil or Planning Commission vote on a zoning item
rezoning_hearingPublic hearing on a proposed rezoning
variance_hearingZoning board of adjustment or variance hearing
environmental_reviewCEQA/NEPA or state-level environmental review. California records sourced from CEQANet; federal EPA EIS records are not yet available.
capital_improvementCapital improvement plan item
code_enforcementCode enforcement proceeding
tax_assessmentTax assessment appeal or action
building_inspectionInspection outcome
foia_requestFiled FOIA or public-records request
planning_matterOther planning matter (Legistar catch-all)
civic_documentUncategorized civic document (low classification confidence)

Entities

The entities_extracted field contains an array of entities mentioned in the record. Each entity includes:
{
  "name": "Kenyatta Johnson",
  "role": "councilmember",
  "sentiment": "supportive",
  "entity_urn": "urn:aprs:entity:person:k-johnson-phila",
  "quote_span": [12481, 12723],
  "mention_count": 7
}
Role values: councilmember, mayor, planning_commissioner, zoning_board_member, developer, resident, attorney, city_agency_staff, state_agency_staff, nonprofit_representative, business_owner, expert_witness, other. Sentiment values: supportive, favorable, neutral, concerned, opposed, hostile.
entity_urn is populated by the entity resolution pipeline. Records where resolution has not yet run will have null URNs.

Blockers

Ordered array of blocker tags identified by the LLM as standing between a proceeding and final approval:
TagMeaning
awaiting_eis / awaiting_ceqaEnvironmental review incomplete
community_oppositionOrganized opposition beyond expected public comment
litigation_threat / active_litigationLegal challenge threatened or in progress
design_revision_requiredDesign changes needed before approval
affordability_covenant_negotiationAffordability terms under negotiation
traffic_study_pendingTraffic impact study not complete
historic_preservation_reviewHistoric district review required
infrastructure_funding_gapInsufficient infrastructure funding
inter_agency_coordinationRequires another jurisdiction’s sign-off
political_holdoverNo movement across multiple meetings with no stated reason

Contingency DAG

The contingency_dag field represents conditional approval chains as a directed acyclic graph:
{
  "nodes": [
    { "id": "n1", "label": "Council final vote", "status": "pending" },
    { "id": "n2", "label": "Planning Commission recommendation", "status": "approved", "date": "2024-02-01" },
    { "id": "n3", "label": "Traffic study", "status": "pending" }
  ],
  "edges": [
    { "from": "n2", "to": "n1" },
    { "from": "n3", "to": "n1" }
  ]
}
Node status values: pending, approved, denied, withdrawn, deferred.

Scores

Upzoning probability

Classifier-estimated probability that the proceeding results in zoning changes allowing greater density or use intensity. Built with DistilBERT fine-tuned on council minutes with labeled upzoning outcomes. Features include language_signals, entity sentiment distribution, document_type, and historical base rate by jurisdiction.
  • >= 0.600 — flagged as “likely”
  • >= 0.850 — flagged as “highly likely”
The training corpus is weighted toward San Francisco, Philadelphia, and Chicago. Probability calibration for smaller jurisdictions may be less accurate.

Hostility index

Aggregate of hostile-sentiment entity mentions and opposition-coded language signals, normalized to document length. Predicts procedural delay, not outcome.

Litigation risk score

Probability of a formal legal challenge (lawsuit, appeal to state board) within 24 months. Built with a gradient boosted classifier trained on historic filings matched to upstream proceedings. Features include hostility_index, presence of attorney role in entities, litigation_threat blocker tag, and jurisdiction base rate.
  • >= 0.700 — flagged as “high risk”

Join keys

KeyPresenceNotes
record_idalwaysAPRS URN
chunk_idalwaysDeterministic from record_id
h3_indexoftenNull when no location is attributable
event_idoftenPresent when mapped to Events Timeline; null for routine filings
jurisdiction_slugalwaysRequired on every record
entity_urnvia entitiesThrough entities_extracted[].entity_urn
parcel_idsometimesPresent for zoning votes and variances

Example query

Find high litigation-risk zoning votes in Philadelphia:
SELECT
  record_id,
  occurred_at,
  summary,
  upzoning_probability,
  litigation_risk_score,
  entities_extracted
FROM read_parquet('civic-intelligence-2026-04.parquet')
WHERE jurisdiction_slug = 'philadelphia-pa'
  AND document_type = 'zoning_vote'
  AND litigation_risk_score >= 0.7
ORDER BY occurred_at DESC
LIMIT 10;

HTML collection mode

Legistar provides a JSON API for accessing council matters, but many cities have restricted API access behind authentication tokens. When the JSON API is unavailable for a jurisdiction, the collector can switch to an HTML collection mode that scrapes the same data from Legistar’s public web pages.

How it works

The HTML collector follows the same three-page navigation path a user would on a Legistar site:
  1. Calendar page — the collector reads the meeting calendar to find upcoming and recent meetings for zoning-related bodies.
  2. Meeting detail page — for each relevant meeting, the collector retrieves linked legislation items.
  3. Legislation detail page — each legislation item is scraped for matter ID, file number, title, type, status, and key dates (introduced, on agenda, final action).
The resulting records are normalized into the same schema as JSON API records, so downstream consumers see no difference in field names or structure.

Zoning variance filtering

The zoning variance collector uses keyword-based filtering to identify relevant records from the full stream of council matters. Records are included if any of the following match: Body keywords — checked against the legislative body or committee name:
  • zoning
  • board of appeals
  • board of adjustment
  • planning commission
  • land use
Matter type keywords — checked against the matter type and title:
  • variance
  • conditional use
  • special permit
  • special use
  • cup
A record passes the filter if it matches on body name, matter type, or title. This is intentionally broad so that a variance filed under a generically named committee is still captured if the title mentions the relevant keywords.
These filters differ from the ones used by the council monitor, which targets broader development-related activity (rezoning, demolition, TIF districts, etc.). The zoning variance collector is narrower and focused specifically on variance and conditional-use proceedings.

Date handling

Records collected via the HTML path may include dates in m/d/YYYY format (for example, 3/15/2026) rather than the ISO format returned by the JSON API (2026-03-15T00:00:00). The collector normalizes all date formats before storage, so occurred_at and other date fields are always stored as ISO dates regardless of the collection path.

Identifying HTML-sourced records

Records collected via the HTML path have source set to legistar_html. You can use this to distinguish them from JSON API records (source = 'legistar') in queries:
SELECT source, count(*)
FROM read_parquet('civic-intelligence-2026-04.parquet')
WHERE document_type IN ('zoning_vote', 'variance_hearing')
GROUP BY source;

Signal extraction backend

LLM-based signal extraction (used to populate entities_extracted, language_signals, blockers, and item-level fields like outcome, units, and dollar_amount) runs on a configurable backend. Schema-constrained classification work defaults to a local model; the hosted Anthropic backend is opt-in.

Backend selection

The Granicus extractor selects a backend based on the CIVIC_EXTRACTOR environment variable:
ValueBackendNotes
ollama (default)Local Ollama runtimeUses the qwen3.6:35b-a3b model with format: json and reasoning disabled (think: false) so JSON output is returned directly.
anthropicClaude Haiku via the Anthropic APIRequires ANTHROPIC_API_KEY. Use for one-off audits or when the local runtime is unavailable.
When CIVIC_EXTRACTOR is unset or set to ollama, the extractor calls the local runtime only — there is no silent fallback to Anthropic. If the local runtime returns a non-200 response or invalid JSON, the record is skipped and extracted_signals remains null. Re-run the extractor after the local runtime is restored to populate skipped records.

Configuration

VariableDefaultDescription
CIVIC_EXTRACTORollamaBackend selector: ollama or anthropic.
OLLAMA_BASE_URLhttp://100.94.177.111:11434Ollama host. Override for local development or alternate inference hosts.
OLLAMA_MODELqwen3.6:35b-a3bOllama model tag. The model must accept the think option and return JSON in response.
ANTHROPIC_API_KEYunsetRequired only when CIVIC_EXTRACTOR=anthropic.

Output schema

Both backends are prompted with the same schema, so downstream consumers — including the civic_decision_signals materialized view — observe identical field shapes regardless of which backend produced a record. Each item carries item_type, outcome, units, sqft, dollar_amount, h3_signal, development_sentiment, and high_signal.

Example

Run a one-off backfill against the hosted Anthropic backend:
export CIVIC_EXTRACTOR=anthropic
export ANTHROPIC_API_KEY=sk-ant-...
python -m groundswell council.granicus --city sf --since 2026-01-01
Run the same command against a locally hosted Ollama instance:
export CIVIC_EXTRACTOR=ollama
export OLLAMA_BASE_URL=http://localhost:11434
export OLLAMA_MODEL=qwen3.6:35b-a3b
python -m groundswell council.granicus --city sf --since 2026-01-01

Known limitations

  • Jurisdictional coverage is uneven — dense for San Francisco, Philadelphia, and Boston; sparse for sunbelt growth markets.
  • language_signals vocabulary is English-only. Bilingual meetings may underperform on signal extraction.
  • Transcripts lag real-time by 24–72 hours. Use occurred_at for event-time analysis, not ingested_at.
  • The legacy document_date field is retained for backward compatibility. Use occurred_at instead.
  • Legistar HTML-sourced records (source = 'legistar_html') may have null date fields when the source page uses non-standard labels. Most major cities — including New York City, Seattle, San Francisco, and Chicago — now resolve titles, sponsoring bodies, and dates through multiple fallback labels, but some jurisdictions may still return nulls. Filter on occurred_at IS NOT NULL if your query requires dates.
  • EPA EIS environmental reviews are not yet available. The environmental_review document type currently covers California CEQA filings (via CEQANet) only. Federal NEPA coverage is planned.
  • Records collected while the local extraction backend was unreachable have null extracted_signals and may also have empty raw_text if the upstream Granicus transcript fetch did not complete. These records require both the upstream fetch and the extractor to be re-run before signals appear. See signal extraction backend for backend configuration.