Generative AI and Public Data: A Framework for Secure Access to Transportation Safety Data

Democratizing Access to Transportation Safety Data with Generative AI

Transportation safety analysis is a complex process requiring the integration of crash records, roadway attributes, and geospatial data, often managed through intricate GIS-based workflows. However, access to this crucial information remains uneven across various agencies and community stakeholders. The high technical prerequisites needed to utilize analytical tools create a significant gap between data availability and the capacity of practitioners, local agencies, school committees, and residents to use it to address their safety concerns.

This disparity limits the ability to retrieve, filter, map, and analyze relevant data, hindering effective planning and targeted interventions. Generative AI emerges as a potential solution to bridge this divide, offering a more intuitive interface. Nevertheless, its deployment in the public sector raises fundamental questions regarding reliability, reproducibility, and data governance—critical aspects for any application impacting public safety and trust.

A Structured Approach for Natural Language Interpretation

To address these challenges, a framework has been developed that proposes a “schema-grounded” natural language interface for transportation safety analysis. This system utilizes a Large Language Model (LLM) to interpret user intent, but with a crucial distinction: query execution occurs deterministically and verifiably against an authoritative database. This “bounded” design clearly separates language interpretation from execution logic, ensuring that results are reproducible and anchored to the database schema, a fundamental requirement for critical public sector applications.

The process involves translating user queries into structured “semantic frames,” which are then validated by a rule-based layer. Subsequently, these frames are compiled into a typed directed acyclic graph of spatial operations, which is finally executed against a PostGIS database. This approach mitigates the risks associated with the probabilistic nature of LLMs, ensuring that data operations are precise and conform to schema definitions, a critical need in public sector applications.

Evaluation and Implications for Trustworthy AI

The framework was evaluated using a statewide Massachusetts transportation safety database, integrating crash records, roadway attributes, and geospatial layers, including schools, bus stops, crosswalks, and municipal boundaries. All tests demonstrated successful query execution. A significant finding from the evaluation is that the validation layer corrected errors in 29% of the queries, highlighting the discrepancy between the flexibility of natural language and the strict requirements of a database schema.

This result underscores the importance of a robust validation mechanism to effectively translate human intent into precise database operations. The combination of natural language accessibility and deterministic execution represents a practical direction for broadening access to transportation safety data, with significant implications for the development of trustworthy AI in public sector planning. For organizations considering the deployment of similar AI solutions, especially in self-hosted or air-gapped contexts, the ability to maintain data sovereignty and execution transparency is a key factor. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these trade-offs.

Future Prospects for Data Governance with LLMs

The adoption of Large Language Models in the public sector, particularly for managing sensitive data such as safety information, requires careful consideration of governance and compliance. This framework demonstrates how the power of LLMs can be leveraged to improve accessibility while maintaining rigorous control over the accuracy and reproducibility of results. The separation between interpretation and execution is a model that can be replicated in other contexts where trust and verifiability are paramount.

Investing in self-hosted infrastructure to support such systems, including robust databases and compute capacity for LLMs, can offer public agencies greater control over the Total Cost of Ownership (TCO) and data security. This approach not only facilitates access to information for a broader audience but also strengthens institutions' ability to make informed decisions based on reliable data, promoting responsible and transparent use of artificial intelligence.