From Street Views to Urban Science: Discovering Road Safety Factors with Multimodal Large Language Models

Tang, Yihong; Qu, Ao; Yu, Xujing; Deng, Weipeng; Ma, Jun; Zhao, Jinhua; Sun, Lijun

doi:10.1016/j.trc.2026.105692

From Street Views to Urban Science: Discovering Road Safety Factors with Multimodal Large Language Models

Yihong Tang, Ao Qu, Xujing Yu, Weipeng Deng, Jun Ma, Jinhua Zhao, Lijun Sun

Transportation Research Part C: Emerging Technologies, 188, 105692

Paper Code BibTeX

UrbanX is an interpretable, MLLM-powered framework for hypothesis-driven urban scientific discovery.

Abstract

Urban and transportation research has long sought to uncover statistically meaningful relationships between key variables and societal outcomes such as road safety, aiming to generate actionable insights that guide the planning, development, and renewal of urban mobility systems. However, traditional workflows face several key challenges: (1) reliance on human experts to propose hypotheses, which can be time-consuming and prone to confirmation bias; (2) limited interpretability, particularly in deep learning approaches; and (3) underutilization of unstructured data that encodes critical urban context. To address these limitations, we propose a Multimodal Large Language Model (MLLM)-based approach for interpretable hypothesis inference, enabling the automated generation, assessment, and refinement of hypotheses concerning urban form and transportation safety. Specifically, we leverage MLLMs to generate road safety-relevant questions and automatically answer them based on street view images (SVIs) through visual question answering (VQA). These responses are used to construct interpretable embeddings for each SVI, which are then incorporated into linear statistical models for transparent and explainable regression analysis. UrbanX supports iterative hypothesis testing and refinement guided by statistical evidence, such as coefficient significance, thereby enabling rigorous, transparent scientific discovery of previously overlooked correlations between urban design and transportation risk. We evaluate our framework on Manhattan street segments and demonstrate that it outperforms pretrained deep learning baselines while offering full interpretability. We demonstrate that UrbanX matches or exceeds the explanatory power of existing expert-curated built environment variables, validating its potential to replace labor-intensive feature engineering with automated, scalable discovery of potential safety-related factors. Beyond road safety, UrbanX can serve as a general-purpose foundation for hypothesis-driven urban mobility analysis, extracting structured insights from unstructured data across diverse socioeconomic and environmental outcomes. This approach establishes a scalable and trustworthy pathway for interpretable, data-driven scientific discovery in urban and transportation systems using foundation models such as MLLMs.

Method

UrbanX framework figure from the paper — The framework iterates through hypothesis generation, MLLM-based VQA embedding construction, and interpretable regression assessment.

Results

Model performance comparison from the UrbanX paper — UrbanX improves road-safety prediction while preserving human-readable features.

Spearman correlation ablation result from the UrbanX paper — Rank-correlation comparison against direct VLM prediction and latent visual embeddings.

Hypothesis importance SHAP summary from the UrbanX paper — Top discovered hypotheses remain interpretable and map back to natural-language questions.

Hypothesis significance and independence figure from the UrbanX paper — Discovered factors are predictive, statistically significant, and weakly correlated.

VQA Case Study

Panoramic street-view image used for VQA analysis in the paper — Panoramic SVI used for VQA analysis in the paper. The open-source repo does not redistribute the full private image collection.

Cost accuracy tradeoff for VQA in UrbanX — Cost-accuracy trade-off for VQA across model scales.

BibTeX

@article{tang2026street,
  title={From street views to urban science: Discovering road safety factors with multimodal large language models},
  author={Tang, Yihong and Qu, Ao and Yu, Xujing and Deng, Weipeng and Ma, Jun and Zhao, Jinhua and Sun, Lijun},
  journal={Transportation Research Part C: Emerging Technologies},
  volume={188},
  pages={105692},
  year={2026},
  publisher={Elsevier}
}