With rapid urbanization, urban renewal has become increasingly important. Traditional research has relied on expert assessments and objective indicators, lacking scalable frameworks that effectively translate street-level conditions into actionable renewal strategies. This study proposes a Vision–Language Model (VLM)-based framework to address these gaps, using the Hongshan Central District of Urumqi, China, as a case study. Specifically, we collected 4215 street-view images (SVIs) and employed VLMs to assess six perceptual dimensions (i.e., safety, liveliness, beauty, wealthiness, depressiveness, and boringness), together with textual descriptions. The best-performing model, selected by a 500-respondent perception survey validation, was used to conduct spatial pattern and text mining analyses to inform targeted urban renewal strategies. Results show that (1) VLMs have a high consistency with humans in evaluating the spatial perception of six dimensions; (2) spatial clustering analysis successfully delineated four distinct renewal priority tiers, confirming the method’s capability in translating perceptual data into actionable spatial strategies; and (3) textual mining of the VLM’s rationales revealed that areas with lower perceptual scores are predominantly characterized by deficiencies in foundational infrastructure and street-level order, thereby providing explanatory evidence directly linked to the generated renewal priorities. This study provides a generative artificial intelligence (GAI)-driven and interpretable evaluation framework for urban renewal decision-making, facilitating precision-oriented and intelligent urban regeneration. © 2026 by the authors.

Urban Street-Scene Perception and Renewal Strategies Powered by Vision–Language Models

Yao Yuhan;Dall'O' Giuliano;
2026-01-01

Abstract

With rapid urbanization, urban renewal has become increasingly important. Traditional research has relied on expert assessments and objective indicators, lacking scalable frameworks that effectively translate street-level conditions into actionable renewal strategies. This study proposes a Vision–Language Model (VLM)-based framework to address these gaps, using the Hongshan Central District of Urumqi, China, as a case study. Specifically, we collected 4215 street-view images (SVIs) and employed VLMs to assess six perceptual dimensions (i.e., safety, liveliness, beauty, wealthiness, depressiveness, and boringness), together with textual descriptions. The best-performing model, selected by a 500-respondent perception survey validation, was used to conduct spatial pattern and text mining analyses to inform targeted urban renewal strategies. Results show that (1) VLMs have a high consistency with humans in evaluating the spatial perception of six dimensions; (2) spatial clustering analysis successfully delineated four distinct renewal priority tiers, confirming the method’s capability in translating perceptual data into actionable spatial strategies; and (3) textual mining of the VLM’s rationales revealed that areas with lower perceptual scores are predominantly characterized by deficiencies in foundational infrastructure and street-level order, thereby providing explanatory evidence directly linked to the generated renewal priorities. This study provides a generative artificial intelligence (GAI)-driven and interpretable evaluation framework for urban renewal decision-making, facilitating precision-oriented and intelligent urban regeneration. © 2026 by the authors.
2026
File in questo prodotto:
File Dimensione Formato  
land-15-00244 (1) (1).pdf

accesso aperto

: Publisher’s version
Dimensione 6.53 MB
Formato Adobe PDF
6.53 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1308460
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact