With rapid urbanization, urban renewal has become increasingly important. Traditional research has relied on expert assessments and objective indicators, lacking scalable frameworks that effectively translate street-level conditions into actionable renewal strategies. This study proposes a Vision–Language Model (VLM)-based framework to address these gaps, using the Hongshan Central District of Urumqi, China, as a case study. Specifically, we collected 4215 street-view images (SVIs) and employed VLMs to assess six perceptual dimensions (i.e., safety, liveliness, beauty, wealthiness, depressiveness, and boringness), together with textual descriptions. The best-performing model, selected by a 500-respondent perception survey validation, was used to conduct spatial pattern and text mining analyses to inform targeted urban renewal strategies. Results show that (1) VLMs have a high consistency with humans in evaluating the spatial perception of six dimensions; (2) spatial clustering analysis successfully delineated four distinct renewal priority tiers, confirming the method’s capability in translating perceptual data into actionable spatial strategies; and (3) textual mining of the VLM’s rationales revealed that areas with lower perceptual scores are predominantly characterized by deficiencies in foundational infrastructure and street-level order, thereby providing explanatory evidence directly linked to the generated renewal priorities. This study provides a generative artificial intelligence (GAI)-driven and interpretable evaluation framework for urban renewal decision-making, facilitating precision-oriented and intelligent urban regeneration. © 2026 by the authors.
Urban Street-Scene Perception and Renewal Strategies Powered by Vision–Language Models
Yao Yuhan;Dall'O' Giuliano;
2026-01-01
Abstract
With rapid urbanization, urban renewal has become increasingly important. Traditional research has relied on expert assessments and objective indicators, lacking scalable frameworks that effectively translate street-level conditions into actionable renewal strategies. This study proposes a Vision–Language Model (VLM)-based framework to address these gaps, using the Hongshan Central District of Urumqi, China, as a case study. Specifically, we collected 4215 street-view images (SVIs) and employed VLMs to assess six perceptual dimensions (i.e., safety, liveliness, beauty, wealthiness, depressiveness, and boringness), together with textual descriptions. The best-performing model, selected by a 500-respondent perception survey validation, was used to conduct spatial pattern and text mining analyses to inform targeted urban renewal strategies. Results show that (1) VLMs have a high consistency with humans in evaluating the spatial perception of six dimensions; (2) spatial clustering analysis successfully delineated four distinct renewal priority tiers, confirming the method’s capability in translating perceptual data into actionable spatial strategies; and (3) textual mining of the VLM’s rationales revealed that areas with lower perceptual scores are predominantly characterized by deficiencies in foundational infrastructure and street-level order, thereby providing explanatory evidence directly linked to the generated renewal priorities. This study provides a generative artificial intelligence (GAI)-driven and interpretable evaluation framework for urban renewal decision-making, facilitating precision-oriented and intelligent urban regeneration. © 2026 by the authors.| File | Dimensione | Formato | |
|---|---|---|---|
|
land-15-00244 (1) (1).pdf
accesso aperto
:
Publisher’s version
Dimensione
6.53 MB
Formato
Adobe PDF
|
6.53 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


