Authors - Sejal Vaishnav, Sanskrati Jain, Suman Dikshit, Vashvi Srivastava, Shailendra Sharma, Vaishnav Preeti Prakash, Vishal Shrivastava, Ram Babu Buri, Mohit Mishra Abstract - Traditional object detection systems are limited in their ability to capture the complexity of urban scenes, often overlooking critical spatial, contextual, and functional relationships required. This paper introduces Urban Scene Intelligence, a Semantic Anchor-and-Expand (SAE) framework that integrates multi-modal perception, structured scene graph construction, and controlled narrative generation to produce grounded descriptions of urban environments. The proposed modular architecture incorporates OWL-ViT for open-vocabulary object detection, SegFormer for semantic segmentation, DepthAnything for spatial depth estimation, Qwen2-VL for attribute enrichment, and OCR for extracting textual context. Unlike end-to-end multimodal models, the threestage pipeline explicitly separates visual perception, symbolic reasoning, and language generation, thereby improving interpretability and factual grounding. By unifying heterogeneous visual cues into a symbolic representation and generating context-aware descriptions from this representation, the SAE framework establishes a transparent and extensible approach to urban scene understanding in complex real-world environments.