Loading…
Thursday April 9, 2026 12:15pm - 2:15pm GMT+07

Authors - Sejal Vaishnav, Sanskrati Jain, Suman Dikshit, Vashvi Srivastava, Shailendra Sharma, Vaishnav Preeti Prakash, Vishal Shrivastava, Ram Babu Buri, Mohit Mishra
Abstract - Traditional object detection systems are limited in their ability to capture the complexity of urban scenes, often overlooking critical spatial, contextual, and functional relationships required. This paper introduces Urban Scene Intelligence, a Semantic Anchor-and-Expand (SAE) framework that integrates multi-modal perception, structured scene graph construction, and controlled narrative generation to produce grounded descriptions of urban environments. The proposed modular architecture incorporates OWL-ViT for open-vocabulary object detection, SegFormer for semantic segmentation, DepthAnything for spatial depth estimation, Qwen2-VL for attribute enrichment, and OCR for extracting textual context. Unlike end-to-end multimodal models, the threestage pipeline explicitly separates visual perception, symbolic reasoning, and language generation, thereby improving interpretability and factual grounding. By unifying heterogeneous visual cues into a symbolic representation and generating context-aware descriptions from this representation, the SAE framework establishes a transparent and extensible approach to urban scene understanding in complex real-world environments.
Paper Presenter
Thursday April 9, 2026 12:15pm - 2:15pm GMT+07
Virtual Room B Bangkok, Thailand

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link