Authors - Hemamalini Siranjeevi, Swaminathan Venkatraman, Dharshini V, Gayathri A, Sushma Sri R Abstract - Urban environments generate massive video data from surveillance and mobile sensors, necessitating efficient and intelligent summarization for smart city and transportation systems. This paper proposes a multimodal video summarization framework that moves beyond object-centric analysis toward high-level urban scene understanding. Unlike traditional methods that rely on low-level visual features or isolated object detection, the proposed approach captures contextual relationships and temporal continuity through a multi-stage pipeline. The system integrates multimodal perception, combining deep learning-based object detection, multi-object tracking, and acoustic analysis to preserve entity identities and environmental context. We employ relational inference and motion heuristics to model spatial and semantic interactions, which are then structured into a Dynamic Knowledge Graph (DKG) representing entities, interactions, and temporal events. A semantic synthesis module, powered by a transformer-based language model, generates concise, coherent, and semantically meaningful summaries. This architecture enables scalable, context-aware video summarization adaptable to real-world urban applications.