Authors - Deepak T. Mane, Deepak R. More, Gopal D. Upadhye, Rucha C. Samant, Hemlata U. Karne, Suraksha Suryawanshi, Prem Borse Abstract - Efficient vehicle type classification is vital for intelligent transportation systems, traffic monitoring, and urban mobility planning. This paper presents a Real-time Multimodal Vehicle Type Classification System that leverages both visual and acoustic data to identify and categorize vehicles such as cars, buses, trucks, and motorcycles from live video streams. The proposed system integrates CNN-based and Transformer- based models for feature extraction across modalities, enhancing detection robustness under diverse lighting, weather, and traffic conditions. A lightweight preprocessing pipeline performs synchronized frame extraction, audio segmentation, and feature fusion while ensuring minimal latency in real-time environments. The proposed multimodal architecture combines late fusion of visual and audio features to enhance the reliability of classification when either modality is suffering from low visibility or occlusion. Experimental evaluations demonstrate that the proposed framework achieves a classification accuracy of 96.2% at 28 fps, outperforming unimodal baselines with real-time efficiency. This system is deployable for intelligent traffic surveillance, automated tolling, and urban safety analytics.