Authors - Thota Neha, Napa. Sai Gopi, R. Aarthi Abstract - The increasing realism of deepfake media has raised signifi cant concerns regarding the authenticity of digital content. Most existing detection methods rely on audio–visual fusion, which often introduces ad ditional complexity and may degrade performance when one modality is unavailable or unreliable. This work presents a dual-stream deep learning framework that pro cesses audio and video independently, avoiding explicit fusion. The au dio stream employs a CNN–BiLSTM model on log-Mel spectrograms to capture temporal and spectral artifacts, while the video stream uses EfficientNet-B0 with BiLSTM to model spatial inconsistencies and tem poral variations in facial sequences. Experiments conducted on multiple benchmark datasets, including ASVspoof 2019, WaveFake, LJSpeech, FaceForensics++, and Celeb-DF (v2), demon strate that the proposed approach achieves competitive detection perfor mance. In addition, the framework maintains robustness under missing modality conditions and offers improved interpretability compared to fusion-based methods. These results indicate that independent modality-specific learning pro vides a practical and effective alternative for deepfake detection in real world scenarios.