Authors - Tri Wiyana, Roberto Tomahuw Abstract - Mental health disorders are among the major global health problems, and early diagnosis is the key for effective management. Conventional methods are based on self-reported or clinical scales, for which intervention comes late. In this paper, we propose a multimodal AI framework for the detection of early mental health detection from typing and voice behaviors. We extract BERT-based linguistic embeddings of text transcripts and spectral features of the speech signals from the audio data using the DAIC-WOZ dataset for capturing verbal cues. These features are then combined by machine learning algorithms to classify depression. The proposed framework prioritizes non invasive, privacyconscious detection with explainability techniques used to foster clinical confidence. We further present experimental results to show that the multimodal fusion also provides classification gain over unimodal baselines. This study demonstrates the capability of AI-based, real-time methods for proactive mental health monitoring and provides a stepping stone towards healthcare deployment.