Authors - Kaniska D, Shreya J V, Srinidhi K, Sudhakar K S, Bagavathi Sivakumar P, Krishna Priya G Abstract - Language modeling of clinical text in healthcare pens down a necessitated context along with a high level of security measure for sensitive patient information. A few large language models have shown very good clinically related performance in documentation, summarization, and these models have been rolled out freely. Therefore, these models generate hallucinated or non verifiable outputs. Retrieval augmented approaches thus fix the problem by limiting the answer to the evidences retrieved. However, majority of the existing systems rely on the textual records only and the integration of the diagnostic imaging is not done systematically. In this paper, we put forward a retrieval grounded multimodal clinical modeling framework that unifies structured clinical text with imaging-derived contextual features. A patient specific vector indexing approach is used for isolated retrieval and a modality aware visual analytics approach turn imaging outputs into structured signals, hence language generation. The entire framework is performed fully offline, thus supporting privacy preserving deployment in resource-limited clinical settings. Experimental results show steady multimodal integration as well as the semantic consistency alignment between the retrieved evidence and the generated output.