Authors - Gia Nghi Thoi, My An Tran, Tram Thi Tuyet Le, Nhat Van Hoang Nguyen, Long Hong Buu Nguyen, Dien Dinh Abstract - Medical diagnosis using Small Language Models (SLMs) of ten suffers from hallucinations and knowledge inconsistency. While re inforcement learning (RL) from knowledge graph feedback offers a po tential solution, pure reinforcement learning strategies often encounter challenges related to sample inefficiency and poor exploration. To address this, a hybrid training pipeline that combines supervised alignment with structural reinforcement is proposed. The method applies knowledge guided supervised fine-tuning (SFT) with hard negatives to refine deci sion boundaries and employs a bipartite-specific reward model to capture interactions between symptoms and diseases. Experiments on multiple medical datasets, including DXY, GMD, and MED-D, demonstrate that this hybrid approach outperforms pure RL methods. By incorporating knowledge graph (KG) information as a structural regularizer, the model achieves improved accuracy, stronger cross-dataset generalization, and reduced overfitting while maintaining strict adherence to diagnostic out put constraints