Authors - Meixin Hu, Chuanchen BI Abstract - Speech synthesis is an important tool for improving human-computer interac tion, accessibility, and other multimedia applications. Traditional Text-to-Speech (TTS) systems have issues related to robotic tone, slow inference and lack of expressiveness. This current study presented a realization of the effectiveness of the neural TTS system using Fast Speech 2 as the underlying neural TTS sys tem. The system used in the current study was a combination of Fast Speech 2 as the underlying neural system in generating high-quality utterances and HiFi-GAN as the underlying neural vocoder. The process involves reconstructing natural-sounding text utterances in terms of mel-spectrograms by Fast Speech 2 that incorporate the use of variance adaptation in terms of pitch, duration, and energy. The implementation of natural-sounding utterances in terms of mel spectrograms is done in real-time using HiFi-GAN. The implementation of the available studies provided insights into Fast Speech 2’s effectiveness in generating mel-spectrograms in real-time and faster. The use of HiFi-GAN provided insights in generating natural-sounding utterances in real-time. The effectiveness of Fast Speech 2 in generating high-quality utterances has further stretched the poten tial use of Fast Speech 2 in virtual assistant applications, audiobooks, accessible text services, further highlighting its significance in advanced human–computer interaction systems.