Authors - Linda Sara Mathew, Anna Irene Ditto, Anna Keerthana V, Cristal James Tomy Abstract - With proper and real-time crop mapping and yield prediction, agricultural planning, food security, and climate-resilient decisions are necessitated. The conventional field surveys are slow, expensive and inconsistent whereas the increased supply of multispectral, hyperspectral and SAR satellite imagery has made automated crop surveillance possible. Nevertheless, operational methods continue to suffer significant setbacks, such as low accuracy in the presence of a cloud cover, lack of empirical models of the complex time-dependence of temporal growth, difficulties in treating mixed pixels in the smallholder landscape, and the lack of a single framework that incorporates optical, SAR, and phenology data. Even though recent researchers have investigated deep spatio temporal models to map rice, SAR–optical fusion, mixed-pixel decomposition, temporal attention networks, multi-GPU UNet architectures, and phenology-based yield estimation, none of them have an all-encompassing, scalable framework. The study suggests a Multimodal Deep Spatio-Temporal Framework that involves multispectral alongside SAR images and phenological data, which can be used to automatically map crops and predict yields. With CNN-LSTM encoders, attention-based TCNs, adaptive mixed-pixel processing, multimodal fusion, and multi-GPU segmentation, the framework should help provide a powerful, scalable agricultural intelligence system that can be used to monitor the region and country in real-time.