Authors - Samiksha M, Sharanya G S, Shrina Anahosur, Surabhi K C, Surabhi Narayan Abstract - Multi-angle image synthesis is highly important when it comes to the generation of 3D scenes. But the current methods are either ex pensive in terms of computational costs or lack photorealism in their outputs. We propose a novel sketch and text based multiview image generation approach that solves the above-mentioned problems by mak ing use of multimodal diffusion models efficiently. Our pipeline utilises DreamShaper v8 for converting the input sketch and text into a pho torealistic 2D image and then passes this 2D image into a fine-tuned Zero123plus model for the final generation of consistent multiview im ages, showing a 43.69% improvement in the overall perceptual quality compared to baseline sketch-to-multiview models. Moreover, our pipeline shows flexibility in scalability by generating anywhere from 6 to 64 consis tent multiview images according to the requirements of the downstream tasks. We demonstrate the success of our pipeline through extensive ex periments conducted using voxel-based grid approaches and Neural Ra diance Fields (NeRF). Our pipeline greatly reduces computational costs, all while maintaining photorealism in the outputs, confirming the poten tial of sketch and text based multimodal conditioning as an intuitive and efficient paradigm for controlled 3D content generation.