MULTIMODAL SENTIMENT ANALYSIS USING VISION-LANGUAGE TRANSFORMERS (VLTS) FOR SOCIAL MEDIA CONTENT

Om Prakash Singh, Neha Gupta

Department Of Computer Science, Dr. K.N. Modi University, Newai, Tonk, 304021, Rajasthan, India

Abstract: The technique of multimodal sentiment analysis enables researchers to examine human emotions through its capability to process data from three different sources which include text and audio as well as visual signals. The research intro- duces a new multimodal framework which uses transformer technology to build its system via vision-language models and large language models that enable the system to analyze multiple modal links while maintaining its ability to under- stand contextual information. The model uses cross-modal attention together with feature fusion methods to increase the accuracy of sentiment predictions. Researchers used CMU-MOSEI dataset for testing purposes because it contains over 23000 annotated video segments from more than 1000 speakers who presented their opinions about different topics with sentiment intensity labels that ranged from 3 to +3. The experimental results show that the proposed model out- performs both traditional unimodal methods and early fusion methods because it achieves better accuracy and F1-score results. The system now uses explainable AI techniques to enhance model interpretability which enables its applications in real-world scenarios involving social media analytics and human-computer interaction as well as affective computing.

Keywords: Multimodal Sentiment Analysis, Large Langua Models (LLMs), Vision-Language Models, Explainable AI (XAI), Cross-Modal Attention, CMU-MOSEI Dataset, Support Vector Machine (SVM).

VOLUME 10 ISSUE 03 2026: 268 – 280

DOI: https://doi.org/10.71058/jodac.v10i03011

Full PDF