Skip to the main content

Original scientific paper

Hybrid Vision Transformers and CNNs for Enhanced Image Captioning with Beam Search Optimization

Sushma Jaiswal ; Guru Ghasidas Central University, Bilaspur (C.G.) *
Harikumar Pallthadka ; Manipur International University, Imphal, Manipur
Rajesh P. Chinhewadi ; Manipur International University, Imphal, Manipur
Tarun Jaiswal ; National Institute of Technology, Raipur (C.G.)

* Corresponding author.


Full text: english pdf 689 Kb

page 130-138

downloads: 0

cite


Abstract

Deep learning has significantly advanced image captioning, with the Transformer, a neural network originally designed for natural language processing, excelling in this task and other computer vision applications. This paper provides a detailed review of Transformer-based image captioning methods. Traditional approaches relied on convolutional neural networks (CNNs) to extract image features and RNNs or LSTM networks to generate captions, but these methods often face information bottlenecks and difficulty capturing long-range dependencies. The Transformer architecture brought groundbreaking improvements to natural language processing with its attention mechanism and parallel processing, and researchers have successfully adapted this architecture to image captioning tasks.
Transformer-based image captioning systems now outperform previous methods in both accuracy and efficiency by integrating visual and textual data into a unified model. This paper explores how self-attention mechanisms and positional encodings in Transformers have been adapted for image captioning, and discusses the use of Vision Transformers (ViTs) and hybrid CNN-Transformer models. Additionally, it highlights the importance of pre-training, fine-tuning, and reinforcement learning for improving caption quality. The paper examines challenges such as multimodal fusion, aligning visual and textual information, and ensuring caption interpretability. Finally, it emphasizes how future research may expand the application of Transformer-based methods to areas like medical imaging and remote sensing, unlocking new possibilities for multimodal understanding and generation, and enhancing human-computer interaction.

Keywords

CNN, LSTM, Image Caption, BLSTM, CNN.

Hrčak ID:

324933

URI

https://hrcak.srce.hr/324933

Publication date:

23.12.2024.

Visits: 0 *