Semantic interdisciplinary evaluation of image captioning models

Abstract In our day-to-day life, synchronizing vision and language aspects plays a crucial role in solving various real-time challenges. Image captioning is one of them, and it aims to recognise objects, activities, and their relationships in order to provide a syntactically and semantically correct visual description. There are existing works of image captioning in various directions, such as news, fashion, art, and medical domains. The core architectural idea of image captioning is based on merging CNN, RNN, and transformer models. In practice, there are many conceivable combinations, and brute forcing all of them would take a long time. As we know, there is no work on interpreting image captioning models across various usecases. In this research article, we examine and analyze different image captioning models used across various domains, and multiple insights are extracted to determine the best combinational architecture for a new application without ignoring contextual semantics. We examined numerous designs and determined that LSTM is best for image captioning across several domains.


Introduction
Visual data like photographs and videos may now be acquired fast and inexpensively, providing a wealth of knowledge for addressing real-world problems. The availability of huge amounts of ABOUT THE AUTHOR Uddagiri Sirisha, is a PhD full time research scholar in the school of Computer Science and Engineering from VIT-AP University, Amaravati.

PUBLIC INTEREST STATEMENT
Image captioning has various applications such as recommendations in editing applications, usage in virtual assistants, for image indexing, for visually impaired persons, for social media, and several other natural language processing applications. Many researchers have been doing research in this field of image captioning using deep learning architectures. The paper contains brief study for the methods involved in a variety of contexts, including news articles, fashion, medical photos, art images, and even human images.
data creates demands for automatic visual understanding and content summarization, which is not possible in real time. Humans can understand themselves without a description of an image, and models will not inherit this flexibility until an appropriate hybrid architectural combination is established. As a result, image captioning is used to describe images by using various deep learning models.
Captioning is a tool for describing the content of an image by utilising computer vision and natural language processing. Captioning allows the model to recognise not only the objects in an image but also their relationships with other objects to generate human-readable phrases.
Captions are generated by using a CNN-RNN architecture, which involves CNN layers for feature extraction on input data and RNN to make predictions based on time series data. A sample diagram is shown in Figure 1.
A convolutional neural network is an algorithm that takes an input image and applies weights and biases to different items in the image to distinguish one image from another. Many convolutional layers are integrated with nonlinear and pooling layers in each neural network. When a picture is streamed through each convolution layer, the first layer's output becomes the input for the next layer, and so on for all subsequent layers. To construct an N-dimensional vector, a fully connected layer is required after a series of convolutional, nonlinear, and pooling layers.
Long short-term memory networks are a sort of recurrent neural network that can learn order dependence in sequence prediction problems and overcome the RNN's short-term memory constraints. LSTM can save useful information throughout the processing of inputs while discarding irrelevant data. The prior inputs have a longer reference in LSTM and GRU, but only to a limited extent. To overcome this, we use transformers to provide an unlimited reference to the prior inputs.
A transformer is an encoder and decoder architecture that transforms one sequence into another using only attention. The encoder transforms an input sequence into an abstract continuous representation that includes all the data. The decoder then takes the encoder's continuous representation and produces a single output, which is then fed back into the preceding output.
The domains listed below are used for image captioning: 1) Hand crafted features are learned using techniques like Scale-Invariant Feature Transform (SIFT; Lowe, 2004), Local Binary Patterns (LBP; Lowe, 2004), the Histogram of Oriented Gradients (HOG; Dalal & Triggs, 2005), and a combination of these approaches. To classify an object, extracted features are fed into a classifier like a support vector machine (SVM; Boser et al., n.d.). As handcrafted features are job dependent, extracting features from a huge and diverse amount of data is tough. To classify an object, extracted features are fed into a classifier like a support vector machine (SVM; Boser et al., n.d.). As handcrafted features are job-dependent, extracting features from a huge and diverse amount of data is tough. 2) Deep-learning-based methods are automated to classify both linear and non-linear feature spaces.
The following methods are used for deep learning-based image captioning (Adriyendi, 2021): Template/Retrieval-based methods:Different objects, properties, actions, and their relationships are identified using a template-based technique. Then the empty space in the templates is filled. From the training data, the retrieval-based approach detects visually similar images and then generates captions. A novel approach is used to create captions from both the visual and multimodal spaces.
Dense/Single Sentence Captioning: Dense captioning (Johnson et al., 2016) generates captions for each region in the image, whereas single sentence captioning generates captions for the entire scene in the image.
Architecture/Compositional-based image captioning: Architecture based image captioning involves convolutional neural network and recurrent neural network (Vinyals et al., 2015) whereas compositional-based approaches include (Fang et al., 2015), GAN based (C. , Semantic concept-based (You et al., 2016), Graph based (Pan et al., 2004), Attention-based (Xu et al., 2015), and stylized captions (Gan et al., 2017). In encoder-decoder architecture-based image captioning, the visual features are extracted using CNN approach and fed into the LSTM for the caption generation. But in compositional architecture-based image captioning, the visual features and also the attributes are extracted using CNN approaches and fed into the language models for the caption generation. They are then re-ranked to identify high-quality image captions using a deep multimodal architecture.
In recent years, there has been a decent work for image captioning because of the capability of deep learning models to efficiently extract useful features from images. Most of the deep learning models use convolution neural networks (LeCun et al., 1998) for extracting image features and language models for caption generation of an image. The most common CNN approaches are Resnet-50, VGG-16, Inception v3, Densenet, FasterR-CNN, etc., and the most frequently used RNN models are LSTM and its successors i.e. single-layer LSTM, single-layer LSTM with attention, Bidirectional LSTM, Hierarchical LSTM, GRU and Transformers.
Model performance depends on two categories of metrics. 1) To measure the overall compatability between generated captions and ground truth.
2) Calculation of precision and recall scores.

BLEU
BLEU stands for Bilingual Evaluation Understudy Score and is initially proposed for machine translation. BLEU is a metric to measure the compatibility between two text strings. Bleu is applied on the test dataset.
The Bleu score is determined as follows: 1) Let us consider a phrase s, a list of possible reference phrases and a candidate phrase c.
No of times N c appears in S r No of n À grams in S c where N c represents the n-grams in candidate phrase, S r represents the reference phrase and S c represents the candidate phrase.
3) Recall is a measure of quantity and it measures the entire content of the output whereas precision is a measure of quality as it measures the n-grams score separately. Recall is not considered in BLEU metric and so in order to satisfy for recall, BLEU uses a brevity penalty.
4) Let S r be the reference phrase length and S c be the candidate phrase length. We compute the brevity penalty(P) as follows The geometric mean of the n-gram precision score multiplied by the shortness penalty(P) for short phrases generates BLEU score. Then, the BLEU score is 6) The BLEU metric goes from 0 to 1. To get a score of 1, the phrase must be exactly like the reference phrase. Otherwise, the score is between 0.1 and 0.9%. As a result, even a human translator will not always get a perfect score of 1.

METEOR
METEOR is one of the evaluation metric for machine translation. METEOR addresses the disadvantages of BLEU metric such as recall evaluation, absence of explicit word matching. Based on wordto-word similarities between a candidate and a reference phrase, meteor ratings are generated. If more than one reference phrase is provided, the score is calculated individually against each reference. Meteor chooses the alignment with the most similar word order between the two strings. To discover the best alignment between two strings, the exact mapping module is utilised first, followed by stemming and synonyms. Later, we calculate unigram precision p = m = c, recall r = m = r, where m denotes the mapped unigrams between the two strings and c,s denotes the total number of words in candidate, reference phrases, respectively.
The harmonic mean of precision and recall of unigram matching between candidate and reference phrases computes METEOR score. The harmonic mean is given as: METEOR groups the unigrams in the candidate and reference phrases into chunks, i.e. when the entire candidate phrase matches the reference phrase then there is only one chunk and if there is no match, then there are many chunks. Then the penalty for the chunks is computed using the formula: For each additional chunk, penalties increase by a maximum of 0.5% and then drop to their minimum value based on the number of matching unigrams. Finally, the metor score M s is calculated as: M s = mean*(1-penalty)

SPICE
SPICE stands for semantic propositional image phrase evaluation. SPICE is used for finding image caption similarity using a scene graph. Therefore, in SPICE, we transform both candidate phrase and reference phrases into an intermediate representation such as scene graph.
For a given candidate phrase c and a set of reference phrases s = (s 1 ; s 2 ; . . . ; s m ) associated with an image, it computes similarity between c and s. An object class (C), a relation (R), and an attribute (A) will be parsed from the given phrase utilising the scene graph. In order to create a scene graph, the candidate phrase c is first transformed into a logical proposition i.e.
The performance of SPICE depends on parsing and its value is bounded between 0 and 1.

CIDER
CIDER stands for consensus-based image description evaluation. It compares the consistency of a candidate phrase to a set of human-written reference phrases. CIDEr, on the other hand, has a higher agreement with consensus as measured by humans. The characteristics of saliency, grammaticality, semantics and accuracy are essentially captured by this measure.
For a given candidate phrase c and a set of reference phrases s = (s 1 ; s 2 ; . . . ; s m ) associated with an image, initial stemming is applied and each phrase is represented in the form of n-grams. The candidate and reference phrases n-gram co-occurrences are then calculated. The n-grams that appear in all image descriptions are given a lower weight in the CIDER metric. The cosine similarity between the candidate and the reference n-grams can be determined after performing a term frequency inverse document frequency (TF-IDF) weighting on each n-gram.

ROUGE
ROUGE means recall-oriented understudy for gisting evaluation, It was created with the intention of assessing summarization systems. The amount of overlapping units, such as n-grams, word sequences, and word pairs, is measured by ROUGE. We have four categories of ROUGE, i.e. ROUGE-L: which basically measures the Longest Common Subsequence, ROUGE-N: N-gram Co-Occurrence Statistics, ROUGE-W: Weighted Longest Common Subsequence, ROUGE-S: Skip-Bigram Co-Occurrence Statistics. ROUGE metric relies highly on recall, it favors long phrases.

Prior analysis in image captioning
A huge number of articles on image captioning have been published in recent years. As summarised in Table 1, only a few survey studies have been published, each of which gave a decent literature review of image captioning. The writers of (Hossain et al., 2019) looked at deep learningbased image captioning algorithms, a taxonomy of image captioning techniques, various  Updown, OSCAR, VIVO, Meta Learning, and GAN-based models are among the state-of-the-art techniques examined by the authors in (Elhagry & Kadaoui, 2021). The advancements in semantic segmentation were discussed by the authors in (Oluwasammi et al., 2021). Template-based, retrieval-based approaches and as well as current improvements in encoder-decoder structure were discussed by the authors Y. In Monshi et al. (2020), the authors presented a literature survey on multi-modal datasets for training deep DL models that generate radiology text. Pavlopoulos et al. (2021) presented an overview of publicly available datasets, evaluation measures, and encoder decoder architectures in the medical domain.

Novelty and contributions
The novelty and contributions of this work include • We investigate the performance of various image captioning algorithms in a variety of contexts, including news articles, fashion, medical photos, art images, and even human images.
-News image captioning creates meaningful captions for photos from news articles that have been published in a variety of newspapers.
-High-level semantic information, professional knowledge, and several specific symbols are frequently included in art image captioning.
-Medical captioning encodes and generates a report from medical reports taken during a patient's examination.
-Image captioning helps visually impaired people to navigate more easily.
• From diverse domains, we developed a list of image captioning models. However, there is a lack of understanding of the model's performance in response to a variety of parameters. We apply statistical analysis to overcome this problem and reduce the amount of time spent on brute forcing for new applications.

Image captioning in various domains
The main theme of the paper is to reduce the time in finding out the best possible CNN and RNN architectures irrespective of the domain under consideration. Various domains and organization of this section are shown in Figure 3.

News image captioning
News image captioning differs from generic image captioning in three aspects: 1) Input consists of image-article pairs.
2) Caption generation is performed based on the image and its related article.
3) News captions contain named entities and additional context extracted from the article.
Considering these differences, a sample image caption on news article is shown in Figure 4.  News image captioning is used in editing applications. News image captioning is the process of creating detailed and informative captions for photos from news articles published in various newspapers. News article contains the objects along with the text and they refer to various locations, people, events, time, etc. So, the captions generated depict the visual and textual features of a news article. The most commonly used datasets are Visual News, Good News, NYTimes800K, Breaking News, BBC News, Daily Mail dataset. Table 2 shows various datasets used in news image captioning. Table 3. In state-of-the-art approaches, the authors (Hu et al., 2020;Liu et al., 2020;Yang et al., 2021;Zhao et al., 2021) specifically used attention-based encoder-decoder frameworks for generating captions for news article images. Mostly, in news image captioning the authors (Furkan Biten et al., 2019;Hu et al., 2020;Liu et al., 2020;A. Tran et al., 2020;Yang et al., 2021;Yang & Okazaki, 2020;Zhao et al., 2021) used RESNET-152, Chen and Zhuge (2019); Ramisa et al. (2017) applied Oxford VGGNET for visual feature extraction.

Analysis of architectures used in news image captioning is shown in
Named entity recognition is critical for identifying and classifying named entities in news image captioning. The identified entities are divided into numerous categories based on the raw and structured text such as persons, organisations, places, money, time, and so on. The writers of (Hu et al., 2020;Liu et al., 2020) utilized spaCy toolkit,  applied named entity embedding (NEE), and multi-span text reading (MSTR) for captioning.
Text feature extraction from news stories can be done at the article or sentence level, however it is not enough to collect all of the information in the article. For text classification,     Tran et al., 2020;Yang et al., 2021;Zhao et al., 2021) and NYTimes800K A. Tran et al., 2020;Yang et al., 2021;Zhao et al., 2021) are the most widely used datasets in news image captioning. The authors  in  introduced the largest visual news image captioning dataset consisting of one million images with articles, captions, and other metadata. The model performance is effective on visual news dataset when compared to other two datasets, i.e. GoodNews and NYTimes800K.
A detailed workflow of the process involved in news image captioning is shown in Figure 5 • The input for image captioning is a news article {image +Text} and they need to be processed separately.
• The image encoder uses multiple CNN models to extract characteristics from the image. Various embedding approaches are used by text encoder to extract the features from news articles.
• Decoder combines both the visual and textual features using various RNN models and generates the caption as output.
News image captioning evaluation metrics analysis is shown in Table 4. Among all metrics, CIDER is the best performance metric in news captioning since it has dedicated embedding feature to focus more on un-common words. The authors in  introduced a new dataset, i.e. visual news dataset and the proposed model is able to generate better captions in a more efficient way, i.e. from 13.2 to 50.5 in CIDER score. The authors in  constructed a multi-modal knowledge graph, and the authors in  even integrated domain-specific knowledge and achieved the best results on both datasets, i.e. GoodNews dataset and the NYTimes800k dataset. News image captioning is an inherently complicated challenge for machine intelligence, because it combines images and news articles for caption development. A detailed workflow of the process involved in news image captioning is also outlined. We summarized the work across various parameters used in news image captioning. News image captioning architectures are further evaluated across different datasets.

Image captioning for visually impaired
Visually impaired people face a number of challenges every day-from reading the label on a frozen dinner to figuring out ways to reach home safely. These problems will get worsen when they travel to new places. Many tools based on computer vision and other sensors have been developed to address these issues (talking OCR, GPS, radar canes, etc.). But adding a different perspective to solve these problems is possible with deep learning. Various data science researchers addressed several problems for visually impaired and one of them is generating image/scene captions considering the capabilities of respective challenged person. Image captioning helps visually impaired people to navigate more easily. The most commonly used datasets are VizWiz dataset, Flick8k, MSCOCO (Table 5).
To create captions for the given input image, state-of-the-art algorithms primarily focused on image captioning models. To create caption, an input image is passed through CNN to extract  visual features, and the outputs are merged and submitted to a multimodal transformer network. Various image captioning architectures used for visually impaired people are shown in Table 6.
In (Dognin et al., 2020), the authors developed a multi-modal transformer that uses ResNext visual features, object detection-based textual features, and OCR-based textual features. The authors in (Ahsan et al., 2021) used AoANet as a captioning model and BERT to build OCR token embeddings. In (Makav & Kılıç, 2019b;Pasupuleti et al., 2021;Zaman et al., 2019), the authors utilized VGG16 for feature extraction and Guided LSTM for text generation, Makav and Kılıç (2019a) applied VGG16 for visual features and NLP model for generating human-like captions.
The models build in this domain are deployed in edge and IoT devices. For example, the authors integrated the developed model into (Rane et al., 2021), (Makav & Kılıç, 2019a) into smart phones, (Chharia & Upadhyay, 2020) converted the captions to audio, (Zaman et al., 2019) integrated the model to be available in Braille. The authors in (Makav & Kılıç, 2019b) designed an android application "Eye of Horus" to provide an user-friendly interface, i.e. user can either choose an image from the gallery or take a new photo using the smartphone camera to produce captions. The caption can be listened and also displayed on the screen.
We came up with a common workflow based on the all the existing systems (Figure 6.) and the same is briefed as below: • The input for image captioning is an image.
• Image features need to be extracted using CNN model and sent to RNN model for further processing.
• The model can even extract text from image using state-of-the-art OCR.
• Later, the results obtained from feature extraction and text extraction are combined to generate a caption.
• The generated caption or model can be used in the form of voice using GTTS, deployed in any smart device and also it can be converted to Braille. The performance of models for visually challenged persons is shown in Table 7.
We summarized the view of many researchers, to support a natural, socially important use case, i.e. presented their image captioning algorithms to generate captions from images which includes object detection, text detection and recognition. We outline the developed models that are      integrated into electronic devices which helps visually impaired people to traverse more easily. We review the work across various architectures and evaluated on different datasets.

Image captioning with art images
In the field of computer vision, generating captions from art images has been an important task (Figure 7). Artworks are characterized by various artistic styles, attributes, and motives along with great diversity of artists in different periods. To annotate historical artwork images, we require professional knowledge,high-level semantic information and the textual descriptions for ancient objects often include a lot of specific symbols. The most commonly used artwork datasets are IconClass Caption, SemArt, BibleVSA, ancient Egyptian art, the ancient Chinese, Flickr8K, Flickr30K, WikiArt dataset (Table 8).
An extensive quantitative and qualitative study (Bai et al., 2021; Garcia & Vogiatzis, 2018; Sheng & Moens, 2019) is used to validate the captions for the artwork. On these iconographies datasets, the authors in (Vaswani et al., 2017) fine-tuned the state-of-the-art models. External knowledge (Garcia & Vogiatzis, 2018) is utilised to characterise many characteristics of the image, such as its style, content, or composition (Table 9).
Art Image captioning workflow is depicted in Figure 8.
• An art image is used as the input for captioning.
• The encoder extracts the image characteristics using various CNN models. Art features needed to be extracted based on the external knowledge.
• Decoder combines both the art and image features using various models and generates the caption as output.
Art image captioning performance is summarized in Table 10. Generic image captioning metrics may not be highly correlated with the assessment of art expertise and originality, so the authors in (Hacheme & Sayouti, 2021) describe three other metrics and they are GreedyMatching, Skip-Thought and EmbeddingAverage.
We curated the work in image captioning for art images across various parameters. Annotating antique artwork photos requires professional skills. As artworks are characterized by various artistic styles, attributes, and motives. So, a comprehensive view of these models' performance w.r.t to various parameters helps to reduce the time to develop new applications.

Image captioning with fashion images
Fashion is an industry related to social, cultural, and economic implications in the real world. It is critical to provide correct captions for online fashion items not only to attract customers, but also to boost online sales. It is difficult to recognise and describe the rich features of fashion products, unlike conventional image captioning. In this case, the input is a fashion photograph, and the output is a fashion caption. The most commonly used datasets are DeepFashion, InFashAIv1, Fashion Captioning Dataset, FASHION-IQ, Fashion Database (Table 11). An instance of image captioning for fashion-related images is shown in Figure 9.
To increase the quality of text descriptions, Yang et al. (2020) proposes attribute-level and sentence-level semantic reward as measures, Hacheme and Sayouti (2021) combined DeepFashion and InFashAIv1 datasets for performance improvement. The authors in (Tateno et al., 2020) applied DNN to translate visual information collected from clothing into language expression, enaling visually impaired persons to access the shape and texture of objects. The authors in (J. Li et al., 2019) applied several hyperparameters to achieve a 39.12% average recall using a single model and a 43.67%a verage recall with an aggregation of 16 models on the FASHION-IQ dataset (Table 12).
A detailed workflow of the process involved in fashion image captioning is shown in Figure 10 and the same is briefed below: • The input for image captioning is a fashion image.
• Encoder extracts the features from the image using various CNN models.
• Decoder takes the input from encoder and also attributes of the image are taken into consideration to generate caption as output.  Fashion image captioning performance is summarized in Table 13. Along with the generic image captioning metrics, the authors in  have used mAP (mean average precision) and ACC (accuracy), Sadeh et al. (2019) applied diversity (Div.), vocabulary usage (Vocab.) Different from generic image captioning, identification and description of attributes plays a vital role in fashion image captioning. A detailed workflow of the process involved in fashion image captioning is also outlined. We briefed fashion image captioning architectures across various parameters. An intense view of the model performance w.r.t various parameters helps in generating accurate captions.

Image captioning with medical images
Medical captioning encodes medical images from a patient's examination and generates a full or partial report. Medical reports are always key decisive factors for initiating the right treatment of various diseases. Medical images are typically interpreted by highly skilled professionals. They write medical reports to describe the findings of the patient's abnormalities and diseases. Even for experienced radiologists, drafting a medical report can be time-consuming and unpleasant. As a result, the creation of medical reports can aid radiologists in making decisions, as well as assist medical teams in reducing workload and improve work efficiency. The most commonly used    IU-X-RAY  The Indiana University chest X-ray dataset, which contains 7470 chest X-ray scans and 3955 de-identified radiology reports.Each report includes Impression, Findings, and Indication.The dataset is divided into three parts: 70% for training, 10% validation and 20% for testing.
MIMIC-CXR  MIMIC-CXR contains 377,110 chest X-ray images and 227,835 patient reports from a total of 64,588 patients. The training set has 368,960 records, the validation set has 2,991 records, and the test set has 5,159 records.

OPENI-IU
The OpenI-IU dataset contains 3,996 radiology reports and 8,121 accompanying chest X-ray images that have been carefully annotated with by human specialists. Only unique frontal photos and their related reports with findings or impressions are chosen from the dataset.
CheXpert (Yuan et al., 2019) CheXpert has 224,316 multi-view chest x-ray images from 65,240 patients, representing 14 common radiographic findings.The observations are derived from radiology reports that are categorised as good, negative, or unsure using NLP methods.
OpenI  OpenI is a radiography dataset of 3,851 distinct radiology reports and 7,784 frontal/lateral images. Body parts, observations, and diagnoses were annotated on each OpenI report.
CX-CHR (Jing et al., 2020) A professional medical examination institution provided the CX-CHR dataset, which contains 35,500 pictures. Each image includes a textual report authored by trained radiologists, which includes parts including Complain, Findings, and Impression.
ChestX-ray14  There are 108,948 frontal-view X-ray images in the ChestX-ray database. They divided the total dataset into 3 parts i.e. 70% to training, 10% to validation, and 10% to testing.
With the tremendous amount of research publications in the medical domain, selection of relevant papers for review is a typical problem. We addressed this by prioritizing the very recent works based on the citation score (i.e. greater than 20).
In , the authors employed a multi-modal input encoder and a decoder architecture,  introduced a retinal disease identifier (RDI) and a clinical description generator (CDG), applied contextualized keyword encoder and a medical description generator for retinal report generation (Table 15).
To generate radiology reports, for a given a set of radiology images (N), the visual backbone extracts the visual features F and results in the source sequence f1, f2, . . ., fs for the subsequent visual language model. The authors in (Chen et al., 2021Jing et al., 2020;Liu, Yin et al., 2021;Pahwa et al., 2021;Wang et al., 2018Wang et al., , 2021Xue et al., 2018;Yuan et al., 2019) extracted the visual features by pre-trained convolutional neural networks RESNET and also the authors in (Gale et al., 2018;Guanxiong Hsu Liu et al., 2019;Laserson et al., 2018;Y. Li et al., 2018;Nooralahzadeh et al., 2021;Pino et al., 2021;Zhou et al., 2021) used DenseNet as it is more effective in report generation task. The authour's in (Guanxiong Hsu Liu et al., 2019;Jing et al., 2017;Y. Li et al., 2018;Yuan et al.,    generation of radiology reports. Xue et al. (2018) proposed a multimodal recurrent model containing an iterative decoder to improve the coherence between sentences. The authors Wang et al. (2018) proposed multi-attention model to combine image and text modalities using the CNN-RNN architecture, to improve disease classification and report generation, Jing et al. (2020) constructed a model to find the relationship between findings and impression.
The authors in (Liu, Yin et al., 2021) proposed contrastive learning to organize the data into similar/dissimilar image pairs. For chest X-ray report generation, the authors in (Guanxiong Hsu Liu et al., 2019;Y. Li et al., 2018; applied reinforcement learning and knowledge graph, Pahwa et al. (2021) used skip connections and transformers. The template-based multiattention model (TMRGM) presented by Wang et al. (2021) for automatically creating reports for healthy and abnormal individuals. Most recently, Han et al. (2018); Zhang et al. (2020) utilized abnormality graph embedding module to assist the generation of reports. This is further extended by authors Gurari et al. (2020) to generate a report based on GAN's.  A detailed workflow of the process involved in medical image captioning is shown in Figure 12 and the same is briefed below:

Reference
• The input for image captioning is a medical image.
• Encoder extracts the features from the image using various CNN models.
• Decoder takes the input from encoder and also attentive word embeddings are taken into consideration to generate the caption as output.
Medical captioning performance is summarized in Table 16. The automatic detection from medical images using ImageCLEF (Pelka et al., 2020)  We discussed about how to create medical reports more efficiently based on various parameters. Several architectures have been evaluated across different datasets. Substantial progress has been made towards implementing automatic reports based on various deep learning models. Medical reports help experienced radiologists to take a right decision for the treatment of the patients.

Image captioning in other domains
Image captioning is increasingly being employed in a variety of additional applications, including the effective retrieval of images in military applications, driving operations in complex traffic scenarios, and risky situations (Table 17).
The authors in (W. Li et al., 2020;Mori et al., 2019) utilized the model as an assistance system that can prevent traffic accidents, Arriaga et al. (2017) described how image captioning helps in the dangerous circumstances involving knife, guns, fire, blood, dead bodies and broken objects.
The summary work on other domains is shown in Table 18.

Extracting insights from image captioning models across various domains
We compiled a list of image captioning models from various fields. However, a deeper knowledge of the model's performance in relation to numerous parameters remains unexplored. We use statistical analysis to solve this problem and reduce the time spent in brute forcing even for new applications.
Statistical analysis is a crucial tool in experimental research for effective interpretation. We have used Chi-square test for statistical analysis. Chi-Square test of independence is used to check whether two variables have a significant relationship between them. The Null hypothesis(H 0 ) and Alternate Hypothesis(H 1 ) are defined as below H 0 : The metric score is unaffected by size or architecture.
H 1 :The metric score is influenced by size and architecture.
We tested the above hypothesis on our metadata which was created based on the different domains. From Table 19 the following conclusions can be drawn.
• Interpretation 1: The evaluation metrics BLEU-1, ROUGE are dependent on size.

Challenges
• Traditional image captioning models lack compositionality and naturalness since they frequently create captions in a sequential fashion, i.e., the next generated word is dependent on both the previous word and the image attribute. Even though it is syntactically correct, in some complex scenarios, semantically irrelevant language structures will be constructed.
• The second challenge is datasets, i.e., models struggle to differentiate captions across similar contexts when they overfit to some of the same objects that co-exist in a common domain.
• The third difficulty is determining the quality of the generated captions. Since the existing captioning models do not take the complete image context into account, the captions will not be helpful for the images that have high variance in comparison with training data.

Research directions
• Image captioning models have fewer datasets than other types of models. Furthermore, the annotation procedure is entirely manual. As a result, semi-automated or automatic annotation procedures are required.
• Image captioning models are evaluated using various evaluation metrics like BLEU, CIDER, ROUGE, etc. But the metrics vary from application to application. But there is no standard mechanism to opt for an appropriate image captioning metric for the application under consideration. This triggers the design of a framework for optimal metric selection considering various parameters like the dataset, application, and the model.
• Interdisciplinary image captioning models show very low performance and require more time. So there is a need for optimization techniques to improve performance.
• Image captioning models are data hungry. They require a lot of data to generate captions. There are recent deep learning advancements that can build robust models even with less data. Integration of these techniques in the context of image captioning would be a possible alternative for improving existing image captioning models.
• New approaches need to be developed to provide diverse, creative, and human-like captions across multiple areas.

Conclusion
Image captioning is now being applied across many domains. We observed that there is no single best architecture that performs best across many domains. To automatically guide the user to pick the right architecture for a new application, we surveyed, analysed, and interpreted various constraints associated with best-performed models across various domains. We summarised the aspects of datasets, architectures, and evaluation metrics, and also how the architectures are evaluated across different datasets.
At the end, we have given our inferences from the extensive survey, which would help the researchers to further streamline and reduce the burden of brute-forcing the highly complex neural network models. The interpretation can be further extended by considering multiple datasets across various domains.