Multi-Script Video Caption Localization Based on Visual Rhythms

ABSTRACT Localization of video caption plays an important role in information retrieval in multimedia applications. In this work, we present and evaluate a novel method for localizing video captions using visual rhythms, which enable the representation and analysis of a specific feature throughout the time. We build visual rhythms from the text location maps produced by general text localization methods that are far more common in the literature than caption-oriented ones. Then, we process the maps properly to keep only the captions, generating caption localization masks. To meet the need for a standardized and large dataset, we constructed a new one, where captions with thirteen different scripts are added to the video frames, generating a total of 221 videos with ground truth. Experiments demonstrate that our method achieves competitive results when compared to other literature approaches.


Introduction
Texts present in videos can be categorized into scene texts and captions (Zedan, Elsayed, and Emary 2016). Scene texts occur naturally in video content, whereas captions are artificially embedded into the video frames. In this work, we are particularly interested in localizing video captions, which are usually static, that is, fixed in the same position in some consecutive frames. However, captions can also be scrolled, such as the vertical scrolling in video credits.
Several recent works have investigated different tasks related to video captions, whose development can assist the task of video content analysis (Valio, Pedrini, and Leite 2011). We define detection as the task that aims to verify whether there is a caption in a given frame. The localization problem, in turn, returns a bounding box for the caption of each frame.
The script recognition problem aims to recognize the script of the text in order to facilitate the task of text recognition. At last, the text recognition aims to determine the words that are written in the text.
The majority of text localization methods proposed in the literature were not focused on captions. However, they can be used as baseline for methods addressing this specific problem, provided that additional steps are proposed to distinguish the two types of text. Thus, the main objective of our work is to propose a new method for caption localization based on general text methods.
As a first contribution of this work, we propose and evaluate a novel method for localizing video captions based on the visual rhythm. Visual rhythm ) is a spatiotemporal representation that summarizes relevant information present in a video. It is defined as the concatenation of a predefined feature extracted from each frame or small groups of neighboring frames. Initially, we apply a text localization method in the video frames that results in binary maps in which white pixels indicate text regions. Then, we build visual rhythms from the text location maps. The components of these rhythms are properly processed to keep only the captions. Finally, caption localization masks are retrieved from processed rhythms. We can use any text localization method in the first step, including those for natural text. The remaining process aims to transform the initial text localization into caption localization. This is completely deterministic and does not use or require any learning-based technique.
Similar to specific methods, there is also a lack of large datasets containing the necessary annotation to evaluate the methods. Since the methods do not usually employ a standardized and public dataset, it is difficult to compare their results and generalize their findings. For this reason, our second contribution is a new multi-script dataset with ground truth for video caption tasks. In order to build this database, we collected 17 distinct videos from YouTube with subtitles and under the Creative Commons license. Each subtitle was obtained in English and then translated in such a way that we get 11 different scripts. Subtitles are added to videos according to pre-set settings as font color and size. By including 11 types of subtitles in each of the 17 different videos, we built a dataset with 221 unique videos. Concerning the number of frames, our dataset has 87,789 frames from the original videos and more than 1 million frames with subtitles. Since the detection and localization are performed in the frames and not in the video, it can be considered a large dataset. To the best of our knowledge, there is no similar dataset in the literature. Experiments demonstrate that the proposed method achieves superior results when compared to the method used as a baseline. We believe that our results can be even better in future work if we improve the performance of the caption detection step. This text is organized as follows. This section presents a brief introduction with some relevant concepts. Related work is described in Section 2. Section 3 presents the proposed method for video caption localization. The description of the proposed dataset and its construction is provided in Section 4. Section 5 reports and discusses the obtained results. Final considerations and directions for future work are outlined in Section 6.

Background
This section presents a brief review of the literature related to the topic investigated in this work. Methods for text localization in still images are presented in Subsection 2.1. Methods for text detection and localization in videos are described in Subsection 2.2. Long, He, and Ya (2018) conducted a detailed literature review, categorizing the works based on traditional image processing and deep learning techniques. Prior to the advent of deep learning techniques, methods mainly used connected component and sliding window strategies. Some of these works are described as follows. Wu, Hsieh, and Chen (2008) presented an algorithm for localization of text regions in images based on mathematical morphology techniques. First, an average filter is applied to smooth the image. Then, the difference between the opening and closing operations of the smoothed image is calculated. A closing operation is applied to the image of the differences to merge the characters into a single component. The image is binarized and the components are labeled. Since texts can still be divided into several small segments of different orientations, the orientation of the text is estimated based on statistical moments. From the texts and their rotation angles, their properties are used to select the candidate texts. The nearby text segments are joined. Then, an x-projection technique is used to extract features from the candidate texts and verify them. Epshtein, Ofek, and Wexler (2010) proposed an operator to find the stroke width value for each image pixel, and applied it to the text detection problem. Initially, each pixel is set to an infinite value. Edges are computed using the Canny algorithm (Canny 1987). For each edge pixel, its gradient direction is used to find the next one, which is expected to be in the opposite edge of the same stroke. Each intermediate pixel in this path receives the value of the distance between the two edge pixels, if it is smaller than the current value. Based on these distances, pixels with similar values are clusterized, making them candidates for letters, which are filtered according to their size. The letters are finally grouped into lines of text. Neumann and Matas (2010) proposed localizing text in images using Maximally Stable Extremal Regions (MSER). Each of the localized regions is classified by the Support Vector Machine (SVM) into characters and non-characters using features such as aspect ratio and compactness. From similarity and spatial proximity features, the characters are joined in lines. Zhang et al. (2016) used a fully convolutional network (FCN) to detect text blocks. The network has five convolutional stages based on the 16-layer VGG model. A deconvolution layer is added at the end of each stage. The fusion of the result of each feature map produces the salience map for text detection. Several other neural network architectures have been proposed for text detection, localization, and recognition in images (Arafat and Iqbal 2020;2017b;He et al. 2017a;Jiang et al. 2020Jiang et al. , 2017Katper et al. 2020;Liao et al. 2018;Liu and Jin 2017;Shi, Bai, and Belongie 2017;Villamizar, Canévet, and Odobez 2020). Yin et al. (2016) investigated the most relevant approaches to detection, tracking and recognition of texts in videos. High-pass filters were used for text detection in videos (Agnihotri and Dimitrova 1999). Other works used the coefficients of the Discrete Cosine Transform (DCT) in the Moving Picture Expert Group (MPEG) domain, in order to perform text detection in compressed videos with low computational cost (Zhang and Chua 2000;Zhong, Zhang, and Jain 2000). Khare, Shivakumara, and Raveendran (2015) proposed a new descriptor for text localization in videos with the invariance of rotation, scaling, font type, and font size. For each frame, the proposed descriptor finds the orientations with the second-order geometric moments. Text candidates are obtained from the analysis of the dominant orientations of the connected components. Candidates of text with constant speed and uniform direction, verified by optical flow analysis, are finally considered as text.

Text Localization in Videos
Searching for captions on all video frames can be computationally expensive. In order to reduce the cost, visual rhythm representation (Chun et al. 2002;Concha et al. 2018;Moreira, Menotti, and Pedrini 2017;Pinto et al. 2012Pinto et al. , 2015Souza 2018;Tacon et al. 2019;Torres and Pedrini 2018;Valio, Pedrini, and Leite 2011) can be used to detect frames in which captions are present. Thus, localization techniques can only be applied to the appropriate frames. Chun et al. (2002) proposed the visual rhythm for the frame detection problem. The visual rhythm is constructed by concatenating the vertical, main and secondary diagonal rhythms. Prewitt filter is applied to the visual rhythm to highlight the horizontal edges. Caption is detected based on the analysis of certain properties such as duration. Lyu, Song, and Cai (2005) proposed a method for detection, localization and extraction of texts in videos. Text detection is done through edge detection and local thresholding. Text localization is performed through a coarse-to-fine analysis. Text extraction is done by adaptive thresholding, followed by labeling and padding steps. Lee et al. (2007) detected captions in videos based on the assumption that caption holds in many consecutive frames. Initially, they identify the frames in which new captions start with a strategy based on decreasing the frame rate. Twelve wavelet features are extracted from the region, which are used as input to a classifier that determines whether that region refers to a caption. Valio, Pedrini, and Leite (2011) addressed the problem of caption detection with rotation invariance through visual rhythms. Initially, the visual rhythm is calculated and segmented. Captions are then determined based on some predefined rules. Visual rhythms are calculated in a zigzag scheme. Different scales were considered, which demonstrated a trade-off between efficiency and efficacy.
Zedan, Elsayed, and Emary (2016) proposed a method for caption detection and localization in videos. The method is based on edge features and the integration of multiple frames. Initially, the edges are computed using the Canny method (Canny 1987), only at the 1 3 lower part of the frame. Horizontal lines are detected through the number of edge pixels in each row of the frame. The captions localization are determined from an analysis of these horizontal lines. Finally, the frames are clusterized. From the clustering, the authors classified the captions as static and with horizontal or vertical scrolling. Chen and Su (2018) performed caption localization in videos using visual rhythms. Vertical and horizontal rhythms are extracted from the video. Then, vertical lines in horizontal rhythm and horizontal lines in vertical rhythm are defined using the Sobel filter and the Hough transform. Vertical and horizontal rhythms are extracted in different positions in order to obtain a rhythm that contains a barcode pattern. Localization is also estimated from visual rhythms. Vertical and horizontal projection techniques are finally used to refine the location of captions. Sravani, Maheswararao, and Murthy (2021) extracted video text using a hybrid method of MSER through morphological filtering. A 2D discrete wavelet transform was employed to remove noise from background and to enhance the text contrast. The method is also combined with traditional text extraction approaches based on edge dependent and connected components to produce better results.
Valery and Jean (2020) developed a method for detecting and localizing embedded subtitles in video streams based on the search for static regions in the video frames. Connected components are extracted from the background via a Gaussian mixture model (GMM), generating binary image masks. Heuristic rules are used to identify the subtitles in the video stream.
Most approaches available in the literature, either in videos or images, localize the scene texts and not caption texts. In this sense, the method proposed in this work can be used to extend such methods to the problem of caption localization in videos. This aspect is interesting due to the scarce development of methods specific to caption texts.
Approaches that address video captioning tasks do not employ a standardized, large-numbered dataset, making it difficult to compare the methods. The dataset proposed in this work can be used to compare the results of the methods to solve these problems. The vast majority of approaches in the literature use their own sets of videos and do not make them publicly available, so it is not possible to compare different methods in the same videos in these cases. For example, Zhang and Chua (2000) employed six television news videos, while Chun et al. (2002) used three long videos (7-59 minutes). Chen and Su (2018) evaluated their method with a set of five videos. Valery and Jean (2020) presented only qualitative results in a small set of videos.

Caption Localization
The method proposed in this work for caption localization applies a scene text localization approach as a baseline, followed by a visual rhythm analysis, which uses the spatio-temporal information of rhythms to maintain only the captions from all scene texts. We use this approach due to the large number of literature methods for the localization of scene texts. The visual rhythm was employed because of its ability to summarize the spatio-temporal information of a video into a single image (or few images), allowing the employment of classic image processing techniques. Figure 1 presents a diagram with an overview of our method, which has the captioned video as input. Each of the frames has its captions located separately. Then, the horizontal and vertical visual rhythms are constructed from the caption localization masks. Visual rhythms are processed to correct the estimation of caption locations. Finally, we reconstruct the masks across the regions defined by the visual rhythms. The following subsections explain each step of the method.
The core of the Frame Text Localization step can be done by any thirdparty text localization method (in our case, the method proposed by Epshtein, Ofek, and Wexler (2010)), which can use any technique, such as deep learning. In turn, the other steps use a deterministic algorithm based on classical image processing techniques without any learning-based approach, which tends to have a lower computational cost, in addition to not requiring a training stage.

Frame Caption Localization
Initially, we localized the text of each frame independently. Several methods in the literature, such as those discussed in Section 2, could be used in this step.
In this work, we apply the method proposed by Epshtein, Ofek, and Wexler (2010), motivated by the quality of the results obtained with this method, and the availability of its code. 1 Figure 2a presents an example of caption localization obtained through this method. In our context, we are interested in the regions surrounding the caption. Since the baseline method generates a segmentation result, we perform a postprocessing that obtains the caption localization from the segmentation. In addition, since caption texts are often present either at the top or bottom of the frames, we assume that they are not in the central region and we keep only the top and bottom estimates. These estimates are used to generate the mask illustrated in Figure 2b. To do so, a new image of the same size is initialized with zeros. For each row of the localization mask, we identify the first and the last non-black pixels in the corresponding row of the segmentation image. Every pixel between these two (including them) becomes a white pixel in the mask image. Although we use a specific method for this step, any other method that finds text in images could be applied. We conjecture that the final performance may be superior when a better method is used to find text in the frames. Nevertheless, the focus of the method proposed in this work is described in the following subsections, where visual rhythms are employed to refine the localization from temporal information.

Visual Rhythm Construction
After computing the localization mask for all frames, we built the visual rhythms in the vertical and horizontal directions. Visual rhythm  is an image that summarizes information from a video. Inspired by the idea of building visual rhythms from the average of rows and columns (Souza and Pedrini 2020), we use the standard deviation of rows and columns.
Visual rhythms were formally defined by . Let a video be defined as the set an arbitrary operation that maps each F i into an n � 1 column vector S i . A visual rhythm (VR) is defined as the n � t image given by: (1) In our case, the operation TðF i Þ is the standard deviation. To compose the i-th vertical rhythm column, the standard deviation of each row of the mask is calculated for the i-th frame of the video. Similarly, to construct the i-th column of the horizontal rhythm, the standard deviation of each column of the i-th frame mask is calculated. By concatenating the information from each frame into  columns, the vertical rhythm will have dimensions H � N, whereas the horizontal rhythm will have dimensions W � N, where H and W, respectively, correspond to the frame height and width, and N is the number of frames in the video. The standard deviation was chosen empirically, as we noticed that the average would have problems since the caption usually has a very small size compared to the frame size, which becomes worse in extreme cases. Since the input frame has only values 0 and 1, the standard deviation will give us the information we need: which rows and columns have the most variation. Moreover, horizontal and vertical directions are more suitable for our context (for instance, in comparison with zigzag scheme (Valio, Pedrini, and Leite 2011)) since they match the usual caption direction. Figure 3 presents examples of visual rhythms, where captions are defined as rectangles. The rectangles are larger at the horizontal visual rhythm because the horizontal dimension of the caption is typically larger than its vertical dimension. We also observed, especially at the horizontal visual rhythm, that the rectangles have discontinuities and are not regular. This is due to text detection failures, which we intend to correct in the next step.

Visual Rhythm Processing
Errors that occur in frame-by-frame text localization can be corrected by incorporating temporal information. From the visual rhythms, we can analyze the temporal information regarding the location of caption from the analysis of adjacent columns. Since we assume that the caption remains fixed in the same position for a certain period of time, it is expected that the corresponding columns should contain uniform rectangles. Thus, in this step we use image processing techniques to make the noisy rectangles from the previous step more uniform. Figure 4 presents a diagram with the steps of processing visual rhythms. In processing, we consider the binary images of the visual rhythms. To obtain such images, we define as positive any pixel above a constant threshold T ¼ 10, empirically chosen.
We preprocess the vertical rhythm based on the horizontal projection as shown in Algorithm 1, in which we determine which rows of vertical visual rhythm will be kept according to an analysis of the projection. In summary, we try to keep consecutive rows with positive pixels (white pixels). This is done in order to calculate the vertical position of the captions, which can be either top or bottom. This preprocessing is done only at the vertical rhythm since the vertical positions of the captions in the frames do not have significant changes throughout the video. In some real-world scenarios, captions may appear at both the top and the bottom, such as in sound effect captions. However, this does not appear in our dataset, and we consider it beyond the scope of our work. Since the caption is predominantly present in a same frame region, false positives can be filtered from the frequency with which estimates appear in a given region. Thus, we calculate the horizontal projection of the vertical rhythm, which consists of a histogram, where the i-th value is the sum of the pixels of the i-th row. From this histogram, we define a range of rows that represent the location of the captions, so that rhythm rows that fall outside this range are set to null. The desired range of the histogram is one that has high values. These values can be represented by one or more nearby peaks. Thus, we calculate the differences in consecutive values, described as where Hist i is the i-th value of the histogram. To avoid problems with border values, the zero value is added at the beginning and end of Hist, before this calculation.
The histogram rows we are looking for are defined as a range based on an analysis of the Δ values. We assign the first index i as the beginning of the range, where Δ iþ1 is the first value greater than the p max parameter, and assign the last index j as the end of the range, where Δ jÀ 1 the last value that is less than the p min parameter. To define the values of p max and p min , we consider the initial values, chosen empirically, respectively as 0.45 and −0.38. The final values are chosen separately, checking if there is at least one value in Δ that satisfies these restrictions. If it does not exist for the p max parameter, it will be decremented by 0:01, while for the p min parameter it will be incremented by the same value.
We compute the connected components of the horizontal rhythm and of the preprocessed vertical rhythm. We consider that two pixels belong to the same component if they are connected by an 8-neighborhood and are both positive. Only components that have a minimum width are considered, that is, captions that are in a minimum number of consecutive frames. Since a caption can end in one frame and another start in the next frame, two or more captions can be connected in the same component. Figure 5a provides an example of this case. Thus, for each of the detected components, we analyze and separate their subcomponents. It is possible to observe that the components are separated by entirely black columns indicating non-captioned frames.
We perform the subcomponent analysis by comparing the component columns. We start with a column set C ¼ ;. To consider adding a new column in C, we verify the first and last pixel that there is a positive value in that column. We called these pixels as boundary pixels (superior and inferior). The current column c is added to the set C if at least one of the following conditions are satisfied.
(1) a threshold value is equal or greater than the difference between (a) the position of the boundary pixels in the column c and (b) the averages of the positions of the boundary pixels of all columns currently present in the set C, for at least one of two boundary pixels. In this work, the threshold empirically was chosen as 10 for the horizontal rhythm and 5 for the vertical rhythm; (2) at least 70% of the next k ¼ 5 columns have at least one of two boundary pixels close enough to the average to be part of the set C; (3) c is the last column of this component. When none of the previous conditions are valid, the end of a subcomponent is determined. This subcomponent starts at the first and ends at the last column added to C. Assuming that the frame text localization method generally gets the right result, the upper and lower bounds given by the first and last rows in which positive pixels exist are computed, respectively, as the mode of the first and last rows of the columns in the set C. Then, if the current column is not the last column of the component, it will be added to the set C and the process continues to the last column. The result of processing a component from subcomponents can be seen in Figure 5b, where we can observe that the irregularities present in the component have been corrected.
Irregularities within a component, as shown in Figure 5a, mean that the location of a specific caption changed over the seconds. However, this contradicts our premise that each caption that appears throughout the video is fixed in the same position and with the same delimitation. If two consecutive subcomponents have close values for both the first and last rows, they are joined so that all columns between them are part of a single component. Finally, to reduce the number of possible failures, we check the correspondence of the localized components of two rhythms, so as to maintain only the intersections of the components of both rhythms. Figure 6 presents the visual rhythms obtained at the end of the process.

Video Caption Localization
As we use the rhythms in the vertical and horizontal directions, we can retrieve the caption localization from the positive pixel coordinates. The positive pixels in the i-th column of the vertical rhythm indicate the rows of the i-th frame Vlog that should be positive. That is, if the j-th pixel of the i-th column is active at the vertical rhythm, the j-th row of the mask of the i-th frame must become active. Similarly, positive horizontal rhythm pixels indicate which columns should be active in the mask. Masks are constructed so that a given pixel is active only if its row and column are active.

Dataset
There is a lack of public data, both in terms of quantity and quality, to evaluate related literature approaches. Thus, we propose a dataset that contains videos with caption and the necessary information for its detection, localization, segmentation, script recognition and text recognition.
To create the dataset, seventeen videos were collected from YouTube. Three criteria were established for a video to be inserted into the dataset: (i) be marked as a Creative Commons license, which gives the right to reuse and edit the video to anyone; (ii) have separate caption in raw text, providing a subtitle file with the format.srt; (iii) has no or little text already embedded in the video.
Preprocessing was done to deal with cases where the video had few frames with embedded text, leaving only frames without embedded text. Thus, it was possible to create a dataset if we had information related to the location of captions in all frames of the video. Table 1 presents information for each of the collected videos. It is possible to observe that the videos were taken at different frame rates per second (FPS), at high resolution and with different durations. Figure 7 presents a frame of each of the videos. Videos #1, #7, #8, #10 and #12 feature plenty of camera shake and cuts in different scenarios, such as music and sports clips. Videos #2, #3, #13, #14, #15, #16 and #17 have either a fixed camera or slow motion. Videos #4 and #5 are animations. Some other scenarios were also considered, such as indoor environments in videos #6 and #11 and slightly darker surroundings as in video #9.
To add the captions to the videos, different characteristics were considered, namely: position, size, color and script. Figure 8 illustrates the process of inserting captions and creating ground truth. Each video and script pair results in a captioned video. As thirteen scripts were considered, 221 captioned videos were obtained.
For each captioned video, the ffmpeg tool was used to insert the caption into the video. Position, color, and size of captions are randomly chosen, all with equal probability, within predefined options. The settings were selected to include the most common conditions found in real cases. Captions were positioned either below or above in the video frames. Font sizes were chosen from small, medium and large, respectively, 12, 24 and 36. White, yellow, and red colors were considered for captions.

e2032926-2200
After inserting captions on videos that did not have embedded text, we can calculate the mask for caption segmentation from the differences between the frames of the original video I and the frames with the added text J, expressed as Dði;j;tÞ ¼ Iði;j;tÞ À Jði;j;tÞ (3) where t indexes video frames over time. In addition, i and j index, respectively, the rows and columns of the frame. Pixel ði; jÞ is active in segmentation mask of the t-th frame if, and only if, Dði; j; tÞ is nonzero. If there is at least one nonzero value in DðtÞ, there will be a caption in the t-th frame. By analyzing D rows and columns where there is at least one non-zero value, we can calculate the rectangle surrounding the segmentation mask obtaining the location mask. As mentioned previously, thirteen different scripts were considered. For this, each of the captions, originally in English, were translated into each of the scripts, presented in Table 2. The scripts adopted are of different types and emerged at different times. For each script, we consider its official language. In the case of Roman, English language was used. Figure 9 illustrates the same sentence in different scripts. The used scripts have substantial variation, which can hamper the development of a system that supports them all. In addition, there is a great similarity between some of the scripts, especially those of the same type, which can make their automatic recognition a hard task. Figure 10 shows statistics about captions added to the videos. The graphics illustrate the number of videos in which captions were added with a certain color, size or position, relative to the seventeen videos collected (Figure 10(a, b, c) or the thirteen scripts ( Figure 10(d,e,f). From these statistics, it is possible to observe the dataset profile. For example, for the Urdu script, the added texts were distributed almost equally over position and color. However, few large texts (36) were added. SubFigure 10g shows the number of frames with and without text per video, where it is possible to see which videos have text on most of their frames. Video #6 has the largest number of frames without text, whereas video #4 is the video with the highest percentage of videos with text.

Evaluation Metrics
Although we propose a method for caption localization without an earlier step for explicit frame caption detection, we evaluated which video frames a caption was localized and compared it with detection methods available in the literature using the proposed dataset.  Let TP, FP and FN be true positive, false positive, and false negative, respectively. In the detection problem, positive indicates a frame with a caption. Precision and recall metrics are used for caption detection evaluation and can be described respectively as Precision evaluates the number of frames that have captions over all the frames that have captions. A lower precision value indicates that fewer frames estimated as captioned frames are incorrect. Recall evaluates the number of frames that are estimated to have captions, within all frames that have captions. A lower recall value indicates that fewer captioned frames were estimated.
The ratio between the intersection area and the union area of the estimation of caption location and the mask can be used for evaluating the caption localization, expressed as where A is the estimate and B the mask of the caption location. Values close to zero are obtained when the estimate differs from the mask, either by size or location, whereas values close to one occur when the estimate and mask are similar. From the intersection over the union, we define the accuracy used in localization as where N is the total number of frames in the video, and

Experiments and Discussions
This section presents the experiments conducted on our dataset with the method proposed in this work and different literature approaches. Subsection 5.1 presents the results obtained for the detection of frames with caption. Subsection 5.2 describes and discusses the results obtained for the caption localization task. We compared the results of our method to those proposed by Valio, Pedrini, and Leite (2011) and Epshtein, Ofek, and Wexler (2010). The method proposed by Valio, Pedrini, and Leite (2011) performs caption detection, while the one proposed by Epshtein, Ofek, and Wexler (2010) performs text localization. The experiments presented in Subsection 5.1 compare our method with them. These experiments are preliminary in relation to those presented in Subsection 5.2 and show an aspect in which our method can be improved in future work. These methods were chosen due to their good results and code availability. It is important to highlight the difficulty in making a comparison with other methods in the literature, because, in addition to the lack of available codes, the datasets were not, until this work, standardized for these tasks, in such a way that each literature work used its own set of a few videos.

Caption Frame Detection
The following experiments aim to present the results of frame detection methods in the proposed dataset, in addition to verifying and comparing the results obtained by determining which frames had a caption found by the localization methods. In order to do so, we analyzed the information of the generated masks, in such a way that it was possible to make a comparison with the method proposed by Valio, Pedrini, and Leite (2011). The results obtained with our method and with the approach developed by Epshtein, Ofek, and Wexler (2010) are calculated by assigning true when the sum of the caption mask is greater than zero, and false otherwise. The experiments of this subsection refer only to detection of frames with captions, which is important to eventually determine the methods drawbacks. By improving the quality of the methods at this step, the final results (localization) also tend to be improved. Table 3 presents the average results obtained for each source video. For the method proposed by Valio, Pedrini, and Leite (2011), the recall values obtained are above 90% for almost all videos, which shows that this method rarely erroneously disregards frames with a caption. In terms of precision, most videos have values above 90%, however, videos #4, #6 and #10 have low values, with 49.4% in the worst case.
The results of the method proposed by Epshtein, Ofek, and Wexler (2010) show a better balance between recall and precision, with recall values above 90% for almost all videos, and precision below 80% only for video #10. For our method, we can observe that the precision values were above 90% for almost all videos, outperforming the low results of the method proposed by Valio, Pedrini, and Leite (2011). The precision value for video #10 was also higher than the value of the method proposed by Epshtein, Ofek, and Wexler (2010). On the other hand, recall values are lower, especially for videos #5, #6, #7 and #11. This shows that our method may have trouble finding captions in some frames, but almost does not erroneously find captions in frames that do not Table 3. Results obtained for detecting frames with captions for the different videos in the dataset. have them. This result also indicates that our method removes true-positive frames from the estimation, which affects its final results. Possibly, the poor performance of all methods in the video #10 may be explained by their high rate of smaller and yellow captions located mainly at the top, where the colors of objects in the scene, such as yellowed dry leaves, can cause methods to confuse captions with the background. Table 4 presents the average results calculated for each script. We can observe that the values obtained for the different scripts are very close for all the metrics to the results obtained by Valio, Pedrini, and Leite (2011). This was expected since this method considers information independent of the characters used, which shows script invariance. Similar results can be seen for different caption font characteristics, presented in Table 5.
The method proposed by Epshtein, Ofek, and Wexler (2010) shows superior results for precision rate, with recall above 90%. For our method, high precision was obtained, but in some scripts, such as Arabic, Bangla, and Telugu, we achieved lower recall values. Similarly to the results obtained by Valio, Pedrini,   (2011) and Epshtein, Ofek, and Wexler (2010), the values obtained with our method showed invariance for different characteristics of the caption font. These results can be seen in Table 5.

Caption Localization in Frames
In the following experiments, we present the final results for the caption localization. Table 6 presents the average accuracy for each video for our method and the method based on the stroke width operator proposed by Epshtein, Ofek, and Wexler (2010). Additionally, the accuracy rate was also computed only in frames where there was a caption according to the ground truth (correct frames). This was done so that we could also evaluate the results with a lower impact of the errors on non-caption frames by isolating the assessment of error location in false positive detection. In this case, we do not modify our method to take into account only the correct frames, but only calculate the evaluation metrics in those frames.  For twelve of the seventeen videos, our method had better accuracy, with the most noticeable improvement for videos #2, #6 and #9. On the other hand, the results of the other five videos were slightly lower, with negative highlighting for video #14.
For results considering only correct frames, there was an improvement for all videos, except video #14, with the most evident improvement on videos #2, #6, #7 and #9. These results show that our method achieved better results than the method proposed by Epshtein, Ofek, and Wexler (2010) mainly due to the improved localization of captions in frames that had already been detected. However, the final result is not as good as it could be, because our method erroneously disregarded some true positives, such as video #7, which has a low recall on detection, as shown in Table 3. Table 7 presents the average accuracy in caption localization for each script. Our method obtained better results than those achieved by Epshtein, Ofek, and Wexler (2010) for all scripts, especially Arabic, Bangla, Malayalam, Persian, and Tamil, where their method obtained low accuracy, probably due to the character formats of these scripts. Since our method considers temporal information, without analyzing the character format, we possibly solved problems that occurred in the caption localization due to the nature of the script. Table 8 shows the average accuracy in caption localization for different font characteristics. Regarding the font size, our method was superior for larger fonts, maintaining good results for videos with smaller fonts, while both methods were invariant to position and color of the captions. From the results obtained, we believe that the method by Epshtein, Ofek, and Wexler (2010) had problems in dealing with very large fonts, and this reflected the better performance of our method. Figure 11 presents examples of results for frames where scene texts and captions are shown together. In an overview, our method performed a good caption localization (despite some misalignments) and ignored scene texts. In the first case, the scene text is present in the center of the frame, a region that is disregarded by our method. In the second example, the scene text appears on the opposite side of the captions. The success of our method, in this case, is associated with considering only captions in the same region of the frame, and the scene text is eliminated through temporal analysis. The third example also uses temporal analysis to succeed. In this case, the caption and scene texts are presented in the same region and the method separated them. Finally, the last example shows that our method is also capable of disregarding large scene texts.

Conclusions
The contributions of this paper are twofold. We presented a novel method for localizing video captions using visual rhythms for temporal analysis of frameby-frame location estimates. A new dataset was created with seventeen initial videos. For each video, we added captions with thirteen distinct scripts to obtain a total of two hundred and twenty one videos with ground truth.
Experiments were performed on the proposed dataset to compare our results against two other methods. It is worth mentioning that a more extensive comparison would be a very difficult task due to the lack of available codes and annotated public datasets. In this sense, the dataset built in this work may provide an opportunity for other authors to evaluate their methods. Experimental results demonstrated that our method can considerably improve results for most videos, with improvement being evident in some specific videos. However, our method could still be improved in order to reduce the great number of false positives, which would mitigate the obtained results.
As directions for future work, we intend to evaluate the results of our method considering frame-by-frame localization made by a deep machine learning approach, such as convolutional neural networks. We will also improve the comparative analysis of the literature on the dataset proposed in this paper. Note 1. https://github.com/aperrau/DetectText/