Parsing AUC Result-Figures in Machine Learning Specific Scholarly Documents for Semantically-enriched Summarization

ABSTRACT Machine learning specific scholarly full-text documents contain a number of result-figures expressing valuable data, including experimental results, evaluations, and cross-model comparisons. The scholarly search system often overlooks this vital information while indexing important terms using conventional text-based content extraction approaches. In this paper, we propose creating semantically enriched document summaries by extracting meaningful data from the results-figures specific to the evaluation metric of the area under the curve (AUC) and their associated captions from full-text documents. At first, classify the extracted figures and analyze them by parsing the figure text, legends, and data plots – using a convolutional neural network classification model with a pre-trained ResNet-50 on 1.2 million Images from ImageNet. Next, we extract information from the result figures specific to AUC by approximating the region under the function’s graph as a trapezoid and calculating its area, i.e., the trapezoidal rule. Using over 12,000 figures extracted from 1000 scholarly documents, we show that figure specialized summaries contain more enriched terms about figure semantics. Furthermore, we empirically show that the trapezoidal rule can calculate the area under the curve by dividing the curve into multiple intervals. Finally, we measure the quality of specialized summaries using ROUGE, Edit distance, and Jaccard Similarity metrics. Overall, we observed that figure specialized summaries are more comprehensive and semantically enriched. The applications of our research are enormous, including improved document searching, figure searching, and figure focused plagiarism. The data and code used in this paper can be accessed at the following URL: https://github.com/slab-itu/fig-ir/.


Introduction
In recent years, scientific publications in digital libraries have shown a tremendous increase, thus, creating a challenge for the end-users to access and search for the relevant content using traditional Information Retrieval (IR) techniques Safder, Hassan, and Aljohani 2018;Said et al. 2019;Iqbal et al. 2021). Digital libraries such as Google Scholar, Citeseerx, Scopus, Web of Science and DBLP, etc., are the prominent places to look for while searching scholarly documents (Mutlu, Sezer, and Akcayol 2019;Zhu et al. 2014). The number of documents in these repositories is enormously high. For example, Google Scholar has reached up to 170-175 million documents. Thus, it has become critical to mine this ever-increasing data and understand the semantics of the scholarly documents (Rahi et al. 2019). Primarily the scholarly full-text documents consist of textual and nontextual contents. Generally, traditional IR systems are only limited to tapping the textual contents of scholarly documents and lack to tap non-textual contents for providing results against user queries (Safder et al. 2020). Therefore, understanding the different non-textual contents (figure, tables, and graphs) and developing a relationship between them is essential to cater to user search needs.
Notably, scholarly articles contain many tables and figures that are enriched with helpful knowledge. A recent study, conducted over 10,000 articles randomly selected from top computer science conference proceedings from the year 2004 to 2014, shows that over 70% of publications contained a minimum of one figure, around 43% had one table, and almost 36% contained at least one table and one figure (Choudhury, Wang, and Giles 2015). Additionally, researchers use result-figures (bar charts, line graphs) to present their data and compare the findings of different experimental studies, summarize the experimental results, etc. Therefore, a knowledge gap will remain if result-figures from scholarly documents are not parsed and analyzed in IR systems to provide efficient results against user queries across scientific disciplines (Hassan et al. 2012).A number of studies have been done on developing IR systems for textual data, but only a few works have used non-textual contents to improve IR results (Hassan et al. 2018b;Hassan et al. 2018a;Safder et al. 2017). Quite a few studies have used figure metadata such as text lines near the figures or the text lines mentioning the figures and their figure captions, from within the textual content of the documents to generate enriched descriptions of the figures (Bhatia and Mitra 2012;Moraes et al. 2014). Additionally, parsing and understanding of figure contents to understand figure semantics have also gained less attention. Recently, FigureSeer is one of the prominent frameworks established by Siegel et al. (2016) that has localized scientific result-figures (line graphs) from scholarly documents, extracted and classified them and later parsed them to improve IR systems. However, none of the studies has focused on designing specialized resultfigures based summaries for an article by parsing both result-figures semantic contents (figure text, plotted data, area under the curve) and figures metadata (figure captions and similar sentence) from full-text of a scholarly document .
We formulate our problem into the following sets.
g be the set of all result-figures present in scholarly documentsD: Let C d ¼ c 1 ; c 2 ; c 3 � � � c n f g be the set of all captions against each F d found in collection D, let S d ¼ s 1 ; s 2 ; s 3 � � � s n f g be the set of summaries generated for each d in D, using C d . Furthermore, we merge figure semantics by parsing F d and information from S d together, to form a tuple such that f n ; s n f g is the result-figures specialized summary for a document d n . In this paper, we present a technique to identify, extract and parse resultfigure generate specialized summaries to improve searching results of IR system. Following are the two main contributions of this study: This research's first and foremost contribution is to parse scientific resultfigures to extract figure semantics from research documents. Firstly, we extracted result-figures such as bar charts, line charts, and node graphs from pdf documents. Secondly, we parse these figures to extract textual and plotted data. Furthermore, we design an approach to compute the area under the value of the curve for the precision-recall (PR) graph plots found in full-text documents. The designed approach shows encouraging results when employed over a collection of result-figures that contains 17,950 figures extracted from 1,000 research papers indexed by the semantic scholar. 1 The second contribution of this study is to generate result-figures specialized summaries with the addition of figure semantics extracted from each figure image and figure metadata mined from the full text of a document. Firstly, we captured figure caption sentences against each figure using regular expressions. Secondly, we extracted similar sentences matched to caption sentences from the whole document, using Okapi BM25 (Beaulieu et al. 1997). Furthermore, we combined figure metadata (caption sentences and similar sentences) and figure semantics extracted by our designed figure parsing technique to build enhanced figures specialized summaries for each scholarly document. Lastly, the specialized summaries are evaluated against human-generated summaries using ROUGE-N, ROUGE-L, Jaccard Similarity and Edit Distance metrics (Lin 2004).
The rest of the paper is organized as follows: Section 2 focuses on the previous research that has been carried out in figure parsing and summarization. Section 3 discusses the entire approach, including the steps involved in specialized summary generation. Section 4 discusses the experimental results for figure parsing and also presents metrics for summary evaluation. Finally, Section 5 concludes the paper.

Related work
Many recent studies were focused on mining knowledge from figures and full-text to understand the text and figure semantics efficiently (Unar et al. 2019;Zhao et al. 2019). We categorize the related work into three sections; the first section presents prominent studies on figure classification and parsing. The second one is concerned with generating documents summaries using figures and full-text. The last section aims to cover the research work on specialized scholarly search systems that have been developed over the years.

Review of figure classification & parsing
Extracting information from figures has gained much attention from the last few years (Qian et al. 2019). A number of studies have been presented to extract figures, classify them and my figure contents (Saba et al. 2014;Takimoto, Omori, and Kanagawa 2021;Thepade and Chaudhari 2021;Xu et al. 2019). Recently, a machine learning-based heuristic independent approach has been designed that extracts figures from PDF documents (Ray Choudhury, Mitra, and Giles 2015). Clark and Divvala (2016) presented PDFfigures, an advanced system to capture images from pdf articles along with figure titles, captions, and body text. The designed technique decomposed pages into different parts such as graphics, figure texts, body text, and captions. Then it locates figures by analyzing the empty regions within the text that takes a pdf document as input and separate figures, tables, and captions from pdf. Generally, figures found in the scientific literature are often complex and composed of many different subfigures. In order to understand these figures, it is important to separate these multiple subfigures. However, PDFFigures cannot divide figures into relevant subfigures. Later, a data-driven approach is proposed to separate subfigures from pdf documents using deep convolutional neural networks (Tsutsui and Crandall 2017).
Moreover, Choudhury, Wang, and Giles (2015) proposed an architecture that automatically extracts figures and their metadata such as captions, headings, etc. Additionally, the proposed system also utilized a Natural Language Processing (NLP) module to understand the intended content and knowledge from figures and designed a search engine to index the extracted figures and their related metadata. Choudhury, Wang, and Giles (2016) claimed that if an image is extracted by rasterizing (PNG, JPEG) a PDF page, all information is lost. At the same time, all characters in the original image can be restored if images are converted into SVG vectors. Likewise, neural network-based page segmentation techniques are also explored to segregate text blocks, figures, and tables from scholarly pdf documents (Chen et al. 2015;He et al. 2017).
Classification of figures has also become an exciting area of research from the last few years. Choudhury, Wang, and Giles (2015) designed an unsupervised technique for figure classification that outperformed the traditional feature learning methods such as SIFT and HoG. Additionally, the authors also developed figure parsing modules as sub-components of the system and performed different analyses on colored line graph extraction to highlight the easy and hard cases. Moreover, OverFeat (Sermanet et al., 2013), an integrated framework, has recently been proposed that is trained by a convolutional network to detect, classify and locate objects in images simultaneously. This convolutional-based approach localizes and detects the figure by accumulating bounding boxes.
A state-of-the-art approach for figure extraction and parsing is FigureSeer (Siegel et al. 2016) that deployed a deep neural network to classify result-figures into bar charts, line graphs, node diagrams and scatter plots. The system also designed a figure parsing technique for line graphs. Al-Zaidy and Giles (2015) automatically extract text and graphical components from bar charts by using different heuristics and image-processing techniques. A number of machine learning approaches have been proposed for the semantic structuring of charts by computing features from both graphic and text components of graphs (Al-Zaidy and Giles 2017; Siegel et al. 2018).
Among the notable figure extraction systems, ChartSense (Jung et al. 2017) is a semi-automatic system that extracts data from charts. It uses a deep learning approach to classify the type of charts. However, it requires user interaction to complete the extraction task effectively. Likewise, Scatteract (Cliche et al. 2017) is a fully automated system that deals with scattered plots with a linear scale. This system uses deep learning for identifying the components of the charts and maps the pixels to the chart coordinates with the help of OCR and robust regression. They also focused on text detection, recognition and data extraction from scatterplots. To calculate the area under a curve, the trapezoidal rule has long been used. The trapezoidal rule is simply integral of the function, where the function is divided into small intervals, each representing a trapezoid (Tallarida and Murray 1987). A lot of work has been done using the trapezoidal rule to calculate the area under the curves for different fields. For example author states about using the Trapezoidal rule to calculate the area under discrete and continuous curves. The area under the curve approach is used in medicine, for instance, to calculate the magnitude of pain experienced by the patients or to compute the area for plasma leveltime curve. However, to our knowledge, no work has been done to find the area under the curve of parsed result figures.

Review on document summarization using figure metadata
Generating summaries for documents has been explored extensively for the last many years (Barros et al. 2019;Liu et al. 2021;Mohamed and Oussalah 2019;Sinoara et al. 2019). However, generating customized summaries for figures by analyzing figure contents and figure metadata has received relatively less focus. Moraes et al. (2014) proposed a system that generates a high-level description for images from the textual contents that is found near to the image in the document. Moreover, generating summaries of document elements such as figures, tables and algorithms may assist users in grasping the critical details about a document element (figure, table, and algorithm) quickly, instead of reading and understating the whole document. Bhatia and Mitra (2012)

Review on specialized scholarly search systems
The practical significance of searching the most relevant research articles has motived the development of advanced scholarly search systems (Li et al. 2013;. Recently, Safder and Hassan (2019) presented a prototype system for algorithm metadata searching is designed. The proposed system supports the users for searching an algorithm based on the evaluation results such as precision-recall and f-measure etc., reported in the text of a paper, using deep learning-based techniques. AlgorithmSeer (Tuarob et al. 2016), a customized system to search for an algorithm from full-text articles. They designed rule-based and machine learning-based techniques to identify and extract algorithms written in full-text publications. Furthermore, the presented system has been integrated along with the CiteSeerx repository. Hassan, Akram, and Haddawy (2017) used textual elements from full text to improve IR results.
TableSeer (Liu et al. 2007), a specialized tool to identify, extract and index tables from documents. They implemented a custom-made TableRank algorithm to tweak the searching results. FigureSeer (Siegel et al. 2016), a system that identifies, parse, and indexes result from figures. AckSeer (Khabsa, Treeratpituk, and Giles 2012), an acknowledgment repository that extracts acknowledgments and identifies entities from CiteSeerx data. Furthermore, Choudhury et al. (2013) designed a tailed search engine for figures from chemistry publications.
Despite the availability of these advanced systems, there is a need to design more tailed techniques to search for results figures from articles. Since the nontextual content of a document is as much important as the textual content. In e2004347-328 particular, result figures such as graphs contain important information which is often not found in the running text. Unfortunately, this information is ignored, leaving behind a gap in extensively conveying the idea of the document. Hence, getting the gist out of the figures, it is important to parse these images. Lee, West, and Howe (2017) use techniques from computer vision and machine learning to classify more than 8 million figures from PubMed into five-figure types and study the resulting patterns of visual information as they relate to scholarly impact. They found that the distribution of figures and figure types in the literature has remained relatively constant over time but can vary widely across fields and topics. They also found a significant correlation between scientific impact and the use of visual information, where higher impact papers tend to include more diagrams, and to a lesser extent, more plots. Zha et al. (2019)_ define a new problem called mining algorithm roadmap in scientific publications and then propose a new weakly supervised method to build the roadmap. The algorithm roadmap describes the evolutionary relationships between different algorithms and sketches the undergoing research and the dynamics of the area. It is a tool for analysts and researchers to locate the successors and families of algorithms when analyzing and surveying a research field. They proposed abbreviated words as candidates for algorithms and then used tables as weak supervision to extract these candidates and labels. Next, they propose a new method called Cross-sentence Attention NeTwork for cOmparative Relation (CANTOR) to extract comparative algorithms from the text. Finally, they derived order for individual algorithm pairs with time and frequency to construct the algorithm roadmap. Through comprehensive experiments, their proposed algorithm shows its superiority over the baseline methods on the proposed task.

System's architecture
This section explains the details of our dataset and proposed approach. Figure  1 shows the detailed architecture of our designed technique. Firstly, we mined the figure semantics by extracting, classifying and parsing result-figures. Our designed parsing technique extract curves and calculates their area under the curve value from line graphs. Secondly, we extracted figure metadata from the article. Specifically, we convert the pdf to text, detect captions of the figures and similar text lines. Lastly, we combined figure semantics and result-figures metadata to generate a specialized summary.

Data
Our dataset is a subset of data corpus used by FiguresSeer (Siegel et al. 2016) that contained over 22,000 full-text documents belonging from top computer science conferences; CVPR, ICML, ACL, CHI, and AAAI, indexed by Semantic Scholar. From among these, we randomly selected 1000 documents and obtained around 12,146 figures belonging to different classes (graph plots, bar charts, flow charts, etc.). We also extracted 5,804 subfigures from these documents. Later, we used a random sample of 1000 line graphs for parsing that yields 1272 axes, 2183 legend entries, and plots.  Generally, graph plots are the standard form of figures representing experimental results. Once we had extracted figures, we performed a classification mechanism to find the type of extracted figures. For this purpose, we deployed an existing state of the art model for figure classification (Siegel et al. 2016). A CNN classification model using ResNet-50 (He et al. 2017) was pre-trained on 1.2 million Images from ImageNet (Deng et al. 2009) and fine-tuned for figure classification.

Parsing result-figures
Figure Parsing is a complex task, especially when it comes to result from figures. They have strict requirements. Minor variations while parsing plot data can change the results and affect the overall output. Moreover, figures take different designs structures and formatting styles while being created. This also makes it hard for establishing a common ground between them. Color is a helpful feature for identifying different parts of the figure, however sometimes the same color is reused for different plots, or most of the time, a figure is displayed in grayscale, making it hard to distinguish the figure plotted parts. Also, noise such as heavy clutter, deformation, and occlusion, hinder in truly parsing the plotted data. Algorithm 1 presents the detailed pseudo code to parse a line graph figure.

Textual Parsing
We have used OpenCV with its deep-learning-based OCR, Pytesseract, 2 for figure parsing. For each text, it extracts associated bounding box in the form of [x,y,w,h] where "x" and "y" are points in the cartesian coordinates of the figure, "w" is the width and "h" is the height of the bounding boxes. Width and height are later used to find the rotation of text, i.e. either vertical or horizontal. In this way, it extracts figure text, axes labels, axes scale and legend texts. This information is then saved in a JSON file.

Curve separation
Figures containing only one curve are easy to parse as compared to the figures containing multiple curves. The case gets harder if the curves intersect each other or if the curves are in grayscale. Therefore, the approach taken here is to separate the curves based on colors. This is done by transforming the figure into an HSV color-space. Every single curve is saved into a separate file. However, we came across the problem of color variation within the single curve and across multiple curves. We tackled this issue by observing the HSV value within the curve and across the curves and then setting a range for the popular colors. We extracted points on the curves by using Hough transformation. This gave us the end-points of small lines on the curves that are then sorted in increasing order on the horizontal axis.

Extracting area under the curve
In order to compute the area under the curve (AUC) in line graphs, we applied the trapezoidal rule (Pavičić et al. 2018) that divides the curve into multiple small intervals in the shape of a trapezoid and computes the area for each small trapezoid. The trapezoidal rules for computing area under the curves with multiple points on the curves are given below in Eq. 1: where y 0 ¼ f x 0 ð Þ; y1 ¼ f x 1 ð Þ; y nÀ 1 ¼ f x nÀ 1 ð Þ; y n ¼ f x n ð Þ and h is the distance between two points on the curve as shown in Figure 2.
Note that each trapezoid contributes to the computation of AUC. We computed AUC for all the curves of a line graph as they might represent results corresponding to different models or to the same model with different parameter settings. Furthermore, we generated a figure semantic summary by parsing figure details. Algorithm 2 presents the pseudo-code to calculate the area under the curve from a line graph. Firstly, we identified the x-axis label and the y-axis label. Next, we identify each curve legends label and their symbols based on the legends symbol color. We separated each curve and calculated AUC for each curve. Furthermore, we sorted the curves based on the area in descending order and picked the curve that has greater AUC among all. Lastly, we generated a description for the parsed line graph as follows: [X-axis Label] versus [Y-axis Label] [Name of the curve that has a higher value of AUC] performed better than [Name of all other curves].

Extracting figure metadata from full-text
General summaries provide a different idea of a full-text document. Here, information about specific document elements might be lost. Whereas a specialized summary may help to find a particular knowledge from the full text. The amount of detail to be shown in a specialized summary is proportional to the user's need, as it should not be too long or too short. They were inspired by the work of (Safder et al. 2020) where the author first generated a synopsis of fulltext documents and then enriched it with algorithmic-specific sentences to get a more meaningful summarization of full-text documents generated specialized summaries for the figures . Following are the preprocessing step carried out in a specialized summary generation.

Hierarchal segmentation of document
A full-text document is composed of different sections such as Abstract, Introduction, Literature review, Methodology, Experiments and Results, Conclusion and References. These sections are organized hierarchically. The possibility that the result-figures will occur in the Experiments and Results section is high. Therefore, we need to segment this section and ignore the rest of the paper. Therefore, we extracted the plain text from the PDF document by using the PDFBox library 3 moreover, performed a documents segmentation mechanism to divide a document into its standard sections.

Figure caption detection
In order to extract figure metadata, we extracted figure captions from full text. Generally, figure captions are mentioned under the figures and follow a specific pattern, as mentioned below: Where figure caption words "Figure|Fig|FIGURE|FIG" are followed by some integer value which keeps track of the number of the specific figure, followed by some delimiter which can be of the type ":" or "." In the end, it has some text explaining the figure.

Average line length
Line length is the count of words in a line. Average line length is the sum of a number of words in each line divided by the total number of lines. All those lines having lengths lesser than the average line length are discarded. In this way, we filter out the section headings, titles, etc. As captions are already being detected, hence they are not removed. Furthermore, Sparse lines often are created when converting from PDF to text. These lines can be associated with author name, table data, equations, etc. These lines are removed using word density measure as shown in Eq. 2: (2)

Similarity to caption
In order to generate figure metadata, we parse the text document and find the similarity of each line with the caption. Each line is then given a score, and we pick top lines. For this purpose, we adapt Okapi BM25 (Beaulieu et al. 1997) similarity matching measure, as shown in Eq. 3. The motivation behind using Okapi BM25 as a similarity measure is that it is more efficient than other similarity measures and performs well in many ad-hoc retrieval tasks.
where N = # of lines in the D, L f t = line frequency, tf tl = frequency of term t in line L, tf tc = frequency of term t in captionC, l L = length of lineL, l av = average length of line in D, and thek 1 , k 3 and b are set to 2, 2 and 0.75.

Specialized summary generation
We generated a specialized summary by combining two sub summaries: (a) figure semantics that we generated by parsing figure and computing AUC (b) figure metadata. We also proposed a query search mechanism, however developing it is beyond the scope of this paper. The proposed query search mechanism will work according to the following ranking formula: Where n = total no of result-figures in a full-text document D,Q is the query, D f is tf-idf based summary of full-text document, D s i is the specialized figure-based summary including figure semantics and figure metadata, against result-figure i and λ is the weighting parameter which is set to 0.5, hence equal weight is given to both summaries.

Results and discussion
In this section, we discuss the results against figure extraction, classification, figure parsing and empirical results for a summary generation. Furthermore, figure parsing results are divided into three subsections: parsing figure text, curve separation and finding the area under the curves.

Figure extraction & classification
In order to extract and classify figures from PDF documents, we deployed some existing state of the art techniques (Clark and Divvala 2016;Siegel et al. 2016). Figure 3 shows the results of our dataset (described in Section 3.1) from the existing approach for figure extraction. The blue bounding box highlights the figure on the PDF document page whereas, the red bounding box highlights the subfigures within the figure. Moreover, Figure 4 illustrates the classification results with the existing state of the art approach (Siegel et al. 2016). The deployed approach achieved an accuracy of 86% using ResNet-50. The model classifies figures into different classes such as graph plots, node diagrams, bar charts, and scatter plots. Figure  4 is sample classification results for graph plot (line graph) and bar chart, respectively.

Textual parsing
The designed approach draws the bounding boxes against the axes labels, axes scale, legend text and figure text found in the figure. The bounding box is in the form [x,y,w,h] . Using width 'w' and height 'h,' we can find the angle of rotation. Figure 5 shows the bounding boxes identifying the text in the figure. However, we can see that not all the text is identified; our parser omits two scale marks from the y-axis. Such minor errors are likely to occur.

Curve Separation
The curve separation process is divided into two steps; firstly, curves are separated by color. Secondly, the transformation is applied to extract plotted data. It first finds end-points of small lines on the curve that are then joined to form contours surrounding the curves. Figure 6 represents that not all curves are identified, two out of 5 curves are black, and our parser did not pick them up, which is the limitation of our current model.

Area under the curve
For finding AUC, we used the trapezoidal rule (Pavičić et al. 2018), it divided the curve into multiple small intervals and calculated the area covered by those intervals. Furthermore, we performed an empirical evaluation of the computed AUC value against the mentioned AUC results in the respective paper. Among 12,146 figures, we chose a random sample of over 1000 figures. Furthermore, for the empirical evaluation of AUC approach, we randomly selected 55 precisionrecall curves from our dataset and computed their AUC value. Next, we crossmatched computed AUC with the AUC in full-text, and the results are shown in Figure 7. Every blue dot on Figure 7 represents the value of either computed AUC or reported AUC. The red line is the ideal straight line on this plot. Closer the value of blue dots to the red line lesser the probability of error between computed AUC and reported AUC and vice versa. Figure 8 indicates that the bluer dots are closer to the red line that is clearly shows the encouraging performance of our designed AUC computation approach. AUC formula using trapezoidal rule has a few limitations, such as that it is error-prone; however, if the curve is divided into more segmentations, the value of error will be smaller.

Summary generation
In order to evaluate the specialized summary, we generated a human-generated summary to work as a reference summary. For the human-generated summary, we allocated three human annotators. These summaries varied from three to five lines long and at least had 12 keywords. Figure 8 shows a sample of a human-generated summary (reference summary) and system-generated summary for a line chart extracted from this research paper (Goldberg et al. 2009). It is observed that system summaries are longer in length and provide a more detailed description of the given line chart as compared to the humangenerated summaries. Moreover, in order to measure the quality of system generated summaries against reference summaries, we applied four standard and commonly used metrics, ROUGE-1, ROUGE-L, Jaccard Similarity and, Edit Distance (Lin 2004). Firstly, we performed some pre-processing, e.g. lemmatization and stop word removal, on both summaries. Then we computed the values for all four-evaluation metrics. Following are the brief details of summary evaluation metrics:

ROUGE-1
ROUGE-N (N =1) is a 1-gram recall, calculated between the system summary and reference summary by finding the overlapping words between them ROUGE À N Recall ¼ Countover lapping words # of words in reference summary whereas, Rouge À NPrecisionis calculated by dividing with the sum of words in system summaries.

ROUGE À N Precision ¼
Countover lapping words # of words in System summary (6) Figure 8. System generated summary against human-generated summary.

e2004347-340
Note that numerator sums over the overlapping keywords and provides more weight to the matched words. Therefore, ROUGE -N favors a system summary that shares more words with a reference summary.

ROUGE-L
The ROUFE-L is a measure to find the longest common overlapping subsequence between the system summary and reference summary. The basic intuition is that the longer the length of an overlapping subsequence is, the more similar the two summaries are.
The Edit Distance is a well-known measure to find the dissimilarity between two strings by counting a minimum number of operations required to transform one string into another. In our case, we computed the number of operations (removal, insertion or substitution) performed in order to convert human-generated summary to system summary. The larger the value for edit distance, the more dissimilar the summaries are.

Jaccard similarity
The Jaccard Similarity is a well-known matching algorithm to measure the similarity between two summaries. It ranges from 0 to,1 and the value closer to one shows the higher similarity between the summaries. If X is the reference summary and Y is the system summary.
j represents the number of common words in both summarie,s and | X|,|Y| denotes the length of respective summaries. Table 1 summarizes the results for all evaluation metrics against five summaries. We observed high values of recall ane slightly low values of precision for ROUGE-1 (R1) and ROUGE-L (RL) (see Figure 9). Similarly, in case of the Jaccard Similarity, we achieved approx. ~ 40% similarity between system and reference summary. For Edit Distance, a relatively low score is being observed. Overall, it shows that the system summaries consist of most of the keywords used in human-generated summaries plus some more information. Therefore, system summaries are more detailed.   Figure 6 shows the precision-recall curves for . . . Figure 5 shows the precision recall curves for the . . . dominate other curves Table 5 shows the average . . . Figure 6 Products domain precision recall curves . . . margin prediction to produce the curves All curves . . . precision versus recall area under the curve greenCombo perform better roi = 0.79   Figure 6: Products domain precision-recall curves. Figure 6 shows the precision recall curves for . . . Figure 5 shows the precision recall curves for the . . . Figure 6 Products domain precision recall curves . . . dominate other curves Table 5 shows the average . . . Products . . . precision versus recall area under the curve ï¿½ greenCombo perform better roi = 0.64 Figure 6 shows precision-recall curves for products domain that is manual, words, templates and words plus templates. he area under the curve shows that words plus templates perform better with roi = 0.

Concluding remarks
We presented a system that first extract result-figures from scholarly documents, classify them and then parse them in order to extract figure semantics such as text, plotted data and area under the curve. On the other hand, our system detects captions from full-text scholarly document. Using those captions, it generates figure specific summaries. Finally, it combines the figure semantics from parsing figures and figure metadata from the summaries to generate specialized summaries. We also propose a query search mechanism where a document ranking approach is suggested for the future using the semantic meta-data of result-figures. One of the limitations of this work is the dependency of system summary on extracting the original reference summary from the paper. In future, we need a better way to make the evaluation stronger keeping in the view that low precision may not be bad, since the system enriches the summary by incorporating more words to original summary. Our work can be extended to create an automated figure evaluation mechanism where we can review the area under the curves or other similar shapes. Another important application of this work can be semantic plagiarism detection of figures in scholarly documents.