Semi-automatic staging area for high-quality structured data extraction from scientific literature

We propose a semi-automatic staging area for efficiently building an accurate database of experimental physical properties of superconductors from literature, called SuperCon2, to enrich the existing manually-built superconductor database SuperCon. Here we report our curation interface (SuperCon2 Interface) and a workflow managing the state transitions of each examined record, to validate the dataset of superconductors from PDF documents collected using Grobid-superconductors in a previous work. This curation workflow allows both automatic and manual operations, the former contains ``anomaly detection'' that scans new data identifying outliers, and a ``training data collector'' mechanism that collects training data examples based on manual corrections. Such training data collection policy is effective in improving the machine-learning models with a reduced number of examples. For manual operations, the interface (SuperCon2 interface) is developed to increase efficiency during manual correction by providing a smart interface and an enhanced PDF document viewer. We show that our interface significantly improves the curation quality by boosting precision and recall as compared with the traditional ``manual correction''. Our semi-automatic approach would provide a solution for achieving a reliable database with text-data mining of scientific documents.


Introduction
The emergence of new methodologies using machine learning for materials exploration has given rise to a growing research area called materials informatics (MI) [2].This field leverages the knowledge of the materials data accumulated in the past to efficiently screen candidates of the materials with desired properties.As a matter of course, such an approach requires a larger amount of material-related data for training models.Researchers have been developing large aggregated databases of physical properties generated by first-principles calculations based on Density Functional Theory (DFT), such as Materials Project [3], JARVIS (Joint Automated Repository for Various Integrated Simulations) [4], NOMAD (Novel Materials Discovery) [5], that played a role of a strong driving force for the development of materials informatics.Using DFT data for machine learning (ML) in materials science has become popular since, in principle, it allows researchers to simulate and obtain various types of physical properties of the target materials only by knowing the crystal structures of the subjects.Those DFT codes are designed to reproduce/simulate the physical properties that should be observed by experiments in reality.Nonetheless, caution must be exercised while utilising these computed figures for constructing ML models aimed at steering experiments.This caution arises due to the potential lack of validity in their predictions when dealing with specific simplifications of the interactions between atoms and electrons in solids, such as electron-electron Coulomb correlation, spin-orbit coupling, and similar factors.
On the contrary, accumulated datasets of experimental data from scientific publications are still scarce, despite abundant publication availability, and exponential growth in materials science [6].Currently, only a few limited resources exist, such as the Pauling File [7] and SuperCon [8], necessitating reliance on manual extraction methods.This scarcity can be attributed to inadequate infrastructure and a shortage of expertise in computer science within the materials science field.
The SuperCon database was built manually from 1987 [8] by the National Institute for Materials Science (NIMS) in Japan and it is considered a reliable source of experimental data on superconductors [9][10][11][12].However, the updates of SuperCon have become increasingly challenging due to the high publication rate.In response to the need for a more efficient approach to sustain productivity, we embarked on the development of an automated system for extracting material and property information from the text contained in relevant scientific publications.This automated process enabled the rapid creation of "SuperCon 2 Database", a comprehensive database of superconductors containing around 40000 entries, within an operational duration of just a few days [1].Matching the level of quality seen in SuperCon while simultaneously automating the extraction of organised data can be achieved with a properly designed curation process.We use the term curation to describe the overall process of reviewing and validating database records, while correction refers to the specific action of altering the values of one or more properties within an individual record.At the moment of writing this article, we are not aware of any other curation tool focusing on structured databases of extracted information.There are several tools for data annotation, such as Inception [13], and Doccano [14] which concentrate on text labelling and classification.
In this work, we designed and developed a workflow with a user interface, "SuperCon 2 Interface", crafted to produce structured data of superior quality and efficiency to the one obtained by the "traditional" manual approach consisting of reading documents and noting records, usually on an Excel file.We developed this framework around the specific use case of SuperCon, however, our goal is to be adapted to alternative data frameworks.
Our contributions can be summarised as follows: • We developed a workflow and a user interface that allow the curation of a machine-collected database.We demonstrate that using it for data correction resulted in higher quality than the "traditional" (manual) approach.• We devise an anomaly detection process for incoming data lower rejection rate (false positive rate) from domain experts.
• We propose a mechanism that selects training data based on corrected records, and we demonstrate that such selections are rapidly improving the ML models.
The subsequent sections, Section 2 describes the curation workflow and Section 3 the user interface on top of it.Finally, we discuss our evaluation experiments and results in Section 4.

Curation workflow
The curation of the SuperCon 2 Database acts as a workflow where user actions result in database records state transitions (Figure 1).Allowed manual actions include a) mark as valid (validation) when a record is considered correct or corrected by someone else.When a record is not valid, users can: b) mark as invalid when considered "potentially" invalid (or the curator is not confident), c) perform manual correction to update it according to the information from the original PDF document, and d) remove the record when it was not supposed to be extracted.
Besides manual operations from users, this workflow supports also automatic actions: "anomaly detection" for pre-screening records (Section 2.2) and the "training data collector" for accumulating training data for improving ML models (Section 2.3).
Although only the most recent version of a record can be viewed on this system, the correction history is recorded (Section 3.3).

Workflow control
The workflow state is determined by the "curation status" (Section 2.1.1),the user action, and the error type (Section 2.1.2).

Curation status
The curation status (Figure 1) is defined by type of action, manual or automatic, and status, which can assume the following values: • new: default status when a new record is created.
• curated: the record has been amended manually.
• validated: the record was manually marked as valid.
• invalid: the record is wrong or inappropriate for the situation (e.g., T m or T curie extracted as superconducting critical temperature).• obsolete: the record has been updated and the updated values are stored in a new record (internal status1 ).• removed: the record has been removed by a curator (internal status).

Error types
We first introduced error type in [1] and extended their scope in this work to consider data curation and anomaly detection.
Users are required to select one Error Type at every record update or removal.This information is stored in the "original" record and can be different at every record modification.The error type values can be summarised as follows: • From table: the entities Material → T c → Pressure are identified in a table.
At the moment, table extraction is not performed • Extraction: The material, temperature, and pressure are not extracted (no box) or extracted incorrectly.• Linking: The material is incorrectly linked to the T c given that the entities are correctly recognised.• T c classification: The temperature is not correctly classified as "superconductors critical temperature" (e.g., Curie temperature, Magnetic temperature...).• Composition resolution: The exact composition cannot be resolved (e.g., the stoichiometric values cannot be resolved).• Value resolution: The extracted formula contains variables that cannot be resolved, even after having read the paper.This includes when data is from tables • Anomaly detection: The data has been modified by anomaly detection, which facilitates their retrieval from the interface.• Curation amends: The curator is updating the data which does not present issues due to the automatic system.

Anomaly detection
Anomaly detection is the process of identifying unusual events or patterns in data.In our context, this means identifying data that are greatly different from the expected values.This post-process was introduced in a limited scope to draw attention to certain cases during the curation.
The anomaly detection uses a rule-based approach and marks any record that matches the following conditions • the extracted T c is greater than room temperature (273 K), negative, or contains invalid characters and cannot be parsed (e.g."41]") • the chemical formula cannot be processed by an ensemble composition parser that combines Pymatgen [15], and text2chem [16] • the extracted applied pressure cannot be parsed or falls outside the range 0 -250 GPa.
Records identified as anomalies have status "invalid" and error type "anomaly detection" for easy identification.Since this process may find false positives, its output requires validation from curators.For example, in certain contexts, T c values above room temperature or applied pressure up to 500 GPa may be valid in researchers' hypotheses, calculations, or simulated predictions.
We ran the anomaly detection on the full SuperCon 2 Database (40324 records [1]).The anomaly detection identified 1506 records with invalid T c , 5021 records with an incomplete chemical formula, 304 records with invalid applied pressure, and 1440 materials linked to multiple T c values.Further analysis and cross-references with contrasting information may be added in future.

Automatic training data collector
The curation process is a valuable endeavour demanding significant knowledge and human effort.To maximise the use of this time for collecting as much information as possible.We integrated an automatic procedure in the curation process that, for every correction, accumulates the related data examples that can be used to improve the underlying ML models.

Training data collection
In the event of a correction (update, removal) in a database record, this process retrieves the corresponding raw data: the text passage, the recognised entities (spans), and the layout tokens information.This information is sufficient to be exported as training examples, which can be examined and corrected, and feedback to the ML model.

Training data management
We designed a specific page of the interface (Section 3) to manage the collected data (Figure 2) in which each row corresponds to a training example composed by the decorated text showing the identified entities, the document identifier, and the status.The users can examine the data, delete it, send it to the annotation tool to be corrected, and then export them.We integrated our interface with Label-studio [17] for the correction of the collected training examples.Label-studio is an open-source, python-based, and modern interface supporting many different TDM tasks (NER, topic modelling, image recognition, etc.).

Curation interface
The workflow is operated through the user interface, which offers several key features to facilitate the data curation process (Figure 1).It provides a comprehensive view of materials and their related properties as a table which includes search, filtering, and sorting functionality (Figure 3).The detailed schema, including examples, is reported in our previous work [1].
During the curation process, it is often necessary to switch back and forth between the database record and the related context in the paper (the related paragraph or sentence).Our interface provides a viewer for individual documents, which visualises in the same window a table with the extracted records and the original PDF document decorated with annotations that identify the extracted materials and properties (Figure 4).

Manual curation approach
In this section, we discuss our strategy concerning manual curation, which is still indispensable for developing high-quality structures.
We selected curators from domain experts in the field, to certify sufficient data quality.Nevertheless, as confirmed from our experiment in Section 4.3, the experience of each individual may have an impact on the final result.We followed two principles to guarantee robustness in the curation process.First, we built solid curation documentation as a form of example-driven guidelines with an iterative approach we first introduced in [18].Then, we used a double-round validation approach, in which the data was initially corrected by one person, and validated in a second round, by a different individual.

Curation guidelines
The guidelines consist mainly of two parts: the general principles and the correction rules with examples of solutions.The guidelines are designed to provide general information applied to corrections and very basic explanations containing illustrations for a faster understanding (e.g. the meaning of the colours of the annotations).Differently from our previous work [18], these guidelines are divided into examples for different scenarios based on the error types mentioned in Section 2.1.2.Each example described the initial record, its context, the expected corrected record and a brief explanation, as illustrated in Figure 5.

Curation and processing logs
The Supercon 2 interface gives access to information regarding the ingestion (processing log) and the curation process (curation log).The processing log is filled up when the new data is ingested, it was built to have minimal functions able to explain why certain documents haven't been processed (Figure 6 top).For example, sometimes documents fail because they don't contain any text (image PDF documents) or they are too big (more than 100 pages).
The curation log provides a view of what, when and how a record has been corrected (Figure 6 bottom).

Results and evaluation
In this section, we illustrate the experiments we have run to evaluate our work.The evaluation is composed of three sets of results.The anomaly detection rejection rate (Section 4.1) indicates how many anomalies were rejected by curators after validation.Then, we demonstrate that the training data automatically selected contributed to improving the ML model with a small set of examples (Section 4.2) Finally, we evaluated the quality of the data extraction using the interface (and the semi-automatic TDM process) against the classical method of reading the PDF articles and noting the experimental information in an Excel file.In Section 4.3 we find out that using the interface improves the quality of the curated data by reducing missing experimental data.

Anomaly detection rejection rate
We evaluated the anomaly detection by observing the "rejection rate" which consists of the number of detected anomalies that were rejected by human validation.Running the anomaly detection on a database subset with 667 records, it found 17 anomalies in T c , 1 anomaly in applied pressure, and 16 anomalies in the chemical formulas.Curators examined each reported record and rejected 4 (23%) anomalies in T c , 6 anomalies (37%) in chemical formulas and 0 anomalies in applied pressure.This indicates an appropriate low rate of false positives although a study with a larger dataset might be necessary.

Training data generation
We selected around 400 records in the Supercon2 Database that were marked as invalid by the anomaly detection process and we corrected them following the curation guidelines (Section 3.2).Then, we examined the corresponding training data corrected by the interface (Section 2.3) and obtained a set of 352 training data examples for our ML models.We call the obtained dataset curation to be distinguished from the original SuperMat dataset which is referred to as base.
We prepared our experiment using SciBERT [19] that we fine-tuned for our downstream task as in [1].We trained five models that we evaluated using a fixed holdout dataset from SuperMat averaging the results to smooth out the fluctuations.We use the DeLFT (Deep Learning For Text) [20] library for training, evaluating, and managing the models for prediction.A model can be trained with two different strategies: (1) "from scratch": when the model is initialised randomly.We denote this strategy with an (s).(2) "incremental": when the initial model weights are taken from an already existing model.We denote this strategy with an (i).
The latter can be seen as a way to "continue" the training from a specific checkpoint.We thus define three different training protocols: (1) base(s): using the base dataset and training from scratch (s).
(2) (base+curation)(s): using both the base and curation datasets and training from scratch (s).(3) base(s)+(base+curation)(i): Using the base dataset to train from scratch (s), and then continuing the training with the curation dataset (i).
We merge "curation" with the base dataset because the curation dataset is very small compared to "base", and we want to avoid catastrophic forgetting [21] or overfitting.The trained models are then tested using a fixed holdout dataset that we designed in our previous work [1] and the evaluation scores are shown in Table 1.
This experiment demonstrates that with only 352 examples (2% of the Super-Mat dataset) comprising 1846 additional entities (11% of the entities from the Su-perMat dataset) (Table 2), we obtain an improvement of F1-score from 76.67% 2 to values between 77.44% (+0.77) and 77.48% (+0.81) for (base+curation)(s) and base(s)+(base+curation)(i), respectively.This experiment gives interesting insight relative to the positive impact on the way we select the training data.However, there are some limitations: the curation dataset is small compared to the base dataset.This issue could be verified by correcting all the available training data, repeating this experiment, and studying the interpolation between the size of the two datasets and the obtained evaluation scores.A second limitation is that the hyperparameters we chose for our model, in particular, the learning rate and batch size could be still better tuned to obtain better results with the second and third training protocols.

Data quality
We conducted an experiment to evaluate the effectiveness and accuracy of data curation using two methods: a) the user interface (interface), and b) the "traditional" manual approach consisting of reading PDF documents and populating an Excel file (PDF documents).
We selected a dataset of 15 papers, which we assigned to three curators -a senior researcher (SD), a PhD student (PS), and a master's student (MS).Each curator received 10 papers: half to be corrected with the interface and half with the PDF Document method.Overall, each pair of curators had 5 papers in common which they had to process using opposite methods.For instance, if curator A receives paper 1 to be corrected with the interface, curator B, who receives the same paper 1, will correct it with the PDF document method.After curation, a fourth individual manually reviewed the curated content.The raw data is available in the Appendix A.
We evaluated the curation considering a double perspective: time and correctness.Time was calculated as the accumulated minutes required using each method.Correctness was assessed using standard measures such as precision, recall, and the F1-score.Precision measures the accuracy of the extracted information, while recall assesses the ability to capture all expected information.F1-Score is a harmonic means of precision and recall.

Discussion
Overall, both methods required the same accumulated time: 185 minutes using the interface and 184 minutes using the PDF Document method.When the experiment was carried out, not all the curators were familiar with the interface method.Although they had access to the user documentation, they had to get acquainted with the user interface, thus the accumulated 185 minutes included such activities.
We examined the quality of the extracted data and we observed an improvement of +5.55% in precision and a substantial +46.69% in recall when using the interface as compared with the PDF Document method (Table 3).The F1-score improved by 39.35%.
The disparity in experience significantly influenced the accuracy of curation, particularly in terms of high-level skills.Senior researchers consistently achieved an average F1-Score approximately 13% higher than other curators (see Table 4).Furthermore, we observed a modest improvement between master's students and PhD students.These findings indicate also that for large-scale projects, employing master students instead of PhD students may be a more cost-effective choice.Thus, using only a few senior researchers for the second round of validation (Section 3.1).
Finally, the collected data suggest that all three curators had overall more corrected results by using the interface as illustrated in Table 5.
The results of this experiment confirmed that our curation interface and workflow significantly improved the quality of the extracted data, with an astonishing improvement in recall, thus preventing curators from overlooking important information.

Code availability
This work is available at https://github.com/lfoppiano/supercon2.The repository contains the code of the SuperCon 2 interface, the curation workflow, and the ingestion processes for harvesting the SuperCon 2 Database of materials and proper-ties.The guidelines are accessible at https://supercon2.readthedocs.io.

Conclusions
We built a semi-automatic staging area, called SuperCon 2 , to validate efficiently new experimental records automatically collected from superconductor research articles (SuperCon 2 Database [1]) before they are ingested into the existing, manually-build database of superconductors, SuperCon [8].The system provides a curation workflow and a user interface (SuperCon 2 Interface) tailored to efficiently support domain experts in data correction and validation with fast context switching and an enhanced PDF viewer.Under the hood, the workflow ran "anomaly detection" to automatically identify outliers and a "training data collector" based on human corrections, to efficiently accumulate training data to be feedback to the ML model.
Compared with the traditional manual approach of reading PDF documents and extracting information in an Excel file, SuperCon 2 significantly improves the curation quality by approximately 6% and +47% for precision and recall, respectively.In future, this work can be expanded to support other materials science domains such as magnetic materials, spintronic and thermoelectric research and expanding the evaluation to a larger dataset.Table 1.F1-score from the evaluation of the fine-tuned SciBERT models.The training is performed with three different approaches.The base dataset is the original dataset described in [18], and the curation dataset is automatically collected based on the database corrections by the interface and manually corrected.s indicate "training from scratch", while i indicate "incremental training".The evaluation is performed using the same holdout dataset from SuperMat [18].The results are averaged over 5 runs or train and evaluation.

Figure
Figure 1.Schema of the curation workflow.Each node has two properties: type and status (Section 2.1.1).Each edge indicates one action.The workflow starts on the left side of the figure.The new records begin with "Automatic, New".Changes of state are triggered by automatic (Section 2.2) or manual operations (update, mark as valid, etc.. Section 3.1) and results in changes of the properties in the node.Each combination of property values identifies each state."(*)" indicates a transition for which the training data are collected (Section 2.3)

1 .Figure 2 .
Figure 1.Schema of the curation workflow.Each node has two properties: type and status (Section 2.1.1).Each edge indicates one action.The workflow starts on the left side of the figure.The new records begin with "Automatic, New".Changes of state are triggered by automatic (Section 2.2) or manual operations (update, mark as valid, etc.. Section 3.1) and results in changes of the properties in the node.Each combination of property values identifies each state."(*)" indicates a transition for which the training data are collected (Section 2.3)

Figure 3 .
Figure 3. Screenshot of SuperCon 2 interface showing the database.Each row corresponds to one material-Tc pair.On top, there are searches by attribute, sorting and other filtering operations.On the right (last column) there are curation controls (mark as valid, update, etc.).Records are grouped by document with alternating light yellow and white.

Figure 4 .
Figure 4. PDF document viewer showing an annotated document.The table on top is linked through the annotated entities.The user can navigate from the record to the exact point in the PDF, with a pointer (the red bulb light) identifying the context of the entities being examined.

Figure 5 .
Figure 5. Sample curation sheet from the curation guidelines.The sheet is composed of the following information: a) Sample input data: a screenshot of the record from the "SuperCon 2 Interface", b) Context represented by the related part of the annotated document referring to the record in exams.c) The Motivation, describing the issue, d) the Action to be taken, and the Expected output.

Figure 6 .
Figure 6.Top: Processing log, showing the output of each ingestion operation and the outcome with the detailed error that may have occurred.Bottom: Correction log, indicating each record, the number of updates, and the date/time of the last updates.By clicking on the "Record id", is possible to visualise the latest record values.

Table 2 .
[18] support, the number of entities for each label in each of the datasets used for evaluating the ML models.The base dataset is the original dataset described in[18], and the curation dataset is automatically collected based on the database corrections by the interface and manually corrected.

Table 3 .
Evaluation scores (P: precision, R: recall, F1: F1-score) between the curation using the SuperCon 2 interface (Interface) and the traditional method of reading the PDF document (PDF document).

Table A1 .
Timetable recording the time spent for each of the 15 articles.Each row indicates the time and the event (Start, Finish) from each of the curators: Master Student (MD), PhD Student (PD), and Senior Researcher (SR).Duration is expressed in minutes.