SuperMat: Construction of a linked annotated dataset from superconductors-related publications

A growing number of papers are published in the area of superconducting materials science. However, novel text and data mining (TDM) processes are still needed to efficiently access and exploit this accumulated knowledge, paving the way towards data-driven materials design. Herein, we present SuperMat (Superconductor Materials), an annotated corpus of linked data derived from scientific publications on superconductors, which comprises 142 articles, 16052 entities, and 1398 links that are characterised into six categories: the names, classes, and properties of materials; links to their respective superconducting critical temperature (Tc); and parametric conditions such as applied pressure or measurement methods. The construction of SuperMat resulted from a fruitful collaboration between computer scientists and material scientists, and its high quality is ensured through validation by domain experts. The quality of the annotation guidelines was ensured by satisfactory Inter Annotator Agreement (IAA) between the annotators and the domain experts. SuperMat includes the dataset, annotation guidelines, and annotation support tools that use automatic suggestions to help minimise human errors.


Introduction
The vast majority of scientific knowledge exists as published articles [1][2][3][4].These publications are presented mainly as text, which is challenging to be used as a machinereadable structure.Meanwhile, as a part of the text and data mining (TDM) discipline, computer-assisted information collection from the literature has become a supportive asset for scientific research [5].In the past decades, new TDM processes were developed for several natural science disciplines to achieve automatic document processing such as information retrieval, entity extraction, and clustering.TDM has been applied in biology for identifying interactions between agents (e.g.bacteria, viruses, genes, and proteins) [6][7][8] to support the research on serious diseases including cancer [9].In chemistry, it was used for the disambiguation of chemical compounds names, synthesis extraction, and retrieval [10].In both domains, the application of TDM was based on manually curated datasets (corpora) that functioned as infrastructures.Examples are the BioCreative IV CHEMDNER corpus [11] in chemistry, and Genia [12] and GENETAG [13,14] in biology.Such datasets are crucial for developing, training, and evaluating TDM systems.
In comparison, such resources in the materials science domain are rather limited.Reported cases include NaDev [15] on nanocrystal devices research, a corpus for extracting synthesis recipes [16], and ChemDataExtractor [17] which focuses only on chemical entities.In the superconductors domain, we could identify MagDb [18] focusing on magnetic materials with limited information categories.Another project is SC-CoMIcs [19].SuperMat is different from SC-CoMIcs based on the following reasons: (a) it provides full papers instead of abstracts which contain more detailed information about the research on superconducting materials, and (b) it contains linked entities.
To address this shortage of infrastructure, experimental data is extracted manually [20], or ab-initio calculations are used [21] but they might not accurately describe the real system.Several challenges still hinder the data-driven exploration of materials (also called Materials Informatics (MI)), namely: the lack of data standard, infant stage of the data-driven culture, a wide variety of conflicting stakeholders, and missing incentives for researchers to contribute to large collaborative initiatives [22].To bridge these gaps, it is necessary to create infrastructural resources to support TDM processes in materials science through the automatic construction of databases for materials and their properties.Such application can minimise the need for humans to read the new papers and extract the key information therein.Equally importantly, it enables scientists to focus and leverage computing power and human resources to find deeper relationships between superficially unrelated information.Other applications include providing semantically enriched search engines that accept fine-grain queries [23] to reduce the time needed to access specific information.These processes cannot be established without essential resources such as dictionaries, lexicons, and datasets.
Research on superconducting materials has been growing rapidly towards both fundamental science as well as practical applications.Superconductors display many intriguing phenomena including zero-resistivity, the ability to host a high magnetic field, quantisation of the magnetic flux, and vortex pinning.Current applications of superconductors include medical instruments, high-speed trains, quantum computers, and the Linear Hadron Collider (LHC) [24][25][26].However, discovering a new superconductor is a challenging task [27].For example, in a previous work [28] out of ∼1000 studied materials, only 3% were found to be superconducting.The National Institute for Materials Science (NIMS) in Japan has been manually constructing databases to support material research, and SuperCon (http://supercon.nims.go.jp) is a manually curated data source for the superconductor domain.These databases would help researchers design new superconducting materials with a higher superconducting critical temperature (T c ) (ideally up to room temperature) [29,30].However, the current resources are very limited and not dynamic enough to incorporate the information from new publications in a timely manner.In this paper, we present SuperMat (Superconductors Materials), an annotated linked corpus for superconducting material information.This dataset contains 142 documents with 16052 (7166 unique) entities, and 1398 links that can serve as an infrastructural data for TDM processes in the domain of superconducting materials.We also describe the construction guidelines for SuperMat, in the hope of supporting researchers to systematically create annotated data.Furthermore, the unique feature of links between entities in SuperMat will allow the development of more precise methodologies to associate a particular material with its properties.

Content acquisition
SuperMat originates from PDF documents of scientific articles related to superconductor research.The PDF format is the most widely used format for scientific publications [31].The original documents were collected from the following sources: (a) the Open Access (OA) version of peer-reviewed articles referenced in the SuperCon database records; (b) articles provided by domain experts containing suitable items and potential links of material names, T c values, measurement methods, and pressures; (c) articles from "condensed matter" category of arXiv (https://arxiv.org/archive/cond-mat)selected using the search terms of "superconductor", "critical temperature", and "superconductivity".
Pre-print versions of peer-reviewed articles were obtained using a lookup service for bibliographic data called biblio-glutton (https://github.com/kermitt2/biblio-glutton) that aggregates data from various sources: the Crossref (https: //www.crossref.org/)bibliographic database, the unPaywall (http://unpaywall.org) service, the PubMed Central repository (https://pubmed.ncbi.nlm.nih.gov/), and mappings to other databases.We queried biblio-glutton using the bibliographic data of each article referenced in Supercon; subsequently, we downloaded the pre-print article associated with the retrieved record, if available.Although the published version may be different from the pre-print version of a document, the differences measured by comparing pre-print and peer-reviewed articles in biology [32] measured objective differences to be around 5%.

Preliminary annotation study
Preliminary annotation study was carried out to assess the effort required from the annotators to reach an acceptable Inter Annotation Agreement (IAA >0.7) .We annotated two randomly selected OA papers, by using a preliminary version of the guidelines with a limited tag-set of four labels: <material>, <tc> (expression describing the presence or absence of superconductivity), <tcValue> (value of T c ), and <doping> (amount of substitution, such as stochiometric values, usually expressed as functions of x or y).The process was iterated multiple times.Each iteration ended with computing the IAA using the Krippendorff's alpha coefficient [33,34], while annotators discussed the disagreements, and updated the guidelines.
Based on the results in Table 1, IAA reached a satisfactory level around 0.9 after the third iteration.In the second iteration, although the average IAA reached 0.7 on three of the four labels, the average agreement was not satisfactory.When analysing the disagreement, we noticed that the low score in the <doping> label was caused by a heavy overlap with the <material> label, which required more precise definition in the guidelines.
Based on this preliminary study, the following changes were implemented.(a) The label <doping> was merged under the <material> because, even with detailed documentation it was too difficult for humans to annotate them in a consistent way.(b) Three more labels were added: measurement methods and pressure (described as parametric conditions in relation to T c ), and class of materials.

Tag-set design
The tag set (also referred to as labels) represents the classes of entities and the type of links between them, which were designed to be extracted from the text (Figure 1).

Entities
Entities (also referred as Named Entities, mentions, or surface forms) are chunks of texts that represent an information of interest, as follow: • Class (tag: <class>) represents a group of materials defined by certain characteristics.Superconducting materials can be classified according to different criteria such as the composition and magnetic properties.Among publications collected for this study, the domain experts identified three types of classes based on: (a) the composition and crystal structure, (b) material phenomena (e.g."I-type" and "II-type superconductivity", "BCS superconductors", "nematic", and "conventional/unconventional superconductivity"), and (c) high/low T c value (e.g."high-tc" superconductors).
In this work, we only considered the (a) classes, mainly because the material composition and crystal structure do not change with time.For example, a cuprate from 1998 is still called a cuprate today.In comparison, many material phenomena used for (b) are not robust enough, and can be biased by the viewpoint of the author(s) or research group, or the measurement methods.Finally, the definition of "high-tc" superconductors (c) is completely relative; i.e., with the progress of research, materials once considered "high-tc" might not be so anymore.
• Material (tag: <material>) identifies the name of one or more materials.This label is used to collect the following types of information: • Chemical formula indicating the material by its general or stochiometric formula (e.g.LaFe 1-x O 7 , WB 2 ), • Compositional name (e.g.magnesium diboride) or abbreviations (e.g.

YBCO),
• The material's shape (e.g.wire, powder, thin film) or form of material (e.g.single/poly crystal), • Modification by a dopant (Zn-doped, Si-doped) or by percentage of doping (2%-doped).We also considered qualitative expressions such as overdoped, lightly doped, and pure as valid information, • Substrate information (e.g.grown on MgO(100) film) when it was adjacent to the material name or formula, in the text, • Additional information about the sample (e.g.as-grown, untwinned, single-layer) when it was adjacent to the material name or formula, in the text.• Superconducting critical temperature (tag: <tc>) identifies expressions related to the phenomenon of superconductivity.Any temperature mentioned in the text is not necessarily the T c .Rather, it could refer to the temperature for other pro-cesses/events such as annealing/sintering temperature, specific measurements, and structural changes.This label identifies the presence or absence of superconductivity at a given temperature.In addition, modifiers of this information (increasing/descreasing T c ) are also retained.• Superconducting critical temperature value (tag: <tcValue>) represents the temperature at which the superconducting phenomenon occurs.It can be defined by different experimental criteria, such as the onset, mid-point of resistivity drop, or zero resistivity.This value also considers boundary conditions, such as the onset of superconductivity, zero resistance.• Applied pressure (tag: <pressure>) indicates the applied pressure corresponding to a measured T c .• The measurement method (tag: <me method>) indicates the method used to measure or calculate the presence of superconductivity.Here, we considered the following categories: resistivity, magnetic susceptibility, specific heat, and theoretical calculations.

Links
The links connects entities of materials or samples to their corresponding properties, conditions, and results.The links are non-directional, and there are no restrictions on the number of links for each entity.We defined three types of links: • material-tc: linking materials to their T c values.
• tc-pressure: connecting T c and the applied pressure under which it was obtained.
• tc-me method: linking T c and the corresponding measurement method.

Annotation guidelines
Annotation guidelines include the principles and the rules that describe what constitutes as desired information for the SuperMat dataset and how to annotate it.They include detailed description of the specific rules that have been defined for each type of information to be annotated, with one or more definitions and examples illustrating what to annotate in different cases, exceptions, and references.We used an online system to track the discussions and decisions when a question or a comment was raised, and provided a link to such issues in the respective description or example.In addition, the guidelines include linking rules that provide information on how to correctly connect the entities in a relationship.The guidelines were built using a dynamic markup language (called RestructuredText) and stored in a git (https://git-scm.com/)version control system repository.We deployed them as HTML files via web, which were updated automatically after each modification.They can be accessed at https://supermat.readthedocs.io.

Annotation support tools
The task of annotating documents is tedious and requires both attention and subject knowledge from the annotators.Annotation support tools aim to maximise the efficiency of annotators and minimise human mistakes.They are composed of a webbased collaborative annotation tool, automatic annotation suggestions, and automatic corpus analysis.

Web-based collaborative annotation tool: INCEpTION
The annotation tool is the platform used for creating, correcting and linking annotations.After evaluating several tools, we selected INCEpTION [35,36], a web-based multi-user platform for machine-assisted rapid dataset annotation construction.IN-CEpTION provides supportive functionalities that include: • Multi-layer annotation sheets allow different annotation schemas over the same documents, • Two annotation steps: annotation consists of manually correcting pre-imported documents, while curation allows another user to validate the annotations (Figure 5).• On-the-fly automatic suggestions based on active learning and string matching (Figure 5), • Bulk annotation corrections, and • Being open-source (Apache 2.0 license), and under active development at the time of this paper (https://inception-project.github.io/).

Annotation suggestions
Previous works have demonstrated that annotation suggestions improve the quality of the output [37][38][39].We provide two types of annotations suggestions.(i) Machinebased annotated data that were assigned to the documents before loading into the annotation tool.Here, we use a machine learning (ML)-based system from a previously implemented prototype [40] to support our tag-set.(ii).Active learning recommendations provided by INCEpTION are assigned on-the-fly based on previous annotations.The active-learning recommendations are less precise since they aim to increase the recall, and therefore they need to be explicitly accepted by the annotator.

Automatic corpus analysis
Automatic corpus analysis is a set of scripts designed to run after the validation step.These scripts automatically find inconsistencies in the links and entities, while extracting the statistics of the corpus.We calculated the inconsistencies by examining every annotated entity and computing the frequency of the same text being annotated with different labels.The script outputs a summary table by visualising each annotation value, as well as their labels and frequencies.We visually inspected this table, because the reported inconsistencies can be either obvious mistakes (Table 3) or arise from ambiguities (Table 2); therefore their context should be verified.
Although the links are conceptually non-directed, we have defined a practical convention to maintain their consistency.For example, material-tc is always represented as a link between <tcValue> and <material> entities.The script also computes the statistics (Table 4) for the number of entities (total, unique, by class), the number of links (total, intra-and inter-paragraph, between paragraphs), and other statistical information.

Annotation process
The annotation workflow (Figure 2) was designed following the MATTER (Model, Annotate, Train, Test, Evaluate, and Revise) schema [41] and other related work [11,15].The workflow is composed of five steps (Figure 2): data-preparation, correction, validation, testing and evaluation, revision.This workflow involves three main actors: the automatic process, computer scientists, and the domain experts.
The first step of the annotation process involves preparing the machine-based annotated data from the source PDF documents.The PDF files are converted to an XML-based format, and annotation is automatically applied.This is followed by four more steps: • Annotation: The human annotator can select a document and manually add, remove, or modify each entity based on rules defined in the guidelines.Once the annotation is complete, the document is marked "ready" for the validation.• Validation/Curation by domain experts: Annotations from different users are validated and merged into a final document (Figure 5).The domain expert ("curator"), can compare the different annotated versions, and select the best combination of annotations, or add new ones.This step ensures that the annotations are cross-checked and that the document is validated by domain experts.• Automatic consistency checks and statistical analysis: This step aims to discover obvious mistakes such as mislabelling or incorrect linking.A sequence labelling model is trained and evaluated using 10-fold cross-validation.The evaluation provides precision, recall, and f-score metrics for all the labels.The resulting model is used for producing machine-based annotated data in the following iteration.• Review: Retrospective analysis of the past iteration, where unclear cases are discussed and documented in the annotation guidelines.

Data transformation
There are two processes of data transformation (Figure 3): (a) from the source document (PDF) to the dataset format representation (XML-based), and (b) from the dataset format representation to the annotation tool exchange formats (https://inception-project.github.io/releases/0.16.1/docs/ user-guide.html#sect_formats) and vice-versa.
• PDF to XML-based: This step converts the PDF source document to the dataset format representation in XML following the Text Encoding Initiative (TEI, https://tei-c.org/)format guidelines.Such transformation is performed by leveraging the functionalities provided by GROBID (https://github.com/kermitt2/grobid).
We developed a customised process for collecting a subset of information from the source PDF document.The process extracts the title, keywords, and abstract from the header; and paragraphs, sections.and figure and table captions from the body.All the callouts to references, tables, and figures are ignored.The resulting structured document is then encoded in XML as will be described below.
• XML to the annotation tool exchange formats: We transform our XMLformatted data into an INCEpTIONS compatible import format, such as the Webanno TSV 3.2 (https://inception-project.github.io/releases/0.17.0/docs/user-guide.html#sect_formats_webannotsv3), and vice-versa using a set of Python scripts.The Webanno TSV 3.2 format is an extension of the CONLL (https://www.signll.org/conll/)format, with additions of the header and column representation.

Data Record
The dataset is composed of 142 PDF documents, of which 92% (130) are OA (Figure 4a).To comply with copyright restriction, few articles from our dataset are not publicly available in our repository.The top three publishers represented in the corpus are American Physical Society (APS), Elsevier, and IOP Publishing (Figure 4b). Figure 4c illustrate the distribution by publication date.We summarise SuperMat's content in Table 4, with the statistics of documents, entities, and links given separately.In particular, this dataset contains 16052 (7166 unique) entities spread over six labels and 1398 links.Each document is encoded according to the XML TEI guidelines, which is a rich format for document representation.We have carried out no specific customisation, in order to remain fully compliant with the general TEI schema.A TEI document has two main parts: the header (within the <teiHeader> tags) containing all the document metadata, and the body (within the section delimited by the <text> tag).The transformed data has the following structure: We transformed the source documents into these TEI-compliant structures using a simplified representation for specific content types.The general objective is to flatten the content into a generic structure where priority is given to the annotations.For instance, the keywords section, which groups together the key terms defined by the author(s) of the paper, is encoded using the generic tag <ab type="keywords"> as free text, instead of the dedicated <keywords> element that would typically be part of the header.For both the abstract and the article body, the text is segmented in paragraphs (by means of the <p> element).The text is annotated with the generic <rs> (referencing string) element adorned with three attributes: @type (the entity type), @corresp (to provide a link to another annotation such as from material to T c ), and @xml:id (to uniquely identify the annotation for referencing or linking purposes).

[...] </p>
In the above snippet, the entities "3.0 GPa", "exceed 45K" and "LaFeAsO1-xHx" are linked together via the pairs @corresp, @xml:id.This schema supports multiple annotations to any part of the document.For example, the entity exceed 45K has a second link with the corresponding identifier ("#9") to an annotation outside this paragraph.

Applications
SuperMat is constructed as a resource for TDM applications in superconducting materials.It can be used as data source in several complementary tasks: (1) creation of an automatic information extraction system for dataset creation, (2) articles classification, (3) named entity extraction (for example, automatic dictionary construction), (4) clustering and document synthesis, ( 5) training of machine learning (ML) algorithms, (6) evaluation of rule-based or ML-based algorithms, and (7) development of downstream processes, such as material name parser, or quantity normalisation.

Practical applications
Such a dataset may benefit several types of possible applications: • Evaluation tasks: This corpus can be used for evaluation tasks on automatic extraction.In particular, we can envision two popular tasks in superconducting materials science, namely: (a) NER and (b) EL methods.EL techniques have been mainly designed and studied using text from Wikipedia and newswires services which represent most of the available data.To the best of our knowledge, however, there is no application within materials science.• Automatic information extraction for superconducting materials: This dataset can be used as training data for such a purpose.Automatic information extraction using ML and text mining techniques can accelerate the construction of databases for superconducting materials.• Document retrieval: Information retrieval is a key application helping researchers overcome information overload.One way is through query expansion to cover multiple expressions of the same term.By collecting and clustering all expressions under the same concept, it would be possible to retrieve documents when, for example, the resistivity measurement is described by a phrase other than "resistivity".Furthermore, the assigned labels can be used to boost documents where a certain term belongs on a specific label.For example, cobalt oxide can appear as either <material> or <class> depending on the context, while a user would like to obtain documents where cobalt oxide appears as <material>.• Weighted-clustering: Scientific document clustering has recently gained growing attention because of its potential capacity for finding additional relevant documents of interest.For example, clustering can help locating similar experimental settings in a large collection of documents.However, clustering documents based on their general content might not be optimal for finding such detailed similarities.Annotation can be leveraged to tilt the clustering algorithm toward entity similarity, which may provide a more focused clustering towards a specific type of information.

Technical Validation
The following measures were employed to ensure the creation of a high-quality dataset: • Each document was revised and validated by domain experts, • The workflow begins by assigning machine-based annotated data.This has demonstrated to improve the annotation task over several aspects, namely: time consumption, error rate, and annotation agreement [37][38][39].• On-the-fly automatic annotation recommendations, which provide fresh suggestions based on online decisions made by the annotators.• The annotators have rapid access to changes in the annotation guidelines.
• The discussions were documented and linked in the guidelines.
• Reviews are discussed and approved collaboratively between domain experts and other annotators.
These guidelines are a vital piece of this work since they contain knowledge accumulated from these activities.However, measuring the completeness of the guidelines is challenging.Assuming that the documents validated by domain experts represent the ground truth, we conducted IAA analysis between different annotators against the ground truth, using the Krippendorf's Alpha metric [33].Table 5 shows the average IAA which is satisfying with a value of approximately 0.9.The highest score is obtained in the <material> entities, while the lowest one is obtained in <pressure>, which appears less frequently in the papers.The disagreement in <tcValue> can appear to be too low as compared with other labels such as <class>, which is, at first look, more ambiguous.We analysed the different cases and identified three reasons why this happens.First, <tcValue> may depend heavily on the context that requires more human attention, and it is therefore more prone to errors.Second, our suggestions system is challenged in its ability to disambiguate critical temperatures from other temperature data, leading to incorrect or invalid suggestions.Finally, the presence of mathematical symbols (e.g."~", "<", and ">") or other modifiers ("up to", "exceeds", etc.) before the <tcValue> could generate small disagreements that accumulate in the average score.
To more precisely isolate the impact of the guidelines, we grouped the IAA results by level of domain experience.Table 6 displays the IAA between the validated data and the data corrected by (a) domain experts (researchers who conduct superconducting development experiments), (b) non-domain-experts (researchers with no experience with superconducting materials), and (c) novices (students in materials science with limited domain experience).Obviously, the domain experts have the highest agreement and the IAA value (around 0.95) is 0.06 higher on average than that of non-domain experts.Thus, superconducting materials is a complex domain that requires knowledge in materials science to produce high-quality data, while crowdsourcing initiatives such as the Amazon Mechanical Turk might not work well.
Furthermore, we measured the reliability of the guidelines by observing how quickly novices could reach a satisfying agreement with the validation of the domain experts, without any previous training on the guidelines.From Table 6, the novices can attain high IAA results by only using the guidelines and our annotation support tools.The average difference in agreement with domain experts (around 0.05) indicates that the guidelines are precise and complete, and that the annotations tools offer sufficient support.

Conclusions
In this paper we described the construction of an annotated linked dataset from scientific publications on superconductors development.SuperMat aims to establish a solid infrastructure where to build or improve TDM processes in superconductor ma-terials domain.We annotated 142 full-text articles where the data was automatically extracted from the PDF document and encoded through the XML TEI guidelines providing a basic structure of the original document.The dataset is validated by domain experts and provides 16052 entities of six categories, and 1398 links between materials, properties and conditions.This approach can be extended to other materials domains following similar methodology.

Figure
Figure 1.Example in the annotated corpus.The excerpt was taken from [43].

1 .
Figure 1.Example in the annotated corpus.The excerpt was taken from [43].

Figure 3 .
Figure 3. Summary of the data transformation flows.

Figure 4 .
Figure 4. Distribution of paper in the dataset by (a) license, (b) publisher, and (c) year of publication.

Table 1 .
Summary of the IAA for each annotation iteration.

Table 2 .
Inconsistencies resulting from the overlapping of <material> and <class> labels.

Table 3 .
Inconsistencies resulting from human mistakes.Figure 2. Annotation workflow.Different colours illustrate the involvement of each group at each step of the workflow.

Table 4 .
Statistical overview of the dataset.Links ip indicates the number of links within the same paragraph (intra-paragraph).Linksep indicate the number of links from different paragraphs (extra-paragraphs).

Table 5 .
Average IAA between the annotated and validated documents

Table 6 .
Calculated IAA for annotations produced by domain experts, nondomain experts, and novices compared to the validated version.Annotations from domain experts are cross validated.