Automatic knowledge acquisition from superconductivity information in literature

ABSTRACT In this study, we developed a natural language processing model for extracting information solely from the abstracts of literature on superconducting materials, with the aim of making predictions for materials science. Using a dataset of tagged documents (annotations) on superconductivity, the DyGIE++ framework was employed for the simultaneous extraction of the named entities, relations, and events. Additionally, a model was developed for classifying the subject material in the abstracts. After training with 1,000 annotated abstracts, the model extracted information, such as the material composition, superconducting transition temperature, doping information, and process information, automatically from 48,565 abstracts registered in the Scopus database since 1937. The numbers of extracted entries concerning superconducting materials and transition temperatures were 43,944 and 24,075, respectively, i.e. equivalent to the number of entries in the existing databases. Machine learning models were constructed to predict physical and chemical properties. For example, the superconducting transition temperatures were predicted for compositions, with a mean absolute error of 15 K. In addition, the doping information indicated that the superconducting transition temperature was correlated with the choice of dopant and doping site. GRAPHICAL ABSTRACT


Introduction
Materials informatics (MI) aims to efficiently explore materials using information science approaches such as machine learning [1].Although constructing an extensive database of materials and physical properties can generally enable data-driven research, such large structured databases are not generally available.In recent years, active studies have been conducted to extract desired information from unstructured literature text using natural language processing (NLP) based on deep learning.In this context, NLP has attracted attention as a technique for automatically and sustainably constructing the databases necessary for MI [2].In specific applications of NLP, a large amount of human effort is often required to annotate the literature for supervised learning.This has limited the vast extension of various fields of corpora.
Research on information extraction in materials science has gained momentum over the past few years [3][4][5][6][7][8].Watson et al. proposed a method for extracting terms, such as material compositions and analytical methods, for inorganic materials [8], and Friedrich et al. proposed an information extraction task related to solid oxide fuel cells [7].However, the corpus of materials, that includes compositions, structures, properties, synthesis processing, and analyses, remains in an early stage of development.Superconducting material is an important research field in science and technology, with a wide range of applications.In particular, the development of hightemperature superconductors able to exhibit superconductivity at higher transition temperatures (typically than 77 K, the boiling temperature of liquid nitrogen) has been intensively pursued for practical applications, such as linear bullet trains, medical analyses, and solutions to energy problems.To this end, the exploration of new superconducting materials remains a popular issue.
There have been a few studies on extracting information on superconducting materials from the literature.Foppiano et al. constructed the corpus 'SuperMat' by extracting information from annotated superconductivity literature [9].The compositions and superconducting transition temperatures were linked using a rule base.Court and Cole constructed a classification model for the Curie temperature, N´eel temperature, and superconducting critical temperature from 300 phase-transition records, and predicted the phase transitions of superconducting materials [10].
In this study, we used a superconductivity corpus for materials informatics (SC-CoMIcs), a tagged textual corpus tailored for the superconductivity domain [11][12][13].It obtains fully automated annotations from abstracts related to superconductivity.We applied SC-CoMIcs to perform a slot extraction of information, such as material compositions, superconducting transition temperatures, doping information, and process information, from approximately 48,000 literature abstracts.The extracted information was validated by comparison with existing superconductivity databases.In addition, useful knowledge, such as predictions of transition temperatures and selections of doping materials, for designing materials could be obtained only with the automatically extracted information.

Methods
Figure 1 presents an overview of the proposed information extraction system.The system defines the named entities and relations/events, and then performs slot filling by linking material compositions with doping conditions, physical properties, and synthesis process conditions.Then, 1,000 manually annotated abstracts from the literature on superconducting materials are used to train and validate two independent neural models employed in the system.DyGIE++ [14] simultaneously extracts named entities, relations, and events, and the classification model identifies the subject material from the abstract (called the 'Main' entity class).The first slot filling is performed based on the relations automatically tagged by DyGIE++.Rule-based linking is then performed for the main entity, allowing for additional slot extractions for information implicitly related to the subject material throughout the abstract (even if there is no specific relation assigned by DyGIE++).
In this study, the literature abstracts were obtained from the Scopus peer-reviewed literature database by using application programming interfaces (APIs) with appropriate search conditions ('supercond* AND tc OR transition temp').After excluding 1,000 literature abstracts used to construct the SC-CoMIcs, 47,965 abstracts registered from 1937 to 2021 were retrieved.The extracted information included the superconducting transition temperature, doping information, and synthesis process, as these were associated with the material composition.For verification, we compared the statistics regarding the extracted superconducting transition temperatures with those from the existing 'SuperCon' database provided by the National Institute of Materials Science [14].
In superconducting materials, doping with additional elements often significantly changes properties, such as electron or hole densities and superconducting transition temperatures.Doping can be specified based on a small amount of additives called 'dopants' and the 'sites' to which the dopants are introduced.For example, in the case of the composition Cu 1−x Zn x O, Zn is the dopant and Cu is the site.This can be expressed by the equivalent phrase, 'Zn is doped for the Cu site in CuO' in the literature.In this study, we considered both cases: the former was extracted from the chemical compositions, and the latter was extracted by DyGie++, which was trained based on doping event triggers, such as 'doped' and 'substituted', and event arguments, such as 'dopant' and 'site'.Regarding the doping information, we investigated the correlations between the dopant and site elements using scatter plots of the atomic radius, ionic radius, and electronegativity of the dopant and site atoms.The relationship between the doping information and the transition temperature was also examined using the slot extraction.The highest transition temperature under the doping conditions was used as the transition temperature.The results were plotted in a three-dimensional scatter plot, and the distribution was analyzed.
To perform data-driven material exploration from the extracted information, we attempted to construct a machine learning model for predicting transition temperatures from composition information.In previous studies, Stanev et al. [15,16] and Z. Liu et al. [17] constructed regression models for superconducting transition temperatures using descriptor vectors generated from compositional formulas in the superconducting material database SuperCon.To test our fully automatic prediction from the extracted information in the literature, we vectorized the material compositions using 'magpie' descriptors [18], and constructed a random forest regression model to predict the superconducting transition temperature.

Results
The number of literature abstracts, extracted by year, is shown in Figure 2. The number of studies on superconducting materials increased dramatically after 1987, when YBa 2 Cu 3 O 7À δ was discovered by C.W. Chu et al [18].The number of studies has since decreased, but still exceeds 500 published papers each year.
The results from the slot extraction using Sc-CoMIcs are summarized in Table 1.We compared two databases: one from the 1,000 abstracts used for training SC-CoMIcs, and another comprising 47,965 abstracts.The numbers of results for the subject materials in 'Main' classes, and the superconducting properties (such as Tc) in 'SC' class are approximately proportional to the numbers in the database, as expected.In contrast, the numbers of results for doping information are approximately 10 times as many as those from the database of 47,965 abstracts.This is probably because the search conditions for the 1,000 literature abstracts included 'doping'.
A histogram of the extracted transition temperatures is shown in Figure 3.The red lines show the transition temperatures of the representative superconducting materials, which coincide with the histogram peaks.The statistics for the extracted transition temperatures for each material group are listed in Table 2.It can be seen that the peaks in the histogram and average Tc's for the material groups containing Mg, Fe, and Cu correspond well to the Tc's of MgB 2 (39 K), LnFeAsO 1−x F x (26-55K), and YBa 2 Cu 3 O 7−δ (60-90 K), respectively.Figure 4 shows a comparison between the extracted data and existing SuperCon data.Although the extracted data contain some noise, the distributions and magnitudes are generally consistent.These results indicate that the database automatically extracted only from the abstracts can cover most of the subjective materials and their transition temperatures are as reported in the literature.
Figures 5-7 show scatter plots of the atomic radius, ionic radius, and electronegativity for the dopant and site atoms, respectively.We used 6,097 pairwise data for the dopant and site in the extracted slots.The distribution of the data was fitted by a Gaussian function, as shown in the contours.The straight lines indicate the principal direction of the Gaussian distribution and coincide with the diagonal direction of the plots.This means that the dopant and site atoms tend to be selected so that their atomic and ionic radii, and electronegativities are close to each other.This is consistent with the Hume-Rothery rule, an empirical law concerning the solubility of doping [19,20].
The scatter plots of the doping information were color-coded based on the transition temperatures of the slots, as shown in Figure 8.We observed two clusters in the data centered at 150-200 K and 100-       176.47 pm, 175.56 pm) respectively.The higher transition temperature cluster (class 1) dominantly includes cuprates while the lower transition temperature cluster (class 0) include all kinds of materials (Fig. S1).The frequently appeared doped sites for the class 1 are Ba, Ca, Sr, Tl, Bi, Hg (Fig. S2), which are chosen to be effective to achieve a high transition temperature for cuprates; those for the class 0 are Cu, La, Fe, Zr, Si, B, Ru, O, C, F, As, Al, which are typical for Fe-based materials such as LnFeAsO 1-x F x .Note that although we chose the two classes of clusters is to find a clear correlation between transition temperature and atomic radius of dopant/site as shown in Figure 9, the number of clusters chosen for this analysis is rather arbitrary as discussed in Supplemental Information.
Using the extracted compositions and transition temperatures, we trained the random forest regression model using magpie descriptors.Table 3 summarizes the performance of the predictions of the transition temperature after the optimization of the hyperparameters in the model.As shown, selecting only the maximum value improves the prediction accuracy.In addition, using only the automatic tying with DyGIE++ results in better accuracy relative to using all slot-extracted data.This can be attributed to the fact that DyGIE++ based binding has a higher fit rate; this reduces the amount of noise information and correspondingly increases the prediction accuracy for the transition temperatures.
Figure 10 shows the prediction results for the test data from the model trained with the maximum transition temperature for each composition.The performance is quantified with a coefficient of determination R 2 = 0.708 and mean absolute error MAE = 13.63K.Although the prediction is good overall considering the present fully automatic procedure, there are scattered data in the plot, which hinder accurate predictions as discussed in Supplemental Information.Figure 11 shows the distribution of errors between 'Predict' and 'Target'.The peak near 0 and symmetrical distribution indicate that the prediction model is unbiased in its errors.
One of the sources of noise is that the same composition may have several different transition temperatures when 'Tc is 70-88 K' exists in the sentence.As a countermeasure, we took the maximum value as the transition temperature for the same composition in the literature.However, this cannot solve the case where multiple transition temperatures are obtained for the same composition but under different conditions.Such a problem may intrinsically exist when extracting only abstracts containing limited information.However, one could improve the model by  including additional information, such as doping and process information, in the slots.

Conclusion
We have demonstrated a fully automatic system for obtaining structural data on superconducting materials from approximately 48,000 literature abstracts.We then applied it to acquire useful knowledge on materials science, such as doping selections and predictions of transition temperatures for compositions.Statistical comparison showed that the number of transition temperatures extracted by the present system using only the abstracts was approximately similar to those from an existing database constructed with great human effort and time.Although the extracted data include noise owing to imperfections in the model and limitations concerning the information described in the abstracts, the benefit of the present system should increase with an extension of sources, such as patents, and with immediate online updates in real time.
We emphasize that the slot filling in the present method, linking material compositions with doping conditions, physical properties, and synthesis process conditions, enables us to construct machine learning models to find new physical chemical knowledge behind the relations and to predict the relations for new materials.Our finding from the extracted data, regarding the selection of doping materials and the prediction of the transition temperatures, encourages data mining in the literature   to identify new 'empirical' knowledge and predictions in materials science based on human scientific activity in the past and future.Thus, the statistical analysis and visualization of a large amount of data will become increasingly important.

Figure 1 .
Figure 1.Overview of the extraction system.

Figure 2 .
Figure 2. Number of literature abstracts from Scopus of each year.

Figure 3 .
Figure 3. Frequency of transition temperatures in abstracts.

Figure 4 .
Figure 4. Comparison between extraction data and existing database SuperCon.

Figure 5 .
Figure 5. Scatter plots of covalent radius for the dopant and site atoms.
150 K.A mixed Gaussian distribution was then employed to distinguish between the two classes of clusters in the transition temperatures.The results are shown in Figure 9. Clearly, two clusters are visualized: one with an average transition temperature of 45 K for smaller covalent radii, and another transition temperature of 112 K for larger covalent radii.The average atomic radii of the dopants and sites for each cluster are (138.4pm, 135.08 pm) and (

Figure 6 .
Figure 6.Scatter plots of ionic radius for the dopant and site atoms.

Figure 7 .
Figure 7. Scatter plots of electronegativity for the dopant and site atoms.

Figure 9 .
Figure 9. Scatter plots of covalent radius for the dopant and site atoms with transition temperatures.A mixed Gaussian distribution was employed to distinguish the two classes indicated as blue and red.

Figure 8 .
Figure 8. Scatter plots of covalent radius for the dopant and site atoms with transition temperatures indicated in the color bar.

Table 1 .
Slot extraction of each class.The number of abstracts used for the extractions are 1,000 and 47,965, respectively.

Table 2 .
Frequencies of each material group.