Creating research topic map for NIMS SAMURAI database using natural language processing approach

ABSTRACT In this study, we present an approach to create a visual research topic map for materials science researchers from a large collection of archived research papers using natural language processing (NLP). We apply this approach on SAMURAI, a directory service for the researchers of the National Institute for Materials Science (NIMS), Japan. Visualization of research content will support exploratory searches by maximizing the information absorbance and intuitively capturing the research characteristics of each materials science researcher. In addition, a research topic map can connect researchers with similar topics aiming to find potential collaborators. Collaboration can support the advance of scientific research. We analyze all available publications of a researcher using frequent term analysis. In addition, materials science knowledge resources were utilized including dictionaries and automatic extraction for the material names. Noise reduction was implemented using stop word filtering. The topics were then visualized using a word cloud technique for each researcher. An analysis of the topic similarity was conducted to find researchers that share similar topics leading to the creation of a topic map for each researcher. The approach aims at maximizing information absorbance for public knowledge by applying NLP approaches to information mining from materials science research papers. NLP analysis and visualization code are available https://github.com/ThaerDieb/Topic_map.


Introduction
With the load of available research papers published continuously by materials scientists, it is desirable to maximize the absorbance of the extent of information that exists in these papers. The use of natural language processing is getting more attention to analyse scientific publications for knowledge discovery [1,2]. For example, scientific publications clustering [3]. Recently, topic extraction and visualization are being utilized to tackle this issue because of their intuitive display of information that facilitates information access. Additionally, topic visualisation allows to capture the research trend quickly without having to navigate an entire list of publications. Furthermore, it can support the link between researchers with similar research topics to find potential collaboration leading to the construction of a topic map. Such digital infrastructure systems are necessary to facilitate the accumulation and distribution of scientific data.
In this paper, we present an approach to create a research topic map for materials science researchers from a large collection of publications using natural language processing techniques and materials science knowledge resources such as dictionaries. We extract representative topic terms from research publications that are related to materials science and engineering. We use frequent term analysis and automatic extraction of material names in addition to domain knowledge resources. Noise reduction is implemented by removing stop words. A word cloud approach was used to visualize the extracted terms. Finally, a topic similarity analysis was conducted to find researchers who share similar topics, leading to a research topic map for each researcher.
Visualization of topics from an unstructured text has been previously studied using several different approaches. For example, Miller et al. applied wavelet transforms to a custom digital signal constructed from words within a document to create visualisation [4]. Recently, word cloud visualization technique [5,6] has been widely used to present the contents and themes in a text for summary and visualization. Examples include Gottron et al. work on visualizing web documents [7] and ConceptCloud browser, which makes use of a combination of concept lattices and tag clouds, to visually present academic publication data by Dunaiski et al. that also used interactive word cloud visualization [8]. Such a technique is a visual representation of word frequency using different font sizes and colors to represent terms with different levels of importance. Venetis et al. presented metrics to capture the structural properties of a tag cloud, and a set of tag selection algorithms [9]. Some publishers have implemented a word cloud from research papers on a large scale. An example of this is Elsevier SciVal [10]. Although these studies had a significant impact on the visualization of research content, there has been no focus on the materials science domain. It is necessary to use domain-aware resources to improve the processing for materials science research publications. A few attempts have been made to visualize the research content in materials science; however, they have been restricted to a limited research area such as quantum materials only [11].
There is some work that has been conducted to provide collaborator suggestion such as on 'ResearchGate' or 'Researchmap'; however, these functions are for general science community and do not focus on materials science. In addition, our proposed system provides a wider connected network of potential collaborators. A special feature of this system is that collaborator suggestion is focused on each topic of interest rather than on the general researcher output. For example, a materials scientist seeks to find a collaborator on developing a certain material or a specific measurement method.
A key task in the construction of a topic map is the extraction of key terms [12,13] that represent the researchers' topics from their publications. Frequent term analysis has been utilized for text mining applications. For example, Nagwani et al. designed a frequent term based text summarization algorithm [14]. The term frequency can reflect the importance of this term in a text of a collection of texts. However, in scientific publications, term frequency analysis based on the general content of the paper is insufficient. It is necessary to study the term frequency for domainspecific terms. For example, in materials science, researchers are interested in searching for publications discussing the synthesis or characterization of certain materials. In this study, we utilized domain knowledge dictionaries and an automatic chemical formula extraction tool to extract materials names and other domain-specific terms from research publications. In addition, the importance of the term can be weighted depending on not only its frequency but also other factors such as its location in the text (title, keywords, abstract).
In this study, we validate the proposed approach on SAMURAI [15], a researcher directory service maintained by the National Institute for Materials Science (NIMS), Japan for its researchers. Publications for each researcher are linked to the NIMS Digital Library [16]. SAMURAI can synchronize the profile information of researchers and their publications selfarchived in the NIMS institutional repository systems [17][18][19][20]. This database has interoperable functions using the Open Researcher and Contributor ID (ORCID) [21]. It identifies authors of journal articles from NIMS and research members of NIMS through the implementation of a global ID. Each researcher profile is attached to a unique ResearcherID linked to major global databases such as KAKEN (Database of Grants-in-Aid for Scientific Research by National Institute for Informatics (NII)). Through this directory, NIMS is promoting materials research, supporting the management of its researchers' activities, and introducing NIMS researchers and their work to the public.
Even though we validated this approach on SAMURAI, it can be applied to visualize the content of research papers related to materials science domain in general as it integrates domain knowledge features and it is not dependent on the database itself.
The remainder of this paper is organized as follows. In Sec. 2, we introduce our approach for researcher output visualization and topic map construction. The details of the experiment and results are discussed in Sec. 3. Sec. 4 then provides some concluding remarks and discusses areas of future study.

Outline
The proposed approach consists of 6 main components: in the data collection, a set of materials science publications are collected in XML format for a group of researchers from a database. Next, these publications are preprocessed by extracting certain sections and removing noisy data. Additionally, domain knowledge terms are extracted in a separate step. After that, frequent term analysis is conducted for each extracted section. In the visualization component, a vector representing each researcher's topics is constructed based on frequent term analysis from the researcher's publications. This vector is visualized using word cloud technique. In the final step, similarity analysis between different researchers is conducted to construct topic map. Figure 1 illustrates the outline of the system. Details of each component are discussed in the following subsections.

Data collection
The SAMURAI database holds a list of publications for each researcher that is linked to the NIMS digital library and other data sources. A list of publication digital object identifiers (DOIs) [22] for each researcher was constructed from the researcher's profile in SAMURAI (in SAMURAI, the DOIs are stored in a PostgreSQL database). Using the DOIs, recent publications, mainly available beginning in 2004, were retrieved from the NIMS text and data mining platform (TDM-PF). The dataset used for this study consists of 8269 published articles by 1058 SAMURAI researchers in XML format. Publications are the sum of all retrieved publications for each author using the DOIs.

Data preprocessing
We extracted the plain text from the following sections of all the publications authored by each SAMURAI researcher: • Title of the paper • 'Keywords' section • Abstract of the paper We used the XML tag name for each section provided by the publisher to extract the section. Because the main body of the publication is not collected, the effect of including the 'Keywords' section is balanced. The text of each section is segmented into tokens using the NLTK word tokenizer. Natural Language Toolkit (NLTK) is an open-source Python package with data sets, supporting research and development in Natural Language Processing [23]. The following processing steps are then conducted on the tokenized texts to remove noisy data: □ Removing numeric values, punctuation marks (for example, "23.5", "!", "?"). Such data are not related to the topics discussed in the publications. Even though numerical values are important in some "grey area" topics such as when reporting catalysts with "high yield", however, it is beyond the scope of this study to tackle these issues. □ Filtering general English language stop-words such as "but", "an", "he". These stop-words frequently occurring in English but do not carry a thematic component or significance themselves [24]. □ Physical units such as "m" (meter) for length measurement, and "K" (Kelvin) for temperature measurement, among others, are frequently found in materials science research publications; however, for our objective in this study, they are not informative regarding the research output. The list was compiled using SI-based and derived units [25].
Other terms can also be filtered out depending on the application. For example, too frequent terms in a certain sub-domain of materials science do not contribute to similarity metrics between researchers in this sub-domain.

Motivation
We aim to construct a topic map for materials science researchers, so they can find potential collaborators.
Using general linguistic features only is not sufficient to compute the similarity between researchers' topics that are focused on materials science. It is necessary to include domain related knowledge terms. It has been proven that utilization of domain knowledge features has leveraged the performance of text mining applications in materials science [26]. In this study, we used two types of domain knowledge resources, one is the chemical compound extractor and the other is a dictionary for measurement related terms.

Extraction of chemical compounds
In materials science research, materials names play an extremely informative role in the content of the publication. Many publications are centered around the synthesis or characterization of a specific material. A segmentation of text using general language tokenization schema might miss the material names. For this reason, it is necessary to extract material names using a domain-specific tokenization schema [27] to provide chemistry-aware tokenization. Because numerous material names are basically chemical compounds, we developed a chemical formula extraction tool to extract materials names from text. This tool uses regular expressions and element information from the periodical table to identify chemical compounds formula.

Extraction of measurement related terms
Measurements of material properties is a fundamental issue in materials science and engineering. Some materials scientists even specialize in measurement science. A significant number of materials science publications contain discussions on measurement and related issues. Identifying measurement related terms such as methods, results, and simulation, is necessary to characterize the researcher output. As discussed in Section 2.4.2, general English languagebased tokenization schema might have a low matching ratio for domain-specific knowledge including measurement related terms. Domain-aware tokenization issue was reported in other domains such as bioinformatics [28,29]. Domain-aware tokenizers were also developed [30]. Matching ratio analysis for some domain-specific terms was conducted [27]. We used a digitized version of the Japanese dictionary of physics and chemistry [31] to extract several measurement related terms translated into English: • Measurement method dictionary, for example, X-ray diffraction microscopy and Maxam-Gilbert method [32].
• Simulation category such as the Monte Carlo method [33].

Frequent term analysis
We computed term frequency for the terms extracted in Section 2.3 for each section (title, keywords, abstracts) separately. In addition, the term frequency was calculated for the terms extracted in Sections 2.4.2 and 2.4.3. Five sets of (term, frequency) were created and normalized based on the length of the extracted sections for each researcher as follows: • T tf Title terms extracted from all publications. The research output for a researcher is represented as a set R tf resulting from the merging of five sets. In research publications, sections in the paper structure have different levels of influence on the publication topic [34]. For example, an occurrence in the title is more topic representative than an occurrence in the body. Several studies have used a term weighting strategy for document processing applications [35,36]. For this reason, occurrences of the terms should be wighted differently based on their location in the publication. We use a simple strategy by assigning a heavier weight for terms in the title and in the keywords section than those in the abstract. Because domain knowledge terms are extracted separately from other terms, the frequencies of domain knowledge terms are used in the case of double extraction to avoid double-counting. We denote T w , K w , A w , M w , and ME w as the weights for T tf , K tf , A tf , M tf , and ME tf , respectively. Here R tf is given by Equation (1) where the values of the weights T w , K w , A w , M w , and ME w are determined by experimenting on a small subset of the data and using feedback from materials scientists in SAMURAI and are 1.2, 1.5, 1.0, 1.0 and 1.0 respectively.

Word cloud generation
The word cloud for each researcher is generated using the R tf set of the researcher, i.e. terms and their frequencies multiplied by the designated weights. We use a word cloud generator package in Python language [37] with the following settings: the maximum number of words equals 200, and minimum number of letters for a word to be included is zero.

Topic map construction
Research collaborations are extremely beneficial to advance sciences. Collaborators can learn from each other by exchanging ideas and sharing skills. To find potential collaborators, we want to analyze researchers who share similar topics. Each researcher is represented by a vector V of discrete variables v 2 N, where v indicates the frequencies from the set R tf discussed in Section 2.5. The similarity between the two given researchers are computed using the cosine similarity metric between the two vectors of these researchers, V 1 ; V 2 , as indicated in (2): where v i ; v j are components of V 1 ; V 2 respectively. When two researchers have a high similarity, the most frequent common term is extracted by intersecting R tf of both researchers. This approach was applied for each SAMURAI researcher in relation to all others to create a relationship based on frequent common terms between researchers. We visualize these relationship as links to create a topic map for the researchers.

Visualization of word cloud
Using the approach discussed in Section 2, we created a word cloud and topics maps for all SAMURAI members. We demonstrate the results here using the three researchers below as examples. The number in the parentheses is the number of the author publications that could be retrieved from the NIMS repository. Coauthored publications by multiple SAMURAI members are counted for each author independently.
• Yuya Sakuraba [39], also affiliated with the Research Center for Magnetic and Spintronic Materials in NIMS (22).
• Hideki Yoshikawa [40], affiliated with the Research and Services Division of Materials Data and the Integrated System, and Research Center for Advanced Measurement and Characterization in NIMS (51).
All the above mentioned researchers have explicitly agreed to have their names, photos, and related data published in this manuscript. Because Kazuhiro Hono and Yuya Sakuraba are affiliated with the same research center, it is expected that they would have similar research topics that differ from those of Hideki Yoshikawa. Figure 2 shows a simplified word cloud visualization for each of these researchers' topics.
From Figure 2, it can be seen that both Kazuhiro Hono and Yuya Sakuraba have common terms with a large font size such as 'spin', 'magnetics', and 'heusler'. This is correlated with the fact that they both belong to the same research center. By contrast, Yuya Sakuraba does not share any of his top terms ('spin', 'Ag', 'magnetoresistance', 'heusler') with Hideki Yoshikawa, who belongs to a different center. This indicates that researchers with different affiliations are less correlated with each other.
However, we can also see that there are some irrelevant terms owing to an insufficient tokenization, such as 'investigated'. Morphological analysis and/or part-of-speech tagging could be used to further improve tokenization efficiency. Considering that our visualization based on an assuming model and limited data is qualitative and projective, a parallel use or model selection would be useful for the visualization of many facets. As pointed out above, the domain knowledge terms are very important aspects for characterizing the research field.

Research topic map
Using the word cloud generated in Section 3.1, a vector for each researcher was constructed using the method discussed in Section 2.7. Using cosine similarity, researcher vector similarity was checked against all other SAMURAI researchers. When the similarity is high between two researchers, word cloud sets of them were intersected to find the best common term and a link between them was created between their profiles. Figure 3 shows the topic map for each researcher mentioned in Section 3.1. From Figure 3, we can observe the topic terms that a researcher shares with another or multiple researchers. This map can provide insights on potential collaboration on a specific topic.
When publications are authored by multiple authors, the researcher position in the author list can affect his interest topics. For example, the topics from a publication where the researcher is the principal investigator (PI) are closer to his interest than those of a publication where he is not. To verify the differences in extracted topic terms when the researcher is the PI or not, we compare the word cloud visualization in both cases. We have collected the publications of Yuya Sakuraba from Section 3.1, where he is either the first author or the corresponding author, and created the word cloud. Figure 4 shows Yuya Sakuraba topics terms for his publications as a PI. When comparing Figure 2 (b) and Figure 4, we can notice that even though major terms are similar; however the frequency (importance) of the terms has changed. For example, 'NiMnSb' has  gained more importance in publications where the researcher is the first or corresponding author. This approach could be extended to add a weighting factor based on the position of the researcher in the author list.

Effect of domain knowledge resources
In this section, we study how the utilization of domain knowledge resources influence the word cloud visualization. Table 1 shows information about the extracted word cloud terms including terms of domain knowledge resources.
From Table 1, domain knowledge terms make up approximately 5.7% of all the terms in the generated word clouds and 7.8% of the unique terms. Chemical compounds make up approximately 5.5% of all terms and 7.5% of the unique terms. This confirms the importance of the extraction of chemical compounds in representing materials science research because many material names exist in their chemical compounds form. By contrast, measurement related resources in this study did not contribute to representing the output of the researchers. Measurement terms make up less than 0.5% of the total and unique extracted terms. Table 2 shows the most frequent domain knowledge terms in the generated word cloud.
To visualize how the utilization of domain knowledge resources affects the research output representation, we compare the word cloud visualization with   Figure 5. Comparison of word cloud visualization with and without using domain knowledge resources for the same researcher. and without using domain knowledge resources for the same SAMURAI researcher. Figure 5 illustrates the differences. This presented researcher is currently affiliated with both the Electric and Electronic Materials Field, Nano Electronics Device Materials Group, Research Center for Functional Materials, and the Quantum Beam Field, Synchrotron X-ray Group, Research Center for Advanced Measurement and Characterization. From Figure 5, we confirmed that the use of domain knowledge resources generated terms that were unable to be found through a normal frequent terms analysis.

Researcher correlation analysis
In order to validate the word cloud results, we analyzed the correlation of SAMURAI researchers who belong to different research centers within NIMS using clustering techniques. Researchers who belong to the same center are expected to have a stronger correlation with each other and a weaker correlation with researchers in different centers. We selected two centers, namely, the Research Center for Magnetic and Spintronic Materials in NIMS, and the Research Center for Advanced Measurement and Characterization. SAMURAI members with no publications available in NIMS repositories are not included.
The word cloud for these researchers was generated using the proposed approach described in Section 2. Using these word cloud terms and their relative importance values, a vector for each researcher is created. We used two methods to perform clustering on these vectors. The first is the K-means algorithm [41,42], where a multidimensional dataset is grouped into K number of clusters using the concept in which a cluster center is the arithmetic mean of all its data points. We set K to 2 because we have two different research centers. The second method is the agglomerative clustering, which is a 'bottom-up' approach of hierarchical clustering [43], in which each data point starts as a cluster, and merged as it moves up the hierarchy. There are different ways to determine how clusters are merged. In our setting, we use the ward linkage that minimizes the variance of the clusters being merged using the euclidean metric. The number of clusters is also set to 2. Figure 6 shows the clustering results using both methods. To produce a 2D plot, the high dimensionality of the dataset is reduced using the truncated singular value decomposition (T-SVD) method. T-SVD is a matrix decomposition method suitable for applications with a sparse dataset [44] such as our data.
We evaluate the quality of the clustering from internal and external perspectives. An internal evaluation is used to compare the performance of different algorithm on clustering the same dataset, whereas an external evaluation is used to evaluate the clustering results using known class labels for the data points. We use the three following metrics: • Davies-Bouldin index [45] is used for the internal evaluation, which is given by Equation (3) as follows: where n is the number of clusters, c i is the centroid of cluster i, σ i is the average distance of all elements in cluster i to the centroid. Clustering algorithm that produces clusters with low intra-cluster distances and high inter-cluster distances will have a lower Davies-Bouldin index, so a smaller Davies-Bouldin index is better.
• The purity (external metric), which measures the extent to which clusters contain a single class [46], is given by Equation (4) as follows: where N is the number of data points, M is the set of clusters, and D is a set of known classes for the data points.
• The Adjusted Rand index ARI (external metric) measures how similar the clusters are to the known class labels [47]. It is adjusted from RI score for chance. RI is given by Equation (5): true positive þ true negative true positive þ true negative þ false positive þ false negative (5) Since the true labels are known, true positive, true negative, false positive, and false negative can be computed against the ground truth. Then ARI can be computed from RI using Equation (6) ARI ¼ RI À Expected RI maxðRIÞ À Expected RI (6) ARI is then will have a value close to 0.0 for random labeling Table 3 contains the evaluation metric values for both clustering methods. We can confirm that agglomerative clustering performed better on our dataset from internal evaluation point of view; however, K-means had a better performance from the external evaluation perspective. The high purity in both methods showed a high correlation between researchers affiliated with the same center (same class). The adjusted Rand index of the K-means clustering showed a 70% accuracy, eliminating the possibility that the accuracy occurred simply by chance.

Conclusion and future studies
In this study, we presented an approach to create a topic map for the materials science researchers using natural language processing (NLP). Publications were collected and preprocessed to remove noisy data. Domain knowledge resources and frequent term analysis were utilized to extract the representative terms. We conducted similarity analysis to find links between researchers who share similar research topics leading to construct a topic map for each researcher. This approach was validated on SAMURAI, a researchers database at the National Institute for Materials Science (NIMS), Japan. We confirmed the stronger correlation between researchers with the same affiliation and a weaker correlation with researchers from different affiliations by performing a clustering experiment.
In the future, we plan to conduct morphological analysis to improve tokenization efficiency. Another extension is to consider a weighting factor for each researcher based on his position in the author list (for example, corresponding author, first author). In addition, we plan to collect and analyze other types of resources such as research notes. We are also implementing an interactive evaluation and adjustment system. Researchers will be allowed to adjust the level of importance of their own research topics or add new terms.