Data-driven analysis of electron relaxation times in PbTe-type thermoelectric materials

Data mining from published papers can generate large experimental datasets that have been overlooked in computational materials informatics. We developed an open web system Starrydata2 to accelerate a comprehensive digitization of data of materials from as-reported plot images in published papers, without sample selection based on performance. By plotting results obtained from our dataset on experimental thermoelectric properties of 434 samples of rock-salt-type (PbTe-type) thermoelectric materials, we revealed differences in electronic structure of parent compounds PbTe, PbSe, PbS, and SnTe from just experimental data. We observed that the calculated Seebeck coefficients were fairly consistent with experimental data for n-type PbTe but not for p-type PbTe, indicating possible modifications in its valenceband electronic structure. We evaluated the electron relaxation time τel from 207 reported samples of n-type PbTe by combining calculations and experimental data. We found that τel is not a constant but varies by at least two orders of magnitude. Achieving long τel was suggested to be critical in increasing the thermoelectric figure of merit ZT. Reproduced with permission from Thermoelectrics Society of Japan. ARTICLE HISTORY Received 22 January 2019 Revised 8 March 2019 Accepted 1 April 2019


Introduction
Materials informatics based on large-scale firstprinciples calculations is rapidly developing in functional materials science [1]. However, the properties of functional materials are not only determined by the electronic structures of the parent compounds but also determined by various experimental factors such as impurity doping and microstructural control. To include these effects in materials informatics, we need to use a large number of digitized experimental data. Thermoelectric materials are an example of such functional materials [2][3][4]. They are studied for applications in compact cooling and power-generation devices that interconvert heat and electricity. For more than a century, experimental searches for efficient thermoelectric materials have been conducted. Because there are many material families, the maximum reported value of the thermoelectric figure of merit is widely used in selecting a good material family. Thermoelectric conversion efficiency increases with increasing ZT, and ZT>1 has been considered as the criterion for applications. Equation (1) implies that Seebeck coefficient S and electrical conductivity σ need to be increased to increase ZT; thermal conductivity κ, comprising electron thermal conductivity κ el and phonon (lattice) thermal conductivity κ ph , needs to be decreased to increase ZT. However, S, σ, and κ el have strong dependence on the carrier doping density n. As a result, these parameters cannot be controlled independently, and a balance is needed to maximize ZT. Introducing impurities and defects of various scales [2][3][4][5] is effective in decreasing κ ph . The phonon-glass-electron-crystal (PGEC) is a concept that describes the best materials for high ZT, in which phonons are scattered as in amorphous glass whereas electrons pass freely as in crystals [6].
Although first-principles calculation is a powerful tool to select many candidate thermoelectric materials [7,8], various kinds of uncertainties arise in predicting actual thermoelectric properties [9]. Calculations usually examine idealistically clean crystals, whereas experimental high-ZT samples are much dirtier. The errors in band gap values create huge errors in calculating Seebeck coefficients in many high-ZT compounds, which are narrow-gap semiconductors. Calculations employing Boltzmann's transport equation cannot calculate σ and κ el directly; they can only evaluate σ/τ el and κ el /τ el , where τ el is an unknown variable called the electron relaxation time. As τ el is the time between electron scattering, long τ el is expected in PGEC materials. Although many studies assume constant τ el (often at 10 −14 s), a large sample dependence of τ el is reported in a few studies [9,10].
Recently, by combining first-principles calculations and experimental data, materials informatics has emerged. The leading example is Citrination thermoelectrics recommendation engine [11], which by machine learning assists the users in the selection of good parent compounds of thermoelectric materials.
As calculation data, they used the TE Design Lab database [7], which contains electronic structure parameters of over 2300 parent compounds. As experimental data, they used several experimental databases including UCSB Thermoelectrics Data (MRL Datamining Chart/Energy Materials Datamining) [12], which contains experimental data of about 300 samples (over 1000 data points at 4 different T) of thermoelectric materials reported in over 100 publications.
To empower such experimental materials informatics, we attempted to collect experimental data on a greater scale. During the history of thermoelectric materials, thousands of experimental samples were fabricated, and their properties were published as papers. Unfortunately, most of the precious data is buried in plot images, and not accessible in digital form. We attempted to recycle such experimental data by plot digitization, termed 'plot mining'. Unlike text and images, digitized scientific data extracted from plot images can be shared without violating the copyrights of publishers [13], if the necessary citation is indicated. Plot mining is a highspeed and costless way to obtain experimental data, rather than repeating similar experiments.
In this study, we developed a web system named Starrydata2, which can efficiently collect and share digital experimental data, extracted from plot images in published papers. We developed this web system from scratch, employing a completely different architecture and user interface from our prototype web system [14]. By embedding the plot digitization and data upload interface in a typical reference manager like Mendeley [15], we attempted to speed up the human-based comprehensive data collection. This web system enables community-based collections of published data by worldwide researchers, and the collected data can be downloaded freely.
A classical approach in theoretical materials science have been to use a state-of-the-art calculation derived from theory, and supporting it with a few experimental data of selected samples. However, in this approach, it was difficult to know what are the critical factors that determine the properties of the majority samples, which may be dirty or low in performance or missing some information. We present an alternative approach to use a large dataset pf experimental samples, and test for guiding principles that work for most of the experimental data. The name of our web system reflects our concept to treat every datum as important as top data, like a starry sky that contains numerous dwarf stars.
As an example, we used our dataset for experimental samples of rock-salt-type thermoelectric materials. Rock-salt-type thermoelectric materials, which are often referred as PbTe-type thermoelectric materials, are a family of thermoelectric materials that crystallize in NaCl-type crystal structure. The typical parent compounds are PbTe, PbSe, PbS, and SnTe. Various other metals can substitute the cation sites, and non-metals can substitute the anion sites. Some samples in this family have been reported to possess high ZT values over 2 [16,17].
The phonon properties of the real samples of rocksalt-type thermoelectric materials is difficult to predict from first-principles. Large atomic mass and weak bonding results in low phonon velocity, and strong phonon anharmonicity related to the lone pairs of the divalent cations are related to the high phonon scattering rate in intrinsic PbTe [18,19]. The chemical properties of the parent compounds accept doping of nonstoichiometry, various elements and solid solutioning in wide ranges of compositions. Such site vacancies, impurity elements, intra-grain nanoprecipitates and grain boundaries in the polycrystalline samples all scatter phonons of various wavelengths [16] to achieve low κ ph [17,20,21] The electronic structures of rock-salt-type thermoelectric materials are also complex, making theoretical prediction of transport properties extremely difficult. The valence bands are composed of the p-bands of the anions have hole pockets along at Σ and L points [22,23], requiring complex treatment of inter-band scattering for transport properties. The calculated hole Fermi surfaces of PbTe are complex in shapes hardly expressed in parabolic band model [22,23]. The band degeneracy changes with T and lattice parameters [16,24]. Spin-orbit interaction splits the bands to reconstruct the band structure around the Fermi level. The width of this splitting increases in the order Sn<Pb and S<Se<Te, and further variation in band structure is expected when these elements are randomly mixed. The conduction band, which is composed of the spbands of the cation, is relatively simpler than the valence band. There are electron pockets at L-point, forming a direct band gap at L. The site vacancies and impurity elements may further modify the band structure around Fermi level. The calculation results strongly depend on the selection of exchange-correlation potentials [23,25].
In such a complex material family, it is important to reveal the trend that is applicable for most of the experimental samples. We plotted the thermoelectric properties of 434 reported samples in single plots, to reveal the common trends across the samples. As a candidate parameter to enable prediction of ZT, we evaluated τ el for constant relaxation time approximation by combination with a first-principles calculation for 207 samples of PbTe.

Methods
A list of papers with keyword 'thermoelectric' was retrieved from the Scopus web system [26]. From this list, we collated possible papers on material properties, by automatically detecting characteristic words of material names in the titles. We manually downloaded the full-text PDF (Portable Document Format) files for these selected papers from the publisher's web sites. Then we manually checked each for content and classified them into categories based on the parent compounds. The papers on bulk samples composed of PbTe, PbSe, PbS, and SnTe with less than 10% impurity elements were classified as rocksalt-type thermoelectric materials.
From each full-text PDF, we manually captured the plot images on the experimental T-dependence of the thermoelectric properties S, σ (or electrical resistivity ρ = 1/σ), κ = κ el +κ ph , power factor P = S 2 σ, and ZT. We opened these images in WebPlotDigitizer [27], and extracted the original numerical data by semi-automatic colour detection or by manual mouse-clicking. The numerical data were fit using polynomial functions of up to 5 th order, to evaluate the thermoelectric properties at T = 300, 400, . . ., 800 K. The missing parameters were estimated by mathematical operations between known parameters. If reported, the experimental Hall carrier density n H,exp at room temperature was also recorded. The possible chemical composition of each sample was extracted by comprehending the text and identifying the starting compositions.
To make the above data collection process more efficient, we developed a web system named Starrydata2 on a cloud server at http://www.starry data2.org. When a Digital Object Identifier (DOI) is supplied by a user, the web system automatically retrieves bibliographic information such as author names and journal names from CrossRef.org [28], and records those to our database. The web system automatically generates links to the publisher's website, the data-collection page, and the data-browsing page. The data-collection page contains the interface of WebPlotDigitizer [27], and a data-upload text box with an automatic unit convertor. The collected datasets can be freely downloaded as text files in Comma Separated Variables (CSV) and JavaScript Object Notation (JSON) formats. This web system is accessible to the public free of charge.
The example first-principles calculations of PbTe was performed using the Full-potential Linearized Augmented Plane Wave (FLAPW) method implemented in the WIEN2k code [29]. We employed the Generalized Gradient Approximation (GGA) correlation-exchange potential [30] with spin-orbit interaction. Core/valence cut-off energy was set at −6.0 Ry, and the calculation was performed on a 50 × 50 × 50 k-mesh. Thermoelectric properties were calculated from the Boltzmann transport equations, using the BoltzTraP code [31], which was modified [9] to include secondorder terms for κ el . The chemical potential (μ) dependences of additional charges per unit cell N, S, σ/τ el , and κ el /τ el , at T = 300, 400, 500, 600, 700, and 800 K were obtained from output files generated by BoltzTraP. The values of μ were converted to carrier doping densities n = -N/V cell [cm −3 ], where V cell is the unit cell volume.
We evaluated τ el for each T in each sample using experimental Seebeck coefficient S exp and experimental electrical conductivity σ exp . From the calculated Sn curve, we estimated n, the carrier doping density that corresponds to S exp . If the S-n curve is bellshaped due to the bipolar effect, we selected the solution with higher n, unless bipolar effects were obvious in the experimental data. From the n and the calculated n-dependence of σ/τ el , we evaluated (σ/τ el ) calc to estimate τ el from With this τ el , we estimated κ ph using

Results and discussion
From Scopus [26], we retrieved with the keyword 'thermoelectric' a list of 47,936 papers published between 1875 and 2015. Our original material-name detection script selected 18,585 papers from the list, and among them we accessed the full-text of 14,835 papers to select those that contain plots of interest and to classify them into material families. The screenshot of our original web system Starrydata2 is shown in Figure 1. For each record of a paper, Starrydata2 stores the bibliographic information, the numerical data extracted from the plots, and the chemical compositions of the corresponding samples. The system only shows the numerical data and the replots, without storing the original full-text and the plot images, which are often protected by publisher's copyright. The users can generate lists of publications of interest, and browse the data collected by all users. They can download them as a data file, either in spreadsheet-like format (CSV and JSON) or in a relational-database-like format (JSON only). Our data visualization system can display the data files in various formats including line plots, heat maps, and multiple scatter plots.
Using Starrydata2, we succeeded in attaining a considerable improvement in the speed of manual data collection. We rejected the selection of papers and samples in previous data collections to increase both the number of recorded samples and the speed of data collection. Currently, we have collected the data for 11,506 samples in 9509 figures published in 1957 papers. About 500-1000 samples are added each month. Since the experimental data for a sample usually appear in multiple figures, we manually related these data from an identical sample by comprehension of the paper. By using the recent version of our web system, a single data collector succeeded to process 166 papers (806 plots, 1148 samples, 3251 datasets and 89,210 data points) in 25 working days. On average, 1.03 papers (5.00 plots, 7.12 samples, 20.2 datasets, 553 data points) were processed per hour. This time includes the time to read the text to identify the chemical composition of each sample. By increasing the number of data collectors, much more experimental data on other material families will be uploaded in our database. Figure 2 shows a part of our experimental dataset on rock-salt-type thermoelectric materials in a comparison with the UCSB Thermoelectric data [12], the largest literature-based experimental dataset on thermoelectric properties. Each data point, which corresponds to one experimental sample, is entered in a plot of P against κ. Our dataset contained 434 samples of PbTe, PbSe, PbS, SnTe and their solid solutions from 64 publications , whereas UCSB Thermoelectric data contained 8 such samples. The large diversity of the scatter plot is a result of the non-selective character of our dataset, which accepted many samples with bad properties. In contrast, the Figure 1. Concept of plot mining in the Starrydata2 web system. An example paper [32] and the screenshots of Starrydata2 web system are presented. Reproduced with permission from Thermoelectrics Society of Japan.
samples from UCSB thermoelectric data were distributed in the right-bottom corner of Figure 2(c), implying that the UCSB Thermoelectric data selectively collected the samples that possess high ZT. Figure 3 shows the raw data of temperature dependences of S, σ, and κ of the 434 samples of rock-salttype thermoelectric material. The colours indicate the maximum ZT of each sample. We observed that samples with wide ranges of S, σ, and κ can be fabricated in this type of thermoelectric material. Simultaneously, it can be said that it is difficult to select a single sample to represent the overall properties of rock-salt-type thermoelectric materials. Most of the samples possessed S values between ±300 μV/K that monotonically increased with T. Two samples undergoing sign changes in S were typical of lown (bipolar) samples. Most of the samples underwent monotonic decreases in σ and κ with increasing T. The range of σ was between 10 3 and 10 6 S/m, and the high-ZT samples were distributed in the middle. The range for κ was below 10 W/mK, and the high-ZT samples were observed near the bottom of the distribution. Such a direct comparison of many experimental data is helpful to reveal the non-calculable characteristics of each material family from the confusion inherent in strong sample dependences of thermoelectric properties.
The diversity of our dataset enabled us to reveal the characteristics of the parent compounds, from only simple scatter plots. Figure 4

is a Jonker plot
showing the relationship between S and log σ. Among p-type samples, the Jonker curve of p-type PbTe was higher than those of PbSe and PbS. In contrast, the Jonker curve of n-type PbTe was not distinguishable from those of PbSe and PbS. These differences are consistent with calculated electronic structures, where the valence bands are composed of p-bands of Te, Se, and S, and the conduction bands are composed of Pb/ Sn sp-bands. From the plot, all reported samples of SnTe were seen to be p-type and high in σ, suggesting the presence of a strong hole-doping mechanism. Figure 5 shows a comparison of theoretical Sn curves obtained from a typical first-principles   calculation, to the experimental S-n curve of PbTe samples. Only 38 samples out of 207 samples of PbTe are shown in this plot, because the values of experimental n above room temperature were not reported for the other samples. Our transport calculation for n-type PbTe were close to the experimental data at for each T, especially in high-n region over 10 19 cm −3 . In contrast, the S-n curve of our calculation was far from the most of reported p-type samples, especially at high T. This showed that the model of our transport calculation is not valid for most samples of p-type PbTe. Several other theoretical papers [38,87] have successfully reproduced the experimental S-n curve of n-type PbTe, however, they did not show S-n curve for p-type PbTe. These implied the possible difficulty in the modelling of transport properties of p-type PbTe, despite the high ZT values compared to n-type PbTe.
By combining first-principles calculations and our experimental data of S exp , σ exp , and κ exp , we attempted to evaluate τ el , an unknown parameter in first-principles calculations of σ and κ el . During the evaluation of τ el , we need the value of calculated σ/τ el . As this value is given as a function of n, this analysis could be done only for samples with known experimental n. However, for most of the reported samples of thermoelectric materials, Hall measurement to determine experimental n has not been carried out. So in this study, we attempted to estimate n by using the reported values of experimental S and a calculated S-n curve. Since this analysis is applicable only when there is a consistency between calculated and experimental S-n curves, we carried out this analysis only for n-type samples. Figure 6(a) shows the values of n and τ el of the 207 samples of n-type PbTe, estimated from S exp and σ exp . The values of τ el were between 10 −15 s and 10 −13 s, exhibiting a two-orders-of-magnitude variation. These τ el values are composite values of intrinsic τ el and extrinsic τ el , which can be expressed by Matthiesen's rule such that f el,total = f el,intrinsic + f el,extrinsic , by using electron scattering rate f el = τ el −1 . The intrinsic f el due to electronphonon interaction of PbTe has been calculated to be around 10 12 -10 13 Hz by a DFPT calculation considering screening effects by electrons [87]. Our total f el were in a range between 10 13 -10 15 Hz. The samples with f el 10 13 Hz (τ el~1 0 -13 s) are expected to be clean samples, in which intrinsic electron-phonon interaction is one of the dominant electron scattering mechanisms. The samples with f el~1 0 15 Hz (τ el~1 0 -15 s) were expected to be dirty samples, where extrinsic electron scattering mechanisms are dominant. The candidates of such extrinsic electron scattering centers in n-type PbTe include atomic vacancies, impurity atoms, intra-grain nanoprecipitates, dislocations, grain boundaries and impurity phases. The trend that short τ el is observed more frequently in high-n samples was consistent with the expectation that the carrier-doping vacancies and impurities also scatter electrons. The trend that τ el decreases with increasing T is observed especially in low-n samples, and this was consistent with the that phonon scattering rate increases with increasing T.
Our analysis also revealed that the popular transport calculations that use a fixed value of τ el = 10 −14 s for constant relaxation time approximation is not valid for most of the reported samples of n-type PbTe. Figure 6(b) shows the relationship between τ el and ZT. For each T, we observed a trend whereby an increase in τ el results in a monotonic increase in ZT. The surprising thing here is not that ZT increased with τ el , but that the τ el of almost all the reported experimental samples clearly followed a common Figure 5. Calculated carrier doping level (n) dependences of the Seebeck coefficient (S) from first-principles calculation using Boltzmann transport equations, in the range 300-800 K, against the experimental Hall carrier concentration at room temperature, for 38 samples of n-type and p-type PbTe. Experimental data of S are also plotted against n. curve, which can be drawn for each T = 300, 400, . . ., 800 K. This showed that the ZT values can be expected only from T and τ el , for these real samples of PbTe-type thermoelectric materials. In researches of thermoelectric materials, this is quite a rare case that ZT values can be predicted only from one physical parameter other than T. This result gives researchers a strong guiding principle that an increase in τ el directly enhance ZT of n-type PbTe. Figure 6(c) shows the relationship between τ el and κ ph . If the dominant electron scattering centers acted as the dominant phonon scattering centers, we would have seen a correlation in this plot. However, we could not observe any correlation between τ el and κ ph . This indicated that in most samples of n-type PbTe, there are dominant phonon scattering mechanisms other than electrons, or there are dominant electron scattering mechanisms other than phonons. Even though electrons in ideal PbTe crystals are reported to be scattered mainly by phonons [87], this does not mean that phonons are mainly scattered by electrons. The advantage of n-type PbTe may be the availability of the phonon scattering mechanisms without heavy electron scattering, as to realize the concept of PGEC.

Conclusion
We developed an original web system Starrydata2 as an open database, to let researchers gather and share experimental data from published plot images. So far, we have succeeded to collect experimental data from more than 11,500 samples of thermoelectric materials. Our web system enabled collective analysis of published samples, to discover the guiding principles that work for most of the published samples.
By combining our partial dataset on rock-salt-type thermoelectric materials with first-principles calculations, we analysed thermoelectric properties of rocksalt-type thermoelectric materials. The differences in valence band electronic structures of PbTe, PbSe, PbS, and SnTe were revealed in the Jonker plots.
For n-type PbTe, we found that the effective τ el for constant relaxation time approximation varied by more than two orders of magnitude, indicating that τ el is not a constant but a very important determinant of ZT. This analysis is applicable for other thermoelectric materials, whose S-n curves were successfully reproduced by first-principles calculations and transport calculations. As our database is growing to cover more and more families of thermoelectric materials, evaluation of τ el of various experimental samples of thermoelectric materials will be a strong guide for experimental researchers to design high-ZT samples of thermoelectric materials.