How can polydispersity information be integrated in the QSPR modeling of mechanical properties?

ABSTRACT Polymer informatics is an emerging discipline that has benefited from the strong development that data science has experienced over the last decade. Machine learning methods are useful to infer QSPR (Quantitative Structure-Property Relationships) models that allow predicting mechanical properties related to the industrial profile of polymeric materials based on their structural repeating units (SRUs). Nonetheless, the chemical structure of the SRU is only one of the many factors that affects the industrial usefulness of a polymer. Other equally relevant factors are polymer molecular weight, molecular weight distribution, and production method, which are related to the inherent polydispersity of this kind of material. For this reason, the computational characterization used for the building of QSPR models for predicting mechanical properties should consider these main factors. The aim of this paper is to highlight recent advances in data science to address the inclusion of polydispersity information of polymeric materials in QSPR modeling. We present two dimensions of discussion: data representation and algorithmic issues. In the first one, we examine how different strategies can be applied to include polydispersity data in the molecular descriptors that characterize the polymers. We explain two data representation approaches designed by our group, named as trivalued and multivalued molecular descriptors. In the second dimension, we discuss algorithms proposed to deal with these new molecular descriptor representations during the construction of the QSPR models. Thus, we present here a comprehensible and integral methodology to address the challenges that polydispersity generates in the QSPR modeling of mechanical properties of polymers. Graphical abstract


Introduction
There is a growing awareness that plastics or synthetic polymers are part of our lives at any level due to the versatility and low cost of these materials, which have increased their consumption almost 20 times over the last 30 years in multiple industrial applications. A strategy to deal with this high demand is the use of polymer informatics [1][2][3][4][5][6]. This interdisciplinary field applies the principles and practices of informatics to polymer science to improve the characterization, development, and discovery of new materials [7][8][9][10][11][12]. It is an emerging field that seeks to achieve efficient and reliable acquisition, management, analysis, and dissemination of data on diverse materials with the goal of greatly reducing the time and risk required to design, produce, and deploy a new material, which generally takes more than 20 years [13,14].
ip@cs.uns.edu.ar Instituto de Ciencias e Ingeniería de la Computación (ICIC), Universidad Nacional del Sur (UNS) -Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Bahía Blanca, Argentina Several authors highlight as a challenge for creating polymeric databases the fact that a synthetic polymer is rarely a single chemical entity [14][15][16]. In this sense, even the simplest polymers are generally described by distributions. However, the characterizations of the most used polymers today usually store only the data from the structural repeating units (SRU) of the polymers, without considering the polydispersity of these materials [14,17]. Hence, it is worth asking if data science can allow us to overcome these challenges.
Data science, especially data mining, develops methods and tools to extract meaningful information and patterns from data. The capability to predict a property value, with a certain degree of accuracy, is one of the most widely used applications of data mining in Chemistry. Emulating chemists in the well-known computer-aided drug design, materials researchers have used data mining for selecting materials with a desired property. Each material is described by features (molecular descriptors) that represent aspects of its chemical structure; and the target property is measured in a real material (after synthesis). Measuring this target requires the synthesis and characterization of materials; consequently, it is the most expensive and time-consuming step. Thus, in silico property prediction techniques have been proposed to use these predictions as a virtual screen to synthesize the minimum possible compounds (see Figure 1).
In particular, the Quantitative Structure-Property Relationship (QSPR) is a methodology that provides a mathematical relationship between the information encoded in the computational representation of the chemical structure of molecules and a given property (e.g. physicochemical, mechanical, etc.). As a result, the prediction of target properties for unknown materials is obtained before synthesis, based on the already known molecular descriptors [18][19][20][21]. Currently, QSPR is widely applied to several studies such as the creation of multifunctional databases based on design intent [22] and the prediction of important optical properties in organic [23] and inorganics materials [24]. Specifically, in the case of polymeric materials, the most studied property, and coincidentally the one for which more data is available, is glass transition temperature (T g ) [25,26]; from seminal works with only 22 compounds in their database [27,28] to the current ones with large locatable, accessible, interoperable, and reusable databases [29]. Several challenges are still present in the informatics of materials [3,9,30]; thus, it still constitutes a high-interest research topic nowadays [31][32][33][34].
Polymeric materials cannot be treated as in conventional molecular computing of small molecules, since in addition to being several orders of magnitude larger than them, they are usually polydisperse. This means that they are made up of families of molecules of the same structural constitution, but of different sizes; to represent them, it is necessary to refer to the average molecular weights derived from the molecular weight distribution curve. Another important value derived from this curve is the polydispersity index (PDI). It is equal to 1 when exceptionally all the molecules are of the same size; otherwise, it is higher than 1. Therefore, it is intuitive to think that this polydispersity affects the final properties of the polymeric materials. In fact, the mechanical properties are one of the most affected by this intrinsic condition, in which every fraction of molecular size contributes to the final property. This is the reason why a new approach from polymer informatics is needed to successfully address property prediction [35,36].
The aim of this paper is to present the relevance of polydispersity in the QSPR modeling of polymer mechanical properties and to highlight recent developments in data science addressing this issue. Two dimensions of analyses are presented: how to incorporate polydispersity information in the computational representation of polymers and how to address the algorithmic challenges derived from these new representations. In the first one, we explain two data representation approaches designed by our group based on the definition of trivalued and multivalued molecular descriptors. Both representations capture information from the molecular weight distribution of the materials and use an in silico polymerization strategy also developed by our group. In the second analysis, we discuss the algorithms proposed by our group to treat these new representations during the building of the QSPR models.

Polydisperse synthetic polymers and mechanical properties
As explained above, the size of the polymeric materials is represented in the molecular weight distribution curve. Two widely used average weights are derived from there: Mn (number-average molar weight) and Mw (weightaverage molar weight), which are graficated in Figure 2 and their mathematical formulas are described in Equations (1) and (2), respectively. In addition, the PDI is obtained from the ratio of both (Mw/Mn). At this point, it is relevant to understand that the molecular weight distribution has a great influence on mechanical properties. For example, for polypropylene (PP), the higher their average molecular weights, the higher the modulus of elasticity and the tensile strength (properties derived from the tensile test) [36]. For this reason, companies produce relatively new PPs in which the extremes of a remarkably high molecular weight have been eliminated without involving a considerable increase in the extremes of a low molecular weight, thus decreasing the PDI (narrow molecular weight distribution curve). Consequently, the resulting PP has a lower level of elasticity [37].
where N i is the number of molecules of molecular weight M i The mechanical properties of a material are those that involve a reaction to an applied force. Although mechanical tests are standardized, the derived properties are not constant and often change as a function of temperature, loading rate, and other conditions (testing parameters). In particular, the tensile test consists of subjecting the specimen of a material to a variable tensile force at a constant crosshead speed until it fractures (Figure 3(a)). The stress-strain curve derived from the test provides information on the stiffness, strength, toughness, and ductility of the polymer, that is, how the material behaves under the testing conditions ( Figure 3(b)) [38,39].

Data science and QSPR modeling in polymer informatics
QSPR is one of the most widely used methodologies for in silico molecular modeling. Once validated, these models are applied to predict the properties of new compounds [40][41][42]. The main advantages of QSPR models are the reduction of time and cost because of the computational design prior to synthesis. In this context, data science has a key role, since it allows inferring QSPR models using machine learning techniques applied to discover patterns and relationships among data [43].
In the context of polymeric materials, QSPR modeling presents major challenges due to the uncertainty introduced by the polydispersity of these materials and the few reliable data available to build databases. In this sense, both the calculation and the selection of molecular descriptors become more complex and computationally demanding tasks because of the large size of the polymers and their characterization, which is given by the molecular weight distribution curve and not by a single value, as in the case of drug molecules. Most of the existing QSPR models for polymeric materials use a single molecule to describe each material without considering polydispersity; these approaches tend to simplify the computational representation of the material using only its SRU [14,17,44,45]. Then, the calculation of the molecular descriptors is done using the shortest chain of polymers that characterizes the material. In this context, it becomes necessary to add the information from the molecular weight distribution curve to achieve more reliable predictive models, as explained in the following sections.
Furthermore, the limited availability of data poses a problem in itself. Compared to other molecules, existing studies on the properties of polymers, particularly those related to mechanical properties, are very scarce. Especially, if we are looking for cured datasets, most of them have little data, and molecular weight distribution curves or other structural characteristics of polymers useful to predict the target property are rarely reported. Likewise, different experimental factors that determine the attainment of the final value of the target property must be considered, such as those related to the processing of the material, the thermal history, and the property measurement test (e.g. the tensile test parameters), since these experimental parameters influence the value of the measured property. Consequently, there arises the need to homogenize the data and gather those obtained under the same conditions (temperature, crosshead speed, specimen shape, etc.). At the same time, the task of homogenizing the data is difficult, since there is not a large amount of experimental data generated under similar conditions. Therefore, developing a polymer database for mechanical properties is arduous work [14,15,44].

Trivalued representation of polymers
The first proposal incorporates relevant information from the weight distribution curve through a trivalued representation of the molecular descriptors of the polymers [46]. We work with an ad hoc database that has 77 homopolymers, linear and amorphous, curated according to criteria for the processing of the material, the thermal history, and the property measurement test norm. For this, trivalued representation, in addition to the SRU, it is necessary to compute the two polymeric chains whose lengths correspond to the M n and M w values of the material by means of in silico polymerization. In particular, in our database the Mn varied within a range from 4700 to 765,000 [g/mol] and the Mw varied from 19,500 to 2,200,000 [g/mol]. Thus, each material will be represented by the Simplified Molecular Input Line Entry Specification (SMILES) corresponding to three chemical structures relevant for its characterization. Hereinafter, we will refer to these three SMILES encodings as instances of different weight of the same material mat, generically denoting them as mat SRU ,mat Mn , and mat Mw . Therefore, the i-th polymer in a materials database will have the following chemical structures associated with it: mat i SRU , mat i Mn; and mat i Mw . This process is outlined in Figure 4. Regarding the in silico polymerization, we developed a computational tool called PolyMaS [47], which characterizes an instance of weight in 2D using the SMILES code that allows a more compact representation than the connection tables. This computational polymerization is performed from the SMILES code (character string) of the SRU that contains the indicated head and tail. Then, through operations with these character strings, a head-tail polymerization is generated, resulting in a single polymer chain of the required length for the instance to be represented. Note, the computational polymerization expression refers to the macromolecule generation procedure.
Molecular descriptors can be calculated for each of these instances, defining three databases for the same group of materials: DB SRU , DB Mn , and DB Mw . Note that each database refers to the same universe of materials and molecular descriptors; the only thing that varies is the instance of weight represented in each case. For this reason, the value mdA i Mw corresponds to the calculation of the descriptor mdA made on the instance of weight Mw of the i-th material. In this way, it is possible to define the concatenation of the DB SRU , DB Mn , and DB Mw databases as the trivalued representation of the molecular descriptors of a materials database. Figure 5 shows the definition of the database with trivalued representation, DB TRI .

Algorithmic construction of QSPR using the trivalued representation
Once the DB TRI is obtained, the next step to build the QSPR model is to identify the most relevant descriptor values for the property to be predicted. This task can be performed through the feature selection algorithms typically used in data science. This representation allows identifying situations in which a molecular descriptor is relevant for a particular instance of weight but not for the rest, as well as detecting those that are relevant for all instances. For example, Cravero et al. [46] identified that many selected descriptors were computed for several instances of weight, which allows promoting the design of a multivalued representation.
Descriptor selection can be carried out in a completely automatic way, directly taking the output of the feature selection procedure (whose search algorithm can be, for example, a neural network, random committees, or a random forest), or also this output can be intervened by a domain expert (expert-in-theloop), using visual analytical methods, such as VIDEAN [48]. This second option can help to obtain more interpretable models in physicochemical terms, since the expert's knowledge allows prioritizing descriptors according to their semantic meanings. The last step to obtain the predictive model consists of conducting a supervised machine learning process using the information from the selected molecular descriptors and the target property.

Performance analysis of the impact of using the trivalued representation
In Cravero et al. [46], an exhaustive performance analysis of using trivalued representations were presented. In particular, the main research question explored in that work it was: 'is it advisable to integrate, in a single database, molecular descriptors corresponding to polymeric chains of different characteristic weights related to the molecular weight distribution curves of the materials?' In the mentioned work, an in-house database was developed and applied by our research lab [49]. The polymers included in this database are homopolymers, linear and amorphous. The 77 initial polymers were characterized by their SRU in SMILES code. To obtain the DB SRU , DB Mn , and DB Mw databases used in that work, it was necessary to generate the macromolecules that reached the average weights of the polymers in the database by using the PolyMaS software tool [47]. After that, once all molecular descriptors for the three databases were computed, the trivalued database was defined as the DB TRI joining all the molecular descriptors (see Figure 5). Note that this integrated DB was denoted as DB Global in Cravero et al. [46], but it is conceptually the same DB. Therefore, this fourth database contains all the information associated to the three different instances of molecular weight of polymeric materials, characterizing polymeric materials by capturing part of their polydispersity. Regarding the studied targets, three mechanical properties derived from the tensile test were included in the performance analysis: tensile modulus, elongation at break and tensile strength at break [50].
Several experiments were executed in Cravero et al. [46] to establish performance comparisons among the three representations based only on single-polymeric chains, DB SRU , DB Mn , and DB Mw , also known as unique weight instance representations, and the trivalued representation, DB TRI . In these experiments, different machine learning strategies were explored both for selecting the most useful molecular descriptors and for learning the most accurate QSPR models. From all these experiments, we will only focus here on the comparison between the best QSPR models obtained from the unique weight instance representations and the trivalued representation because these results bring answer to the early mentioned research question, but the detailed results of the remaining experiments can be founded in Cravero et al. [46].
The generalizability of the QSPR TRI models can be compared with the unified generalizability quantification of QSPR models generated from unique weight instance representations (DB SRU , DB Mn , and DB Mw ), which were denoted as QSPR SRU , QSPR Mn , and QSPR Mw models respectively. This methodology consists of computing the R 2 metric of the alternative models generated from the unique weight instance representations (SRU, Mn, and Mw) in an aggregated manner called QSPR AWI (all weight instances; AWI). For example, the R 2 value of the QSPR AWI aggregated model for the tensile modulus property is 0.9609. This value was obtained by computing the R 2 value that corresponds to the complete set of prediction results obtained by QSPR SRU , QSPR Mn , and QSPR Mw models for the same property when these models are applied in their external validation datasets respectively (adding the three testing outputs in a unique set of results). Therefore, the validation outputs obtained by these three QSPR models are interpreted as the R 2 of a unified QSPR model. Table 1 shows the R 2 performances of the QSPR AWI models reported by Cravero et al. [46] for the three target properties in contrast with the R 2 metrics achieved by the QSPR TRI models for the same property validation datasets. In all cases, the trivalued representation achieves the highest R 2 values, supporting with evidence an affirmative response to our research question.

Multivalued representation of polymers
In the second representation, we proposed to define the molecular descriptors of the polymers from data sampled from the weight distribution curve [51,52]. For this representation, in addition to the SRU, it is necessary to sample a set of s instances of different weight w j from the weight distribution curve of the material together with their corresponding frequencies f j . Ideally, these instances should be equally distributed along the x-axis of the weight distribution curve, covering as wide as possible the weight range of the material. Once the sampling is completed, the in silico polymerization of the polymer chains corresponding to the s sampled weights must be carried out. Thus, for the same material mat, we will have s instances of weight that characterize it, which are encoded as SMILES and which we will call consecutively as mat 1 ,mat 2 , . . . mat s . Thus, the i-th polymer in a material database will have the following s associated chemical structures: mat i 1 , mat i 2 , . . .,mat i s . Furthermore, as we also know the frequencies f j of these chains in the bulk material, we can say that the i-th polymer will be characterized by the following s pairs: This process is outlined in Figure 6.
Each of these instances can be calculated with its molecular descriptors, thus being able to define databases for the same group of materials: DB 1 , DB 2 , . . ., DB s . Note that each database refers to the same universe of materials and molecular descriptors; the only thing that varies is the instance of weight represented in each case. For this reason, the value mdA i 1 corresponds to the calculation of the descriptor mdA carried out on the instance of weight mat i 1 of the i-th material. In this way, it is possible to define a multivalued representation of the molecular descriptors of a material as a discrete distribution that combines the values of the descriptors computed for the different instances of weight with their corresponding frequencies. Following our notation, the discrete distribution corresponding to the descriptor mdA for the i-th polymer, denoted as mdA i , would be defined by the following sequence of pairs: ðmdA i 1 ,f i 1 Þ; ðmdA i 2 ,f i 2 Þ; . . .,ðmdA i s ; f i s Þ. Thus, a DB MUL database corresponding to this multivalued representation will be integrated by molecular descriptors modeled using discrete distributions closely related to the weight distribution curves of the materials. The definition of the database with multivalued representation, DB MUL , is outlined in Figure 7.

Algorithmic construction of QSPR using the multivalued representation
Unlike the classic representations based on the SRU and the trivalued representation, where the molecular descriptors are characterized as real or integer numbers, the multivalued representation defines the descriptors as discrete distributions. For this reason, it is not possible to apply in the multivalued representation the traditional feature selection algorithms to identify the most relevant descriptors or the traditional supervised learning methods to learn the QSPR model. In other words, the multivalued representation requires specific algorithms to infer the predictive models.
To drive the selection of molecular descriptors, our group proposed an algorithm called FS4RV DD , as an acronym for Feature Selection for Random Variables with Discrete Distribution. This method was proposed by Cravero et al. [52] and a more in-depth evaluation of its performance, in terms of scalability and noise level of the data, was later presented in Cravero et al. [51]. Below is a self-contained overview of this method.
FS4RV DD is encoded in two sequential phases, as illustrated in Figure 8. In the first phase, a pairwise correlation analysis is performed to rank the molecular descriptors from the highest to the lowest correlated with the target property to be predicted. In terms of our notation, we will express, for example, that the  Since we cannot directly calculate a correlation between the vector that represents a molecular descriptor, such as mdA, and a vector of real numbers, such as tp, in FS4RV DD we sample k times the discrete distributions contained in a descriptor, thus generating k vectors of real numbers. Continuing with our example, in the case of the descriptor mdA, each sample mdA k would be a vector formed by q real values, where the i-th input of the vector mdA k would come from sampling the discrete distribution mdA i . Then, it is possible to calculate the Pearson correlation between each of these k vectors sampled from mdA and the vector tp, and thus define a vector pc tp mdA that will contain the k correlation values computed between both variables. Finally, the point estimator(µ) of pc tp mdA and its corresponding variance (f ) are computed to rank the descriptors. The higher the value of the point estimator of pc tp mdA , the higher the position of mdA in the ranking. Between descriptors with same point estimator values, the variance will be used to break the tie, being the winner the one with the lowest variance.
In the second phase of FS4RV DD , the best c descriptors will be selected from an iterative exploration of the ranking elaborated in the previous phase, where the cardinality c of the selected subset is a parameter defined by the user of the algorithm. This process will prioritize the choice of the highest correlated with the target variable descriptors (i.e. the first ones in the ranking), but they must not be similar to each other. In other words, it is sought that the c selected molecular descriptors contribute with nonredundant information.
To measure the similarity between two descriptors mdA = [mdA 1 , mdA 2 , . . ., mdA i , . . ., mdA q ] and mdB = [mdB 1 , mdB 2 , . . ., mdB i , . . ., mdB q ], it is necessary to compare the similarities that both present vectors have for the same entries; that is, it is necessary to compare the similarity between each pair of discrete distributions (mdA i , mdB i ), for i varying from 1 to q. This can be achieved by calculating the Bhattacharyya Distance [53], a function denoted as B dist that generates values between 0 and 1, where zero indicates the highest degree of similarity between the contrasted distributions. In this way, we can define a distance vector between the descriptors mdA and mdB denoted as dist mdA mdB = [B dist (mdA 1 ,mdB 1 ),B dist (mdA 2 ,mdB 2 ), . . ., B dist (mdA i ,mdB i ), . . ., B dist (mdA q ,mdB q )]. Finally, we will establish that mdA and mdB contain similar information if the ratio between the number of entries of dist mdA mdB that have a similarity value lower than ∪ BD and the total number of entries, q, is greater or equal to ∪ SR , where ∪ BD and ∪ SR are two thresholds defined by the user as parameters of the FS4RV DD algorithm.
The last step to obtain the QSPR model consists of conducting a supervised machine learning process using the information from the selected molecular descriptors and the target property. To accomplish this, it is necessary to define supervised learning methods designed to solve regression problems in which the descriptive variables correspond to discrete distributions. Figure 9 outlines a possible algorithmic methodology to carry out this task proposed by Cravero et al. [52]. The strategy is very simple and consists of generating k samples of all the discrete distributions associated to the different molecular descriptors selected to build the QSPR model. Thus, our original supervised learning problem with descriptors represented as discrete distributions is now transformed into a new problem that consists of solving k traditional supervised learning problems in which the descriptors are real numbers obtained through the sampling process. Therefore, we can generate k QSPR models, one for each transformed problem, and integrate them into a single final model based on the consensus of the k predictions. We provided here a brief and intuitive explanation of this approach, but a more detailed description can be found in Cravero et al. [52].

Performance analysis of the impact of using the multivalued representation
In Cravero et al. [51], an exhaustive performance analysis of using multivalued representations were presented. In particular, the main research question explored in that work it was: given a probabilistic characterization of molecular descriptors by using discrete distributions in combination with the FS4RV DD method, is it possible to achieve a more accurate identification of the most relevant molecular descriptors than the one obtained from traditional approaches that use a simplified representation of the molecular descriptors.
Developing a database with multivalued characterization of real polymers is a complex task. First, it is required to collect the weight distribution curves of the materials in order to sample various instances of different weights and their frequencies in the bulk material. These curves are not usually reported in publications, unlike what usually happens with the availability of Mn and Mw values required for trivalued representation. Additionally, the calculation of the molecular descriptor values for the largest instances sampled from these curves is computationally expensive since the huge size of these macromolecules. For all these reasons, in Cravero et al. [51] and Cravero et al. [52] synthetic databases were generated for evaluating a proof of concept of the multivalued representation and studying the soundness and scalability of FS4RV DD algorithm, where each material, mat i , is represented by a discrete distribution. Then, this discrete distribution is a template for obtaining the discrete distributions of the molecular descriptors associated to the i-th simulated material. A detailed explanation of this procedure is provided in Cravero et al. [51].
Three synthetic databases integrated for 400, 800, and 1600 polymeric materials were built. Each database included 100 molecular descriptors and diverse scenarios were defined by varying the conditions for building the synthetic target variables. Targets had been created by changing the number of molecular descriptors randomly selected (5, 10, or 20), modifying the kind of correlation (linear and nonlinear), and incorporating (or not) noise to the data. In summary, the combinations of these conditions generated 36 different experimental scenarios.
An extra challenge was to set a fair experimental framework for performance comparisons with other approaches, because there are no similar methods that work with molecular descriptors characterized by discrete distributions. For this reason, Cravero et al. 51 decided to compare FS4RV DD method with state-ofthe-art feature selection methods using representations based on a single value. In particular, Cravero et al. [51] decided to use the smallest value of the discrete distribution of a mat i simulated polymer as an analogue for the SRU-based representation of this material. In this way, an additional synthetic database with single value representations of molecular descriptors could be defined considering only the first value of the discrete distribution as the single descriptor value. Regarding the performance metrics, the idea was to determine the capability of a feature selection method for detecting the molecular descriptors that had been previously correlated with the target variables during the construction of the simulated data sets. Then, the performance can be assessed as in a classification problem, in which the feature selection algorithm classifies the molecular descriptors into two classes: correlated and not correlated with the simulated target variable. In this work, we will focus on the Percentage of Correctly Classified (%CC) molecular descriptors, also known as Accuracy in the Machine Learning research, but other additional metrics and experiments were also reported in detail by Cravero et al. [51].
For the single value representation, the AttributesSelection method included in Weka [54] was used in that work, where the attributes correspond to the molecular descriptors of the simulated database. It provides a fair and direct comparison with respect to the FS4RV DD method in terms of classification metrics. In particular, the following parameters: CorrelationAttributeEval as attribute evaluator and Ranker as search method were used. Table 2 shows %CC values achieved by the multivalued and single-valued representations using FS4RV DD and AttributesSelection as feature selection algorithms, respectively, for different experimental conditions. Only in one experiment, that correspond with the database that have 400 simulated materials, 20 molecular descriptors selected, linear target calculated and without noise in the data, the high performance is obtained by the single-valued representation. In the remaining 35 scenarios the multivalued representation achieves the highest %CC values, which supports an affirmative response to our research question.

Concluding remarks
The demand for polymer design is experiencing strong growth, with changing scenarios that are shaping new and complex challenges, which favor an increase in the volume of data available for the discovery of new materials, making more feasible the application of data science techniques. However, achieving an adequate use of these techniques requires rigorous and realistic computational representation models of polymers. In this regard, a shortcoming identified by many authors is the lack of characterization of polydispersity in the most used representations for QSPR modeling of this type of materials, such as those based on SRU. For this reason, in this paper, two alternative computational representations of polymers, previously developed for our research group, were described to show how an expert can incorporate information from the weight distribution curve of a material by using trivalued and multivalued representations.
The trivalued representation constitutes a more rigorous characterization of a polymer than the characterizations based on a single polymer chain, such as those supported exclusively on SRU. In this sense, the experiments reported by Cravero et al. 46 provided clear evidence that the combination of data from the molecular descriptors obtained for the three instances of weight considered (SRU, Mn, and Mw) allows generating more precise QSPR models than those obtained using a single instance of weight, whatever the instance of weight considered. Another favorable aspect of this representation is that, in general, the Mn and Mw values of polymeric materials are usually provided in publications; therefore, it is not difficult to access this information. Beyond these advantages, it is worthy to note that the trivalued representation does not take advantage of the information about the frequency that each instance of weight has in the weight distribution curve of the material. These frequencies have a clear impact on the properties of a polymer and, for this reason, they were considered to define the multivalued representation.
In our opinion, the multivalued representation is conceptually the most realistic characterization for polymers in the context of QSPR modeling, because it captures the information about the molecular weights and frequencies of several polymeric chains that define the bulk material. The results reported by Cravero et al. 51,54 showed that this alternative is clearly superior to the single-value representations, like it is the SRU-based representation, Table 2. Performance comparisons, in terms of %CC, between single valued (SV) and multivalued (MV) molecular representations (MR) in 36 different experimental scenarios, varying the relationship with the target in a linear (L) or non-linear (NL) way and the presence of noise (N) or the absence of it (nN). Highest result obtained by each scenario is highlighted in bold. for detecting the most relevant molecular descriptors and capture more information from the weight distribution curve of the materials than the trivalued representation presented in the previous section. Nevertheless, these approach presents several important challenges. Firstly, to the already well-known problem of the lack of availability of databases of polymeric materials for QSPR modeling, the need of weight distribution curves for each material is added here. This limitation is so strong that in the publications in which this representation was presented it was necessary to evaluate its feasibility and performance using simulated data. 51,52 For this reason, it should be acknowledged that the trivalued representation is supported by data that is currently more accessible.
Secondly, the new algorithms that the multivalued representation demands pose an important challenge. In the present work, we describe the FS4RV DD feature selection algorithm and a consensus QSPR modeling method, both developed by our group, which provide a first alternative solution. However, there is certainly much work to be done in this regard. An alternative we are currently evaluating is to define a new multivalued representation based on molecular descriptors represented as fuzzy numbers. In this way, the information from the weight distribution curve would continue to be sampled to generate the molecular descriptors, but it would also take advantage of all the methodologies that fuzzy logic provides.
Finally, beyond the challenges mentioned above, we consider that multivalued representations will be an affordable alternative in the medium term, thanks to the strong advances that data science is making in the field of Materials Sciences. This should undoubtedly lead to an ever-increasing availability of the type of databases required to support these rigorous approaches.