Understanding the evolution of a de novo molecule generator via characteristic functional group monitoring

ABSTRACT Recently, artificial intelligence (AI)-enabled de novo molecular generators (DNMGs) have automated molecular design based on data-driven or simulation-based property estimates. In some domains like the game of Go where AI surpassed human intelligence, humans are trying to learn from AI about the best strategy of the game. To understand DNMG’s strategy of molecule optimization, we propose an algorithm called characteristic functional group monitoring (CFGM). Given a time series of generated molecules, CFGM monitors statistically enriched functional groups in comparison to the training data. In the task of absorption wavelength maximization of pure organic molecules (consisting of H, C, N, and O), we successfully identified a strategic change from diketone and aniline derivatives to quinone derivatives. In addition, CFGM led us to a hypothesis that 1,2-quinone is an unconventional chromophore, which was verified with chemical synthesis. This study shows the possibility that human experts can learn from DNMGs to expand their ability to discover functional molecules.


Details of chemical synthesis
All reaction conditions dealing with oxygen and moisture sensitive compounds were carried out in a dry reaction vessel under nitrogen atmosphere. Unless otherwise noted, chemicals obtained from commercial supplies were used as received. Dehydrated 1,4-dioxane was degassed by nitrogen bubbling before using. 1 H NMR (400 MHz) and 13 C NMR (100 MHz) spectra were measured for a CDCl3 solution of a sample and are reported in ppm (δ) from internal tetramethylsilane for 1 H NMR and from solvent peak for 13 C NMR. UV/vis spectrum was measured in CH3CN at room temperature.
Electrospray ionization mass (ESI-MS) spectrum was recorded on a spectrometer in the positive mode.
A sample was injected as a CH3CN solution.

Dependence of MCTS parameter
To find the suitable MCTS parameter for molecule generation by ChemTS, we used ChemTS for generating long wavelength absorption molecules with MCTS parameters, C= 1, 2, and 4. In the case of C=1, the growth of averaged absorption wavelength is saturated around 800 nm after generating 25,000 molecules as shown in Fig. S5(a). On the other hand, in the case of C=4, the growth of averaged absorption wavelength is very slow as shown in Fig. 6(a) and the averaged absorption wavelength still 400 nm after generating 40,000 molecules. Hence, we concluded that C=2 was suitable parameter for designing long wavelength absorption molecule. Along with the absorption wavelength evolution, molecular properties (HOMO/LUMO gap and oscillator strength, molecular weight, conjugate length, and number of aromatic rings) in the case of C=1 shows the similar tendencies with the discussion in main text. In the case of C=4, the large change of molecular properties is not observed obviously. We thought that the computational time (120 h) is not enough to develop long wavelength absorption molecules with C=4.  the distribution profiles of generated molecules for each property. A thin shade area represents 5%-95% of the total distribution, while a dense shade area represents 15%-75% of the total distribution in each number of generated molecules.

S9
The dependence of MCTS parameter on the odds ratio is also investigated. In the case of C=1 and 4, although ChemTS still insisted on the ketone derivatives after 30,000 molecule generation, the odds ratios of 1,2-naphthoquinone derivatives grow after 20,000 molecule generation for C=1 and 30,000 molecule generation for C=4 as shown in Figure S7 and S8. Total odds ratios are summarised in Table S1 for each parameter. Except for C=4, 1,2-naphothoquinone derivatives shows high odds ratio. In the case of C=4, we thought that if more computational time had been available, ChemTS would have focused on the 1,2-naphotoquinone derivatives.  Table S1 as a function of the number of generated molecules. Odds ratios are computed for every S10 100 generated molecules.  Table S1 as a function of the number of generated molecules. Odds ratios are computed for every 100 generated molecules.

Computational studies of 1-4.
DFT calculations were carried using the Gaussian 16 program package. 1 The density functional theoretical (DFT) method was employed with the APFD hybrid functional 2 and the 6-31G(d) basis set. 3