CrossRef citations to date
Special Report

Benford’s Law and distributions for better drug design

Pages 131-137 | Received 11 Jun 2023, Accepted 26 Oct 2023, Published online: 03 Nov 2023



Modern drug discovery incorporates various tools and data, heralding the beginning of the data-driven drug design (DD) era. The distributions of chemical and physical data used for Artificial Intelligence (AI)/Machine Learning (ML) and to drive DD have thus become highly important to be understood and used effectively.

Areas covered

The authors perform a comprehensive exploration of the statistical distributions driving the data-intensive era of drug discovery, including Benford’s Law in AI/ML-based DD.

Expert opinion

As the relevance of data-driven discovery escalates, we anticipate meticulous scrutiny of datasets utilizing principles like Benford’s Law to enhance data integrity and guide efficient resource allocation and experimental planning. In this data-driven era of the pharmaceutical and medical industries, addressing critical aspects such as bias mitigation, algorithm effectiveness, data stewardship, effects, and fraud prevention are essential. Harnessing Benford’s Law and other distributions and statistical tests in DD provides a potent strategy to detect data anomalies, fill data gaps, and enhance dataset quality. Benford’s Law is a fast method for data integrity and quality of datasets, the backbone of AI/ML and other modeling approaches, proving very useful in the design process.

1. Introduction

As we mark the beginning of the data-driven drug design era, understanding and effectively using the statistical distributions that form the core of AI/ML methodologies and drug discovery tools have garnered high importance. This work aims to elucidate the impact of these distributions, first with a section on methods, and then a section on distributions, with a keen focus on the application of Benford’s Law in DD. This approach provides a comprehensive discussion of statistical distributions used in the past and recently. It sheds light on how Benford’s Law can make a significant difference in AI/ML-based DD models, an area that has yet to be extensively explored in the literature.

1.1. Data-driven discovery

The slow and expensive drug discovery process can take approximately 15 years and $2 billion to develop a small-molecule drug. It may be sped up by advances in structural biology (cryo-EM, prediction) and the development of vast virtual libraries of drug-like small molecules, along with the availability of abundant computing resources, physics-based methods, artificial intelligence/machine learning (AI/ML,) and screening gigascale chemical spaces. This computer-driven drug discovery [Citation1,Citation2] may provide many initial hits in libraries of 1010 structures [Citation3]. However, an essential factor for distribution-based optimization is generating and accessing distributions that include negative or inactive data. Structure-Inactivity Relationships (SIRs) in drug discovery [Citation4]** may significantly help this and emphasize that machine and deep learning techniques can benefit from negative results. Currently, there is a gap in the literature regarding inactivity data that limits the use of in silico methods, as authors are more inclined to publish novel datasets rather than exhaustive ones. Negative data may be published as preprints as a first step [Citation4]** or in bespoke databases.

In some cases, positives can be labeled as ‘spies’, i.e. negatives to help identify a decision boundary, such as peptides in a neural network (NN) classifier giving similar results to real negatives [Citation5]. Experimentally confirmed hit rates for DD are still around 10–30%, free energy predictions are also circa 1 kcal/mol error [Citation3]. In addition, de-risking drug development necessitates proper dealing with therapeutically-relevant data from translatable animal or computational models, including accelerating the toxicological characterization of compounds.

Distributions for chemical bioavailability, lead-likeness, and fragment are well known [Citation6]. Lately, AI models have gained importance making use of the relative strengths of machine vision in AI/ML-enabled medical devices [Citation7], as well as being considered for replacing experiments in drug approvals by regulatory bodies [Citation8]. New data is becoming available using automated labs, organs-on-a-chip or functional organoids, multiparameter optimization, and consideration of polypharmacology. New tools for data science and AI/ML make finding patterns among multivariate and synthetic data faster and provide relevant information.

A vital consideration in this drug design transformation is the use of different statistical distributions that play a pivotal role in optimizing the process and outcomes. This work discusses ideas and perspectives for these distributions, describing data science and drug design techniques where these distributions are used, some new advances, and how they are employed. These include machine learning, Bayesian methods, quantum computing, multi-objective optimization, personalized medicine, and a section on statistical distributions used in cutting-edge methodologies in DD.

2. Machine learning and AI in drug design

The advent of Machine Learning and AI in drug design is a game-changer, with statistical distributions playing a pivotal role in optimizing outcomes. This section delves deeper into the specifics of how ML and AI utilize distributions such as Normal, Uniform, and Xavier/Glorot in drug design.

The initialization of weights in deep learning and NN often uses the Normal or Uniform distribution. More recently, the Xavier/Glorot distribution, a specific type of Normal distribution, has been used to maintain the variance of activations and back-propagated gradients stable across layers [Citation9]. The recent development of the AutoInit algorithm has advanced the initialization of weights in NNs by adapting to different network architectures and analytically tracking the mean and variance of signals, thus improving performance across various network settings [Citation10].

With the rise of deep learning, deep generative models like Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) have found use in de novo drug design. These models learn latent representations of molecular structures and use these to generate novel drug candidates [Citation11].** The DeepTarget model has been proposed as an end-to-end deep learning model for generating novel drug candidates based solely on the amino acid sequence of the target protein, reducing the heavy reliance on prior knowledge [Citation12]. Low-shot data: a single potent Nurr1 agonist as the template in fragment-augmentation, fine-tuned a chemical language model (CLM) using SMILES to obtain novel Nurr1 agonists using sampling frequency for design prioritization, demonstrating the usefulness of these methods in hit and lead generation [Citation13].

In drug design, reinforcement learning (RL) can optimize molecular properties and target binding through iterative optimization. The probability distributions in RL algorithms, such as policy gradients or Q-learning, help guide the search for promising drug candidates. The Softmax distribution is often used to convert the output of a NN into a probability distribution over actions. These probability distributions are advantageous in multi-armed bandit problems, where the agent must balance exploration and exploitation [Citation14]. The integration of RL with drug-target interaction for drug design has led to the development of a model that uses a recurrent NN for molecular modeling and drug-target affinity as the reward function for optimal molecular generation, thus improving the efficiency of drug design [Citation15].

Sudden jumps in properties such as bioactivities of closely related chemical compounds, called activity cliffs, are common in drug design and represent hard-to-model distributions since the changes in structure can be subtle, such as a change in one atom. Indeed, ML using molecular descriptors performed better than deep learning on activity cliffs in datasets [Citation16].

Bayesian approaches have gained popularity in drug design due to their capability to incorporate prior knowledge and uncertainty into the model. These methods leverage a variety of distributions to represent prior beliefs about the parameters being estimated. Here, the manuscript, the use by Bayesian methods of these distributions in the process of drug design is explored.

Bayesian models often use various probability distributions, such as Normal, Gamma, or Beta distributions, to represent prior beliefs about the parameters being estimated [Citation17]. In clinical trials, Bayesian statistical methods incorporate prior data into trial design, analysis, and decision-making. They are increasingly recognized for their potential to reduce the time and cost of bringing innovative medicines to patients [Citation18].

Gaussian processes are also used as priors in Bayesian optimization. A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution, allowing to model uncertainty about the function to optimize [Citation19]. The choice of distribution in a Bayesian network depends on the nature of the variables. For binary variables, a Bernoulli distribution might be used. For positive integer counts, a Poisson distribution might be used instead. For continuous variables, a Normal distribution can be used [Citation20].

Quantum computing has emerged as a promising tool for describing chemical processes in drug design. Quantum states are described using complex probability amplitudes, a generalization of classical probabilities. The Born rule converts these amplitudes into probabilities, creating a unique distribution for each quantum state [Citation21]. Quantum computing has been proposed for molecular simulation algorithms [Citation22] and understanding reaction mechanisms [Citation23].

Drug design often involves balancing multiple objectives. The Pareto distribution is often a vital tool used to model the trade-off between different objectives. This section discusses the significance of the Pareto distribution for this task. Drug design involves optimizing multiple, often opposing, physicochemical and pharmacological properties. The Pareto distribution is often used to model the trade-off between different objectives in multi-objective optimization. The Pareto front represents the set of non-dominated solutions, where a solution ‘dominates’ another if it is better in at least one objective and no worse in the others. In molecular (inverse) design, graph-based, non-dominated sorting genetic algorithms (NSGA-II, NSGA-III) have been proposed for molecular multi-objective optimization [Citation24].

Also crucial in drug design is the response to medicaments. Personalized medicine tailors medical treatment to the individual characteristics of each patient and aims to improve drug design for populations and individuals. Statistical distributions play a crucial role in modeling individual differences in response to treatment. This part presents the importance of these distributions in personalized medicine. The random effects in a mixed effects model are often assumed to follow a Normal distribution. This distribution allows us to model individual differences in response to treatment [Citation25]. The Cox proportional hazards model is commonly used in survival analysis. This model assumes that the hazard function, which describes the risk of the event as a function of time, is a product of a baseline hazard function and an exponential function of the covariates. The baseline hazard function can take any form and is not associated with a specific distribution.

3. Distribution-centric design

In addition to the previously mentioned distributions, several others, such as Gaussian, Boltzmann, Poisson, Log-normal, Weibull, Gamma, Beta, and Gumbel distributions, have also found application in drug design. Each of these distributions has strengths and limitations and a unique role in different aspects of drug design, which are discussed in this section. Some of these distributions have an established role in following patient data, e.g., and others have received renewed interest. The choice of distribution depends on the specific context, problem, and the available data.

The Gaussian distribution (Equation. 1) has a long history in modeling. It is heavily used in quantitative structure-activity relationship (QSAR) modeling of (normally distributed) descriptors to understand the relationship between molecular properties and biological activity, which aids in drug design [Citation3].

1 fx=1σ2πe(xμ)2/2σ21

The Gaussian distribution is bell-shaped and symmetric about the mean μ. It is fully characterized by its mean and standard deviation σ.

The generalized Boltzmann distribution (Equation. 2) has been applied in statistical thermodynamics and molecular dynamics simulations for the past few decades to sample various molecular conformations, helping researchers understand the thermodynamics and kinetics of drug-target interactions [Citation6].

2 P=eFE/kT2

The Boltzmann distribution is a decreasing exponential function, where E is the energy of a microstate, T is the temperature, k is the Boltzmann constant, and F is the free energy Helmholtz free energy).

The Poisson distribution (Equation. 3) is used in cheminformatics to model rare events, such as the occurrence of specific chemical patterns or motifs within a large chemical library [Citation26].

3 Px;λ=eλλxx!3

The Poisson distribution is discrete and represents the probability of a given number of events occurring in a fixed interval of time or space. Its shape depends on the parameter λ (the average rate or expected value of events) and the number of occurrences, x.

The Log-normal distribution (Equation. 4) is used in pharmacokinetics and pharmacodynamics to model the distribution of various drug-related parameters, such as drug clearance, half-life, and volume of distribution [Citation27].

4 fx=12πe(lnxμ)2/2σ24

The log-normal distribution arises from the multiplicative product of many independent random variables, each of which is positive, positively skewed, and characterized by its parameters μ and σ. The variable x is always positive.

The Weibull distribution (Equation. 5) is frequently utilized in the analysis of survival, which plays a pivotal role in clinical trials. An instance of this would be the time until a patient experiences a specific side effect, which may conform to a Weibull distribution. These times to side effects can aid in comprehending the safety profile of a drug Citation28.

5 fx;λ,k=kλ(xλ)k1e(x/λ)k5

The Weibull distribution can take on a variety of shapes depending on its shape parameter k and scale parameter λ, both positive numbers.

The Gamma distribution (Equation. 6) is employed in modeling waiting times between events. In drug discovery, it may be used to model the time until a given reaction transpires in a biochemical process, which can be critical in comprehending the mechanism of action of a drug [Citation29].

6 fx;α,β=xα1ex/βΓαβα6

The shape of the Gamma distribution depends on the shape parameter α and the rate parameter β.

The Beta distribution (Equation. 7) is utilized in Bayesian statistics, which is gaining popularity in drug discovery. It can model the prior knowledge concerning the efficacy of a drug, which is subsequently updated with fresh data to furnish a posterior distribution for the drug’s efficacy [Citation30].

7 fx;α,β=xα1(1x)β1Bα,β7

The Beta distribution is defined on the interval [0, 1] and can take a variety of shapes depending on its shape parameters α and β.

The Gumbel distribution (Equation. 8) is usually used in extreme value theory. However, in drug discovery, the Gumbell distribution could be utilized to model the drug-drug effects (efficacy or toxicity) that drugs can induce, which is significant in understanding the drug’s therapeutic [Citation35].

8 fx;μ,β=1βez+ez8

where z=(x-μ)/β

The Gumbel distribution is asymmetric and skewed to the right. It can model the distribution of the maximum (or the minimum) of several samples of various distributions.

A particular distribution that can be further explored in dataset design is Benford’s Law. Also called the First-Digit Law, it provides a pattern in the leading digits of some natural datasets. It has potential for application in drug design, particularly in identifying anomalies or inconsistencies in large chemical databases. In this section, we focus on the potential of Benford’s Law in drug design, and how it can enhance the quality of datasets, which is crucial for AI/ML and modeling.

Benford’s Law is a statistical principle that posits that the primary digit (initial non-zero digit) is often small in numerous (not all) naturally occurring datasets. The likelihood P of observing a leading digit d (d ∈ {1, 2, …, 9}) is precisely described by Equation. 9:

9 Pd=log101+1/d9

Despite being extensively utilized in various fields, including accounting, finance, and data analysis for anomaly and fraud detection, Benford’s Law has not yet been widely applied in drug design. Nevertheless, there is potential for applying Benford’s Law in drug design, particularly in identifying anomalies and inconsistencies in large chemical databases or datasets. For instance, scientists could employ Benford’s Law to recognize potential data entry errors, inconsistencies in experimental measurements, or biases in the reported data. Identifying and correcting such issues could enhance the quality of datasets utilized in various drug design tasks such as QSAR modeling, molecular docking, or virtual screening.

The Chi-squared (χ2) distribution statistic is then helpful in checking statistical significance against Benford’s distribution, i.e. if the p-value is smaller than a defined significance level (say 5%), then the null hypothesis that the observed distribution fits the distribution of first digits expected under Benford’s Law can be rejected, indicating that the observed distribution significantly deviates from what is expected under Benford’s Law.

Benford’s Law has been proposed to check the manipulation of data for QSAR/QSPR models since they are usually subject to high selection to attain high correlation values [Citation31]*. Also, it can be used for the distribution of mRNA transcription data from a large number of organisms, solubility and activity data, with several data sets available from ChEMBL [Citation32] and NCBI [Citation33] following Benford’s Law distribution [Citation31]*.

Burgeoning data-centric technologies are pivotal in enhancing pattern detection while accentuating the influence of data quality and potential bias on processes like ML and modeling. An approach makes use of the distribution of the initial significant digits of critical parameters in medicinal chemistry (specifically logP, logS, and pKa, both predicted and observed) to evaluate their compliance with Benford’s Law, an underlying pattern discernible in numerous natural phenomena [Citation34]*. Data quality is heavily contingent upon datasets’ dimensions, diversity, and scale. The logarithm of the octanol/water coefficient (logP) estimates the predicted ability of a chemical to transverse a biological membrane. The solubility of a compound (measured as Solubility in mg/L or the logarithm logS) has a crucial impact on the ability to deliver a chemical to its site if action in a sufficient dose and in a sufficiently long-lasting action time. The minus logarithm of the acid dissociation constant, pKa, indicates the possible ionization states of compounds at physiological conditions, which also affect the compound’s solubility, bioavailability, and permeability. Distributions of experimentally-determined values for these parameters for drug compounds follow Benford’s Law with statistical significance, but not as well as larger datasets of experimental or computationally-obtained values (), also seen in their p-values.

Figure 1. Distributions of the pharmaceutically-relevant properties of logP, pKa, and solubility, logS, compared to Benford’s Law, shown as frequencies of first significant digits in percentages. Data from [Citation34]*.

Figure 1. Distributions of the pharmaceutically-relevant properties of logP, pKa, and solubility, logS, compared to Benford’s Law, shown as frequencies of first significant digits in percentages. Data from [Citation34]*.

LogPexp, pKa_exp, and Solubility_exp distributions show (non-statistically significant) deviations from Benford’s Law, especially for the first few digits. These deviations could indicate that these datasets may have inherent biases, errors, or anomalies. LogP_ALOGPS, LogP_JCHEM, pKa_JCHEM, and LogS_ALOGPS show distributions with a closer fit to Benford’s Law for the first few digits, suggesting that the data may be more uniformly distributed across different orders of magnitude. However, deviations for later digits indicate that there may still be some anomalies or biases in the data. The distribution of LogS_exp shows deviations from Benford’s Law, particularly for the first few digits. However, it aligns more closely with the later digits. This alignment suggests that specific ranges of values are overrepresented in the dataset. For Solubility_ALOGPS and LogPexp_NCI, there are deviations from Benford’s Law across all digits, suggesting that the data may be skewed or biased in some way. Distributions that deviate the most often have smaller dataset sizes. This deviation could be due to the ‘Law of Large Numbers’, which states that as a sample size grows, its expected value of a result gets closer to the average for the whole population. In other words, smaller datasets have a higher chance of observing deviations from expected patterns, such as Benford’s Law.

Drug-centric profiling may be overly restrictive or undersized since there is a relatively small amount of approved drug compounds (a few thousand); hence, deploying more extensive collections of predicted or experimentally verified values can reinstate the distribution typically observable in other natural occurrences. This approach may be instrumental in refining, profiling, ML, comprehensive dataset analysis, and other data-driven methodologies, thereby improving automatic data generation and compound design processes.

Another application is in data manipulation and fraud detection in chemical processes, reporting, and regulatory filing, as following compliance with Benford’s Law distribution can quickly assess if the underlying numerical distributions are likely to be sampled from a non-manipulated distribution.

ML and similar technologies are profoundly reliant on data quality. Benford’s Law is a rapid statistical tool to determine if some specific types of non-bounded data that traverse multiple orders of magnitude are likely to align with natural phenomena. This methodology is particularly effective for large datasets, and the first significant digit distributions of experimental and predicted values of logP, pKa, and solubility can significantly impact drug design campaigns and processes.

These methods are just a few examples of the many probability distributions and approaches used in drug design. As the field continues to evolve, newer methods and algorithms may be developed and implemented to improve the drug discovery process.

Data integrity is key to ML monitoring, though fundamental procedures such as addressing missing values, range violation, feature analysis and engineering, and type mismatch must be performed prior to using data for training a model. In addition, subject domain knowledge is irreplaceable for finding and interpreting any anomaly or manipulation in the data presented. ML-based models are generally driven by a pipeline with complex features and automated workflows that can cause multiple transformations of the data for the model to train. Studying the distributions in the data can give at least preliminary observations of the underlying data and other integrity tests.

4. Conclusion

In conclusion, the effective use of statistical distributions, particularly in AI/ML models, is central to the future of drug design. Substantial datasets are required to fully utilize AI models, which have increasingly gained prominence in this field. They can be drawn from automated labs, organs-on-a-chip, or functional organoids, among other sources. These models also rely on a variety of statistical distributions, including Normal, Uniform, Xavier/Glorot, Gaussian processes, Bernoulli, and others, each having its strengths and limitations. For instance, deep learning employs various distributions for weight initialization in neural networks and generating novel drug candidates. Reinforcement learning uses probability distributions for optimizing molecular properties, and Bayesian methods incorporate prior knowledge and uncertainty into models. Quantum computing and multi-objective optimization also utilize specific distributions, and personalized medicine relies on models like the Cox proportional hazards model.

Further, Benford’s Law, which outlines a pattern in the leading digits of natural datasets, may serve as a valuable tool for anomaly detection in large chemical databases. Utilising Benford’s Law could enhance data quality, which is critical for AI/ML and modeling. For instance, essential parameters in medicinal chemistry, such as logP, logS, and pKa, can be evaluated for compliance with Benford’s Law, leading to improvements in profiling and data-driven methods.

The effective use of these technologies, distributions, and methodologies is central to the future of drug design, with the potential to vastly improve the efficiency and cost-effectiveness of the drug discovery process. Our contribution to the literature is emphasizing the potential of Benford’s Law in enhancing the quality of large and small datasets, ultimately improving the drug discovery process.

5. Expert opinion

While Benford’s Law has not yet been a prominent method in drug design, it could be employed to detect anomalies in data and improve the quality of the datasets used in the drug discovery process.

Benford’s Law can also identify data gaps in distributions and, thus, better design experiments for improving the quality and representativeness of data sets, and planning and guiding optimization processes while controlling the use of resources. Data sets of more compounds than the few thousand existing approved drugs are needed to represent better the full phenomena of bioactive (and bioinactive) compounds.

Data-driven discovery will become ever more present in several fields, including the pharmaceutical and medical. As such, it is envisioned that more and better data, methods to appropriately deal with data bias, algorithm bias, their effects, ownership, tools, access, and fair use of these will be crucial aspects to focus on.

Issues such as accountability, fraud, data manipulation, consent, representativeness, and reliability of assumptions will be central for further research and development.

It will also be necessary for regulators to consider the effects of AI/ML and distribution-based devices and methods, such as through pregistered trials and research, open source and ethical supervision of data collection, processing, storing, featurisation, model building, model deployment, inference, as well as effects including indirect ones on different populations and individuals.

Machine vision is consolidating, and initial successes are already plentiful. Though less speedy till now, we expect successes to also come from other ML areas such as AI, signal processing, time series, automated monitoring, predictive analysis, and tools to process in a secure and quick manner vast amounts of curated data presently stored in pharmaceutical companies.

Article highlights

  • As the relevance of AI/ML and data-driven discovery escalates in fields like pharmaceuticals and medical devices, meticulous scrutiny of datasets utilizing principles like Benford’s Law is anticipated to enhance data integrity and guide efficient resource allocation, experimental planning, and strategy.

  • In the era of data-driven discovery, addressing critical aspects including bias mitigation, algorithm effectiveness, data stewardship, and fraud prevention, is essential.

  • Harnessing Benford’s Law and other distributions in drug design provides a potent and fast strategy to detect data anomalies, fill data gaps, and enhance dataset quality.

  • For a more comprehensive and accurate portrayal of bioactive and bioinactive compounds and their phenomena, data sets need to encompass a broader spectrum than the mere few thousand presently approved drugs.

  • These considerations, coupled with the ability to generate distribution-based, automated experimental data and securely and swiftly process vast volumes of proprietary, curated data, could revolutionize areas for drug design.

  • Advances can follow successes seen in the machine vision field in other fields, such as AI, signal processing, and sparse, limited, and noisy data.

Declaration of interest

The author has no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.

Reviewer disclosures

Peer reviewers on this manuscript have no relevant financial or other relationships to disclose.

Additional information


The authors are funded by the Estonian Research Council (grant no. PRG1509); EU COST (European Cooperation in Science and Technology) Action CA21111 OneHealthDrugs.


Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.