EPA Is Mandating the Normal Distribution

The United States Environmental Protection Agency (USEPA) is responsible for overseeing the cleanup of sites that fall within the jurisdiction of the Comprehensive Environmental Response, Compensation, and Liability Act (CERCLA; also known as “Superfund”). This process almost always involves a remedial investigation/feasibility study, including deriving upper confidence, prediction, and/or tolerance limits based on concentrations from a designated “background” area which are subsequently used to determine whether a remediated site has achieved compliance. Past USEPA guidance states outlying observations in the background data should not be removed based solely on statistical tests, but rather on some scientific or quality assurance basis. However, recent USEPA guidance states “extreme” outliers, based on tests that assume a normal (Gaussian) distribution, should always be removed from background data, and because “extreme” is not defined, USEPA has interpreted this to mean all outliers identified by a test should be removed. This article discusses problems with current USEPA guidance and how it contradicts past guidance, and illustrates USEPA’s current policy via a case study of the Portland, Oregon Harbor Superfund site. Supplementary materials for this article are available online. ARTICLE HISTORY


Introduction
The United States Environmental Protection Agency (USEPA) is responsible for overseeing the cleanup of sites that fall within the jurisdiction of the Comprehensive Environmental Response, Compensation, and Liability Act (CERCLA; also known as "Superfund"). This process almost always involves a remedial investigation/feasibility study (RI/FS), including deriving background threshold values (BTVs) for pollutants based on concentrations in a designated background area. These BTVs (usually upper confidence limits [UCLs], upper prediction limits [UPLs], or upper tolerance limits [UTLs]) are subsequently used to determine whether a remediated site has achieved compliance by comparing statistics based on data from the remediated site with the BTVs. A key component of deriving BTVs is determining exactly what the "background" data are. Past USEPA guidance states outlying observations in background data should not be removed based solely on statistical tests, but rather on some scientific or quality assurance basis. However, recent USEPA guidance states "extreme" outliers, defined by tests that assume a normal (Gaussian) distribution, should always be removed from background data, and because "extreme" is not defined, USEPA has interpreted this to mean all outliers identified by a test should be removed. As part of the RI/FS process for the Portland, Oregon Harbor Superfund site, in 2014 the USEPA directed potentially responsible parties (PRPs) to omit all observations from the background data that were identified as outliers according to Rosner's  to review USEPA's approach. This article summarizes past and current USEPA guidance for assessing and dealing with outliers in background data, and, via the Portland Harbor case study, documents USEPA's defense of its procedure.

The Assumed Distribution Matters
Whether an observation is an outlier depends on the assumption for the underlying distribution that generated the data. Although statistical models for environmental data include the normal distribution, often the assumed model is lognormal, gamma, some other right-skewed distribution, or nonparametric (Heath 1967;Ott 1995;van Belle 2008;Limpert et al. 2001;Limpert and Stahel 2011). Several USEPA guidance documents point out that environmental data are often not normally distributed (USEPA 1995(USEPA , 2002(USEPA , 2009(USEPA , 2013a(USEPA , 2015b. USEPA (2009) states that if a test for outliers based on the raw data indicates significance, it should also be performed on transformed (e.g., log) data as well, and if the significance disappears then the observation(s) should be retained.

Dealing With Outliers
Many USEPA guidance documents discuss ways to deal with outliers, including: performing quality control procedures, such as checking for transcription errors, possible sources of contamination in the field or laboratory, and/or laboratory errors (USEPA 2002(USEPA , 2006a(USEPA , 2009; taking more than one physical sample so that an alternate sample from the same location is available to analyze if questionable values arise (USEPA 1995); and looking at results both with and without outliers (USEPA 2006a(USEPA , 2006b(USEPA , 2013b(USEPA , 2015b. Several guidance documents state observations should never be discarded simply based on a statistical test; rather there should be a scientific explanation justifying omitting the data (USEPA 2002(USEPA , 2006a(USEPA , 2006b(USEPA , 2009. USEPA (2009) indicates the dilemma faced by an analyst when an observation appears very different from the others without any explanation: …it may be advisable at times to remove high-magnitude outliers in background even if the reasons for these apparently extreme observations are not known. …On the other hand, an isolated increase without any other evidence could be a real but extreme background measurement. Ideally, removal of one or more statistically identified outliers should be based on other technical information or knowledge which can support that decision. (USEPA 2009, pp. 5-5 to 5-6).
In particular, USEPA (2002, pp. 4-6) states: "… background areas are not necessarily pristine areas. A data point should not be eliminated from the background dataset simply because it is the highest value that was observed. "

Issues With Current USEPA Guidance for CERCLA Sites
For assessing an RI/FS at a CERCLA site, USEPA relies on the ProUCL software and ProUCL Technical Guide (USEPA 2015a, 2015b), a package developed under contract to USEPA and maintained for USEPA by Lockheed Martin. USEPA's Office of Research and Development funded and managed the research, and it was peer reviewed and approved for publication by USEPA. Statistical issues where this guidance contradicts past USEPA guidance for identifying and dealing with outliers at background sites are discussed below.

Current Guidance Assumes a Normal Distribution
The ProUCL Technical Guide (USEPA 2015b, pp. 223, 225) states: In practice, the boundaries of an environmental population (background) of interest may not be well-defined and … a sampled data set may consist of outlying observations coming from population(s) not belonging to the main dominant background population of interest. … a lognormal model tends to accommodate outliers …. … impacted locations may need further investigations. Outlier tests should be performed on raw data, as the cleanup decision needs to be made based upon concentration values in the raw scale and not in log-scale or some other transformed scale (e.g., cube root).
The only tests for outliers available in ProUCL are versions of Dixon's test and Rosner's test that assume all of the data that are not outliers come from a normal distribution. There are many forms of Dixon's test (Barnett and Lewis 1995); the one available in ProUCL assumes just one outlier (Dixon 1953). Dixon's test is vulnerable to "masking" in which the presence of multiple outliers can mask the fact that even one outlier is present. Rosner's test allows you to test for several possible outliers and avoids the problem of masking (Barnett and Lewis 1995;Lau 2015;NIST 2018 (Rosner 1975), whereas the form of Rosner's test available in ProUCL is based on the generalized ESD (Rosner 1983;Gilbert 1987). Rosner's test based on the ESD has the appropriate Type I error level if there are no outliers in the dataset, but if there are actually say m outliers, where m < k, then the ESD version of Rosner's test tends to declare more than m outliers with a probability that is greater than the stated Type I error level (referred to as "swamping"). Rosner's test based on the generalized ESD fixes this problem.
Because ProUCL states that outlier tests should only be performed on raw data, and the only available tests for outliers in ProUCL assume all of the data that are not outliers come from a normal distribution, ProUCL by default assumes that "the main dominant background population" is essentially normally distributed. This is puzzling, given much of the ProUCL Technical Guide explains how to compute means and BTVs on the original scale assuming distributions other than the normal distribution, including the gamma and lognormal distribution, as well as nonparametric methods.
As previously stated, under the null hypothesis, Rosner's generalized ESD test assumes all of the data come from the same normal distribution, and under the alternative hypothesis, up to k of the observations come from a different distribution, where k is pre-specified. ProUCL indicates outlier tests should be used to identify "elevated outliers" (USEPA 2015b, p. 18), but because Rosner's test is inherently two-sided, the assumed Type I error the user sets will be twice the actual Type I error to identify only large outliers (e.g., setting α = 0.1 will yield a Type I error of 5% for identifying only large outliers). Figure B1 in Appendix B of the supplemental material shows the distribution of the number of high outliers identified by Rosner's test based on 5000 simulations, assuming a maximum of k = 10 outliers (the same number used in the Case Study below), using a normal, gamma, and lognormal distribution with mean 10 and CV of 0.5 or 1, and sample size n = 50. Figure B2, supplemental material shows the false positive rate for identifying at least one high outlier. As expected, the false positive rate for right-skewed distributions is grossly inflated (e.g., for the gamma and lognormal distributions with mean 10 and CV 0.5, the proportion of times [95% CI] at least one outlier was indicated is 0.45 [0.44, 0.46] and 0.68 [0.67, 0.69], respectively). Appendix A, supplemental material contains R code (R Core Team 2016; Millard 2013) to generate figures and analyses shown in this article.

Current Guidance for Dealing With Nondetects Is Faulty
Environmental data often contain left-censored observations, called nondetects. Nondetects occur when the contaminant of interest is present in negligible concentrations or absent, so that the chemical signal recorded on the measuring instrument is small relative to process noise. In this case, the concentration is usually reported as being less than an analytical threshold (detection limit or reporting limit), such as "< 2 μg/kg" (Helsel 2012). For datasets with nondetect values, to construct Q-Q plots using ProUCL, the options are: (1) ignore nondetects, (2) set nondetects to half the detection limit, or (3) set nondetects to the detection limit (the default). Method 1 is not valid because it throws away data and changes the focus from the underlying true distribution to a truncated version of the true distribution. Methods 2 and 3 are incorrect, as noted by USEPA (2009, p. 12): "If simple substitution is used to estimate a value for each nondetect prior to plotting, the resulting probability plot may appear nonlinear simply because the censored observations were not properly handled. " Common proper methods for computing plotting positions for Q-Q plots when data are left-censored involve using valid estimates for the cumulative distribution function (CDF). When the detection limits of all nondetects are less than all uncensored observations, the usual formulas for the uncensored observations can be used.  (2013). Figure 1 illustrates normal Q-Q plots based on a random sample of 40 observations from a normal distribution with mean 10 and SD 5 (CV=0.5). Observations less than 6 (the 21st percentile) were set to 6 and designated nondetects. The first Q-Q plot shows all of the original observations prior to censoring, the second one omits nondetects, the third one sets nondetects to the detection limit, and the fourth one is a censored Q-Q plot based on the Kaplan-Meier technique (USEPA 2009). The first two methods of dealing with nondetects (Figures 1(b) and 1(c)) give false impressions of the data, and in fact the goodness-offit tests reject the null hypothesis of a normal distribution (p = 0.049 and p = 0.001, respectively). A valid method of dealing with nondetects, USEPA's (2009) modified version of the reverse Kaplan-Meier method, (Figure 1(d)) indicates the data appear to come from a normal distribution, and the goodness-of-fit test that accounts for censored observations (Royston 1993) does not reject the null hypothesis of an underlying normal distribution (p = 0.54).
To test for outliers using ProUCL, the options are to ignore nondetects or set nondetects to half the detection limit (the default). Figure B3 in Appendix B of the supplemental material shows the distribution of the number of high outliers identified by Rosner's generalized ESD test based on 5000 simulations, assuming a 5% Type I error (i.e., 10% two-sided Type I error) and maximum of k = 10 outliers, using normal, gamma, and lognormal distributions with a mean of 10 and CV of 0.5, and a sample size of n = 50, with 10% single censoring (i.e., observations less than the 10th percentile of the distribution were set to the 10th percentile and considered nondetects), and setting nondetects to half the detection limit or omitting them. Figure B4, supplemental material shows the false positive rate for identifying at least one high outlier. For the normal distribution, setting nondetects to half the detection limit only mildly affects the false positive rate, but ignoring them inflates the Type I error by about a factor of three: 0.14, 95% CI [0.13, 0.15].

Current Guidance States Outliers Should Always be Omitted
Contrary to past USEPA guidance, the ProUCL Technical Guide (USEPA 2015b, p. 18) states certain kinds of outliers should never be included, even if there is no evidence to indicate the outliers are the result of process errors: "Since the presence of outliers in a dataset tends to yield distorted (poor and misleading) values of the decision making statistics (e.g., UCLs, UPLs, and UTLs), elevated outliers should not be included in background datasets and estimation of BTVs. " As illustrated in the case study, USEPA interprets this guidance to mean all outliers should always be removed.

Case Study-The Portland Harbor CERCLA Site
The Portland, Oregon Harbor has been affected by more than a century of industrial, commercial, urban, agricultural, and residential use along the Willamette River. Water and sediments along Portland Harbor are contaminated with many hazardous substances, including heavy metals, polychlorinated biphenyls (PCBs), polynuclear aromatic hydrocarbons (PAHs), dioxin, and pesticides. Portland Harbor was added to EPAs National Priorities List in December 2000 (USEPA 2018). The Lower Willamette Group (LWG) is a subset of PRPs identified by USEPA (persons for whom EPA has sufficient information to make a preliminary determination of liability under CERCLA; USEPA 2017a) and is composed of 10 parties who signed an agreement with the USEPA to conduct the RI/FS of the Portland Harbor Superfund Site, and four other parties who have contributed financially to the project (LWG 2017). Appendix C in the supplemental material includes maps of the Portland Harbor CERCLA site and the Background area (denoted the Upriver Reach), along with figures showing the locations of the surface sediment sampling stations within the Background area. Information on how the Background area was chosen is given in Section 7 of the Remedial Investigation Report (LWG 2016), included in Appendix C, supplemental material.
While the history of the RI/FS for this CERCLA site spans more than a decade and encompasses numerous documents and statistical issues, this section deals only with data on chemical concentrations in surface sediment sampled from the Background area for a subset of constituents, and specifically focuses on USEPA's method of identifying outliers and its justification for excluding them from calculations of BTVs. (Data used for 1. Using all of the data, compute a BTV by using ProUCLs automated method, which involves ProUCL first choosing a distribution for the data (normal, gamma, lognormal, or nonparametric) based on goodness-of-fit tests that omit nondetects.
2. Using all of the data, perform Rosner's test (which assumes a normal distribution for non-outliers) to identify up to 10 possible outliers (i.e., k = 10) based on the raw (untransformed) data (regardless of what was assumed for the underlying distribution in Step 1). For constituents that include nondetects, set nondetects to half the detection limit.

Remove all outliers identified by Rosner's test, then repeat
Step 1. Note that for Step 3, the distribution chosen by ProUCL to compute the BTV may differ from the distribution chosen by ProUCL in Step 1. 4. Compare the BTV from Step 1 with the one from Step 3 and assess the effect of outliers on the BTV. USEPA also looked at normal Q-Q plots generated by ProUCL to determine outliers, but identified the same observations as outliers that were identified by Rosner's test. Table D1 in Appendix D of the supplemental material summarizes the number of observations, number and percent nondetects, and number of outliers identified by USEPA for 17 constituents that were measured in the Background area. The percentage of nondetects ranges from 0 to 52%. Figure 2 shows plots for total PCBs (in the form of Aroclors) and arsenic. For the PCBs as Aroclors raw data, Rosner's test identified the five largest observations as outliers. However, the normal Q-Q plot using USEPA's (2009) modified version of the reverse Kaplan-Meier method and the log-transformed observations that include the outliers (Figure 2(b)) does not indicate the presence of outliers, and the Shapiro-Francia goodness-of-fit test that accounts for censored observations (Royston 1993) yields p = 0.49 using the log-transformed data. On the other hand, for arsenic, the Shapiro-Wilk goodness-of-fit tests based on the raw and log-transformed observations both yield p < 0.05. Two notable features of Figure 2 are (1) the cluster of large values in Figure 2(a) for PCBs in a relatively small section of the Background area, which might prompt further investigation, and (2) all three outliers for arsenic shown in Figure 2(c) are less than the Portland basin upper background limit of 8.8 mg/kg published by the Oregon Department of Environmental Quality (2013a, 2013b), which states "If the maximum detected concentration is less than the default value, then that metal is not present in site soil above background levels and that metal is not a chemical of potential concern or potential ecological concern, " (ORDEQ 2013a). Thus, removing these observations from the dataset would amount to removing data that ORDEQ considers below background levels. Figures D1-D17 in the supplemental material show plots similar to Figure 2 for all 17 constituents.
In an E-mail of August 12, 2014, USEPA directed LWG to exclude all observations identified as outliers when computing BTVs. Under the RI/FS process, USEPA issues an Administrative Settlement Agreement and Order on Consent (AOC) that includes a section on Dispute Resolution (USEPA 2017a). This section contains text describing the process for both informal and formal dispute resolution within USEPA's RI/FS administrative procedure if a PRP disagrees with USEPA's admin- … the overall thrust of the Technical Guide recommends the removal of extreme values based solely on their significant difference from the remainder of the data set. … EPA did not assume a normal distribution of the data but instead used the most appropriate methods available in the guidance to perform outlier analyses …. USEPA (2015d) admits little is known about potential sources of contamination in the Background area that could be used as weight-of-evidence to justify removal of outliers, and as an example acknowledges multiple possible sources of PCBs in the Background area. USEPA reasons that although all sources of PCBs in the Background area are anthropogenic, values from some kinds of sources should not be included in the Background area. USEPA then implies that since at least some of the elevated values of PCBs may have come from these disallowed sources, this justifies omitting all observations identified as outliers by Rosner's test for all chemicals measured in the Background area.

Discussion
USEPA's current policy of deleting outliers from background data based solely on the results of statistical tests and plots is contrary to past USEPA guidance and good statistical practice. Note also that these tests do not consider potential spatial variability in the underlying geochemical composition of the sediment, which can affect the concentration of analytes. Both NAVFAC (2003) and Myers and Thorbjornsen (2004) recommend looking at scatterplots of trace elements versus major elements (e.g., iron) in order to identify high concentrations of trace elements not explained by the underlying geochemical composition. This idea could be extended to look at Q-Q plots and perform outlier tests based on residuals from a regression model for the analyte of interest that includes geochemical predictors. However, any outlier identified with this method would still require investigation and scientific explanation in order to omit it from the dataset.
It is not clear why USEPA maintains that using Dixon's test, Rosner's test, and/or normal Q-Q plots to determine outliers does not involve assuming a normal distribution. USEPA justifies omitting outliers because of their effects on computed BTVs, especially in the case of small sample sizes. Deleting observations because of their effect on estimated parameters is not a valid way to fix a sample size issue.
USEPA (2002, p. B-6) is relevant as to why valid observations should not be arbitrarily removed from background data: Generally, under CERCLA, cleanup levels are not set at concentrations below natural … [or] anthropogenic background concentrations …. The reasons for this approach include cost-effectiveness, technical practicability, and the potential for recontamination of remediated areas by surrounding areas with elevated background concentrations.
For example, if the CERCLA site is downstream from the background site (as is the case for the Portland Harbor), then contaminants present in the background site can affect contaminant levels in the CERCLA site after remediation.
For cases in which an outlier cannot be explained as being a data entry error, analytical error, or the result of accidental contamination of the physical sample in the field or lab, two possible actions are to run analyses on field replicates of the physical sample (if they exist) and/or collect more physical samples in the vicinity of the sample associated with the large values. For CERCLA sites, which often involve tens of millions of dollars or more in cleanup costs (GAO 2010), it is wise for PRPs to collect multiple physical samples at the same locations, split physical samples into duplicates, and to require the analytical lab(s) performing the analyses to run blinded lab duplicates of each sample. In its final decision, USEPA (2015d, p. 13) implies it might have considered additional evidence before removing outliers: While I understand that conditions of the reference area are uncertain, I note that LWG could have addressed this uncertainty by either investigating potential sources of out-lier contamination, or collected additional samples near or at the location of the potential outlier sample. Based on the record before me, it appears that LWG failed to propose or perform either.
USEPA's mission is to protect the environment and human heath, so it is understandable the agency would prefer to use data that reflect the cleanest possible background area. However, observations that do not fit a model are not necessarily wrong (e.g., McNutt 2014), and, assuming they represent true concentrations, removing them makes the background site look cleaner than it really is. If USEPA considers outlier concentration levels in the background site unacceptable, it would make sense to verify these concentration levels, clean up the affected areas of the background site, and implement a new sampling program in these areas to determine updated background concentrations.
Thirty years ago, Millard (1987) discussed shortcomings of environmental monitoring programs and laws, and noted many of the problems were "the result of the lack of involvement of qualified statisticians in monitoring design, data-analysis, and policy decisions, " and a lack of appreciation for what is involved in the statistics profession. Recently, USEPA published federal regulations that mandate specific statistical procedures for monitoring groundwater at coal combustion residuals landfills (US Government Publishing Office 2017, Electronic Code of Federal Regulations, Title 40, Section 257.93). These regulations mandate that the statistical analysis plan must be certified by a "qualified professional engineer [emphasis added] stating that the selected statistical method is appropriate. " There is still a need for improved awareness of the role of statisticians, and improved statistical support and guidance in the field of environmental regulation.

Supplementary Materials
The supplementary materials contain Appendices A-H. More information is available in the "ReadMe" document accompanying those materials.