Disparities in case frequency and mortality of coronavirus disease 2019 (COVID-19) among various states in the United States

Abstract Objective To utilize publicly reported, state-level data to identify factors associated with the frequency of cases, tests, and mortality in the USA. Materials and methods Retrospective study using publicly reported data collected included the number of COVID-19 cases, tests and mortality from March 14th through April 30th. Publicly available state-level data was collected which included: demographics comorbidities, state characteristics and environmental factors. Univariate and multivariate regression analyses were performed to identify the significantly associated factors with percent mortality, case and testing frequency. All analyses were state-level analyses and not patient-level analyses. Results A total of 1,090,500 COVID-19 cases were reported during the study period. The calculated case and testing frequency were 3332 and 19,193 per 1,000,000 patients. There were 63,642 deaths during this period which resulted in a mortality of 5.8%. Factors including to but not limited to population density (beta coefficient 7.5, p < .01), transportation volume (beta coefficient 0.1, p < .01), tourism index (beta coefficient −0.1, p = .02) and older age (beta coefficient 0.2, p = .01) are associated with case frequency and percent mortality. Conclusions There were wide variations in testing and case frequencies of COVID-19 among different states in the US. States with higher population density had a higher case and testing rate. States with larger population of elderly and higher tourism had a higher mortality. Key messages There were wide variations in testing and case frequencies of COVID-19 among different states in the USA. States with higher population density had a higher case and testing rate. States with larger population of elderly and higher tourism had a higher mortality.


Introduction
The coronavirus disease 2019 (COVID-19) has spread worldwide after its onset in Wuhan, China, reaching over three million cases [1]. In the United States (US), over seven million tests have been performed with over one million tests resulting positive. This disease has resulted in over 70,000 deaths in the US and a case fatality rate at approximately 6.2%. Recent reports have shown differences in case and fatality rates between nearby localities and thought to be likely due to differences in demographic and socioeconomic factors [2]. Furthermore, differences in climate, access to healthcare and adherence to social distancing have also been hypothesized to affect these rates [3][4][5]. However, factors mediating the differences in case frequency and testing frequency between the states in the USA have not been formally assessed. The aim of this study is to use publicly available, state-level data to determine the factors associated with COVID-19 percent mortality and case and testing frequency. approval was required or sought. The study was performed in accordance with Declaration of Helsinki.

Variable identification and data collection
Absolute counts for COVID-19 cases, tests and mortality were obtained from Worldometer [6]. Data were collected from March 13th, 2020 through April 30th, 2020. These dates were selected as March 13th was the first day when these data were publicly reported, and April 30th was the last full day prior to data collection. The number of cases and tests were then normalized to the specific state's population to develop a frequency per 1,000,000 population. Any reference in this manuscript to "case frequency" or "testing frequency" refers to the normalized values in this manner, unless explicitly stated otherwise.
Next, data were collected at a state-wide level to help characterize the population, environment, and infrastructure in the state. The data sources are listed in Supplementary file 1. The selected variables were identified by a literature review of the factors that impact the frequency and severity of viral illnesses, including COVID-19. The following variables were collected: age, gender, underinsured population, ethnicity, influenza vaccination status, population density (persons per square mile), urban air quality rank (lower number signifying better air quality), drinking water quality rank (lower number signifying better drinking water quality), ultraviolet index, precipitation (in inches), temperature (in degree Fahrenheit), average household income (in US$), per capita spending on healthcare (in US$) and high school graduation rate. To capture comorbid conditions, the prevalence of obesity, prevalence of smoking, prevalence of cocaine abuse, prevalence of marijuana abuse, alcohol consumption (gallons per person per year), prevalence of asthma, prevalence of diabetes, prevalence of chronic obstructive pulmonary disease, prevalence of myocardial infarction, prevalence of coronary artery disease, prevalence of hypertension, prevalence of hyperlipidaemia and presence of inactivity were collected. To capture immunosuppressed states, the annual incidence of new cancer and HIV cases per 100,000 population were collected. Finally, the social distancing score by global positioning satellite data, public transportation volume, number of incarcerated inmates, number of nursing home residents, and tourism rank (lower number implies more tourists) were collected. Of these, ultraviolet index, temperature and precipitation were averages for March and April of 2020, whereas the remainder of data were collected from the most recent iteration for each state. Much of the data was collected from government sources, such as the Centres for Disease Control and the complete list of sources is provided The collected data represents state-level and not patient-level data. The endpoints were divided amongst the authors and collected by everyone. The data for each endpoint was then verified by another author who did not primarily collect the data. Finally, values in the top and bottom 10th percentile were identified and verified by a third author.

Statistical analyses
As the data was collected for each state and intended for state-level analyses, the absolute number of COVID-19 cases, tests, and mortality were converted to a frequency using the state population. The frequencies for all the endpoints were calculated per 1,000,000 population. The case frequencies were then used as the dependent variables in a series of singleindependent variable linear regressions to determine the univariate association between case frequency and the other variables previously defined and served as the univariate analyses. Next, a stepwise multivariate regression was conducted with p-value of .05 or less required for inclusion into the final model. Of the resulting models, the one with the highest R-squared value was selected as the final model. The same process was repeated for testing frequency and percent mortality as the dependent variable. Collinearity analyses were run with all multivariate regressions.
All statistical analyses were done using the user-coded, syntax-based interface of SPSS Version 23.0. A p-value of .05 or less was considered was considered statistically significant. The use of the word significant throughout the manuscript refers to "statistically significant" unless explicitly specified otherwise. All statistical analyses were done at the state-level with state-level data. Analyses were not conducted at a patient-level with patient-level data. The subjects here were the 50 states. The age, gender and comorbidity prevalence are not based on patient-specific data but rather the state prevalence.

COVID-19 cases, testing and mortality
A total of 1,090,500 COVID-19 cases were reported in the study period. This resulted in a case frequency of approximately 3332 per 1,000,000 patients (3.3%). In the same period, 6,299,143 tests were done, resulting in a testing frequency 19,193 per 1,000,000 patients out of which 17.3% were reported positive. There were 63,642 deaths during this period which resulted in a mortality of 5.8%. Figures 1 and 2 demonstrate the case frequency and the percent mortality by state.

COVID-19 case frequency, multivariate analyses
The following factors were associated with greater case frequency on multivariable analyses: higher population density (beta coefficient 7.5, p < .01) and increased public transportation volume (beta coefficient 0.1, p < .01). The R-square for this model was 0.78. Collinearity analyses did not demonstrate any significant collinearity (Table 2 shows multivariate analyses).

COVID-19 testing frequency, multivariate analyses
The following factors were associated greater testing frequency on multivariable analyses: higher population density (beta coefficient 19.9, p < .01). The R-square for this model was 0.27. Collinearity analyses did not demonstrate any significant collinearity.

COVID-19 percent mortality, multivariate analyses
The following factors were associated with greater percent mortality using multivariable analyses: median age in years (beta coefficient 0.2, p ¼ .01) and tourism ranking (beta coefficient-0.1, p ¼ .02) ( Table 3). The Rsquare for this model was 0.29. Collinearity analyses did not demonstrate significant collinearity.

Power analyses
For multivariate regression analyses for case frequency, for which there is a relatively adequate effect size, and two predictors in the model, 31 subjects would be required to achieve 80% power. With 50 states and thus 50 subjects in these analyses, the multivariate analyses for case frequency are adequately powered.
For multivariable regression analyses for testing frequency, for which there is a relatively low effect size, and one predictor in the model, 385 subjects would be required to achieve 80% power. With 50 states and thus 50 subjects in these analyses, the multivariable analyses for testing frequency are not adequately powered.
For multivariable regression analyses for percent mortality, for which there is relatively low effect size, and two predictors in the model, 478 patients would be required to achieve 80% power. With 50 states and thus 50 subjects in these analyses, the multivariable analyses for percent mortality are not adequately powered.

Discussion
In these analyses, wide variations in case frequency, testing frequency, and percent mortality were observed. These results also identified factors, such as population density, transportation volume, tourism index and older age to be some of the factors affecting the above outcomes. To the authors' knowledge, this is the first study to incorporate large number of variables to study the differences in COVID-19 transmission among different states in the US. Several laboratory and clinical risk factors have been postulated to predispose patients to symptomatic infection with COVID-19 and related mortality [7][8][9]. However, non-clinical factors that affect such outcomes are unknown. After multivariate analysis in this study, a higher population density was found to be associated with higher case and testing frequency. The results also identified increased public transportation to be associated with higher case frequency. Finally, the results identified older age and increased tourism independently related to higher mortality.
Since these analyses were based on state-level data, the power for multivariate regression was reduced compared to if the analyses were completed with patient-level data. Hence, the findings of the univariate analyses deserve attention as well here. We found direct relationships between case frequency in a state and female gender, underinsured status, average household income and per capita healthcare spending and inverse relationship with obesity, smoking and UV light exposure. Public transportation was significantly associated with frequency of cases, testing and mortality rates on univariate analysis. However, on multivariate analysis, only the frequency of cases remained significantly associated with public transportation. Tourism ranking was also a significant predictor of all three endpoints on univariate analysis and remained significantly associated with mortality even after multivariate regression.
There is limited data on the nature of healthcare disparities during the COVID-19 pandemic. The cumulative COVID-19 incidence has been reported to be significantly variable among jurisdictions, ranging from 20.6 per 100,000 cases in Minnesota to 915.3 per 100,000 cases in New York City [10]. The timing of the introduction of COVID-19 in the state and the extent of mitigation measures may mediate some of this variation. The age of patients has been shown to be a significant predictor of COVID-19 infection and worse outcomes [9,11]. The race and gender of patients have also been reported to be associated with a higher case frequency and worse outcomes in patients with COVID-19 [12,13]. It continues to be shown that population density is associated with an increase in transmission and infection for the high-risk population [14,15].
The results of this study showed that public transport volume was also linked to a higher case rate. Prior studies have demonstrated similar findings with influenza-like illnesses and that its use increases the individual's risk for acquiring an acute respiratory infection [16,17]. A simulation model indicated that the high level of subway usage in New York can influence disease spread in an influenza epidemic and that between 4 and 5% of total infections would occur on subways [18]. This information is particularly important as recommendations to maintaining strict disinfecting guidelines for public transport along with shelter-inplace whenever possible are established [19,20].
The results of this study identified that higher tourism is associated with increased mortality. A possible explanation of this identified phenomenon has been published and thought to be related to an influx of infected patients presenting late in the disease course [21,22]. Furthermore, prior studies have shown that air transportation accelerates viral spreading mainly related to high passenger traffic and risk of surface contamination in airports [23,24]. Similar findings have been reported in trains and other types of commercial vessels [25,26]. Hence, the widespread implementation of travel restrictions, social distancing and lockdowns has become the main preventative intervention to decrease viral spreading during this pandemic. However, in this analysis, social distancing score was not associated with COVID-19 case frequency or mortality. The result from prior studies showed that social distancing is an effective measure at decreasing viral spreading when comprised of quarantines, school closure and workplace distancing [27,28]. However, our current analysis showed a lack of association between social distancing and COVID-19 case frequency, which may also represent a limitation of the process of measuring the social distancing score. This is particularly important as some countries like Sweden have encountered higher COVID-19 case frequencies after adopting more lenient social distancing measures [29]. This analysis also showed a lack of impact of climaterelated factors on case and testing frequency. These findings require further validation as conflicting reports have been published [5,30,31]. Surprisingly, in our analysis states with a higher number of uninsured patients were found to have lower mortality which could possibly be related to underreporting in such populations as both case numbers and testing numbers were also lower in such states.
Finally, most of the comorbidities analyzed were not found to be independently associated with the case mortality in this analysis. This may suggest the complex interplay between demographics, environmental factors and diseases processes [32,33]. The findings from this analysis also highlight some of the potential limitations of state-level data rather than patient-level data, as previous studies have found a few comorbidities to be associated with increases in case frequency and percent mortality.
These analyses offer some early assessment of the factors that may be mediating COVID-19 case frequency, testing frequency, and percent mortality. These analyses present associations using state-level data and not patient-level data. While these analyses offer novel data regarding case frequency, testing frequency and percent mortality in the US, these analyses are not without their limitations. First, all the study data was captured from publicly available sources which only had data until 2018. The use of state-level data reduced the power of analyses, as we used for the multivariate regression models the number of states as the subjects. Although the data collection carries a risk of bias, this was minimized by utilizing multiple investigators for the accuracy of data captured. The ecologic design of the paper and use of various data sources with varying methods are other limitations of our study. Finally, the lack of granularity to county or city level data further limits our interpretations.
With these limitations in mind, it is important to frame the intentions of this study appropriately. These analyses are by no means intended to be definitive data but are intended to be exploratory data to help identify variables that should be accounted for in larger, multicenter studies that utilize patient-level data. Factors such as the environmental and local infrastructural characteristics appear to modulate the case frequency and percent mortality and thus could be beneficial to capture in future studies. The data from those variables may assist with the understanding of viral spreading and the pandemic evolution. For instance, the identification of the association between higher tourism volume with higher case frequency and percent mortality may help implement faster travel restrictions for future pandemics. Similarly, the association between public transportation volume and its association with increased case frequency and percent mortality may assist in developing a future public response.

Conclusion
This observational analysis of publicly reported statelevel data identified factors associated with increased case frequency, testing frequency, and percent mortality for COVID-19. These data can guide future study design and develop risk prediction models.