Application of data pooling to longitudinal studies of early post-traumatic stress disorder (PTSD): the International Consortium to Predict PTSD (ICPP) project

ABSTRACT Background: Understanding the development of post-traumatic stress disorder (PTSD) is a precondition for efficient risk assessment and prevention planning. Studies to date have been site and sample specific. Towards developing generalizable models of PTSD development and prediction, the International Consortium to Predict PTSD (ICPP) compiled data from 13 longitudinal, acute-care based PTSD studies performed in six different countries. Objective: The objectives of this study were to describe the ICPP’s approach to data pooling and harmonization, and present cross-study descriptive results informing the longitudinal course of PTSD after acute trauma. Methods: Item-level data from 13 longitudinal studies of adult civilian trauma survivors were collected. Constructs (e.g. PTSD, depression), measures (questions or scales), and time variables (days from trauma) were identified and harmonized, and those with inconsistent coding (e.g. education, lifetime trauma exposure) were recoded. Administered in 11 studies, the Clinician Administered PTSD Scale (CAPS) emerged as the main measure of PTSD diagnosis and severity. Results: The pooled data set included 6254 subjects (39.9% female). Studies’ average retention rate was 87.0% (range 49.1–93.5%). Participants’ baseline assessments took place within 2 months of trauma exposure. Follow-up durations ranged from 188 to 1110 days. Reflecting studies’ inclusion criteria, the prevalence of baseline PTSD differed significantly between studies (range 3.1–61.6%), and similar differences were observed in subsequent assessments (4.3–38.2% and 3.8–27.0% for second and third assessments, respectively). Conclusion: Pooling data from independently collected studies requires careful curation of individual data sets for extracting and optimizing informative commonalities. However, it is an important step towards developing robust and generalizable prediction models for PTSD and can exceed findings of single studies. The large differences in prevalence of PTSD longitudinally cautions against using any individual study to infer trauma outcome. The multiplicity of instruments used in individual studies emphasizes the need for common data elements in future studies.


Introduction
Post-traumatic stress disorder (PTSD) is the most frequent and best documented psychopathological consequence of traumatic events. PTSD is tenacious, debilitating, and treatment refractory in many cases (Breslau, Peterson, Poisson, Schultz, & Lucia, 2004;Hoskins et al., 2015;Institute of Medicine, 2014;Kessler, 2000;Roberts, Roberts, Jones, & Bisson, 2015Schnyder et al., 2015;Sijbrandij, Kleiboer, Bisson, Barbui, & Cuijpers, 2015). Early interventions may reduce the prevalence of chronic PTSD among survivors at risk (Kearns, Ressler, Zatzick, & Rothbaum, 2012), but they are resource demanding and effective in only a subset of survivors. The frequent presence of spontaneously remitting early symptoms (Bryant et al., 2015;Galatzer-Levy et al., 2013) makes it difficult to differentiate those at risk for chronic disorder from those who will remit on their own. This, in turn, constitutes a barrier to targeting prevention efforts to those at risk. Improving PTSD prediction is a highly desirable clinical and public health goal.
Nonetheless, significant amounts of acute-care based, longitudinal data have been collected to date. Studies (reviewed below) have typically annotated the type of traumatic event, participants' symptoms, and information about known PTSD predictors, such as gender, lifetime trauma exposure, prior mental illness, education, and recovery environment (Brewin, Andrews, & Valentine, 2000;Bryant et al., 2012;Freedman et al., 2002;Gabert-Quillen et al., 2012;Koren, Arnon, & Klein, 1999;Macklin et al., 1998;Ozer, Best, Lipsey, & Weiss, 2003), to evaluate prediction of non-remitting PTSD. These data constitute a viable source for inferring risk estimates across different studies, while reflecting the specific culture and context in which each study was conducted.
Pooled analysis, otherwise known as individual participant data meta-analysis (Debray et al., 2015), is therefore preferable to conventional meta-analysis, a central-tendency driven quantitative review of published results, in that the latter cannot properly account for cultural and contextual factors, such as samples' heterogeneities, collection rules, and assumptions underlying the original investigators' statistical analyses. Inconsistency in the variables included in predictive models and analytic techniques across studies may lead to inaccurate estimates of the results and the conclusions that are drawn. For example, a traditional meta-analysis that examines female gender as a risk factor for PTSD may overlook the effect of explicit and hidden covariates in each study (e.g. exclusion of comorbid disorders or prior PTSD), thereby creating unaccounted for heterogeneity in gender-effect estimates. Pooled data analyses differ from meta-analytic methods in that they rely on studies' raw data and consequently can reveal the full distribution of variables instead of mean results. They further account for the original studies' designs and address data analytic heterogeneities (Blettner, Sauerbrei, Schlehofer, Scheuchenpflug, & Friedenreich, 1999). Data pooling is increasingly being used to build large data sets from separately collected samples to reach statistical power and generalizability (e.g. Logue et al., 2015).
Pooling data from different sources, however, requires intense data management, careful identification of constructs and related measures, quality appraisal, and harmonization. Investigators involved in pooling data must carefully consider many important aspects of the studies, including sampling, measurements and assessment schedules, and loss to follow-up. They must also define the dimensions and resolutions within which the 'pooled' data set can be reliably interrogated. The steps to conduct pooled analysis include: (1) defining each pooled study's objectives and inclusion criteria; (2) identifying qualified studies and collecting item-level individual data; (3) harmonizing and merging data from different sources; (4) examining heterogeneity between studies; and (5) analysing pooled data, including sensitivity analyses (Friedenreich, 1993;Smith-Warner et al., 2006).
The International Consortium to Predict PTSD (ICPP) is an effort sponsored by the US National Institute of Mental Health to create a consortium of principal investigators of published and unpublished longitudinal PTSD studies (Bonne et al., 2001;Bryant, Creamer, O'Donnell, Silove, & McFarlane, 2008;deRoon-Cassini, Mancini, Rusch, & Bonanno, 2010;Hepp et al., 2008;Irish et al., 2008;Jenewein, Wittmann, Moergeli, Creutzig, & Schnyder, 2009;Matsuoka et al., 2009a;Mouthaan et al., 2014;Shalev et al., 2012Shalev et al., , 2000Shalev et al., , 2008van Zuiden et al., 2017), combine their individual-and item-level data towards carrying out a pooled secondary analysis, and synthesize information about the predictors of PTSD. The ICPP's goal is to pool and harmonize extant data sets so as to inform PTSD pathogenesis and prediction across trauma types, severity, geography, and clinical circumstances. Participating investigators contributed raw data stripped of personal identification information from current and previous studies. Data sets were reviewed, annotated, harmonized, and used to build a common data set. This paper describes the ICPP's approach to pooling PTSD-specific studies, outlines challenges and solutions, describes the data set generated, and presents a descriptive map of longitudinal PTSD research. In the light of our experience, we discuss study-specific and generic aspects of data pooling and analytics.

Data sources
Longitudinal studies tracking the development of PTSD among survivors admitted to acute care centres were identified by a literature review and by contacting researchers active in the field. Studies were eligible for inclusion if they (1) evaluated civilian survivors of a distinct traumatic event, (2) had a baseline assessment shortly after trauma exposure, (3) included at least one consecutive assessment of PTSD and PTSD symptoms using validated instruments, and (4) had individual participant data available for pooling.
We contacted 12 principal investigators, whose longitudinal studies evaluated 7737 recent trauma survivors. Ten investigators (6648 participants) agreed to share their data. Investigators provided preliminary descriptions of their studies, studies' assessment schedules, published results and, when available, codebooks linking data-set items to instruments and measurements. They contributed a total of 16 studies. One study (N = 99) was discarded owing to loss of follow-up data, and two others (N = 168 and 127) were excluded for lack of item-level data. The studies not included comprised motor vehicle accident (MVA) survivors (100% compared with 77.8% in those included). Table 1 shows the main features of the 13 studies that were included in the final pool. Two longitudinal studies included early interventions. The Jerusalem Trauma Outreach and Prevention Study (JTOPS) had 296 out of 1996 participants randomly assigned to treatment groups that included cognitive behavioural therapy, escitalopram, and placebo (Shalev et al., 2012). The Amsterdam oxytocin study (van Zuiden et al., 2017) evaluated the preventive effect of oxytocin, and data included in the ICPP consisted of that study's placebo group. Beyond available information regarding general study design, investigators have used various inclusion and exclusion criteria for recruitment (Appendix 1).

Constructs and related measures
Following studies' acceptance, the original data sets were reviewed, individual items were identified and linked with specific instruments (e.g. rating scales), and the latter were mapped into six overarching psychopathological constructs: PTSD and PTSD symptoms, other Diagnostic and Statistical Manual of Mental Disorders, 4th Edition (DSM-IV) Axis I disorders, acute stress symptoms, depression and anxiety symptoms, substance use disorders, and global functioning (Table 2).

Participants' identities and time anchors
To enable data amalgamation, each participant in each of the original studies was assigned an ICPP global identifier. Data in different formats and languages were translated and transformed into a standard format following the procedure described in Appendix 2. Because data represented in these studies involved repeated assessments, a 'days since trauma' variable was attached to each instrument, representing the exact timing of the instrument's administration relative to the traumatic event. Days since trauma were subsequently used to build a master summary sheet including all instances of repeated evaluation. The global ID and the days since trauma in the summary sheet were then used as unique identifiers to link different instruments assessed at different times. Studies are listed by chronological order of the end year, from earliest to most recent studies. If two studies ended in the same year, alphabetical order of the principal investigator's last name is used. If both study end year and principal investigator are the same, the study with a larger sample size is listed first.
b Follow-up rate is defined by the percentage of subjects with at least one follow-up after baseline assessment, using dates of follow-up as an indicator of being assessed. The Midwest resilience study has missing dates for both followup time-points. Therefore, the follow-up rate of this study is the percentage of subjects with at least one PTSD assessment after baseline, using the PDS data.
c Unpublished at the time of data transfer.

Demographics and personal variables
Demographics and baseline information are critical data descriptors and potential predictors (Breslau et al., 1998;Brewin et al., 2000;Karstoft et al., 2015). While age and gender could be reliably derived from studies' data sets, other variables were differentially collected, and had to be recoded for use in the common data set, as follows.
Variability in reporting education stemmed from using different coding systems (e.g. years of schooling vs highest level of education completed) and from different education models in different countries. For example, Switzerland and Australia have an apprenticeship system after grade 9 or 10, while the USA and other countries have a high-school system that continues until grade 12. This diversity, corroborated with ICPP international investigators, was finally reconciled by reducing 'education' level into a dichotomous variable describing whether a person has finished high school (or equivalent education level in the country). Marital status was encoded as a dichotomous variable differentiating 'married or cohabiting' from 'unmarried and not cohabiting'.
Several constructs (e.g. PTSD, depression) were consistently measured, whereas others (e.g. coping mechanism, memory) were measured less frequently. Different instruments were used to capture the same construct. Table 2 summarizes the most frequently measured constructs, instruments used, and the frequency of their usage across studies. Other less common instruments were identified; however, these were not considered for pooling at this point. They were inconsistently measured in a minority of studies and therefore were not included in Table 2. As a result, certain baseline predictors known in the literature were not discussed in the current paper (e.g. pain, medication, income, and social support).
Studies typically used four to seven categories of traumatic events varying from one site to another. For example, the JTOPS study (Shalev et al., 2012) had MVAs, work accidents, physical assaults, terrorist attacks, and other traumas as trauma categories, while the Midwest resilience study (deRoon-Cassini et al., 2010) had MVA, assaults, gunshot wounds, stabbings, falls, work accidents, household accidents, snowmobile accidents, and object-fell-on-person accidents. Based on epidemiological studies showing a higher conditional prevalence of PTSD following interpersonal trauma (Benjet et al., 2016), and categorization of the World Mental Health Survey (Karam et al., 2014), we recoded these variable into (1) MVAs, (2) other non-interpersonal trauma (e.g. work or home accidents, falls, and sports accidents), and (3) interpersonal trauma (e.g. assaults, rape and other violence). Inconsistently encoded prior trauma exposures, using study-specific instruments, were similarly recoded into categories known to differentially predict PTSD (Karam et al., 2014): 'interpersonal trauma' (e.g. warrelated events, physical violence, sexual violence, terror, and kidnapping) and 'all other events'.
The severity of traumatic events was assessed in some studies directly or indirectly through several different measurements, including Injury Severity Score, Glasgow Coma Score, amnesia, loss of conscious, Abbreviated Injury Score, ad-hoc severity or exposure levels, length of stay in hospital, and pain score. Trauma severity is not included in this paper due to the lack of consistency across the entire pool. However, it can be used in the future for subsets of studies in which a consistent severity score can be reliably derived.
PTSD diagnostic status was redetermined based on these instruments using DSM-IV decision rules. To infer the presence of each PTSD diagnostic criterion we used, for CAPS interviews, the recommended threshold of frequency ≥ 1 and intensity ≥ 2 (Weathers et al., 1999), and for PSS and PDS an item score ≥ 2 (Foa, 1995;Foa, Cashman, Jaycox, & Perry, 1997;Foa et al., 1993;Foa & Tolin, 2000). We also accepted direct reporting, as some studies' symptom criteria were recorded as 'present' versus 'absent'.
However, criterion A (exposure to traumatic event and initial responses),was not explicitly annotated in many studies and impelled us to conclude that subjects' trauma exposure was implied by their inclusion in each of the studies.
Symptom duration (≥ 1 month required, criterion E) was inconsistently documented and could not be included in our definition of PTSD. Similarly, PTSD criterion F (clinically significant distress or impairment) was absent from more than half of the studies and thus could not be used to sanction the presence or absence of PTSD. To address this shortcoming, we compared the prevalence of PTSD with and without criterion F. Among 1219 cases meeting criteria B, C, and D across all time-points, only 26 (2.1%) of them did not meet criterion F. We consequently refer to participants who meet PTSD symptom criteria (B, C, and D) as having PTSD.

Study features
The 13 studies that were included by the ICPP reflected data collected longitudinally in the Netherlands, Switzerland, the USA, Israel, Japan, and Australia. The studies' sample sizes ranged from 50 to 1996. Studies had up to four repeated PTSD assessments, with the time of final assessments ranging between 4 and 36 months after trauma exposure. Studies recruited from a variety of acute trauma settings, most commonly the emergency room, trauma wards, and intensive care units. The average followup rate was 87.0% (range 49.1-93.5%) ( Table 1).

Baseline characteristics and demographics
Participants included in the ICPP data set (n = 6254) were enrolled between the years 1995 and 2014. Baseline demographic information such as age, gender, marital status, education, and trauma types is provided in Table 3. The participants' mean age was 37.77 years (range 31.25-42.92). The gender distribution was 39.9% female and 60.1% male. The most frequent traumatic event was MVAs (73.8%), followed by other accidents (18.5%) and interpersonal violence (7.7%). Over 60% of individual participants across all studies had experienced another traumatic event before the current trauma.

Sampling heterogeneity
Studies differed in inclusion and exclusion criteria (Appendix 1). The main inclusion criteria were related to the seriousness of injury (e.g. minimum injury severity score), the initial response to the event (e.g. DSM-IV PTSD A2 criterion), and symptom expression after the event (e.g. a minimum score on a screening instrument). For example, three studies (Bryant et al., 2012;deRoon-Cassini et al., 2010;Jenewein et al., 2009) had a minimum hospital admission length (24-48 hours), and three studies (Bonne et al., 2001;Shalev et al., 2012;van Zuiden et al., 2017) had a minimum threshold or criterion for initial symptoms. These criteria screened for more severely injured or more symptomatic patients to be included in the studies. The main exclusion criteria included the extent of injury (aiming to exclude patients considered too severely injured to participate, especially patients with head injury), prior mental health problems (aiming to study the onset of new mental health issues), certain event characteristics (e.g. self-harm), and practical or ethical criteria (e.g. incarceration). Given these criteria, each study can be seen as representing a specific subset of the entire trauma population in medical settings.

Assessment time-points
Times of assessments ranged from in-hospital baseline assessment on the day of the traumatic event to 2-3 years later. A total of 175 assessment date variables were found across all studies. Studies differed, however, in their intended assessment timing and actual days since trauma. Some studies had an assessment date input for each instrument or group of instruments (e.g. participants taking different parts of interviews and questionnaires belonging to the same time-point on different days). The days since trauma calculated from these dates were then clustered according to intended time-points from the original studies (e.g. 1 month, 3 months, 1 year). The actual number of days since trauma varied around intended time-points, with some participants seen earlier and many later than scheduled (see Table 4 for details). Figure 1 is a frequency distribution depicting the number of subjects assessed at any given time relative to trauma. The combined data for all studies (except for the Midwest resilience study, owing to a lack of information on assessment dates) are presented in the upper row, and the lower rows show data for the largest individual studies. Figure 1 not only emphasizes the importance of using real time after trauma as a time indicator in pooled data (since data collection periods were often non-overlapping between studies), but also demonstrates that choosing certain time periods in the pooled data may over-sample subjects from certain studies.

Missing observations
ICPP studies had different rates of attrition (Table 1). Defining loss to follow-up as the absence of follow-up CAPS assessments among those with initial CAPS (n = 3909), 667 participants (17%) were lost and 3242 (83%) retained. Participants lost to follow-up did not differ from those retained on baseline CAPS total scores (27.5 ± 25.0 vs 28.6 ± 25.8, respectively, Mann-Whitney U-test p-value = 0.46). Nonetheless,  (Shalev et al., 2000) 28.77 ± 10.64 Hadassah cortisol (Shalev et al., 2008)  future data analyses should assess the nature of loss to follow-up for specific time periods (e.g. 6-9 months, 1-1.5 years) and constructs of interest (e.g. depression, lifetime trauma), and devise ways to palliate for eventual sampling bias (e.g. via selective imputation).

Potential predictors of PTSD
The current work harmonized data for the most consistently measured potential predictors for PTSD. Pre-trauma predictors included gender, education, age, marital status, and prior trauma exposure; peri-and post-trauma predictors include trauma type and acute stress responses. Other risk factors on which fewer studies collected data are not included in the current paper (such as income, pain, and social support); however, these could be analysed in a smaller subset of samples using the same data-processing methods.

PTSD prevalence
Reflecting studies' inclusion rules, the prevalence of PTSD varied considerably between studies (3.1-61.6% at baseline; 4.3-38.2% on the first follow-up assessment, and 3.8-27.0% on the third assessment) ( Table 5). For example, the study with the highest prevalence purposefully recruited participants with a high likelihood of endpoint PTSD, aiming towards evaluating the effect of early interventions (Shalev et al., 2012), whereas other studies recruited liberally among acute care admissions (Wittmann, Moergeli, Martin-Soelch, Znoj, & Schnyder, 2008) without screening for trauma severity and initial symptoms.

Discussion
The ICPP brought together the largest pooled longitudinal data set of adult civilian trauma survivors to date, containing extensive item-level information on 6254 individuals from 13 studies performed in six different countries. Conceptually, this project hinges on the assumption that PTSD is a robust construct, universally applicable, and therefore amenable to generalization across samples, measures, and datacollection routines (as long as these variations are accounted for). The ICPP effort implies, therefore, that data collected under different circumstances represent subsets of a generic PTSD population. Similar assumptions underlie all current data-pooling enterprises, from genetic consortia such as the Psychiatric Genomics Consortium (http://www.med. unc.edu/pgc/) to the National Institutes of Health data depositories (https://www.nlm.nih.gov/ NIHbmic/nih_data_sharing_repositories.html). As such, the ICPP effort, described above, illustrates generic dilemmas and choices made across similar enterprises. The following discussion addresses the specific data features of the ICPP, our approach to pooling and consequent decisions, descriptive results, limitations, and implications for future studies.

Sources of data
Studies in general used analogous recruitment and follow-up templates, adopted similar instruments, and used a common, long-term PTSD outcome. However, participating studies sampled different communities (e.g. communities with differing rates of violent crimes; deRoon-Cassini et al., 2010;Schnyder, Wittmann, Friedrich-Perez, Hepp, & Moergeli, 2008) and applied study-specific inclusion/exclusion criteria (e.g. initial PTSD symptom severity, injury severity, and present and past mental disorders history) (Appendix 1). Pooling across such differences can be seen as an opportunity to create a sample that is more inclusive, more informative of the general link between trauma exposure and PTSD, and less influenced by specific studies' selection Table 4. Intended assessment time-points and actual days since trauma (median ± SD) for different studies in the International Consortium to Predict PTSD (ICPP).

Addressing time from trauma
The unique challenge for pooling longitudinal data was to organize the data temporally. As longitudinal cohorts, assessment schedules differed from study to study. They may have been based on theoretical concepts or convenience of sampling. Converting studydefined time-points into real time based on interview dates is the optimal solution to this problem. This allowed us to choose specific time frames after trauma and study the predictive value of factors proximal to trauma on distal outcomes, regardless of the study-specific schedules.

Follow-up timing, duration, and attrition
DSM-IV defined 'chronic PTSD' as persisting for more than 3 months. Participating studies' followup assessments extended from 4 months (Shalev et al., 2000(Shalev et al., , 2008 to over 2 years Shalev et al., 2012), and therefore match the DSM threshold. PTSD persistence beyond a certain time period may become permanent (Marmar et al., 2015): a 6 year follow-up study showed that more than half of the PTSD patients assessed 12 months after trauma continued to meet PTSD diagnosis at a 6 year followup (O'Donnell et al., 2016). As such, time to infer PTSD 'chronicity' is an important open question, which ICPP data will allow us to approach using the availability of a 'days since trauma' measure with a wide range of follow-up times. The choice of specific time ranges will be elaborated upon in relevant individual papers regarding these analyses.

Harmonizing across instruments
One of the major challenges was to organize a great variety of instrumentation. Apart from the common measurements, studies used various instruments to measure different risk factors and different areas of outcomes. Finding common items across different instruments measuring the same construct can maximize obtainable information without absolute equalization. For example, the 17 PTSD symptoms were essential in providing PTSD diagnosis as a major outcome measure in all studies. Nevertheless, higher sensitivity and lower specificity of PSS and PDS compared to CAPS have been reported (Foa & Tolin, 2000;Griffin, Uhlmansiek, Resick, & Mechanic, 2004), and symptoms may be over-or underreported as a result of different contexts (in-person vs telephone interviews) (Aziz & Kenford, 2004).
Harmonizing common items with different wording or scaling usually implies reducing the number of categories to the smallest common denominator, often to dichotomizing. Although the resolution of information is diminished, it is a simple way to maximize the number of studies being analysed. Researchers from a previous pooled analysis [the PTSD after Acute Child Trauma (PACT) data archive] reported similar issues and solutions (Kassam-Adams, Palmieri, Kenardy, & Delahanty, 2011;Kassam-Adams et al., 2012). When different instruments were used to measure the same disorder, the population effect made it challenging to reach comparable results across the whole sample. The cut-off score for a scale to distinguish psychopathology can vary in different situations and populations (Beekman et al., 1997;Brennan, Worrall-Davies, McMillan, Gilbody, & House, 2010;Cheng & Chan, 2005;Dozois, Dobson, & Ahnberg, 1998;Geisser, Roth, & Robinson, 1997;Hinz & Brähler, 2011;Hiroe et al., 2005;Kugaya, Akechi, Okuyama, Okamura, & Uchitomi, 1998;Lasa, Ayuso-Mateos, Vázquez-Barquero, Dıéz-Manrique, & Dowrick, 2000;Matsudaira et al., 2009;Wada et al., 2007). Therefore, we could not reliably derive the same diagnosis with different scales in different study populations. Using item response theory to link items from separate instruments to a common scale (Reise & Waller, 2009) may be an approach worth exploring for the next step. This paper has not included all measures found in the participating studies, especially regarding predictors (e.g. social support, medical/psychiatric history). These parameters were measured in some studies, mostly with study-specific questions, and require further data processing in the relevant studies in order to be pooled.

PTSD outcomes
The project demonstrated that the rates of PTSD outcomes in different studies varied considerably even when study methods were grossly similar. Since each sample was selected according to a series of study-specific criteria, the disparity of PTSD prevalence between studies may be a result of multiple factors in the samples. Inferring PTSD prevalence after a traumatic event in hospital settings simply from a certain cohort study is noticeably inaccurate. Initial symptom severity and prior mental illness can be strong risk factors for later PTSD (Brewin et al., 2000;Ozer et al., 2003). Furthermore, epidemiological studies also reported higher prevalence of PTSD in women, in people living in areas with high community violence, and in victims of interpersonal trauma (Breslau, Chilcoat, Kessler, & Davis, 1999;Breslau et al., 1998;Goldmann et al., 2011;McLean, Asnaani, Litz, & Hofmann, 2011). These population factors, as well as sample sizes and assessment time, may all have contributed to differential PTSD prevalences between studies (Matsuoka, Nishi, Yonemoto, Nakajima, & Kim, 2009b;O'Donnell, Creamer, Bryant, Schnyder, & Shalev, 2003). Pooling these studies together can minimize the effect of studyspecific selection and test for results that can be generalizable to the entire trauma population in first responder settings.

Early versus prolonged PTSD
The current literature has not suggested an optimal time for assessing acute or chronic PTSD. One month of symptom duration is needed to meet the DSM criteria for PTSD. However, some researchers choose to evaluate PTSD symptoms much earlier without necessarily diagnosing owing to the difficulty in reaching individuals after their being discharged from hospital (Bonne et al., 2001;Bryant et al., 2008;Hepp et al., 2008). It is noteworthy to recognize that some PTSD symptoms (e.g. insomnia, avoidance) may not manifest as a result of medication or being in a hospital setting. We could not directly observe the impact of early versus late baseline from our results because the time effect is contaminated by the study population effect, as studies adopted different baseline schedules.

Limitations and boundaries to generalization
The ICPP only selected studies from first responder medical settings. The main reason for this is that these settings can best capture people with a distinct, single traumatic event early on. In addition, acute care centres and emergency rooms receive large numbers of potential trauma survivors (Tusche, Smallwood, Bernhardt, & Singer, 2014). Their traumatic events are relatively well documented. Since we did not include any studies of people with chronic or repetitive trauma, such as refugee or family violence studies, and military or war-zone studies, the results of this study may not be generalizable to these trauma populations. Nevertheless, the methods explored in this study may inform studies in other settings. Another limitation of this study is that all PTSD measures are under the DSM-IV/International Statistical Classification of Diseases and Related Health Problems, 10th Revision (ICD-10) system. DSM-5 has additional symptom criteria that were not measured in the original studies. Future work might substantiate the effects and the implications of shifting diagnostic templates (Hoge, Riviere, Wilk, Herrell, & Weathers, 2014).

Implication for future studies
Publicly funded projects are moving towards an era of data sharing, and the ability to utilize shared data in an effective way can profoundly deepen the understanding of individual research findings. Although it is much more time-consuming to pool individual data than summary results, it provides a harmonized data set which can be subsequently used to calculate predictive probabilities, discover symptom trajectories, and conduct analyses that were not involved in the original studies (e.g. random forest). The lessons we learned from pooling ICPP data have a few implications for future efforts. First, before planning prospective studies, it may be helpful for researchers to expect the usage of their data in future pooling and to facilitate the process by documenting study-specific features in detail. Using common data elements, such as the PhenX toolkit (Hamilton et al., 2011), is highly recommended. This practice will facilitate future pooled analysis and maximize poolable information. Secondly, data processing and quality control require careful planning and laborious work. Resources need to be sufficiently allocated in these areas for pooled data projects. Most importantly, future studies should be encouraged to produce more accessible data, and conduct more efficient and informative large-scale analyses. The work from the ICPP has made great efforts in advancing the current field of prediction in PTSD, strengthening the empirical database and opening gateways for more robust clinical prediction tools leading to targeted intervention.

Conclusions
As this work has shown, pooling different data sets at the item level is an important step towards more robust and generalizable findings in the population. Although pooling longitudinal data is far from simple and requires content expertise, this method supports innovative data analyses which might not have been conducted in original studies and promotes results from a heterogeneous population that is less constricted by study selection criteria. The crucial question in pooling is how to acquire data from various sources and maximize obtainable information. While few variables are uniformly annotated across studies, most others require informed decisions as to their harmonization and formatting in the pooled data set. In this work, relative homogeneity and clarity of a few reliably poolable variables (e.g. gender, age, PTSD severity and status, time from trauma) constituted a precondition for pooling. Many variables measuring the same constructs (e.g. trauma type, education, marital status, prior trauma) could be recoded and organized, and ultimately constituted an informative ensemble. Several instruments were used to capture a certain construct (e.g. depression). Using such information requires cross-instrument harmonization, which constrains the depth of information available for subsequent analyses. Ultimately, much of the present work amounts to defining the resolution within which a data set can be questioned, and has implications for future studies pooling itemlevel data from data repositories.
Appendix 2. Data-Set Creation, Cleaning and Quality Assurance Procedures of the International Consortium to Predict PTSD (ICPP)

Overview
The International Consortium to Predict PTSD (ICPP) seeks to develop versatile tools for predicting post-traumatic stress disorder (PTSD) by pooling longitudinal data sets measuring early trauma aftermath. To date, the ICPP has included original data sets from 13 longitudinal studies conducted between 1998 and 2014. These studies were conducted in a variety of nations across several continents using different languages. In these studies, primary investigators obtained a baseline assessment shortly after trauma exposure, followed respondents for at least 1 month, assessed both PTSD diagnosis and the severity of PTSD symptoms, and provided item-level data. A wide variety of instruments evaluated an extensive array of trauma-related outcomes.
Participants were assessed at different time-points, including emergency room, and 1 week, 1 month, 3 months, 4 months, 5 months, 6 months, 9 months, 12 months, 18 months, 2 years, and 3 years after trauma. Each studied adopted two to five time-points among them, with some overlap across studies. We examined the assessment dates to calculate true time from trauma in order to standardize the indicator of follow-up time.
Our goal in this project is to construct standardized item-level data sets that will be of significant value when used in analyses leading towards advances in predicting and treating PTSD.
We created a demographic sheet with all subjects from the original data files. Their age, gender, education, marital status, trauma type, and prior trauma information was included. We also created a data summary sheet with the number of assessments for each instrument, and days since trauma at different assessment points. These two documents serve as the anchor for merging different instruments once they are cleaned and processed into standardized format ( Figure A1).

Original data
Data were received in a wide variety of formats, including SPSS files, Excel spreadsheets, and Access databases. Our staff used the SAS language to process all data, with the ultimate outcome being labelled SPSS files, or any other formats that statisticians request (e.g. CSV).
Most of the data sets contained all data for a subject in one record (wide form), i.e. data points for all interviews at every time-point, with all dates and demographic information as well. In some studies, however, data were transferred in a number of separate data files, each containing one time-point or one/several instrument(s). As a result, variable names were not the same in the data for any instrument across studies or across time-points.

Classifying variables
This is the first step towards understanding and processing raw data from original files. A research associate and a post-doctoral student looked through all variables in each original data set, identified which instrument each variable belonged to, and classified instruments into different categories for sorting. Figure A2 shows an example of how variables were classified in a study.

Data dictionaries
Before a data set can be created, a data dictionary has to be created for its corresponding instrument. Working in conjunction with the owner of each data set, we ascertained the names and time-points of the variables for each instrument in a study's data set. We used our own naming convention for the data points in each instrument, with each variable name consisting of between six and eight characters. The variable name combines the acronym for the instrument with the sequential number of the variable in the instrument. For example, the first variable in the Hospital Anxiety and Depression Scale data set is named HADS01.
The data dictionary contains useful information pertaining to each variable, including:  Figure A1. Data processing procedures.

SAS programs
With the data dictionary providing a blueprint, a unique SAS program was written to transform all the data for an instrument in a study into a cohesive data set. The data set contains one record for each subject at every time-point for which data were collected for the instrument. If no responses were elicited for the instrument from a subject on any given interview date, the resulting blank record was not included in the final data set.

Master Summary sheet
The Master Summary sheet is the most important 'road map' guiding the creation of final data sets for analysis. It follows the long form, which means that the number of records per subject equals the number of time-points when the subject was seen. We have structured the data sets from each instrument to contain a group of standard variables including a subject identifier that is unique for every subject in the database, the date of the traumatic event, the date of the interview with the subject, and a calculated variable of the number of days that have elapsed since the traumatic event (at the time of each interview). There is also an interview time-point description variable that fits the time-point into a range of time, e.g. baseline, < 1 month, 1-3 months, 3-6 months. The time description is consistent across studies. The Midwest study only had the date of the first interview, and the days since trauma at the first interview. We calculated the trauma date accordingly, and added 30, 90, and 180 days to the first interview date to generate subsequent days since trauma. For this study, days since trauma only serve as indicators to facilitate merging and do not represent real time since trauma.

Demographics sheet
The demographics sheet contains age, gender, level of formal education, marital status, and other available baseline information, such as trauma type and prior trauma, when available. Each subject has only one record in the data set (wide form). This file contains all subjects with any data from all original data files collected. It provides information on the number of subjects and some essential baseline information as potential predictors.

Standard instrument data
Standard data sets were created for an instrument for each study that contained data for that instrument. Data were cleaned and then given a thorough quality assurance check.
A merged instrument data set can be created containing all records from all standard data sets with this instrument.
Using the Clinician Administered PTSD scale (CAPS) data set as an example, it was then possible to merge all variables to each record. Our master CAPS data set contains 3952 subjects and almost 10,000 records.

Cleaning data
We look to clean data as quickly and accurately as possible from the moment we create a data set. Any actions taken to clean data are always thoroughly documented. One good example of our data cleaning effort is the days since traumatic event (DST) variable. We first correct dataentry errors. The interview date should never precede the date of the traumatic event. In such instances, the DST value will be a negative number. We can look for such values and make a quick determination as to whether a reasonable assumption can be made regarding the error. In the majority of such cases, the interview occurred shortly after the beginning of a new calendar year and the year value of the interview date was inadvertently that of the prior year. Correction of these errors can then be made in the data-set creation program. In circumstances when such assumptions of data-entry errors cannot be made, we write back to data owners to track back their original records, such as paper logs. When these records are not available, we make the variable missing.
For data points that have a set of acceptable, valid responses, we write out-of-range data checks in our data cleaning programs to find values that do not match up with a corresponding valid response description. For these records, we will contact the owner of that study's data to see whether an acceptable explanation can be provided for the out-of-range value. If so, we are then usually able to programmatically map the data point to a valid response.

Quality assurance
Our data quality-assurance process begins with the transformation of the original data set into a usable file for processing, and does not end until we feel the final data sets are as clean and accurate as possible. During the development and dataset creation processes, our programmers continuously browse the intermediate files with which they work to look for any potential problems in the data.
The data sets can be compared to living entities, in that they are created, and often grow and expand as they are used in analysis, and sometimes merge additional/ derived variables and calculated scores into records. At each point along the way, statisticians, research coordinators and programmers are looking closely at the files with which they work to guarantee that their quality is of the highest standard. Potential problems are discussed as a group, and solutions are sought with input from everyone involved.
We continue to receive data from additional studies and researchers, and continue to enlarge our longitudinal PTSD database following the above procedures.