DATASETS AND

Because of their ef ﬁ ciency and ability to keep many other factors constant, twin studies have a special appeal for investigators. Just as with any teaching dataset, a “ matched-sets ” dataset used to illustrate a statistical model should be compelling, still relevant, and valid. Indeed, such a “ model dataset ” should meet the same tests for worthiness that news organization editors impose on their journalists: are the data new? Are they true? Do they matter? This article introduces and shares a twin dataset that meets, to a large extent, these criteria. In fact, while more than two decades old, the data are still widely cited today in ongoing related research. This dataset was the basis of a clever study that con ﬁ rmed an inspired hunch, changed the way pregnancies in HIV-positive mothers are managed, and led to reductions in the rates of maternal-to-child transmission of HIV.


Introduction
Because of their efficiency and ability to keep many other factors constant, twin studies have a special appeal for students, teachers, and investigators. Since it is difficult to assemble a sufficient number of twin pairs in a short time, investigators often use a "next best" design, namely, pairs or sets of siblings, neighbors, etc., matched on key potential confounding factors. Variables that cannot be matched on are usually handled using conditional (i.e., within-set) regression adjustments.
The infertility dataset in the R datasets package is often used along with the clogit function in the Survival package to illustrate conditional logistic regression. The dataset contains data from a matched case-control study (Trichopoulos et al. 1976) "dating from before the availability" of this type of regression. It is of little surprise therefore, that current students see these data as somewhat outdated. And there is another reason the infertility dataset is a less than ideal teaching example: conclusions made based on these data have since been refuted.
Together with similar case-control data from a tightlymatched study (Panayotou et al. 1972), it was suggested that about 45% of the cases of secondary infertility in Greece, and half of ectopic implantations in Athens, "may be attributable to previous induced abortions." It is important to note that these conclusions relied on self-reported abortion histories. Investigators would eventually become aware of the dangers of relying on self-reported abortion histories and the need to conduct analysis appropriately, see, for example, Melbye et al. (1997) and Rookus and van Leeuwen (1996).
Unfortunately, it was not until the 1990s that, with better data and improved methods, Trichopoulos et al.'s (1976) findings were "considered as an outlier by most investigators, who were inclined to attribute it to chance or to the fact that induced abortions were, at the time, illegal in Greece," Tzonou et al. (1993), and the relation between induced abortion and ectopic pregnancy was put into question, Atrash et al. (1997). Despite this objective evidence, the infertility dataset continues to be used as a "model" dataset for teaching in textbooks, tutorials, and in the classroom, see, for example, Jewell (2003), G€ unther and Fritsch (2010), Aragon et al. (2010), andKerns (2011).
In this article, we provide a teaching dataset containing twin data that are relatively new, that still appear to be true, and that clearly matter. In fact, while more than two decades old, the data continue to be widely cited today in ongoing related research; see, for example, Makunyane, Moodley, and Titus (2016), Lion-Cachet (2016), Milligan and Overbaugh (2014), and Frange and Blanche (2014). We have used these twin data both in introductory courses, for simple comparisons involving paired binary outcomes, and in regression courses that deal with correlated binary outcomes. and mother-to-child HIV transmission, Katz (2003). It is no exaggeration to say that as a result of this dataset, standard practices for the care of pregnant women with HIV promptly changed and as a consequence, many lives were saved. In 1991, analysis based on the data suggested that "caesarean delivery may be helpful" in reducing rates of transmission, a conclusion that would be confirmed in a 1999 meta-analysis (International Perinatal HIV Group 1999). Besides containing 24 lines of data that themselves make an ideal teaching dataset, Table 3 in the 1999 meta-analysis (International Perinatal HIV Group 1999) is a testament to the impact of following up on an anecdote involving just one pair of twins. Only a few years later, following the widespread use of the drug zidovudine in 1994, rates of perinatal HIV in industrialized countries dropped dramatically. In developing countries, attention focused on identifying effective, simple, and affordable interventions, McGowan and Shah (2000). (Challenges remained as not all proposed interventions were found to be successful. Biggar et al. (1996) examined the efficacy of birth canal washing for reducing perinatal transmission in Malawi and found no significant reductions.) In a March 1993 interview for the "In Their Own Words" project-that documents the work of researchers at the National Institutes of Health (NIH) during the early days of the HIV/AIDS epidemic, Harden and Rodrigues (1993)-Dr. James Goedert, M.D., NIH medical researcher, recalled how the registry originated: I was invited by the Pediatric AIDS Foundation to go to a meeting in California on risk factors for pediatric AIDS and different kinds of immune response.
[…] Because of the time difference between California and the East Coast, the first morning I was there at a hotel on the beach I woke up very early. I took a long walk on the beach. I was trying to put together the things I had heard from the day before and make some kind of sense out of them. It was a good time for thinking; the sunrise was not quite over the Pacific, but coming over the mountains in Santa Barbara.
There had been one anecdote of twins being born that someone [Dr. Arye Rubenstein] had mentioned, where one had been infected and the other had not. It seemed to me a possible way to distinguish when the infection occurs; whether it has already occurred before the woman goes into labor, in which case the chance of infection should be random for the first-born and the second-born twin; or whether, as I suspected, again wrongly, the virus was transmitted during separation of the placenta, in which case the second-born should be at higher risk because they are in the womb longer, they are exposed for a longer time.
After assembling a group of collaborators with whom to pursue his idea, invitations to submit data were initially sent to 154 investigators of pediatric AIDS around the world in late 1990. Contributors, including pediatricians, obstetricians, and infectious disease specialists, were asked to provide demographic, clinical, and epidemiological data on sets of HIV-1infected women and their twin or triplet offspring. Data were abstracted from clinical records and submitted on standardized forms with follow-up forms sent out for any required additional information. All the data obtained was anonymous with no names or identifying information. The definition used for HIV status was based on the Centers for Disease Control classification system; further details in  and CDC (1987).
Dr. Goedert recalled his excitement when the data first started arriving through the FAX machine: "Every day was like Christmas. I would come in here and there would be a FAX waiting for me, or maybe two, or three, or four, with more data." Indeed, the FAX machine was indispensible: "It is the one study I have worked on in which 99 percent of it was conducted by FAX machine.
[…] Without the FAX machine that study would never have been done. We literally faxed out invitation letters with forms, and people faxed the forms back in." Results from analysis using the registry data were first published in a poster presented at the VIIth International Conference on AIDS in June 1991 in Florence Italy, Goedert, Duli ege and Amos (1991). This first report, based on 82 sets of twins and 1 set of triplets (50 of which had "complete data") served to welcome new contributors to the registry.
In December 1991, with 101 sets obtained, analysis results were published in the Lancet, . This milestone article presented strong evidence against the null (p-value D 0.004), suggesting a higher risk of HIV-1 infection for first-born twins. The study was among the first to suggest the link between mode of delivery and perinatal transmission of HIV. As a result, it was highly cited and even reported in the New York Times, Warren (1992).
There was however, some criticism received. In response to the Lancet article, Dr. Marc Bulterys and colleagues at the National University of Rwanda suggested that the results may be due to selection bias, Bulterys et al. (1992): "The much higher risk of HIV-1 infection among first-born twins […] was apparent only among twins identified because of an HIV-1 related illness in at least one twin and not among twins identified because the mother was known to be infected. Selection bias would occur if second-born twins who were HIV-1 infected were more likely to die early and thus would be excluded from the analysis." In a response Goedert wrote: "We hope that [more] prospectively collected data […] will be contributed to the registry." Helpful Hint: Use this criticism as an opportunity to discuss the subtleties of selection bias. Ask students why selection bias could occur with the registry data that was not collected prospectively. Remind students that second-born twins are usually lighter than first-borns and thus their infant mortality is higher.
Data continued to be collected, and in June 1993, an updated analysis was published based on 147 sets of twins and 2 sets of triplets, Goedert et al. (1993). By December 1, 1993, the registry included 203 contributed sets from collaborators in 14 different countries. These included 148 sets with "complete data," including 115 sets of twins ascertained prospectively. Courtesy of Dr. Goedert and the computer programmer Myhanh Dotrang, who shared it with us in 2011, this is the hivtwins dataset now being made publically available for teaching purposes. Results based on the analysis of the prospectively identified twins were published in the Journal of Pediatrics, Duli ege et al. (1995). The analysis confirmed the higher risk of HIV infection among firstborn twins. What is more, the research concluded that intrapartum transmission is likely responsible for the majority of pediatric HIV infections, and that reducing exposure to HIV in the birth canal could reduce transmission of the virus between mother and infant: "the protective effect of cesarean delivery is real and clinically meaningful." For work on the International Registry of HIV-Exposed Twins, Dr. Goedert received the Public Health Service Outstanding Service Medal and the International AIDS Society 1992 International Life Prize. While today, the original Lancet study has been rightfully overshadowed by the effectiveness of antiretroviral therapy, it remains an important example of the merits of using twin pairs to gain initial insights, an approach used throughout Dr. Goedert's career, see, for example, Cozen et al. (2013).

The Data
The hivtwins dataset includes 201 sets of twins and 2 sets of triplets. As such, two sets of twins from the registry are missing in the current hivtwins dataset. Thirty-one variables are included, including "in94paper," which identifies the 115 sets with complete data that were ascertained prospectively (by follow-up of infants born to HIV-infected mothers).
While the particular measures taken by the NIH with respect to confidentiality and informed consent are difficult to ascertain today, students should certainly not be given the impression that these matters can be simply overlooked. Neff (2008) provided a simple and straightforward discussion on these matters with regards to chart review, case reports, and observational studies.
For the current release of the data, we altered two variables to reduce (even further) the risk of anyone identifying any of the mother-infant pairs. The data-dictionary we received from Dr. Goedert contained the codes (001-235) that identified the physicians (and the cities they practiced in) who "contributed" the information. We have left the codes in the dataset, but have removed the names of corresponding physicians and cities. Some students or teachers may want to make a point of checking whether any physician contributed more than 1 mothertwin/triple, and what the statistical implications would be if they did. The original dictionary also identified the country of origin for each observation. We have replaced the 17 countries with the continents. This further anonymized the data while maintaining the possibility for analysts to reproduce published results.
The variables of primary interest in the dataset are: the HIV status of each infant, the delivery method of each infant, and an indicator of whether or not the twins/triplets were identical. Information about the mother is also available and includes: the mother's race, the number of weeks pregnant, the mother's AIDS diagnosis, and her potential HIV risk factors. Finally, the data include variables of birth weight, birth length, and head circumference measurements for the majority of the infants as well as a binary indicator of whether or not the child was breastfed. Table 1, described in the next section, provides a summary of the data. Figure 1, a compass plot, shows the proportion of 1st and 2nd born infants with positive HIV-1 status given delivery method.

Replication of the Published Results
With the hivtwins dataset we can easily reproduce the Table  "Factors for mother-to-infant transmission of HIV-1 infection in 115 prospectively identified twin sets" as published in Duli ege et al. (1995); see Table 1. The code to reproduce this table is provided in the Appendix, available in the online supplementary information. Only two differences with the originally published table will be apparent. First, missing are weight measurements for two sets of twins with "same" birth weight. Second, the variable to determine the presence of an HIV infected sibling is missing from hivtwins. Also of note, "Mother's route of infection-Sexual/other" could be calculated differently depending on how one categorizes missing data (i.e., entries of "NA," "Unknown").
Potential Pitfall: You may wish to ask students to reproduce the Table. Try not to get distracted by small differences that may occur due to minor choices such as how to calculate birth weight concordance and how to categorize missing entries.

Simple Analyses, and Discussion of Effect Measures, Suitable for Introductory Courses
The display and analysis of paired binary responses can be challenging. This is particularly so if the sampling is based on outcome, as in case-control studies, but even if it is based on exposure, it is still not that simple. In Table 2 In order to answer the more interesting question-whether or not the infection rate is the same among first and second borns-we can employ McNemar's test. McNemar's test determines if the marginal proportions are significantly different. In this case, is 13% ( D 15/115) significantly different than 26% ( D 30/115)? To answer this, we need only consider the discordant twins (i.e., 4 and 19), the 11 cases in which both the first and second born are infected do not provide additional information. With a continuity correction, we obtain Q D (j4¡19j -1) 2 / (4C19) D 8.52. Under the null, Q follows a chi-squared distribution with 1 degree of freedom. Therefore, we have a p-value of 0.0035 indicating strong evidence against the null. Fay (2015) provided a good explanation of McNemar's test for both twin data and case-control data with accompanying R code.
We have used the data in Table 2 in introductory courses to show how to display and compare binary outcomes in paired observations. Whereas many introductory courses merely handwave, and state-without proof or discussion-that the "a" and "d" frequencies in the concordant cells are "uninformative," teachers may wish to motivate this "McNemar" approach. For example, they can ask what would happen if the paired binary responses were analyzed with a paired t-test. The results may surprise students and prompt some teachers to locate and consult McNemar's original article, McNemar (1947). One can also ask what is the most natural parameter in these situations (i.e., the number that an infinite amount of data would converge to). Is it the absolute difference in two proportions, or their ratio? Why, in this context, do statisticians focus so much on the odds ratio? Is it because the single parameter of the noncentral hypergeometric distribution obtained by conditioning on both margins of a 2 £ 2 table is the ratio of the two odds corresponding to the two binomial parameters (i.e., the odds ratio [P1/(1¡P1)]/[P2/(1¡P2)])? This discussion can lead students toward the simpler P1/P2 comparative parameter, and in nonexperimental contexts, to binary regression models with a log rather than a logistic link. It may also lead one to consider the possibilities available with "marginal" models. What if all infants in the dataset had been (or were to be) delivered by one method?

More Complex, Regression-Based Analyses
The main analysis results published in Duli ege et al. (1995) relied on "quasi-likelihood modeling" to estimate the marginal probability of infection with adjusted odds ratios (and 95% confidence intervals). The authors cite the work of Qaquish and Liang (1992) for their methods. While it may be difficult to use this methodology today within existing software, it is most pleasing that the main result (i.e., "the odds ratio of HIV-1 infection for A twins compared with B twins was 2.4 (CI: 1.4 to 4.0)," Duli ege et al. 1995) can be reproduced identically by estimating parameters for a marginal model with Generalized Estimating Equations (GEE).   Using R's geeglm function within the geepack library, Halekoh, Højsgaard, and Yan (2006), we have:

HIV-1 Infection
> library(geepack) > GEEmodel<-(geeglm(HIV» First, id D BATCHID, family D binomial(), data D twins94_long, corstr D "exchangeable")) > round(exp(confint.geeglm(GEEmodel)),1) The risk ratio is another important estimate to consider. In this case, the risk ratio, equal to 2.0, is very intuitive as there are 30 infected first-born twins versus 15 infected second-born twins. It is fit in a binary regression with a log link, where log(P) D log(P2) C Log(P1/P2) x I, and I D 0/1 indicates whether (1) or not (0) the twin is born first. The exponentiated value of the fitted coefficient of I yields the estimate of the "risk ratio" parameter P1/P2. > GEEmodelRR<-(geeglm(HIV» First, id D BATCHID, family D binomial("log"), data D twins94_long, corstr D "exchangeable")) > round(exp(confint.geeglm(GEEmodelRR)),1) Using GEE, we can also obtain the same coefficient estimates and confidence intervals as those published in Duli ege et al. (1995) for mode of delivery. Consider the following R code:  The complete R code is provided in the Appendix. Replicating the published odds ratios by quasi-likelihood methods may be an interesting challenge for advanced students, with the understanding that statistical modeling of correlated data has evolved quite substantially since the original publication. Hanley, Negassa, and Forrester (2003) provided an accessible introduction to GEE models, while Sj€ olander et al. (2012) provided an overview of various analysis methods for correlated binary data with special attention to the analysis of twin studies.
Helpful Hint: Consider using the estimates obtained as an opportunity to discuss the differences between marginal and conditional models, see Gardiner, Luo, and Roman (2009); as well as the choice between Odds Ratio and Risk Ratio metrics, see Schmidt and Kohlmann (2008).

Conclusion
As with urban myths, data, data-analyses, and remembered results persist even when they have been subsequently shown to be problematic. This is all the more so in teaching, and when students pursue datasets to illustrate a new statistical method. Consider, for example, the well-known "twin age-at-appendectomy" dataset. Obtained from a questionnaire survey, Duffy, Martin, and Mathews (1990), this is another twin dataset, particularly attractive for illustrating bivariate "survival" models. As joyful as statisticians are when they find such a rare dataset, they should first check with their own experience, and wonder what was it about the data-collection that led to certain "outlying" results. No matter how elegant the matching, a finding that 21% of respondents had undergone appendectomy, or a conclusion that 45% of secondary infertility is caused by induced abortions, or for that matter that winning an Oscar adds 4 years to the longevity of performers (see Han et al. 2011), should raise questions as to the data or the data-analysis.
Our pursuit of the data from the registry of HIV-exposed twins began when, in 2011, JH told students in his class about a twin pair where one was HIVC and the other HIV¡, and asked what factors might be responsible for the discordance. One student was so skeptical, even after seeing the 19:4 discordance ratio in the Lancet article, that JH emailed Dr. Goedert and asked about getting the raw data. He immediately responded: Dear Jim. You bring back very fond memories-the origin of the idea (at a very small pediatric AIDS meeting in California overlooking the Pacific), assembling the collaborators and contributors, monitoring the raw data as they arrived in my Fax machine (very modern then), my amazement at the difference in risk by birth order, going back to contributors to validate birth order (a few changed actually strengthening the difference), and discussions with the Statistician on the analysis. It will take some digging by a programmer or two who are still around, but the odds are good of finding a clean data set. I will let you know.
The digging was successful and JH received the data within a week.
Datasets that underlie substantial public health changes can be an important way to help students not limit themselves to the statistical analyses: the importance of the question, the biological underpinnings, the cleverness or elegance of the design, the quality of the data, and the impact on society, are even more important.