A Probabilistic Record Linkage Model for Survival Data

ABSTRACT In the absence of a unique identifier, combining information from multiple files relies on partially identifying variables (e.g., gender, initials). With a record linkage procedure, these variables are used to distinguish record pairs that belong together (matches) from record pairs that do not belong together (nonmatches). Generally, the combined strength of the partially identifying variables is too low causing imperfect linkage; some true nonmatches are identified as match and, on the other hand, some true matches as nonmatch. To avoid bias in further analyses, it is necessary to correct for imperfect linkage. In this article, pregnancy data from the Perinatal Registry of the Netherlands were used to estimate the associations between the (baseline) characteristics from the first delivery and the time to a second delivery. Because of privacy regulations, no unique identifier was available to determine which pregnancies belonged to the same woman. To deal with imperfect linkage in a time-to-event setting, where we have a file with baseline characteristics and a file with event times, we developed a joint model in which the record linkage procedure and the time-to-event analysis are performed simultaneously. R code and example data are available as online supplemental material.


Introduction
Computerized record linkage is often used to aggregate information from multiple existing data files from different sources. In different data files, records that belong to the same individual have been registered. With a record linkage procedure, we determine whether a pair of records belong to the same individual (match) or the records belong to different individuals (nonmatch). Using this information, we can combine the files and use this data to answer new research questions without the need of gathering additional data (Gill et al. 1993;Newcombe 1988).
Ideally, each file contains the same unique identifier, which is guaranteed to be unique for each individual (e.g., patient number). Unfortunately, due to privacy regulations or practical reasons, such a unique identifier is often not available and matches and non-matches cannot be distinguished from each other. In many situations, however, partially identifying variables may have been registered in all files. Examples of partially identifying variables are surname, zip-code, initials, date of birth, and gender. Usually, partially identifying variables have a limited number of unique values and may be contaminated with errors (e.g., typing errors). This means that the combined strength of the partially identifying variables is often not high enough to perfectly distinguish matches from non-matches, causing imperfect linkage. Importantly, if imperfect linkage is not properly dealt with, further analyses of the data could be biased and inefficient (Hof and Zwinderman 2012;Scheuren and Winkler 1993;Baldi et al. 2010 This article was motivated by data from the Perinatal Registry Netherlands (PRN), containing information about 96% of all pregnancies in the Netherlands (Méray et al. 2007). We were interested in the effects of certain characteristics of the first delivery on the time to a second delivery. A problem of the PRN registry is that all pregnancies had been registered individually and no unique identifier was available to determine which pregnancies were from the same mother. However, a number of partially identifying variables (e.g., date of birth of the mother, date of delivery) had been registered for each pregnancy. More generally, in this article, we consider the situation in which we have two files; file A (e.g., cross-sectional cohort study data) containing the baseline characteristics of a number of individuals and file B (e.g., national death registry) containing event times. We assume that the event times from the individuals who experienced the event of interest in the period between collecting B and A were registered in B.
Separation of record linkage and data analysis is common practice, and, for sake of confidentiality, even encouraged. After a record linkage procedure is performed, the regression model of interest is fitted to the linked data. As a consequence of separating the record linkage and the regression analysis, the methods to deal with imperfect linkage in the regression analysis require external information about the probability of correct linkage (Lahiri and Larsen 2005;Kim and Chambers 2009;Chambers 2009;Scheuren and Winkler 1997;Kim and Chambers 2012). Note that this information should be accurate to obtain unbiased parameter estimates. Goldstein, Harron, and Wade (2012) proposed an imputation method to deal with imperfect linkage in the regression analysis. Although their method can deal with time-to-event data, their method requires that the linkage is complete, that is, all individuals with baseline measurements in A experience the outcome of interest and have a true match in B. If this assumption does not hold, which is likely to happen in real data, the imputation method requires external information about the mortality rate among the individuals (Goldstein, Harron, and Wade (2012), page 3485).
Since external information on the quality of the linked data is often not available, it is necessary to obtain this information by taking, for instance, an audit sample and checking the true matching status (Chambers 2009). Note that this requires extra resources, which lowers the attractiveness of combining files to answer new research questions. As an alternative, we propose a joint model in which we simultaneously estimate the record linkage model and the time-to-event regression model. The main advantage of our joint model is that it does not require additional external information.

Problem Description
Consider that we have two files A and B containing respectively n and m records. Let A = {a i : i = 1, . . . , n} be the file containing baseline measurements from n iid individuals, where each individual i is described by the record i and the values of q partially identifying variables r 0 i = (r 0 i1 , . . . , r 0 iq ). A partially identifying variable contains some personal information which cannot be used to perfectly distinguish individuals. Examples of such variables are gender, date of birth, and initials.
File B = {b j : j = 1, . . . , m} contains m events, where each record b j = (t 1 j , r 1 j ) contains the event time t 1 j and the same q partially identifying variables as A, denoted as r 1 j = (r 1 j1 , . . . , r 1 jq ). Examples of such files are death registries, hospital information systems, or healthcare (insurance) databases. File B is obtained at time t B . Importantly, B is obtained after all the baseline measurements are gathered, that is, t B > max[t 0 1 , . . . , t 0 n ]. Our main interest is to investigate the associations between the baseline measurements and the time from t 0 i onwards to occurrence of a particular event (e.g., death). Let T i be the time at which individual i with a record in A experiences the event of interest, where T i > t 0 i . We assume that, if individual i experienced the event of interest before t B (i.e., T i ≤ t B ), the event of individual i has been registered in B. In addition, if T i > t B , individual i has no registered event in B.
To reconstruct the complete data, it is necessary to determine which baseline measurements from A and events from B belong to the same individual. Let {(a i , b j ) : i = 1, . . . , n, and j = 1 . . . , m} be the Cartesian product for A and B consisting of all n × m possible pairs of baseline measurements from A and events from B. In addition, let d = {d i j : i = 1, . . . , n, and j = 1, . . . , m} be the set of n × m corresponding matching indicators, where d i j is a binary indicator which describes whether baseline measurement a i and event b j belong together (i.e., a match and d i j = 1) or not (i.e., a nonmatch and d i j = 0). If a unique identifier, such as a citizen service number, for each individual is present in both A and B, it is possible to determine d and to calculate the follow-up times of all individuals with baseline measurements in A. For instance, consider individual i who experienced the event at T i = t 1 j meaning that baseline measurement a i and event b j are a match (d i j = 1). We treat t 0 i as the start of the follow-up period and the follow-up time of individual i is t i j = t 1 j − t 0 i . On the contrary, if individual i did not experience an event before t B , this individual was followed until time t c i = t B − t 0 i without experiencing the event and was assumed to be right-censored.
In many situations, a unique identifier is not available and we have to rely on the partially identifying variables to determine whether a baseline measurement a i and event time b j are a match or a nonmatch. For each pair of observations (a i , b j ), the values of the partially identifying variables are compared to each other. In this article, we compare the values of the q partially identifying variables with a binary comparison function, that is, leading to the comparison vector g i j = (g i j1 , . . . , g i jq ). Note that it is also possible to use other comparison functions based on, for instance, string comparator metrics (Winkler 1990;DuVall, Kerber, and Thomas 2010;Porter and Winkler 1997). To which extent the comparison vectors can distinguish matches and non-matches depends on the combined strength of the partially identifying variables, where the strength of each variable is determined by the amount of (typing) errors, the number of unique values, and the distribution over these values. The more unique values the partially identifying variable has, the stronger the discriminative strength of the variable. For instance, gender (2 unique values) is a weaker partially identifying variable than the first initial (26 unique values) when the values of both variables are uniformly distributed in both A and B.
To determine whether a record pair is a match or not, record linkage is necessary (Fellegi and Sunter 1969). Basically, there are two types of record linkage strategies (Tromp et al. 2011). In the deterministic approach, all partially identifying variables or a subset of partially identifying variables has to agree to consider a record pair as a match. In this article, we focus on the probabilistic approach, where for each record pair we calculate the probability of a match given the comparison vector. Fellegi and Sunter (1969) developed a framework for record linkage in which they treat the set of all n × m record pairs from A and B as the union of two disjoint sets of matches M and nonmatches U, where and U = {(a i , b j ) : d i j = 0, i = 1, . . . , n, and j = 1 . . . , m}.
In many situations, the partially identifying variables are assumed to disagree completely at random (DCAR) in matches M and agree completely at random (ACAR) in non-matches U, which Fellegi and Sunter (1969) and Schürle (2005) refer to as conditional independence. Under DCAR and ACAR, we have where μ k and υ k are the probabilities that the kth partially identifying variable agrees in, respectively, M and U. The DCAR assumption implies that disagreements of the partially identifying variables in M are uncorrelated. The ACAR assumption implies that an accidental agreement of a partially identifying variable in U happens completely at random. For instance, an accidental agreement on gender is independent of an accidental agreement on surname.
Because d is not observed, the correlation between the comparison outcomes in M and U cannot be measured. Therefore, it is not possible to test the ACAR and DCAR assumptions. If we suspect that the ACAR and DCAR assumptions do not hold, we can extend (2) by adding correlations between the comparison outcomes (Larsen and Rubin 2001;Winkler 1993;Schürle 2005). Another possible extension of (2) is the introduction of extra classes (e.g., potential nonmatch and likely nonmatch classes) (Scheuren and Winkler 1997;Larsen and Rubin 2001).

Estimation of the Matching Indicators
Since the set of matching indicators d is not observed, we need to rely on the partially identifying variables to determine which record pairs are matches and nonmatches. Because each individual has one baseline measurement in A and can experience the outcome of interest only once, not all configurations of d are possible. Each baseline measurement from A can form a match with at most one record from B and vice versa, that is, n i=1 d i j ≤ 1 and m j=1 d i j ≤ 1, leading to a one-to-one matching structure. Following graph theory, Sadinle (2016) showed that a oneto-one matching structure is equivalent to a bipartite matching. The files A and B can be considered as two disjoint sets of nodes. Because of the data structure, each node from A can maximally have one possible edge that connects it to a node in B and vice versa. Therefore, all edges are pairwise disjoint and the corresponding graph is a bipartite matching.
Let g = {g i j : i = 1, . . . , n, and j = 1 . . . , m} be the set of n × m comparison vectors. To estimate d, we treat the matching indicators as missing data by considering the marginal distribution of the comparison vectors over all configurations of d. Under a one-to-one matching structure, the marginal distribution of the comparison vectors is where p(g, d) is the joint probability distribution of the comparison vectors G and the matching indicators D. In addition, the set D contains all possible matching indicator configurations, that is, Using Bayes' theorem, the most likely matching indicator configuration can be determined by calculating the probability p(D = d|g) for each possible matching indicator configuration d ∈ D.
Sadinle (2016), Gutman, Afendulis, and Zaslavsky (2013), and Tancredi and Liseo (2011) proposed methods based on Bayesian inference to estimate d under the one-to-one matching structure. The problem of imposing the one-to-one matching structure is that the size of D, given by is huge (Hof and Zwinderman 2015). Even in record linkage problems with really small files, for instance, n = 5 and m = 30, D already contains about 2.06 × 10 6 matching indicator configurations. Consequently, these methods quickly become very computationally expensive. As a solution, Jaro (1989), Larsen and Rubin (2001), Winkler (1990), among others, proposed to simplify the model by assuming independence between all n × m record pairs, leading to the traditional mixture model for record linkage A consequence of assuming independence between all record pairs is that the decision to declare a record pair a match or a nonmatch only depends on the information retrieved from a record pair and not on the complete data (Tancredi and Liseo 2011). Therefore, if the partially identifying variables are not strong enough, it is possible to assign multiple matches to each record from A (i.e., m j=1 d i j > 1) and to each record from B (i.e., n i=1 d i j > 1). To solve the problem of multiple matches, Jaro (1989) proposed a post hoc procedure to introduce the oneto-one matching structure. The optimal d that satisfies the oneto-one matching structure can be found by solving the following linear sum assignment problem: An alternative method to impose the one-to-one matching structure after fitting (4) was proposed by Zardetto and Scannapieco (2013), who used a genetic algorithm to search for the best-fitting set of matching indicators satisfying the one-to-one matching structure.

Two-Stage Approach for Survival Analysis and Probabilistic Record Linkage
After the set of matching indicators has been estimated, it is possible to fit a time-to-event model to the linked data. Let i be a nonnegative continuous random variable representing the true follow-up time of individual i, that is, time from baseline measurement t 0 i until the event of interest T i . We assume that T i can be described by a proportional hazards model. Conditionally on the baseline measurements x i , the haz- where β = (β 1 , . . . , β p ) is a vector of regression parameters and h 0 (t|α) is a baseline hazard function with parameter vector α. Evaluation of the hazard function at t < 0 gives a hazard of zero since p(T i < 0|x i ) = 0. Throughout this article, we assume a parametric baseline hazard function. The baseline hazard function could be parameterized by, for instance, a constant term and linear trend over time or more flexible parametric functions based on spline specifications. If no match was found in B for individual i, that is, n i=1 d i j = 0, this individual is considered to be right-censored at t c i . Analogously, if a match is found with the jth event time from B, that is, d i j = 1, individual i experienced the event of interest at t 1 i j . Using the estimated matching indicators, the time-to-event model is defined as where S(t|x i ; α, β) = exp{− t 0 h(u|x i ; α, β)du} and f (t|x i ; α, β) = h(t|x i ; α, β)S(t|x i ; α, β) are respectively the survival function and density function of T i evaluated at time t.
With time-to-event data, it is likely that the probability of a match depends on the covariates (e.g., healthier individuals are less likely to be admitted to the hospital) and the follow-up time (e.g., the longer an individual is followed in time, the higher the probability of an event). If we treat the record linkage and the time-to-event analysis as separate tasks, it is necessary to allow the probability of a match in the record linkage model to depend on the time-to-event data and the covariates. However, it is not straightforward to add the time-to-event data and the covariates in the record linkage model. Importantly, as can be seen in the simulation in Section 6, substantial bias could be introduced in the time-to-event model if the probability of a match in the record linkage model is only allowed to depend on the comparison vectors.

Joint Model for Survival Analysis and Probabilistic Record Linkage
Unlike the two-stage model, where we have the risk of misspecifying the probability of a match in the record linkage model, we propose to write the probability distribution of the matching indicators in terms of time-to-event probabilities. Since we are primarily interested in the parameters from the time-to-event model, we consider the matching indicators to be nuisance variables and treat them as missing data. By marginalizing out the matching indicators from the joint likelihood, both the record linkage submodel and the time-to-event submodel are estimated simultaneously. The main advantage of this approach is, and in contrast to the two stage approach, that the probability of a match depends on both the time-to-event data and the covariates.
. . , m} is the set of matching indicators. Ideally, we would impose a one-to-one matching structure to the data. However, similarly to the marginal approach for record linkage from (3), this would require the evaluation of an extremely large number of potential matching indicator configurations.
To reduce the computational burden we propose to assume independence between the observations from different individuals, that is, Each individual can experience the outcome of interest maximally once (i.e., n j=1 d i j ≤ 1). Therefore, the set of all possible matching indicator configurations for individual i is As a consequence of assuming independence between the observations from different individuals, the decision to declare a record pair a match or a nonmatch only depends on the data from a particular individual. Therefore, it is possible to assign multiple potential matches to each record from file B leading to a one-to-many matching structure. If we would be specifically interested in determining the matching indicators, this could be a problem (Tancredi and Liseo 2011). However, our main interest is the estimation of parameters from the time-to-event submodel. We conjecture that, given that the partially identifying variables are of decent quality (i.e., relatively low level of errors), the assumption of independence between the observations from different individuals introduces ignorable bias in the parameter estimates of the joint model and therefore does not hamper our analysis. Results from our simulation study in Section 6 confirm this.

Time-to-Event Submodel
To show that the probability distribution of the matching indicators can be written in terms of time-to-event probabilities, recall from Section 2 that the event time T i of individual i is registered in B if individual i experienced the event of interest before t B . In other words, record a i has a match in B when T i ≤ t B or, equivalently, when the follow-up time T i ≤ t c i . Therefore, the probability that individual i has a match in B can be written as If record a i has a match in B, individual i experienced the event of interest at one of the event times in the set t i (i.e., T i ∈ t i ). For instance, if record pair (a i , b z ) is a match (d iz = 1), individual i experienced the event of interest at the follow-up time T i = t iz . The probability that (a i , b z ) is a match, given n j=1 d i j = 1, is equivalent to the probability that individual i experienced the event at follow-up time t iz and not at the other possible follow-up times from t i , that is,

Record Linkage Submodel
Similarly to the traditional mixture model for record linkage from (4), the comparison vectors are assumed to be independent from each other and all observed and unobserved variables given the matching indicators of individual i. Consider that record i from A has no match in B. The only possible vector d i ∈ D u i is a vector of zeros. Therefore, conditionally on d i ∈ D u i , the probability of observing g i is Given that record i from A has a match in B, there are m possible matching indicator configurations possible. Consider the vector d i ∈ D m i with d iz = 1, that is, record i from A is a match with record z from B and not with the other records from B. Conditionally on d i , we can write the probability of observing g i as where I() is the indicator function. The assumption of conditional independence between the comparison vectors and other variables requires stable partially identifying variables that do not change over time. Initials, date of birth, or gender are examples of such variables. If we use more unstable partially identifying variables, such as place of residence or marital status, it might be possible that the conditional independence in matches does not hold. The probability that place of residence agrees in matches may decrease with a larger follow-up time because some persons may have moved to different addresses. In addition, the probability that marital status agrees in nonmatches could depend on the age of the individual since younger individuals are less likely to be married. If we suspect associations between the values of the comparison outcomes and other variables, we can add these associations to the probabilistic record linkage submodel by extending (2).
In some situations, a variable can be used as baseline measurement in the time-to-event submodel and as partially identifying variable in the probabilistic record linkage submodel (e.g., gender). Because we use the observed values from A as baseline measurements in the time-to-event submodel, the variable is assumed to be without (typing) errors in A. The partially identifying variable could still disagree in matches, but these disagreements are only caused by errors in B.

Joint Likelihood
For the joint likelihood specification, we assume that the comparison vectors are independent from each other and all observed and unobserved variables given the matching indicators. If we suspect that the comparison vectors are associated with some observed variables, the record linkage submodel can be extended by modifying (2). The density function of the variables . . , m is formed by combining the time-to-event submodel and record linkage submodel, that is, where θ = {μ, υ, β, α} is the set of all model parameters. The vectors μ and υ contain the parameters from the probabilistic record linkage submodel. In addition, β is the vector of regression parameters and α is the vector the parameters regarding the baseline hazard function from the time-to-event submodel.
Since the matching indicators are considered to be nuisance variables, we marginalize them out from (12) leading to Parameter estimatesθ n are obtained by maximizing the likelihood, that is, where is the parameter space. Since the joint model is a mixture of m + 1 components of finite measure products, identifiability of our joint model follows from Sec. 5 in Allman, Matias, and Rhodes (2009). We propose to estimate the variance ofθ n with the inverse Hessian evaluated atθ n . This variance estimate gave us good results in terms of coverages of the 95% confidence intervals in our simulations (see Section 6). However, our variance estimate should only be considered as a good approximation. A rigorous proof of asymptotic normality remains challenging.
We implemented the joint model in the statistical program R (R Core Team 2014), where we used numerical approximation techniques to maximize the log-likelihood and obtain estimates of the Hessian. The code of the joint model and example data are available in the online supplementary materials.

Reducing the Computational Burden
If A and B contain a large number of records, evaluating the likelihood from (13) is computationally expensive. Because we treat the matching indicator vector d i as unobserved, we have to evaluate all m + 1 possible matching indicator vectors d i ∈ D i for each record a i . The computational costs to evaluate (13) can be reduced substantially when one or more of the q partially identifying variables are registered without errors in both files. In the record linkage literature, these variables are often referred to as blocking variables (Lahiri and Larsen 2005;Winkler 1993).
Consider that the kth (1 ≤ k ≤ q) partially identifying variable is registered without errors, that is, p(g i jk = 1|d i j = 1) = 1, which implies that The sum in (13)

Setup
In the simulations, file A contained baseline measurements of n = 750 individuals and file B contained m = 1000 event times. For each individual i, the baseline measurements in A were obtained at time t 0 i = 0. All individuals were at risk of experiencing the event until t B = 3. The baseline measurement a i = (x i , r 0 i ) from individual i consisted of three baseline characteristics x i = {x ip : p = 1, . . . , 3} and five partially identifying variables r 0 i = {r 0 ik : k = 1, . . . , 5}. The baseline characteristics were sampled from a multivariate normal distribution with means zero, variances one, and covariances zero.
File B was generated as follows. First, file B was generated without matches by creating a file with uncorrelated records. For each record b j = (t 1 j , r 1 j ), the event time t 1 j was sampled from a continuous uniform distribution over the interval [0, t B ] and the partially identifying variables r 1 j = {r 0 jk : k = 1, . . . , 5} were sampled from similar distributions as the partially identifying variables in A. Second, we added the matches to file B. For each individual i with a baseline measurement in A, a follow-up time T i was sampled from a Poisson process with intensity h i (t ) = h 0 (t ) exp{x i β }, where h 0 (t ) = 0.1 + 0.04t − 0.02t 2 and β = (0.25, −0.16, 0.41). Individual i had a record (i.e., a match) in B if T i ≤ t B . This record was created by changing the values of record b i to represent the characteristics of individual i, that is, With respect to the distribution of the partially identifying variables located in both A and B, we considered the following two scenarios. In the first scenario, we generated partially identifying variables that satisfied the ACAR and DCAR assumptions. In this scenario, we compared the joint model with the two-stage method described in more detail in Section 4. In the second scenario, we generated partially identifying variables that violated the ACAR or DCAR assumptions. In this scenario, we were particularly interested in the performance of the joint model under misspecification of the probabilistic record linkage submodel. The scenarios are described in more detail in Section 6.2.
For each model in both scenarios, the bias, mean squared error, and the coverages of the 95% confidence interval of the regression parameter estimatesβ were reported. In addition, the estimated baseline hazards were reported. All results were obtained by repeating the simulation 500 times. A computer with 8 Intel Xeon (2.40 GHz) processors was used to fit a joint model to the data. We reported the computer run-time for the joint models in both scenarios.

... Scenario : Uncorrelated Partially Identifying Variables with Uncorrelated Errors (ACAR and DCAR)
In this scenario, the five partially identifying variables were sampled from independent discrete uniform distributions with 6, 7, 4, 8, and 2 unique values, meaning that they agree completely at random in nonmatches. In addition, we introduced respectively 10%, 5%, 2%, 0%, and 5% uncorrelated errors in the partially identifying variables. Errors were introduced by changing the value of the partially identifying variable to one of the other possible values. Given our parameter choices, we expected that n(1 − exp{− t B 0 h 0 (u)du}) ≈ 265 individuals experienced the event before t B (i.e., 265 true matches). However, because the partially identifying variables contained errors, approximately 54 (20.4%) of these true matches had a comparison vector in which one or more partially identifying variables disagreed. In addition, because the partially identifying variables had a limited number of unique values, approximately 279 nonmatches were expected to have a comparison vector in which all the partially identifying variables agreed. Note that this was a substantial proportion of the record pairs in which all the partially identifying variables agreed; from the approximately 490 record pairs in which all the partially identifying variables agreed, about 279 (56.9%) were nonmatches.
The performance of the joint model was compared to a twostage procedure to estimate the parameters from the time-toevent submodel. In the first stage, the matching indicators were estimated using the record linkage model mixture model from (4). We used the post hoc procedure from (5) to guarantee one-to-one matching. In the second stage, the estimated matching indicators were used in the time-to-event model from (7). Similarly to the joint model, a parametric proportional hazards model was fitted to the data. In both models, we correctly assumed ACAR and DCAR for the record linkage (sub)model and we used a natural cubic spline to estimate the log baseline hazard.

... Scenario 2 a : Correlated Partially Identifying Variables with Uncorrelated Errors (Not ACAR and DCAR)
In this scenario, the values of the first and second partially identifying variable were (positively) correlated by sampling them from a bivariate normal distribution with means zero, variances one, and covariance 0.3. Both partially identifying variables were converted into discrete variables with 10 equally likely values by using the quantile function of the standard normal distribution.
Because the values of these partially identifying variables were correlated, the ACAR assumption did not hold. The remaining three partially identifying variables were sampled from independent discrete uniform distributions with 4, 8, and 2 unique values. In addition, we introduced respectively 10%, 5%, 2%, 0%, and 5% uncorrelated errors in the partially identifying variables implying that the partially identifying variables disagreed completely at random in matches.
Two joint models were fitted to the simulated data. In the first joint model, we captured the correlation between the comparison outcomes in nonmatches of the first two partially identifying variables. Instead of (2b), we used In the second joint model, we wrongly assumed ACAR and used (2b) in the record linkage parts of the joint likelihood regarding the nonmatches. In both joint models, a natural cubic spline was used to estimate the log baseline hazard.

... Scenario 2 b : Uncorrelated Partially Identifying Variables with Errors that Depend on Time (ACAR and not DCAR)
The five partially identifying variables were sampled from independent discrete uniform distributions with 30, 7, 4, 8, and 2 unique values, meaning that all partially identifying variables agreed completely at random in nonmatches. Although the first partially identifying variable had a large number of unique values and appeared to be a relatively strong identifier, it contained a large amount of error. In addition, this amount increased with time meaning that the DCAR assumption did not hold. The probability that the first partially identifying variable agreed was where we used φ 0 = 3, φ 1 = −1.25 to simulate the data. The other four partially identifying variables contained 5%, 2%, 0%, and 5% uncorrelated errors.
Two joint models were fitted to the simulated data, where in the first model we replaced μ 1 from (2a) with μ 1 (t i j ) from (14) to capture the relation between time and the probability that the first partially identifying variable agreed in matches. In the second joint model, we wrongly assumed DCAR and used (2a) in the record linkage parts of the joint likelihood regarding the matches. In both joint models, a natural cubic spline was used to estimate the log baseline hazard.

Results
The simulation results regarding the regression parameters are summarized in Table 1. In addition, the estimated baseline hazard curves are given in Figure 1. The computational run-times of the joint models were comparable in both scenarios. On average, the regression parameters were estimated in about 18 minutes. In addition, estimating the Hessian matrix took about 9 minutes. Scenario 1. The joint model gave unbiased regression parameters estimates with good coverages of the 95% confidence interval. With respect to the baseline hazard, the joint model gave slightly overestimated estimates. However, the coverage of the 95% confidence interval was relatively good; on average 0.92. With the two-stage method, the regression parameters estimates were all biased toward zero. The baseline hazard was accurately estimated and had a lower mean squared error than the joint model. The variance estimate of the baseline hazard, however, was too low which led to a poor coverage of the 95% confidence interval; on average 0.82.
Scenario 2. In Scenario 2, the joint model gave unbiased regression parameters estimates with good coverages of the 95% confidence interval, even if the probabilistic record linkage submodel was misspecified. Similarly to Scenario 1, the baseline hazard was slightly overestimated with a correctly specified probabilistic record linkage submodel. However, the coverage of the 95% confidence interval was relatively good; on average 0.92 in Scenario 2 a and 0.93 in Scenario 2 b .
Misspecifying the probabilistic record linkage submodel led to biased baseline hazard estimates. In Scenario 2 a , wrongly assuming DCAR resulted in an overestimation of the baseline hazard. This can be explained by the fact that the identifying strength of the first two partially identifying variables was overestimated, leading to an overestimation of the number of matches (or events).
Wrongly assuming ACAR in Scenario 2 b also led to biased baseline hazard estimates. In Scenario 2 b , the amount of error in the first variable increased over time. Assuming ACAR, this trend was ignored which resulted in an overestimation of the baseline hazard in the beginning of the follow-up period (i.e., too many matches with short follow-up) and an underestimation at the end of the follow-up period (i.e., too few matches with long follow-up).

Application: Analysis of the Time to Second
Delivery Using the Perinatal Registry Netherlands

Data Description
The proposed joint model was fitted to data from the Perinatal Registry Netherlands (PRN), which is a large nationwide registry containing information from pregnancies with a duration of at least 22 weeks (Méray et al. 2007). We were interested in the probability of a second delivery given the characteristics of the mother (age and social economic status) at the first delivery and characteristics of the first delivery (gender of the child, pregnancy induced with assisted reproductive technology, and pre-term birth defined as gestational age < 37 weeks). In this analysis, we focused on mothers between 20 and 40 years of age. We assumed a proportional hazards model for the time to second delivery. File A contained the baseline measurements from n = 79830 mothers with a first delivery in 1999. The baseline characteristics are given in Table 2. File B contained the data from m = 696023 second deliveries from 1999 until 2010. We assumed that, when a mother with a delivery in 1999 had a second delivery between 1999 and 2010, the second delivery had been registered in B.
To distinguish matches from nonmatches, we relied on the partially identifying variables "birth date of the mother" and current "place of residence" described by a four digit zip-code, which had been registered for all first and second deliveries without missing values. In addition, because the date of the previous delivery was registered for 361871 (52.0 %) second deliveries in B, a third partially identifying variable "date of delivery" was created by comparing the date of the previous delivery from the second deliveries in B to the date of delivery from the first deliveries in A. With respect to the probabilistic record linkage submodel, we assumed that all partially identifying variables agreed completely at random in non-matches. In addition, we assumed that the birth date of the mother and the date of delivery disagreed completely at random in matches. The probability that place of residence disagreed might increase with the time between the first and second delivery. Moreover, mothers with less economic stability (i.e., low social economic status) could move more often. Therefore, the probability of a disagreement of place of residence in matches depended on maternal characteristics and the time between deliveries.
Let birth date of the mother, place of residence, and date of delivery be respectively the first, second and third partially identifying variable. We used a binary comparison function to compare the values of the partially identifying variables (see (1)) and used the following probabilistic record linkage submodel p(g i j |d i j = 1, z i j ; μ, φ) = μ where the vector z i j contained the time between deliveries, the age of the mother at the first delivery, and the social economic status of the mother at the first delivery. The indicator w i j3 described for record pair i j whether the comparison value of date of delivery was available (w i j3 = 1) or missing (w i j3 = 0). We assumed that the date of the previous delivery registered in B was missing completely at random. Let t be the time since the first delivery in weeks. Because a delivery only had been registered in the PRN registry after a pregnancy of at least 22 weeks, the risk of experiencing a second delivery is zero if t ≤ 22. For our model, this meant that the baseline hazard h 0 (t|α) = 0 if t ≤ 22. The log baseline hazard was parametrized with a natural cubic spline. Since we suspected a highly nonlinear association between the age of the mother and the probability of a second delivery, a cubic B-spline was used to model the effect of the age of the mother. The effect of age was fixed to be zero at the age of 30 years. A computer cluster with 40 Intel Xeon (2.40 GHz) processors was used to fit the joint model to the data.

Results
Estimating the regression parameters of the joint model took about 17.8 hours. The estimation of the Hessian matrix took approximately 10.5 hours. The estimated parameters of the probabilistic record linkage submodel are given in Table 3. As expected, the probability that place of residence agreed in matches decreased with more time between first and second delivery. The probability that place of residence agreed in a match consisting of a mother between 25 and 30 years of age at her first delivery with a normal social economic status and a second delivery which occurred 0-2 years later was about 0.78. This probability decreased to 0.65, 0.67, 0.55, and 0.50 when her second delivery happened respectively 2-4, 4-6, 6-8, or 6-10 years later.
The probability of an agreement was slightly higher for younger mothers (20-25 years of age) and substantially higher for older mothers (35-40 years of age). In addition, the probability of an agreement of place of residence in matches was lower for mothers with a low social economic status.
The effects of the characteristics of the first delivery on the probability of a second delivery are summarized in Table 3 and Figure 2. Mothers with a low social economic status at the first delivery had a lower probability of a second delivery. In addition, the use of assisted reproductive technology to achieve the pregnancy that led to the first delivery decreased the probability of   . Estimated effect of the age of the mother at the first delivery on the probability of a second delivery. The estimated effect, given by the solid line, was fixed at zero at  years of age. The dotted lines, representing the % confidence interval, were obtained as follows. Consider that the effect of the age of the mother is described by g(t; ζ) = w i (t )ζ , where w i (t ) is the basis vector of the spline function describing the nonlinear association, t is the age of the mother, and ζ is the vector of spline function parameters. Assuming thatζ n follows asymptotically a normal distribution, the standard error of the baseline hazard is se[g(t;ζ)] = diag{w(t ) w(t ) }, where = var(ζ n ) is the variance-covariance matrix ofζ n . The % confidence interval was obtained by g(t;ζ) ± 1.96 × se[g(t;ζ)]. a second delivery. No associations were found between the gender (i.e., a girl) and pre-term birth of the first child and having a second delivery. The relation between the age of the mother at the first delivery had a parabolic shape with a maximum around 30 years of age. The estimated baseline hazard is given in Figure 3. The baseline hazard was fixed to zero until 22 weeks after the first delivery. After this time, the baseline hazard quickly increased and peaked at approximately 3.6 years after the first delivery. After this peak, the baseline hazard decreased quickly and at approximately 6 years after the first delivery, the hazard was almost zero.

Discussion
We proposed a joint model for the analysis of time-to-event or survival data obtained from multiple files. More specifically, we considered the situation in which baseline measurements for a set of individuals have been registered in one file and event data in the other file. In the absence of a unique identifier, we used the information from partially identifying variables to determine which records from both files belonged to the same individual. With simulations, we illustrated the performance of the joint model under misspecification of the probabilistic record linkage submodel. The results showed that the regression parameter estimates from the time-to-event submodel were robust to misspecification of the probabilistic record linkage submodel. The baseline hazard estimate, however, was affected by misspecification of the probabilistic record linkage submodel.
Further research should be focused on extending the joint model to more complex types of time-to-event data (e.g., competing risks data and left-truncated data) and to record linkage problems with more than two files. Dealing with more than two files leads to a huge increase in the number of possible matching indicator vectors. Starting points to reduce the computational burden can be found in the Bayesian record linkage literature, in which a number of methods have been proposed to deal with more than two files (Sadinle 2014; Steorts, Hall, and Fienberg 2016; Sadinle and Fienberg 2013).

Supplementary Materials
R code and example data: Implementation of the joint model in R (assuming DCAR/ACAR). In addition, two example files are provided. (zipped file)