Reliability of vertical-cavity surface-emitting laser arrays with redundancy

This paper describes theoretical reliability analysis of a system containing n optical ports in which each optical port contains m redundant vertical-cavity surface-emitting lasers. We study the wearout failure statistics, modelled with lognormal distribution, for three different chip-level integration approaches: ( A ) each laser on its own chip, resulting in m·n chips, ( B ) m redundant lasers associated with one channel are on a single chip, resulting in n chips, and ( C ) all m·n lasers integrated on a single chip. We present a model that includes the run-to-run reliability parameter fluctuation and find that the three integration schemes consistently exhibit MTTF( C ) ≥ MTTF( A ) ≥ MTTF( B ) for lognormal distribution shape parameters observed in commercial vertical-cavity surface-emitting lasers. We also provide analytic approximations for the failure statistics for all three integration approaches enabling straightforward calculation of the failure statistics for any redundancy and channel number.


Introduction
The expected demand in data transmission bandwidth growth today places increasingly difficult requirements on the density, thermal management, and cost of components in data centres where the processing of aggregate data transfer is presently exceeding petabytes/sec. The existing technologies and technologies being developed to address this demand are progressively more relying on optical interconnects in place of copper due to the promise of lower power consumption and price, yet data-centres still require lower cost, power dissipation and size of interconnect components. Today, the cost of transceiver manufacturing is limited by packaging cost and yield, rather than electronic or optical components. Vertical-cavity surface-emitting laser (VCSEL) based technologies are preferred in environments where low power consumption and high density are of primary interest. Presently, VCSELs driven by SiGe driver circuitry operate line rates above 50 Gb/s [1]. Silicon and silica-based packaging platforms offer flexibility in materials choices, 3D-micromachining using standard semiconductor processes on a wafer level [2].
In highly integrated systems, field-replacement of optical components will not be available and multi-chip module replacement will become an important maintenance cost issue. One way of addressing this issue is by using redundant optical sources the same multi-chip module. The topic of this work is the reliability of modules that comprise multiple channels and redundant lasers in each channel. Our focus is on vertical-cavity surface-emitting lasers (VCSELs) as sources that can be integrated on the same chip in a straightforward fashion. In this work, we focus on the reliability expectation. The subject of this analysis is a multi-chip module with multiple (n) optical ports, each of the n ports having m redundant devices. In each optical port, the redundant optical devices are arranged in such a way that they all couple simultaneously into the respective fibre.
An illustration of such an integration with n = 4 and m = 4 is shown in Figure 1. Here groups of four lasers are coupled into their respective fibre. The lasers of one group are redundant: when the first device in a single channel fails, the next one is turned on electronically without significant interruption to the system operation. Each of the four VCSELs is coupled into its respective channel fibre or, in the case of Coarse-WDM source (CDWM), into one core of a single fibre using a suitable optical interposer. Figure 1 shows three approaches for integrating electronic or optical devices into an array of n channels with m redundant devices per channel. The figure illustrates this with VCSELs for m = n = 4, and names the integration strategies with A, B, and C. The three strategies (with m ≥ 1) describe the case (A) in which redundant sources and the sources of different channels are separate chips and generally come from different manufacturing subpopulations, (B) in which redundant sources come on a single chip, namely, the same manufacturing subpopulation, but different channels come from different subpopulations, and (C) case in which all the devices are integrated on a single chip and hence all devices come from the same manufacturing subpopulation. The case (B) is common to the integration of CWDM devices where it is more practical to manufacture arrays of VCSELs of the same wavelength on the same chip and manufacture different wavelength VCSELs on another wafer. Case (C) refers to the case in which even multiplicity of wavelengths appears on the same wafers. It is clear that if all VCSELs had identical failure probability and the failures inherent to array chips were neglected, all three integration schemes would show identical failure statistics. However, experimental evidence available in the industry shows that device reliability varies from run to run. This means that, when chips come from different runs or manufacturers, the three integration approaches must lead to different system life expectancies.
In this work, we are interested in determining which one of three approaches shown in Figure 1 is expected to exhibit the longest useful life (longest wearout) according to theory using commercially available subpopulation reliability characteristics. It will be shown in this work that approach (B) results in the worst life expectancy, while the choice between (A) and (C) approaches is not so clear: 1% cumulative failure time is best for strategy (A), while mean-time-to-failure (MTTF) is highest for the full integration approach (C).
In this work, we develop a mathematical model for the reliability of multi-chip modules with n channel and m redundant devices in which the wearout-failure statistics of each individual device is known. We first briefly introduce the reliability concepts needed for the analysis and apply them to a multi-chip module in which devices fail randomly -this assumption leading to an analytic solution. In the second central part of this work, we discuss a system reliability model that accounts for the manufacturing variance of reliability parameters. The model is applicable to any type of optoelectronic device and arbitrary failure distributions, but we focus our attention on the VCSEL wearout described with lognormal statistics, because experimental data was available in the industry. Optical modules contain mechanical, electronic, and optoelectronic elements. The primary failure concern is the degradation of the laser. The infant mortality on optical modules employing VCSELs is dramatically reduced, if not entirely eliminated with a suitable burn-in process that is applied to every module. As burn-in is ubiquitous in the optical component industry, the devices ultimately fail due to random failures or wearout. Random failures, characterized with a constant hazard rate, exhibit exponential failure rates. The wearout failure of optoelectronic components has successfully been described using lognormal distribution. This has also been confirmed for VCSELs through extensive reliability studies [3,4]. The present approach, illustrated on the lognormal wearout statistics, can be equally applied to random failures and also allows straightforward inclusion of failure probability of the electronic circuitry associated with each device, provided manufacturing variance of these failures is quantified.

Modelling random failures
The channels of the system shown in Figure 1 are assumed to fail independently with identical failure probability functions; the system operates with the channels arranged in series or first-to-fail mode [5][6][7]. Each channel has m redundant devices operating in such a way that the next device turns on when the previous device fails. The failure times of the redundant devices add, thereby increasing the life of the overall line card. The reliability of other electronics will be omitted from this analysis for simplicity, but it is understood that their failure can be accounted for with their own survival functions. The failure cumulative density function (CDF) F(t) for the system shown in Figure 1 is where F m (t) is the probability that a channel with m redundant devices will fail before time t.
The failure probability density function (PDF) of a channel with m redundant devices is given by convolution, Equation (3) assumes that the failure time of the last of m redundant devices is a sum of failure times of each redundant device and that the failures of for device are independent. The MTTF of this system is obtained by using [5], The 1%CDF is determined by setting p 0 = 0.01 in With these general statements, we briefly illustrate the behaviour of random failure of the optical end of our multi-chip module. For random failure use the exponential failure distribution [5] given with f (t) = λ·exp(−λt). A channel with m redundant modules has the failure PDF given by the gamma distribution, The CDF if given by the incomplete gamma function From the recursion relation for incomplete gamma function (for integers) [8] we derive the cumulative failure distribution F m,n (t) of this system as The computed values of these two figures of merit, MTTF and 1%CDF are shown in Figure 2. The MTTF has been obtained by numerically integrating Equation (4), while 1%CDF is determined analytically from Equation (5).
There are several interesting features of these redundant systems evident in Figure 2. As expected, when using only one channel (n = 1), the MTTF increases linearly with the number of redundant devices: λ·MTTF(m,1) = m. On the other hand, keeping a single laser per channel (m = 1) and increasing the number of channels, decreases the MTTF proportionally: λ·MTTF(1,n) = 1/n. This figure also shows that the MTTF of a three-channel device (n = 3), can be made equal to a single-channel device if two redundant devices (m = 2) are used and that the MTTF of a nine-channel device (n = 9) can be made equal to the MTTF of a single-channel device with three redundant devices (m = 3) are used per channel. The graph reveals a dramatic improvement of 1%CFT when more than one redundant device is used. The primary reason for the large jump between m = 1 and m = 2 is the change in the PDF shape: For m = 1, the PDF is an exponential function with a maximum at t = 0, while for m > 1 the PDF is Gamma function with a maximum that is always at t > 0. The difference in behaviour between the MTTF and 1%CDF time will also be evident in the wearout characteristics shown next.

Modelling wearout failures
We now address the wearout statistics and their manufacturing variation. There are two terms that will be needed in the upcoming discussions. The term entire device population means all devices associated with a certain design or manufacturing process. For example, it can mean all devices coming from a specific manufacturer are interchangeable in our multi-channel module with redundant devices. A device subpopulation is the smallest set of devices for which the analysis of failure statistics can be practically assessed. The failure statistics depend on the device design, epilayer design, growth parameters, and the epilayer variation across a wafer. A subpopulation may mean a group of neighbouring VCSELs on a single chip, VCSELs coming from an area on a wafer or VCSELs coming from a single wafer as long as we can treat them as (a) having identical failure statistics and (b) their failures being independent of each other. This definition (and the subsequent analysis) neglects any failures that are associated with building arrays of different sizes, damage that comes from fabricating larger or smaller chips, or thermal cross-talk, for example. It only focuses on independent device failure. In practice, the size of the subpopulation will be determined by the sample space necessary to estimate the reliability parameters. The choice of the size of the subpopulation (multiple wafers in a single wafer run, wafer, or an area on a wafer) does not reduce the generality of our analysis. Evidently, every device subpopulation is a subset of the entire device population.
The probability that a device originated from subpopulation r is denoted with p r . The conditional failure probability density that a device from subpopulation r will fail at time t is denoted with f r (t). The failure distribution of the entire device population is denoted with f e (t). It is straightforward to show that f e (t) is a weighted sum of the failure distributions of each subpopulation: where q is the number of subpopulations. Using these definitions, we now define the reliability statistics for the three integration approaches shown in Figure 1.
Case (A): Each redundant device (and each channel) is an independent device randomly selected from all subpopulations. The expected failure probability density function of one device is given by the weighted sum, Every time a redundant component fails, it is replaced with another that is also randomly selected from all subpopulations. The failure probability for the last of m redundant devices and the survival function S A (t) of an n-channel module are given by Case (B): The redundant devices come on one chip and there are n independent chips that form the n channels of the module. Each channel now comes from a different subpopulation, while the redundant devices belong to the same subpopulation. For each subpopulation (or channel), the PDF is given by The failure PDF for a randomly selected channel chip is given by weighted sum: The survival function S B (t) of an n-channel module is given by Case (C): All of the devices are on the same chip and hence belong to the same subpopulation.
The survival function S C (t) of the arrayed chip is given by the weighted sum over subpopulations: where S m,r (t) is the probability that a channel with m redundant devices selected from subpopulation r will be operation at time t.
As a sanity check, we note that for m = 1, S A (t) = S B (t) and for n = 1 we have S B (t) = S C (t), and hence for m = 1 and n = 1, or when there is no subpopulation variation we have S A (t) = S B (t) = S C (t).
The model described by survival functions (12), (15), and (18) is applicable to arbitrary failure or subpopulation parameter distribution. Most notably, the electronic circuitry failures or random failures of the optical components can be introduced into the individual subpopulation failure distributions in Equations (10), (13), and (16). We will now focus on the wearout phase and use the experimental data available in the field to provide a quantitative comparison between different integrations schemes shown in Figure 1. Experimental evidence from the manufacturers of VCSELs indicates that the wearout failure statistics of VCSELs, i.e. entire device populations, maybe, with a few exceptions, described with a lognormal failure distribution. Although significantly more scarce and not publicly available, evidence exists that the failure statistics of device subpopulations (wafers and areas on wafers) also approximately obey lognormal distribution, albeit with different shape parameters.
The failure probability density function for a device selected from the entire device population is then given by [5], Here σ e and t e (the 50% cumulative failure time) are assessed on the entire device subpopulation. From now on we will use LN(t|t e ,σ e ) to specify the lognormal distribution. The cumulative distribution function of (19) is given by where erf(·) is the error function.
Characterizations of the failure statistics of VCSELs from multiple wafer runs by a variety of VCSEL manufacturers have exhibited the following lognormal shape parameter values: σ e = 0.45 [9], σ e = 0.48 [10], σ e = 0.5 [11], σ e = 0.75 [3], and σ e ≈ 0.9 [12]. Reference [10] fits the wearout statistics evaluated on devices from multiple wafer runs to the Weibull distribution [5] with shape parameter β = 2.7. This distribution can be fitted reasonably well to a lognormal distribution with shape parameter σ e = 0.48. As a consequence, for our analysis, we select σ e = 0.5 as the most representative value for the entire device population.
The failure probability density function of a device in subpopulation r is approximately described with lognormal failure distribution: Here subscript r means refers to subpopulation number. As r represents either a wafer or a chip number, r is an integer that ranges from one to the number of subpopulations. As noted above, real subpopulation failure PDFs were also observed to be approximately lognormal, but with a shape parameter that can be as low as σ r = 0.2 [9], [13].
In our model, we require that all subpopulations have equal shape parameters, and select σ s = 0.2. Setting σ r ≡ σ s in (21) reveals that the only PDF parameter that varies between subpopulations is the 50% cumulative failure time t r . This simple restriction -along with f r (t) and f e (t) being lognormal -uniquely specifies the shape of the subpopulation distribution p r : the time t r obeys a lognormal distribution. To show this we write μ r = ln t r as μ r = μ avg + ν, where ν is a normally distributed with N(0,σ ν ) and μ avg = lnt r (avg) = E[lnt r ]. Here E[·] stands for expected value of random variable (over all subpopulations) and p(ν) = N(ν|lnt r (avg) ,σ ν ). We convert the sum (9) into an integral, which is saying that t r varies continuously between subpopulations, using p(t r ) = LN(t r |t e ,σ ν ): The integral evaluation yields a lognormal distribution (19) with shape parameter σ e ⊃2 = σ s ⊃2 + σ ν ⊃2.
From the assumed value σ s = 0.2 and σ e = 0.5 we find σ ν = 0.46. Note that the repeated application of the weighted sum (22) always leads to a distribution that is asymmetric in t. This is different from the convolution of the lognormal distribution with itself, which does not lead to another lognormal distribution. Central Limit Theorem [14] teaches that repeated convolution of the lognormal distribution ultimately leads to a normal distribution, which is symmetric in t. The relationship between the subpopulation and entire distribution failures discussed above is graphically illustrated in Figure  3. This figure shows three subpopulation distributions f r (t) each with σ s = 0.2 for cumulative distribution of the subpopulation 1/4, 1//2, and 3/4, which correspond to t r = 0.8·t e , t e , and 1.24·t e . The entire device population (combined) distribution f e (t) has shape parameter equal to σ e = 0.5.

Approximate analytic expressions
Before we depart on numerically solving the convolutions and the weighted sums, we provide several approximate analytic solutions that will be helpful for first-order comparison between the different integration schemes. First, we establish an approximation for repeated convolutions of the lognormal distribution. It can be shown that for small σ , the lognormal distribution can be approximated by a normal distribution: LN(t r ,σ ) ≈ N(t r ,t r σ ). After m convolutions with itself, the normal distribution becomes N(m·t r ,t r σ √ m). We now approximate this normal distribution again by lognormal distribution we obtain an approximate expression for a lognormal distribution convolved with itself m times: The factor η is a heuristic correction. Approximation (23) is now used to derive several analytic expressions for the failure statistics of integration approaches (A), (B), and (C). The analytic approximations are summarized in Table 1. Note that all of the expressions have been derived assuming that σ s < σ ν , and that the accuracy will depend on the relative magnitude of these two parameters and choice of quantity fitted (see Figures 9 and 10). For our choice of σ ν = 0.46 and σ s = 0.2, the prediction of the MTTF is better 1% of the MTTF value. Table 1 shows a summary of analytic expressions for failure statistics of three integration approaches. The number of channels and the number of redundant devices per channel are denoted with n and m, respectively. The failure statistics of each subpopulation are assumed to be lognormal with variance σ r ⊃2 and subpopulation 50% cumulative failure time Each channel: Module: The variance σ s ⊃2 is fixed, while the 50% cumulative failure time t r varies between the subpopulations due to manufacturing variation and is distributed with a lognormal distribution with variance σ v ⊃2 and 50% cumulative failure time t e . The failure statistics of the entire device population are assumed lognormal with 50% cumulative failure time t e and variance equal to σ v ⊃2 + σ e ⊃2. The heuristic factors η A and η B improve the lognormal fit to the convolution of n lognormal distributions, while the factor σ s ⊃2/mn in column C (also heuristic) has been introduced to ensure that for n = 1 we have F B (t) = F C (t). We can use the expressions shown in Table 1 to make some general statements about the integration approaches for the case when σ s = 0. We will use numerical calculations to check the analytic approximations and confirm that these statements are valid of σ s > 0, as well. Case (A) has a longer expected life than case (B) for m > 1 and any n, case (C) has a longer expected life that case (B) for any m and n > 1, and there is no (or very weak) channel-number dependence on the shape of the failure distribution in the case (C). It is evident that for σ v > > σ s the integration approach (B) yields the poorest life expectancy when lognormal wearout is assumed, but it is not immediately clear which of the two other approaches, (A) or (C), lead to longer life expectancy for a wider range of failure distribution parameters. We address this question in the next section.

Numerical analysis
The analysis of unreliability for different integration approaches was performed numerically using MAT-LAB. The time sampling was linear to enable the convolutions to be executed using the FFT algorithm. The number of time points used was 2 13 and the time increment was (t/t r ) = 0.004. The number of points used to define p(r) was adjusted by checking how well the numerically evaluated weighted sum (22) matches a lognormal distribution when two lognormal distributions are used in the integral (22). The number was adjusted so that the cumulative distribution functions matched better than 1% for all values for which CDF > 10 −6 . The number of points varied between 19 and 41 depending on the values of m and n, and the fit was generally much better than 1% of the CDF value. All the results were also compared to the analytic approximations shown above.
In Figure 4 we show how the cumulative failure functions ("Unreliability") vary with the increasing number of channels. In this case (m = 1) we have that F A (t) = F B (t) and hence we plot two sets of graphs as a function of the number of channels. Integrating devices on a single chip offers a significant advantage over using individual devices selected from different subpopulations. A simple example that illustrates this follows. Suppose that 1% of the wafers has an unacceptably short lifetime and we are building a 100-channel module. In the case (C), 99% of all integrated modules is likely to work properly, whereas in case (B) each channel has a 99% chance of working, which means that the module has a 0.99 100 = 36% chance of working properly. The analytic expressions given in Table 1 support this conclusion.  Figure 5 illustrates the unreliability for modules that have a single channel, but a varying number of redundant devices. In this case n = 1, and we have that F B (t) = F C (t). All the failure distributions follow an approximately lognormal behaviour as expected from the analytic expressions given in Table 1. As expected, integrating devices on a single chip (case C) offers greater reliability over the other cases (A and B). Again, the scatter in reliability parameters among the subpopulations is clearly detrimental to the approaches that allow random selection of devices from the entire population.
The central question is what is the unreliability dependence on n and m, when both n and m are greater than unity. These results are shown in Figure 6 where the three integration approaches for n = 100 and m varied from 1 to 4 and is representative for all other numbers of channels and redundant devices. We consistently find that the integration approach (B) delivers the poorest reliability expectation. Which of the two remaining approaches (A) or (C) is more appropriate to use, will depend on what statistic we are interested in observing: The 1% cumulative failure time or the meantime-to-failure. Although the unreliability of case (A) increases sharply with time, with the addition of redundant devices it improves faster than the unreliability of case (C). For the example shown in Figure 6, case (A) with m = 4 exhibits almost a factor of two better 1%CFT time than the integration approach (C). Evidently, the 1%CDF is better for case (A) than for case (C) for all m > 1. The situation is quite different with the mean time to failure. As we can see in Figure 7, which shows the MTTF values calculated for the same set of m values, the MTTF values do not follow the same trend. When m > 1, the mean time to failure is best for case (C) -the fully integrated chip.  Finally, we illustrate the dependence of the failure statistics for a wider range of lognormal distribution parameters. We do this on the same extreme example (n = 100, m = 4) as shown in Figures 6-8. We also compare the numerically calculated results with the analytic approximations given in Table 1. Consider a process whose reliability was characterized over many wafer-runs. A randomly selected device failure obeys lognormal failure probability with meantime-to-failure equal to MTTF(1,1) ≡ MTTF(m = 1, n = 1) and shape parameter σ e = 0.9. See, for example, reference [12] for a VCSEL process with such shape parameter. Suppose further that we know that failure statistics of devices coming from a single wafer also  obey lognormal failure distribution, but with a narrower shape, i.e. a shape parameter value smaller than that of the entire distribution: σ s < σ e . We desire to build a 100-channel module (n = 100) in which each channel will have four redundant devices (m = 4). In Figures 11 and 12 we show calculated values of MTTF and 1%CDF for this module as we vary σ e = 0.5, 0.7, 0.9 and the subpopulation spread σ s . The σ s values vary from 0 to σ e . While keeping σ e constant, σ s = 0 means that all devices on a single wafer fail at the same time, while the failure times vary from wafer to wafer. At the other end, when σ e = σ s the failure statistics variation on one wafer is the same as in all wafers, i.e. it does not matter whether a device is selected randomly from one wafer from any wafer.
The bold lines in Figures 9 and 10 show data determined exactly using convolution (numerically), while the thin dashed lines show the results obtained using the analytic expressions given in Table 1. Note that the MTTF value for cases (A) and (B) still had to be integrated numerically.

MTTF for cases A, B, and C
We take the integration approach (A) from Figure 1 and randomly select every one of the n·m devices that go into our module from the entire population (for this case, n·m = 400). By doing this we are not sensitive to any variation within the device subpopulations, and hence the graphs in Figure 9 for case (A) are flat: they start at zero and end at σ s = σ e . We now use the same procedure but integrate the redundant devices on one chip and build each channel as was shown by case (B). When we do this the failure statistics of the module become sensitive to the spread of failure times observed   Table 1. on a single wafer. In fact, any variation of the failure times within a subpopulation (wafer) that is less than the variation observed on the entire population is detrimental to the expected lifetime: the graphs for case (B) are all lower than for case (A). This interesting phenomenon is caused by the fact that if we happen to pick a bad wafer (each channel is a randomly selected wafer), all of our four of our redundant devices will be short lived. This does not happen in case (A) where the selection of redundant devices is independent of the previous choice, i.e. each device has an equal chance of being a log-lived device.
Suppose we now decided to integrate all 400 devices on a single chip. This is modelled with case (C) in Figure 1. If our wafers have the same failure statistics distribution as all other wafers (σ s = σ e ), it will not matter whether we integrated the devices or not. However, the mean-time-to-failure increases if our wafers exhibit more tightly shaped failure distributions (σ s < σ e ). The reason for this is: once we select a good wafer, we will not only have ensured that the redundant lasers work well, but also that all of the other channels operate well. Compare this to case (B), where selecting a bad wafer for just one of the channels will make the entire module fail prematurely. In case (A), selecting a bad wafer for one device still allows a longer useful life because the redundant devices are randomly selected from different wafers. The relationship between the MTTF values can be summarized with (equality applies when m = 1), This is a strong statement in favour of integrating devices on a single chip. As it may not be intuitively obvious why MTTF of approach (B) is worse than both (A) and (C), we restate the physical interpretation of our findings: Suppose we know that a fixed percentage of wafers is bad and the uniformity of failure times on each wafer is good. We use approach (C) as a reference where it is reasonable to expect that the percentage of bad modules is the same as the percentage of bad wafers. What makes this integration scheme superior is that not only is the failure probability low for any device on any given channel on a good wafer but that the redundant devices on any channel are also good. In approach (B) where we place all redundant devices on a single chip, but use different chips for each channel, we have reduced the reliability in two ways: (i) it is more likely that one of the channels will be bad because we are randomly sampling more than once from a group of wafers with a fixed percentage of bad wafers, and (ii) once a channel is bad, all of the redundant devices on that channel will also be bad. Clearly, we have MTTF(C) ≥ MTTF(B). That approach (B) is worse than approach A is evident because the reliability of approach (B) can be improved by letting the redundant devices of each channel (and bad wafer) be selected from the entire device pool (any wafer, case (A)), rather than being selected from the same subpopulation (same bad wafer, case (B)). This leads to better reliability exhibited by the approach (A), MTTF(A) ≥ MTTF(B). However, the described improvement of approach (A) over approach (B), does not reach the reliability expectation of the all-on-one chip integration of approach (C). The reason is that the randomly selected devices for all other channels (approach (A)) come from the entire device pool and hence offer a greater possibility for selecting a short-lived device than if all devices came a from a single (good) wafer in approach (C). This leads to MTTF(C) ≥ MTTF(A) confirming relationship (25). The relevant figure of merit for the maintenance of integrated systems is the expected number of failures in a given time period, also known as the renewal function [5,15]. The determination of the renewal function M(t) is given implicitly with the integral equation [5], where f (t) and F(t) are the PDF and the CDF of our module. The integral equation (26) can be solved analytically only for select number of cases, and we will leave its exact numerical solution for future work. Here we will only consider the limit the renewal rate dM(t)/dt takes when t → ∞. It can be shown in a straightforward manner that this limit is 1/MTTF, and hence the relationship (25) also states the preference of device integration to achieve low-cost system maintenance.
We have determined the variation in 1%CDF with different integration approaches in Figure 10. As noted earlier, the preference between different integration approaches is different for 1%CDF than it is for MTTF, and the reason is the different shapes of the distribution functions. The relationship is summarized with, Here the equality applies when m = 1. Note that the analytic approximations result in a significantly better fit for MTTF than for 1%CDF. This is expected since MTTF is a calculation of the first moment, i.e. an integral, which is significantly less sensitive to the exact shape of the distribution than 1%CDF. It is clear that integrating redundant devices is beneficial for the life expectancy of any integrated system. For some cases, the failure statistics of individual devices will favour all-on-one-chip strategy (C), while in some cases using individual chips (A) may be a better option depending on what quantity we are interested in: 1%CFT is best with independent devices, but better MTTF is exhibited on fully integrated devices. As our analysis is based on lognormal failure statistics, the conclusions may differ if other more complex failure statistics are included.

Conclusion
High-density device integration is a necessity for future optical, electronic, and micro-mechanical devices. The reliability considerations will play an increasingly important role in the design of such systems, and this will especially become true as these technologies become the workhorse of future nano-scale devices for computation, storage, and biomedical applications.
In this work, we have presented a simple reliability model that accounts for run-to-run manufacturingprocess variability of reliability parameters, and hence accounts for the interdependence of failures between neighbouring devices on wafers or arrays. The model has been used to estimate the lognormal wearout failure statistics of multi-channel integrated chips with redundant VCSELs. In doing this, we have assumed that the wearout failure of VCSELs obeys the lognormal distribution, but the approach could be applied using any other failure distribution. We have furthermore presented several analytic approximations for the failure statistics of multi-channel modules with redundant devices in which the subpopulations and entire populations of devices failures are distributed with the lognormal distribution. These expressions, summarized in Table 1, are useful in performing first-order estimates of life expectancy in systems where, due to manufacturing variance, there is a difference between failure statistics within a subpopulation and the entire device population. The presented analysis stresses the importance of tracking failure statistics of manufacturing subpopulations (wafers) and comparing them to failure statistics of entire device populations for more accurate modelling of redundant and multi-channel systems.