Quantifying Gerrymandering in North Carolina

Using an ensemble of redistricting plans, we evaluate whether a given political districting faithfully represents the geo-political landscape. Redistricting plans are sampled by a Monte Carlo algorithm from a probability distribution that adheres to realistic and non-partisan criteria. Using the sampled redistricting plans and historical voting data, we produce an ensemble of elections that reveal geo-political structure within the state. We showcase our methods on the two most recent districtings of NC for the U.S. House of Representatives, as well as a plan drawn by a bipartisan redistricting panel. We find the two state enacted plans are highly atypical outliers whereas the bipartisan plan accurately represents the ensemble both in partisan outcome and in the fine scale structure of district-level results.


Methods
To sample from the space of congressional redistricting plans, we construct a family of probability distributions that are concentrated on plans adhering to non-partisan design criteria from proposed legislation. The non-partisan design criteria ensures that 1. the state population is evenly divided between the thirteen congressional districts, 2. the districts are connected and compact, 3. splitting counties is minimized, and 4. African-American voters are sufficiently concentrated in two districts to affect the winner.
The first three criteria come from House Bill 92 (HB92) of the NC General Assembly, which passed the House during the 2015 legislative session. HB92 also states that a districting should comply with the Voting Rights Act (VRA); the fourth criteria ensures the VRA is satisfied and is based on a redistricting plan proposed by the legislature along with recent court rulings. HB92 proposed establishing a bipartisan redistricting commission guided solely by these principles (see the Appendix for the precise criteria).
There is no consensus probability distribution to select compliant redistricting plans. For example, there is no criteria that determines when a plan contains districts that are not compact enough; it is also unclear if a distribution of plans should more heavily weight more compliant plans, or whether it should equally weight all compliant plans. In short, there is no 'correct' choice for this distribution. We select a particular distribution for our main results and then demonstrate that these results are remarkably stable when the distribution is changed (see Section C of the appendix).
Probability distributions are sampled with a standard Markov Chain Monte Carlo algorithm, which produces about 24,000 random redistricting plans (we test the fidelity and robustness of our sampling in Section B of the appendix) . For each generated redistricting, we re-tally the actual historic votes from a variety of electoral races, including the 2012 and 2016 US congressional elections, producing ensembles of election outcomes. When re-tallying the votes, we make the assumption that people vote for parties rather than people. We use this ensemble of election outcomes to quantify how representative a particular districting is by observing its place in this collection; we also use the ensemble to quantify gerrymandering by identifying districts which have an atypical partisan concentration of voters. We analyze and critique the NC U.S. Congressional districting plans used in the 2012 and 2016 elections, as well as the districting developed by a bipartisan group of retired judges as part of the "Beyond Gerrymandering" project spearheaded by Thomas Ross and the Duke POLIS Center (1). We refer to these districting plans of interest as NC2012, NC2016, and Judges respectively (see Figure 8 for the district maps).

Number of Democrats Elected
Using a related methodology, we assess to what degree three districting plans of interest (NC2012, NC2016, and Judges) are engineered. This is done by seeing how close each districting's properties are to the collection of nearby redistricting plans. Small changes to district boundaries should not have a significant effect on the character of election results.

Results
Using our ensemble of over 24,000 redistricting plans, we tabulate the observed probability distribution of the congressional delegation's partisan composition for the 2012 and 2016 vote counts. We then situate the NC2012, NC2016 and Judges districting plans on this probability distribution (see Figure 1). The partisan composition of the NC2012 and NC2016 districting plans occur in less than 1% of our generated redistricting plans for both sets of election data, and is heavily biased toward the Republicans. The partisan composition of Judges districting occurs in 39% and 28% of our generated redistricting plans for the 2012 and 2016 votes respectively, providing the second most likely outcomes both times.
By keeping the vote counts fixed and changing district boundaries, we have ignored any impact on incumbency. To test the effects of incumbency, we repeat the above analysis over a range of historic elections that include senatorial, presidential, gubernatorial races occurring between 2012 and 2016. We plot the histograms as a function of the statewide Democratic vote fraction in Figure 2. The NC2012 and NC2016 districting plans robustly elect three Democratic candidates over nearly all sets of examined historical voting data, and these results nearly always occur in less than 1% of the ensemble of redistricting plans when examined on the same set of election data. In contrast, the Judges plan gradually shifts from electing four to six Democrats as the statewide Democratic vote fraction changes from 43.7% to 51.6% of the vote; when situated within the ensemble of redistricting plans, the results are nearly always one of the two most expected outcomes.
Although the above results are compelling, the partisan balance in election outcomes is not the only signature of gerrymandering and gives little detail of the structures that produce the atypical results. To further probe the geo-political structure, we order the thirteen congressional districts in any given redistricting from the lowest to highest percentage of Democratic votes in each district to construct an ordered thirteen dimensional vector. For each index, we construct a marginal distribution. We summarize the thirteen distributions in a classical box-plot in Figure 3. On these box-plots, we overlay the percentage of the Democratic vote for the ordered districts in the NC2012, NC2016, and Judges districting plans.
The structure of the box plot reveals interesting features in the three examined districting plans. The Judge's districts gradually increase, roughly linearly, from the most Republican district (labeled 1) to the most Democratic (labeled 13); this behavior is identical to the behavior of None of the authors have any conflicts of interest to declare. 1 To whom correspondence should be addressed. E-mail: jonm@math.duke.edu the marginal distributions. Furthermore, most of the voting percentages from the Judges districts fall inside the boxes on the box-plot which mark the central 50% of the distribution. The NC2012 and NC2016 districting plans have a different structure: Both plans jump in partisan voter percentage between the tenth and eleventh most Republican districts (labeled 10 and 11, respectively). In the NC2012 districting, the fifth through tenth most Republican districts have more Republicans than predicted by the ensemble (labeled 5-10). In the NC2016 districting, the sixth through the tenth most Republican districts have more Republicans than predicted by the ensemble. In both the NC2012 and NC2016 districting plans, votes are removed from the central districts and added to the three most Democratic districts (labeled 11-13) -this is strong evidence that the middle districts have been cracked to reduce the Democrats' influence whereas the most Democratic districts have been packed with more Democrats than is expected. To quantify this observation, there are nearly no eleventh districts within the ensemble that have more Democratic votes than the eleventh district in the NC2012 and NC2016 data. Similarly there are nearly no tenth districts within the ensemble that have fewer Democratic voters than the NC2012 and NC206 redistricting plans. The only exceptions to this trend are in the NC2016 plan under the 2012 votes: There are 15 tenth districts in the ensemble (0.06%) with fewer Democratic voters than the tenth district of the NC2016 plan, and there are 25 eleventh districts in the ensemble (0.1%) with more Democratic voters than the eleventh district of the NC2016 plan.
Packing and cracking potentially reduce a party's political power. When considering the 2012 votes, the box-plot demonstrates that the 7th-9th districts are typically above the 50% line, meaning that we expect these districts to elect a Democratic representative; when comparing these districts to the NC2012 and NC2016 districting plans, we see that the 7th-9th districts fall below this line, meaning that these districts elected a Republican. When considering the 2016 votes, the NC2012 and NC2016 districting plans lead to a similar change in the delegation's partisan composition, which can be seen by examining the ninth and tenth most Republican districts (labeled 9-10).
In addition to the number of elected officials for each power, Figure 3 demonstrates add partisan safety -districts 6-10 are all more robustly republican than they would otherwise be. The effect is that elections within these districts may become more robustly Republican. This effect maybe seen in both the NC2012 and NC2016 plans for districts five through seven under the 2012 vote counts, and districts seven through eight under the 2016 vote counts.
Under a standard uniform partisan swing hypothesis, a statewide shift in the votes results in the box-plots shifting globally up or down in the direction of the swing (e.g. (2)). Hence, under this assumption, the jump in partisan vote fraction in the NC2012 and NC2016 plans results in a wide range of statewide partisan outcomes that produce an identical partisan composition of the Congressional delegation. This effect is absent from the typical ensemble plans as well as the Judges's plan.
The jump in Democratic vote fractions of the NC2012 and NC2016 plans relative to the fairly linear, gradually increasing vote fractions of the typical ensemble and Judges's plan provides a signature of gerrymandering. This structure further reveals the districts which have had votes from one party removed and dispersed to other districts. The cracking and packing dilutes the Democratic party's political power. In the next section, we further quantify this signature by contextualizing the plans of interest within the ensemble of redistricting plans when analyzed with summary metrics. Summary Metrics. Although the above visualizations provide a clear picture of the signature of gerrymandering, there is a long history of employing summary metrics that seek to encapsulate the above structures with a single number: Such metrics include electoral responsiveness, partisan bias, the efficiency gap, mean-median difference, declination and more (see, for example (3)(4)(5)(6)(7). Typically these metrics are contextualized with historical data across past elections, districting plans, and states. However it is unclear how meaningful the measures are, as they fail to consider the geo-political makeup of a region (for example, see (8,9)). For example, in (8), the geo-political makeup of a state may cause it to be have a naturally partisan biased.
Based on these observations, we propose two novel metrics that contextualize a redistricting plan within the underlying space of possible plans: The two indices are • Representativeness Index: Measures how typical the resulting balance of power obtained by a given districting plan is by contextualizing the number of elected representatives with the ensemble of redistricting plans.
• Gerrymandering Index: Quantifies how typical the observed level of packing and cracking is for a given redistricting by measuring how the individual districts deviate from the expected percentage of partisan votes.
All of the above mentioned summary statistics are based on the ordered districting vote percentages shown as the dots in Figure 3, however the two novel statistics also utilized marginal distributional data (shown as the box plots in Figure 3). In the present work, we consider two of the established statistics -partisan bias and the efficiency gap. We also consider the two new proposed statistics. We provide detailed descriptions of all utilized indices below. Each summary statistic is computed for each redistricting plan in our ensemble and for each districting plan of interest (NC2012, NC2016, and Judges).
We contextualize the partisan bias and the efficiency gap, using the 2012 and 2016 congressional voting data, with histograms in Figure 4. In both statistics and under both vote counts, the NC2012 and NC2016 districting plans are extreme outliers; they show extreme bias toward the Republican party and waste an atypical number of Democratic party. Under the 2012 voting data, only 14 of the generated redistricting plans (0.053% of the plans) are as, or more, biased than the NC2012 districting plan, and only 334 redistricting plans (1.3%) are as, or more, biased than the NC2016. For the 2016 voting data, no plan in the ensemble is as, or more, biased than either the NC2012 or NC2016 districting plans. Under both sets of voting data, the NC2012 districting plan has a higher efficiency gap than any plan in the ensemble. The efficiency gap for the NC2016 map is lower than 84 of the redistricting plans in the ensemble (0.34%) for the 2012 voting data, and 86 of the redistricting plans in the ensemble (0.35%) for the 2016 voting data.
In stark contrast, the Judges districting plan has less partisan biased than 15,327 redistricting plans (62.51%) for the 2012 votes and 6130 redistricting plans (or 25.0%) for the 2016 votes. The partisan bias of the Judges plan is also the most probable bias within the ensemble  for both vote counts. The Judges districting plan is also more efficient than 14,778 redistricting plans (60.02%) and 6912 redistricting plans (28.2%) under the 2012 and 2016 votes, respectively. We also contextualize the Gerrymandering Index and Representativeness index with histograms in Figure 5. None of the generated redistricting plans constructed have a partisan bias Gerrymandering Index bigger than NC2012, regardless of the voting data used. Similarly All four metrics over both election years indicate that the Judges plan is very typical. The Judges plan shows low partisan bias, has a reasonable efficiency gap, has a comparatively low level of gerrymandering, and is reasonably representative. In contrast, the NC2012 and NC2016 plans have strong partisan bias toward the republicans, have a large efficiency gap, are unrepresentative, and are highly gerrymandered.

Local engineering
If relatively small changes in a redistricting dramatically change the partisan vote balance in each district then it raises questions how representative the results generated by the redistricting are, and suggests the redistricting was selected or engineered. * We explore the degree to which the NC2012, NC2016 and Judges redistricting plans are locally typical by examining the distribution of Gerrymandering Indices of an ensemble approximately 2,500 nearby districting plans, (see Section S5 for a precise description of 'nearby'). We select the Gerrymandering Index, because it is the only considered index that is independent of the number of elected officials, and we do not expect this number to vary significantly when examining nearby districting plans. We display our results in Figure 6, and find that the NC2012 and NC2016 redistricting plans have Gerrymandering Indices which are significantly larger than their respective ensembles of nearby redistricting plans, while the Judges plan has a Gerrymandering Index in the middle of the ensemble produced by its nearby redistricting plans. In other words, small changes to district boundaries make the NC2012 and NC2016 redistricting plans less partisan but do not change the characteristics of the Judges redistricting. This suggests that NC2012 and NC2016 plans, in contrast to the Judge's plan, were precisely engineered to achieve a partisan goal.

The distribution on redistricting plans
Biased on the criteria for a reasonable redistricting outlined above, we define a family of probability distributions on the space of redistricting plans by first defining a score function on any given redistricting. This score function will return lower score to those redistricting which better adhere to the outlined design criteria. We will then use the score function to define a probability measure on the space of redistricting plans.
Possible methods for generating the ensemble of redistricting plans. Before detailing the method we use we discuss a number of alternative methods and situate them relative to the one we have employed. Ideas to generate redistricting plans with computational algorithms have been being developed since the 1960's (12)(13)(14). There are three main classifications of redistricting algorithms: constructive randomized algorithms (8,15,16), moving boundary MCMC algorithms (17)(18)(19)(20)(21), and optimization algorithms (22, 23). Constructive randomized algorithms begin * As the initial working paper reporting these results of this paper (10) was being completed an the work in (11) appeared which provides an interesting set of ideas to assess if samples being drawn are typical or outliers exactly in our context. We hope to explore these ideas in the near future.
each new redistricting with an initial random seed and either grow a fixed number of districts or combine small districts until the desired number of districts is achieved. Moving boundary MCMC algorithms find new redistricting plans by altering existing district boundaries to sample from a specified distribution on redistricting plans. In (21), the authors demonstrate that MCMC algorithms may be better at sampling the redistricting space than constructive algorithms. MCMC algorithms can theoretically sample the space of redistricting plans with the correct probability distribution, † whereas constructive algorithms may construct many similar redistricting plans while not generating other types of equally valid redistricting plans, leading to a skewed distribution. In particular, one is unsure what distribution on redistricting the algorithm is producing.
Recently an evolutionary algorithm has been proposed which begins with a constructive method, but then uses mixing to find either elite or 'good enough' redistricting plans (23); using the collection of 'good enough' redistricting plans, the authors make statistical predictions on the fitness of districting plans, however as with the constructive algorithms it is unclear how evolutionary algorithms compare with Monte Carlo models in terms of sampling the space properly and from what distribution they draw. One advantage of the MCMC approach over all of the other options is that it samples from a explicitly specified and constructed probability distribution on redistricting plans. Hence, this biases and preferences are explicit and open to critique. We remark that all of the above works have considered minimizing population deviation and compactness; a few of the works have considered minimizing county splitting; none of these works have included Voting Rights Act requirements, whereas our work does. For this work, we use a MCMC algorithm to sample the space of redistricting plans that is similar to those presented in (18,19,21).
Using a MCMC method requires one to define a family of probability distribution on the space of redistricting plans. We define this family of distributions first by developing a score function that evaluates the overall "goodness" of a districting plan, and then use the score function to define a probability distribution. The family of distributions focuses around compliant redistricting plans.
Defining the score function. To define the score function, we introduce several mathematical formalisms, the first of which is represents the state of North Carolina as a graph G with edges E and vertices V . Each vertex represents a Voting Tabulation District (VTD); an edge between two vertices exists if the two VTDs share boundaries with non-zero length. In general VTDs may be split into census blocks, however (i) the utilized redistricting criteria requires this splitting to minimized and (ii) we demonstrate that splitting VTDs to achieve zero population deviation in a district has nearly no effect on our results (see Section S4.A).
Defining the graph this way allows us to formally define a redistricting plan: Assuming each VTD belongs to a single district, a redistricting plan is defined as a function from the vertices, V , to one of the possible districts, which are represented by sequential integers -there are thirteen congressional districts in North Carolina, so we define a redistricting plan as a function ξ : V → {1, 2, . . . , 13}. The redistricting plan function ξ is interpreted as follows: If a VTD is represented by a vertex v ∈ V, then ξ(v) = i means that the VTD in question belongs to district i; similarly, for any i ∈ {1, 2, . . . , 13} and plan ξ, the i-th district, denoted Di(ξ), is given by the set {v ∈ V : ξ(v) = i}. We restrict the space of considered redistricting plans ξ such that each district Di(ξ) is a single connected component; this restriction, along with our edge criteria, ensures that §120-4.52(f) of HB92 is always satisfied. We denote the collection of all redistricting plans with connected districts by R.
A plan ξ is rated with our score function denoted J. J maps each redistricting ξ ∈ R to a nonnegative number. Lower scores signify redistricting plans that more closely adhere to the criteria of HB92. To define the score function J, we employ several sub-functions that measure how well a given redistricting satisfies the individual principles outlined in HB92. We will denote these sub-functions as Jp, JI , Jc, and Jm: the population score Jp(ξ) measures how well the redistricting ξ partitions the population of North Carolina into 13 equal population groups; the isoperimetric score JI (ξ) measures how compact the districts are; the county score Jc(ξ) measures the number of counties split between multiple districts; lastly, the minority score Jm(ξ) measures the extent to which a districting plan adheres to the VRA. Once the sub-functions are specified, our score function J is defined as a weighted sum of Jp, JI , Jc, and Jm; since all of the sub-score functions are not on the same scale, we use a weighted combination to balance the influence of each criteria. Specifically, we define [1] where wp, wI , wc, and wm are a collection of positive weights.
To describe the individual sub-functions, data is associated to our graph G which allows the recovery of relevant features on each VTD. The positive functions pop(v), area(v), and AA(v), defined on a vertex v ∈ V , represent (respectively) the total population, geographic area, and minority population of the VTD associated with v; the symbol AA represents the minority population because African-Americans are the only minority in North Carolina with a large enough population to gain representation under the VRA. The functions pop(v), area(v), and AA(v) are extended to a collection of vertices B ⊂ V by The boundary of a district Di(ξ), denoted ∂Di(ξ), is the subset of the edges E that connect vertices inside of Di(ξ) to vertices outside of Di(ξ). According to this definition, VTDs that border another state or the ocean will not have an edge that signifies this fact. To incorporate state boundary information, we add the vertex o to V , which represents the "outside" and connect it with an edge to each vertex representing a VTD which shares perimeter with the boundary of the state. We assume that any redistricting ξ always satisfies ξ(v) = 0 if and only if v = o; since ξ always satisfies ξ(o) = 0, and o ∈ Di(ξ) for i ≥ 1, it does not matter that we have not defined pop(o), area(o) or AA(o), as o is never included in the districts. † One can rigorously prove that the Markov Chain given by this algorithm converges to the desired distribution if run long enough. One only needs to establish that the Markov Chain transition matrix is irreducible and aperiodic. Since one can evolve from any connection redistricting to another through steps of the chain, it is irreducible. Aperiodicity follows as there exist redistricting plans which are connect to itself through a loop consisting of two steps and a loop consisting of three steps. Since 2 and 3 are prime and hence have greatest common divisor 1, the chain is aperiodic. See the Perron-Frobenius Theorem for more details.
Given an edge e ∈ E which connects two vertices v,ṽ ∈ V , boundary(e) will represent the length of common border of the VTDs associated with the vertex v andṽ. As before, the definition is extended to the boundary of a set of edges B ⊂ E by boundary(B) = e∈B boundary(e) . [3] With these preliminaries out of the way, we define the score sub-functions used to assess the goodness of a redistricting.
The population score function. The population score, which measures how evenly populated the districts are, is defined by where Npop is the total population of North Carolina, pop(Di(ξ)) is the population of the district Di(ξ) as defined in equation Eq. (2), and pop Ideal is the population that each district should have according to the 'one person one vote' standard: pop Ideal is equal to one-thirteenth of the total state population.
The Isoparametric score function. The Isoperimetric score, which measures the overall compactness of a redistricting, is defined by It is is the ratio of the square perimeter to the total area of each district. The Isoparametric score is minimized for a circle, which is the most compact shape. This compactness measure is one of two measures often used in the legal literature, where its reciprocal is proportional to as the Polsby-Popper score or the perimeter score (24,25). The second measure, usually referred to as the dispersion score, is more sensitive to overly elongated districts, although the perimeter score also penalizes them. We select the the perimeter score because it penalizes undulating boundaries, whereas the the dispersion score does not. There are over 20 measures to evaluate compactness (see, for example, (26)). We chose the the Polsby-Popper score both because of its historical significance and because it is consistent with the utilized compactness criteria. To validate the robustness of this choice, we have also examined a dispersion score and find qualitatively similar results (see Section S3.D).
The county score function. The county score function measures how many, and to what degree, counties are split between districts. If two VTDs belong to different districts but the same county, the county is called a split county. The score function is defined as Jc(ξ) ={# counties split between 2 districts} · W2(ξ) where MC is a large constant that heavily penalizes three county splits, and W2 and W3 are weight factors that smooth abrupt transitions between split/non-split counties and are defined by Fraction of county VTDs in 2nd largest intersection of a district with the county Fraction of county VTDs not in 1st or 2nd largest intersection of a district with the county 1 2 . The Voting Rights Act or minority score function. The VRA mandates that minorities have the ability to elect a number of representatives; the number is determined by the fraction of the population comprised of the minority. In North Carolina, the only minority large enough to warrant consideration under the VRA is the African American population. African-American voters make up approximately 20% of the eligible voters in North Carolina; since 0.2 is between 2 13 and 3 13 , the current judicial interpretation of the VRA stipulates that at least two districts should have enough African-American voters so that this demographic may elect a candidate of their choice.
African-American voters should not be overly represented in a district either. The NC2012 districting plan was ruled unconstitutional because two districts, each containing over 50% African-Americans, were ruled to have been packed too heavily with African-Americans, diluting their influence in other districts. The NC2016 districting was accepted based on racial considerations of the VRA and contained districts that held 44.48% African-Americans, and 36.20% African-Americans. The amount of deviation constitutionally allowed from these numbers is unclear.
Based on these considerations, we chose a VRA score function which awards lower scores to redistricting plans which had one district close to 44.48% African-Americans and a second district close to 36.20% African-Americans. We write the score function as where m1 and m2 represent the percentage of African-Americans in the districts with the highest and second highest percent of African-Americans, respectively. H is the function defined by H(x) = 0 for x ≤ 0 and H(x) = x for x > 0. The use of the square root function steepens the score function as districts near the desired population percentage; when used in conjunction with the Monte Carlo algorithm presented below, the steepening will have the effect that districts close to the set desired minority populations are more likely to move toward achieving these populations. Notice that whenever m1 ≥ 44.84% and m2 ≥ 36.20% we have that Jm = 0; this feature allows the possibility for high minority populations, but allows such instances to arise naturally and does not target such an outcome.
A Family of Probability Distribution on Redistricting Plans. We now use the score function J(ξ) to assign a probability to each redistricting ξ ∈ R that makes redistricting plans with lower scores more likely. Fixing a β > 0, we define the probability of ξ, denoted by P β (ξ), by where Z β is the normalization constant defined so that P β (R) = 1. Specifically, The positive constant β is often called the "inverse temperature" in analogy with statistical mechanics and gas dynamics. When β is very small (the high temperature regime), different elements of R have close to equal probability. As β increases ("the temperature decreases"), the measure concentrates the probability around the redistricting plans ξ ∈ R which minimize J(ξ). This idea has been previously used in (18) and (21).
Sampling the distribution. We use a standard Metropolis-Hastings algorithm. We define the proposal chain Q used for proposing new redistricting plans in the following way: 1. Uniformly pick a conflicted edge at random. An edge, 2. For the chosen edge e = (u, v), with probability 1 2 , either: Let conflicted(ξ) be the number of conflicted edges for redistricting ξ. Then we have Q(ξ, ξ ) = 1 2 1 conflicted(ξ) . The acceptance probability is given by: If a redistricting ξ is not connected, then we refuse the step, which is equivalent to setting J(ξ ) = ∞. We utilize simulated annealing to sample the space: β starts starts and remains at zero until 40,000 steps are accepted, which allows the MCMC algorithm to freely explore the space of connected redistricting plans; next β grows linearly to one over the course of 60,000 accepted steps, which allows the algorithm to search for a redistricting with a low score without getting caught in local minimum; finally, β is fixed at one for 20,000 accepted steps a sufficient number of steps so that the algorithm locally samples the measure P β . This process is repeated for each sampled redistricting. For other Monte Carlo algorithms see (17)(18)(19)(20)(21); for other redistricting algorithms see (8,15,16,22,23).
Determining the weight parameters. As we have mentioned above, we have four independent weights (wp, wI , wc, wm) used in balancing the effect of the different scores in the total score J(ξ). In addition to these parameters, we also have the low and high temperatures corresponding respectively to the maximum and minimum β values used in simulated annealing. We have set the minimum value of β to be zero which corresponds to infinite temperature. In this regime, no district is favored over any other, which allows the redistricting plan freedom to explore the space of possible redistrictings. To consider a high β value, we note that β multiplies the four weights in the probability distribution function: This means that one of the five remaining degrees (the four weights and high β) is redundant and can be set arbitrarily. We therefore chose to fix the low temperature (high value of β) to be one.
To select appropriate weights, we employ the following tuning method: 1. Set all weights to zero.
2. Find the smallest wp such that a fraction of the results are within a desired threshold (for the current work we ensured that at least 25% of the redistrictings were below 0.5% population deviation, however we typically did much better than this).
3. Using the wp from the previous step, find the smallest wI such that a fraction of the redistrictings have all districts below a given isoperimetric ratio (we ensured that at least 10% of the results were below this threshold; we chose a threshold of 60, as we have done above).
4. If above criteria for population is no longer met, repeat steps 2 through 4 until both population and compactness conditions are satisfied.
5. Using the wp and wI from the previous steps, find the smallest wm such that at least 50% of all redistrictings have at least one district with more than 40% African-Americans and a second district has at least 33.5% African-Americans.
6. If the thresholds for population were overwhelmed by increasing wm, repeat steps 2 through 6. If the thresholds for compactness were overwhelmed, repeat steps 3 through 6.
7. Using the wp, wI , and wm from the previous steps, find the smallest wc such that we nearly always only have two county splits, and the number of two county splits are, on average, below 25 two county splits.
8. If the thresholds for population are no longer satisfied, repeat steps 2 through 8. If the criteria for the compactness is no longer met, repeat steps 3 through 8. If the criteria for the minority populations is not satisfied, repeat steps 5 through 8. Otherwise, finish with a good set of parameters.
With this process, we settle on parameters wp = 3000, wI = 2.5, wc = 0.4, and wm = 800 and have used these parameters for all of the results presented in the main text. We remark that this choice of parameters allows us to sample the space more quickly. Both the primary and local redistricting plans are available for download ‡ . Other choices in parameters lead to similar results as is demonstrated Section C of the appendix.
Thresholding the sampled redistrictings. It is possible for the simulated annealing algorithm to draw a redistricting with a bad score when using the MCMC algorithm combined with the probability distribution given in Eq. (5). Additionally, using simulated annealing in the MCMC algorithm increases the chance of becoming trapped in a district that is a local minimum of the score function, but has a less-than-desirable score; this is because it may take more steps to escape a local minimum than we take with the high value of β. Lastly, our score functions focus on average values rather than individual districts: For example, the isoperimetric score function is the sum of the individual isoperimetric scores of each district, and it is therefore possible to have one bad district if the rest have exceptionally small isoperimetric scores.
In maximizing the degree of compliance with HB92, we only use samples which pass an additional set of thresholds, one for each of the selection criteria. This additional layer of rejection sampling was also used in reference (21), though the authors of this work chose to reweigh the samples to produce the uniform distribution over the set redistrictings that satisfy the thresholds. We prefer to continue to bias our sampling according to the score function so better redistrictings are given higher weights; we note that the idea of preferring some redistrictings to others is consistent with the provisions HB92.
From our experience from the Beyond Gerrymandering project, redistrictings which use VTDs as their building blocks and have less that 1% population deviation can readily be driven to 0.1% population deviation by breaking the VTDs into census tracts and performing minimal alterations to the overall redistricting plan; we also demonstrate in Section D.I of the appendix that splitting VTDs to achieve zero population deviation has a negligible effect on our results. We thus only accept redistrictings that have no districts above 1% population deviation. Many of our samples have deviations considerably below this value. It is important to emphasize that we require this of every district in the redistricting. In addition to showing that splitting VTDs to zero population deviation has no effect on our results (Section D.I of the appendix), we also show that when the population threshold is decreased from 1% to 0.75% and then to 0.5%, the results are quantitatively extremely similar, and qualitatively identical (Section C.I of the appendix).
We have found that districts with isoperimetric scores under 60 are almost always reasonably compact. Thus, we choose to accept a redistricting only if each district in the plan has an isoperimetric ratio less than 60. The Judges redistricting plan would be accepted under this threshold as its least compact district has an isoperimetric score of 53.5. Neither NC2012 nor NC2016 would be accepted with this thresholding as the least compact districts of each plan have isoperimetric scores of 434.65 and 80.1, respectively. We also note that only two of the thirteen districts for the NC2012 plan meet our isoperimetric score threshold, whereas eight of the thirteen districts of NC2016 fall below the threshold. Although we examine our principle results over a space of highly compact redistricting plans, we also demonstrate that our results are insensitive to lifting this restriction in the Appendix (see Section C.III).
Although redistrictings which split a single county in three are infrequent, they do occur among our samples. Since these are undesirable, we only accept redistrictings for which no counties are split across three or more districts. In order to satisfy population requirements, some counties must be split into two districts; an example making this clear is that Wake and Mecklenburg Counties each contain a population larger than a single Congressional district's ideal population. We do not explicitly threshold based on number of split counties, though redistrictings with more split counties have a higher scores, and hence are less favored. We remark that none of our generated redistrictings have more county splits than the NC2012 redistricting plan, and that the NC2012 plan was never critiqued or challenged based on the number of county splits.
To build a threshold based on minority requirements of the VRA, we note that the NC2016 redistricting was deemed by the courts to satisfy the VRA. The districts in this plan with the two highest proportion of African-Americans to total population are composed of 44.5% and 36.2% African-Americans. With this in mind, we only accept redistrictings if the districts with the two highest percentages of African-American population have at least 40% and 33.5% African-American voters, respectively.
Thresholding in this way sub-samples to roughly 16% of the samples initially produced by our MCMC runs. Although this leads to many unused samples, it ensures that all of the utilized redistrictings meet certain minimal standards. This better adheres to the spirit of HB92. The reported 24,518 samples used in our study refer to those left after thresholding. The full data set of samples was in excess of 150,000. That being said, we show in Section C.I of the appendix that results without thresholding were quantitatively very close and qualitatively identical.

Details of the Indices
Details on symmetry summary statistics. Partisan bias is defined to be the symmetric difference between the expected number of seats won, as a function of the votes cast. If we define E(s|v) as the expected number of seats, s, given a statewide Democratic vote fraction, v, then the partisan bias is defined to be (for a 13 district state) (E(s|0.5 + vs) − E(s|0.5 − vs))/13, where vs is a uniform shift in the overall votes (e.g. (3)). Historically vote shifts are computed by uniformly shifting the vote fractions in the marginal distributions, and we adopt this convention in the current work; we have set vs = 0.05 (5%) in all of the presented results.
The Efficiency Gap is an index that was used in the decision Whitford Op. and Order, Dkt. 166, Nov. 21, 2016 (see also (4)). It quantifies the difference of how many "wasted votes" each party cast; a larger number means that one party wasted more votes than another. More precisely, the Efficiency Gap measures the difference of the relative efficiency for the Democrats and Republicans. The efficiency for each party is the sum of the fraction of votes in districts the party loses plus the sum over the percentage points above 50% in the districts won. The relative efficiency is the efficiency of a given party divided by the sum of percentage of votes obtained by the party in each district. The efficiency gap is the difference between the relative efficiencies of the two parties. § Details of Gerrymandering Index. To compute the Gerrymandering Index, we begin by extracting the mean percentage of Democratic votes in each of the thirteen districts when the districts are ordered from most to least Republican (see Figure 2. For any given redistricting plan, we take the Democratic votes for each district when the districts are again ordered from most to least Republican and consider the differences between the mean and the observed democratic percentage. This gives us a set of thirteen numbers on which we consider a two-norm (each number is squared, and these squares are then summed; the sum is then square rooted).
The Gerrymandering Index is smallest when all of the ordered Democratic vote percentages are precisely the mean values. However, this is likely not possible as the percentages in the different districts are highly correlated. To understand the range of possible values, we have plotted the complementary cumulative distribution function of the Gerrymandering Index of our ensemble of randomly generated reasonable redistricting plans in the main text (see Figure 3). This gives a context in which to interpret any one score.
To provide an example, we note that for the 2012 votes, the mean percentages for the collection of redistricting plans we generate is In summary, in this example the Gerrymandering Index is √ 0.0291 = 0.17.

I. Details of Representativeness Index.
To calculate the Representativeness Index, we first construct a modified histogram of election results that captures how close an election was to flipping a congressional seat. To do this for a given redistricting plan, we examine the least Republican district in which a Republican won, and the least Democratic district in which a Democrat won. We then linearly interpolate between these districts and find where the interpolated line intersects with the 50% line. For example, in the 2012 election (NC2012 map with 2012 votes), the 9th most Republican district elected a Republican with 53.3% of the vote, and the fourth most Democratic district won their district with 50.1% of the vote. We would then calculate where these two vote counts cross the 50% line, which will be 50 − (100 − 50.1) 53.3 − (100 − 50.1) ≈ 0.03, [6] and add this to the number of Democratic seats won to arrive at the continuous value of 4.03. This index allows us to construct a continuous variable that contains information on the number of Democrats elected, and also demonstrates how much safety there is in the victory. Fractional parts close to zero suggest that the most competitive Democratic race is less likely to go Democratic than the most competitive Republican race is to go Republican. On the other hand, fractional parts close to one suggest that the most competitive Republican race is less likely to go Republican than the most competitive Democratic race is to go Democratic. Instead of simply creating a histogram of the number of seats won by the Democrats, we construct a histogram of our new interpolated value in Figure 7. We define the representativeness as the distance from the interpolated value to the mean value of this histogram (shown in the dashed line). These are the values we report in Figure 3. For the 2012 vote data, we find that the mean interpolated Democratic seats won is 7.01, and the Judges plan yields a value of 6.28, giving a Representativeness Index of |7.01 − 6.28| = 0.73. The NC2012 and NC2016 plans both have Representativeness Indices greater than two.  Figure 1 are overlaid on this plot for reference.

Discussion
We have found evidence that that the NC2012 and NC2016 are heavily gerrymandered: they employ packing and cracking to generate a dramatic jump of partisan vote counts in the ordered district structures. This jump is significantly larger than those found in the ensemble of redistricting plans which is a signature of gerrymandering. When summarizing these observations in single number metrics, we consistently find that the NC2012 and NC2016 redistricting plans are outliers, suggesting that (i) these districts are heavily gerrymandered and (ii) do not represent the geo-political landscape expressed in a number of elections occurring between 2012 and 2016. Further analysis reveals the NC2012 and NC2016 plans are locally engineered for partisan benefit. All of these results are consistent with the findings of (23). On the contrary, the districting plan produced by a bipartisan redistricting commission of retired judges from the Beyond Gerrymandering project produced results which are highly typical. The Judges plan is not gerrymandered, was typically representative of the people's will, and displayed consistent statistics with nearby redistricting plans.
The ideas presented in this report are generalizable at both the state and federal level. We note that each state may have different requirements when drawing district boundaries and so care must be taken when considering the criteria to be included in the generative procedure. We hope that the analysis in this report is utilized across different states and levels of government to test the viability of districting plans.
The appendix is organized as follows: (1) we begin by detailing the utilized data, including the HB92 criteria, voting data, the enacted NC2012 and NC2016 districting plans, and the Judges districting plan; (2) we then provide evidence that our MCMC algorithm is properly sampling the space of redistricting plans on the distribution described in the main text; (3) we demonstrate that our main results are insensitive to changing the probability distribution function in a variety of ways; (4) we provide error analyses for how our results might change due to resampling and due to splitting VTDs to achieve zero population; (5) we give details on the properties of our sampled redistricting plans.

A. Details on Data (Examined Districting Plans of Interest; Redistricting Criteria; Voting Data)
I. Information on the three considered redistricting plans. Maps for the NC2012, NC2016, and Judges districting plans are shown in Figure 8. The Democratic vote fractions for each district are given in Table 1; the vote fractions are given for the U.S. congressional elections in 2012 and 2016. In the table, vote fractions are adjacent to numbers in parentheses that give the numerical label of the individual districts as identified in the maps in Figure 8. The last two columns contain the mean values of percentage of Democratic votes for the ensemble of redistricting plans.  Table 1.
The three most Democratic districts (labeled 1, 4 and 12 in both the NC2012 and NC2016 plans) have significantly more Democratic votes than the expected average for these districts. Districts 9 and 13 both show evidence of having less Democrats than one would expect from their rankings. These conclusions are consistent across the 2012 and 2016 votes.
The raw data used to produce Figure 1 of the main text is given in Table 2. It underscores how atypical the results produced by the NC2012 and NC2016 redistricting plans are. If one is ready to accept four seats for Democrats in the 2012 vote then one should equally accept nine. Similarly in the 2016 votes, if one accepts three seats for Democrats as a legitimate outcome then one should also be willing to accept seven seats. None of these results are particularly representative of the votes cast.
II. Redistricting Criteria. The Judges districting plan was established in the Beyond Gerrymandering project ¶ was a collaboration between UNC system President Emeritus and Davidson College President Emeritus Thomas W. Ross, Common Cause, and the POLIS center at the Sanford School at Duke University. The project's goal was to educate the public on how an independent, impartial redistricting process would work. The project formed an independent redistricting commission made up of ten retired jurists -five Democrat and five Republican.
The commission used strong and clear criteria to create a new North Carolina congressional map based on NC House Bill 92 (HB92) from the 2015 legislative season, which states • §120-4.52(f): Districts must be contiguous; areas that meet only at points are not considered to be contiguous.
• §120-4.52(c): Districts should have close to equal populations, with deviations from the ideal population division within 0.1%.
• §120-4.52(g): Districts should be reasonably compact, with (1) the maximum length and width of any given district being as close to equal as possible and (2) the total perimeter of all districts being as small as possible. • §120-4.52(e): Counties will be split infrequently and into as few districts as possible. The division of Voting Tabulation Districts (VTDs) will also be minimized.
• §120-4.52(d): Redistricting plans should comply with pre-existing federal and North Carolina state law, such as the Voting Rights Act (VRA) of 1965.
• §120-4.52(h): Districts shall not be drawn with the use of (1) political affiliations of registered voters, (2) previous election results, or (3) demographic information other than population. An exception may be made only when adhering to federal law (such as the Voting Rights Act (VRA)).
All federal rules related to the Voting Rights Act were followed but no political data, election results or incumbents addresses were considered when creating new districts. The commission met twice over the summer of 2016 to deliberate and draw maps. The maps resulting from this simulated redistricting commission were released in August 2016. The Judges agreed on a redistricting at the level of Voting Tabulation Districts (VTD). This coarser redistricting was refined at the level of census blocks to achieve districts with less than 0.1% population deviation. The original VTD based maps are used in our study and are denoted Judges; our results are insensitive to splitting VTDs to zero population (for a careful study, see below in Section D.I).
Although unratified, HB92 was passed in the House, which provides some credence to using it as a guide to draw fair redistricting plans. In contrast, the criteria adopted in the drawing of the 2016 congressional districts contains a "Partisan Advantage" clause which seeks to predetermine the number of elected officials from each party || ; with the balance of power pre-selected by this criteria, it would be impossible determine any sense of the statistical range in the balance of power. Because we do not wish to use partisan data to draw maps, we utilize the criteria from the non-partisan HB92.
III. Data sources and extraction. VTD geographic data was taken from the NCGA website (27) and the United States Census Bureau website (28) , which provide for each VTD its area, population count of the 2010 census, the county in which the VTD lies, its shape and location. Perimeter lengths shared by VTDs were extracted in ArcMap from this data. Minority voting age population was found on the NCGA website using 2010 census data (29).
Data for the vote counts in each VTD in each election was taken from the NCSBE Public data (30). For the 2016 election, VTD data was reported for precincts rather than VTDs, but rather for each precinct; 2447 of the precincts are VTDs, meaning that we have data for the majority of the 2692 VTDs. However 172 precincts contain multiple VTDs, 66 VTDs were reported with split data, and 7 VTDs were reported with complex relationships. To extrapolate VTD data on those contained in the 172 precincts containing multiple VTDs, we split the votes for a precinct among the VTDs it contained proportional to the population of each VTD. For the 66 split VTDs, VTDs were comprised of multiple precincts all contained with in a certain VTD, so we simply added up the votes among the precincts that were contained in each VTD -there was no extrapolation for these VTDs and these results are precise. For the VTDs with complex relationships, we divided up the votes using estimates based on the geography and population of the VTDs. We note that roughly 10% of the population lies in the VTDs with imperfect data, and that we do not expect significant deviation in our results based on the above approximations. || The 2016 Contingent Congressional Plan Committee Adopted Criteria may be found here: http://www.ncleg.net/GIS/Download/ReferenceDocs/2016/CCP16_Adopted_Criteria.pdf In using 2012 and 2016 data we have only used presidential election year data. Unfortunately, the 2014 U.S. congressional election in North Carolina contained an unopposed race which prevents the support for both parties being expressed in the VTDs contained in that district. In reference (19), the missing votes were replaced with votes from the Senate race. However, since we had two full elections, namely 2012 and 2016, which needed little to no alterations, we chose not to include the 2014 U.S. congressional votes in our study.

B. Evidence of robust sampling
There is a possible pitfall of using simulated annealing: as the districting plan moves along the random walk it may become trapped in a local minimum from which it is unlikely to escape. This may be because the system has cooled too quickly, keeping it trapped in a local region, or it may be because the likelihood of finding a path out of one local region of redistricting plans and into another is small. To test that the MCMC algorithm has sufficiently sampled the space, we consider three tests: 1. we consider three different initial conditions, 2. we double the simulated annealing time, and 3. we increase the length of the MCMC chain for a longer run that includes more samples.
When comparing the main results to the results from the above tests, we will examine both the histogram and marginal distributions presented in Figures 1 and 3 in the main text. We will have evidence for a robust sampling if the histograms and box plots do not significantly change.
In addition to the above tests, we have animated our algorithm and have found that districts may travel from one end of the state to another; such motion suggests that our sampler is not trapped in a local well, and it is reasonable to hypothesize that as districts exchange locations, they lose information on past configurations.

I. Independence of initial condition.
The samples from the random walks presented in the main text begin with the Judges districting plan.
To test for the effect on initial conditions, we also examine using initial conditions with the NC2012 and NC2016 districting plans. Altering the initial condition has only a small effect on the ensemble of election results; the effects are displayed in Figure 9. The initial condition for the NC2012 redistricting has a 15% chance of electing five Democrats rather than the 12% chance we have seen before. We note that these are exploratory runs, with 785 and 835 redistricting plans for the NC2012 and NC2016 initial conditions. These sample sizes are robust enough to provide a general trend but are subject to statistical variations. Hence the small sample sizes are a possible and likely culprit of these variations.  Fig. 9. We display the probability distribution of elected Democrats with respect to initial conditions (left) We display our standard box-plots for the three initial conditions (right; vertical line length is the full range of possibilities, 5% of data is outside the outer lines, 20% of the data is outside the inner lines, and 50% of the data is outside of the boxes).

II. Independence of simulated annealing parameters.
We double the relaxation time of the simulated annealing process to see if it has an effect on the simulated election statistics: Instead of remaining hot (β = 0) for 40, 000 steps, cooling linearly for 60,000 steps, and remaining cold (β = 1) for 20,000 steps, we instead remain hot for 80,000 steps, cool linearly for 120,000 steps, and remain cold for 40,000 steps. 2,411 redistricting plans for the increased cooling times are presented in Figure 10 III. Considering more samples. The above tests on initial conditions and simulated annealing parameters provide evidence that we have properly sampled the probability distribution of redistricting plans. To strengthen this claim, we also continue to allow the algorithm to sample the space until we have sampled roughly 120 thousand acceptable redistricting plans as defined by the original thresholding criteria -this means that each chain in the MCMC algorithm samples nearly 5 times as many districts as the 24,518 plans used in the primary results. We then compare the results of the elections along with the box plots and histogram plots. We find that there is negligible change in the distribution of outcome both for the overall number of elected representatives and for each ordered district (from most to least republican). We display our results in Figure 11. The stability of these results together with those presented in the previous two tests provides robust evidence that we have correctly recovered the underlying probability distribution of redistricting plans.  Fig. 10. We display the probability distribution of elected Democrats with respect the original versus doubled simulated annealing parameters (left). The histogram presented in the primary text overlays this image with the gray shaded histogram for comparison. We display marginal distributions, rather than box plots, to further display the similarity of the sampled structure (right).

C. Insensitivity to the choice of distribution
We have sampled over a fixed a probability distribution function. As mentioned in the main text it is unclear that there is a single 'correct' distribution from which to sample. However, if the results are robust across a range of probability distributions, we may conclude that the precise choice of distributions is irrelevant. To investigate this possibility, we perform the following tests: In all of these tests we find either negligible changes to the histograms and box plots, or qualitatively similar results.
I. Varying thresholds. Achieving a 0.1% population deviation is the only statute of HB92 that we violate -we instead consider all plans below a 1% population deviation. The Judges original redistricting plans in the 'Beyond Gerrymandering' project each had districts that were slightly over 1% population deviation; VTDs were split to achieve a 0.1% population deviation, and this splitting did not impact the election results. In the current section we demonstrate that changing the threshold at lower than 1% population deviations does not effect our results (in Section D.I below, we demonstrate bounds to election result changes in our ensemble when zeroing population). To test that our modified threshold will not impact our results, we change the population threshold to 0.75% and 0.5%. The results are shown in Figure 12, for which we have used the 2012 US congressional voting data. We find that tightening the population threshold has negligible impact on the number of Democrats elected, and that the variation in the histogram box-plots is barely perceptible. In the 0.5% population deviation threshold plots, we have discarded over half of our utilized samples and we still do not see any significant changes. In terms of the compactness threshold, there is no corresponding law to dictate a definite value in the choice of threshold. The NC2016 districts have a maximum isoparametric ratio of around 80, and the NC2012 districts have a maximum of over 400. HB92 mentions that when two districting plans are compared, the one with more compact districts should be preferred. The Judges redistricting has a district with maximum isoparametric ratio of around 54. To test the effect of setting different compactness thresholds, we repeat our analysis with multiple compactness thresholds of 54, 80 and no threshold for all districts within a redistricting. We note that having no threshold does not mean that we have arbitrarily large compactness values. This is because of the cooling process in the simulated annealing algorithm and the fact that we continue to penalize large compactness scores. When considering no thresholding, we have an average maximum isoparametric ratio of around 75 and rarely see redistricting plans with maximal ratio larger than 120. Relaxing or tightening the compactness threshold minimally changes the election results as demonstrated in Figure 13.
II. Varying distribution weights. As described in the Methods section, we have proposed a methodology for determining the weights in the score function that is primarily concerned with obtaining a high percent of redistricting plans below our chosen threshold values. Other parameters may be chosen, and here we test whether making a different choice will affect the statistics on the election outcomes. There are four dimensions to explore -this means that exploring this space exhaustively would come at an large computational cost. To reduce this cost, we perform a sensitivity test about the reported weights used in the primary analysis: We first significantly increase and decrease wp, wI , and wm. For the fourth direction, we could simply increase or decrease wc; we could also increase and decrease the maximum value of β instead, and choose this path instead. Because changing β is equivalent to changing all parameters, this forms a fourth linearly independent search direction, and provides us with information equivalent to changing wc. This leads to eight different parameter sets, which still require a large number of runs. To cut down on the computational cost, we take advantage of the result presented above, where we conclude that ignoring the compactness threshold has a minimal effect on our results. The compactness threshold is by far the most restrictive, so omitting it will allow us  Fig. 11. We extend the samples from the main text by allowing the sampling algorithm to continue until we have sampled roughly 120 thousand districts that fall below the threshold. We find almost no difference between the distributions in the original and extended samples.
to sample more redistricting plans with fewer runs. We examine between 125 and 1100 samples for each search direction (we examine less samples with more restrictive parameters) All eight search directions maintain the overall structure of the election results. The results are visualized in Figure 14. We conclude that significant changes in the parameters will have little effect on the statistical results of the election data.

III. Different weights for lower county splits.
In the above analysis compact districts are prioritized over those with low county splits. In this section we determine the sensitivity of our results when we prioritize keeping a low number of county splits. To make this examination, we double the county weight (wc = 0.8) and reduce the compactness weight (wI = 2). By resetting the compactness threshold to be 80, we obtain just under 15 thousand redistricting plans and note that all of them have a worst district better than the worst NC2016 district. If we had kept the threshold at 60, there would only be a couple thousand samples, thus in order to obtain more samples and because we have found that compactness does not have a large effect on the results, we select the higher threshold. We find that despite changing the weights in this severe way, the over all election results, in addition to the distribution of results per ordered district remains remarkably stable. The results are presented in Figure 15. We also remark that we now have a median of 16 two county splits with a mean of 16.5, in contrast with a median of 21 and mean of 21.6 from the main results (see Section F).

IV. Using a different compactness energy.
We have used the isoparametric ratio for the compactness energy however there are other possible choices. Mentioned above when introducing the isoparametric score function, the dispersion score measures the spread of a district: Typically it is thought of as the ratio between the area of the minimal bounding circle and the districts area. Although useful as a metric to compare two districting plans, the dispersion score does not minimized for jagged perimeters and cannot be used as a sufficient criteria to draw reasonable districts. Nevertheless, we examine the space of districts in which we replace the isoparametric ratios with the dispersion ratio. Given §120-4.52(g)(1) of HB92, which specifies district length and width, we chose to measure dispersion as the ratio between the area of the minimal bounding rectangle and the district area. The redistricting plans we arrive at would never be used due to the jagged perimeters, however if such a drastic change in the compactness criteria gives similar results to those which we have found before, we would have stronger evidence still for the robustness of our analysis.
We threshold on everything but compactness as the isoparametric ratios become very high. We keep the weights the same as in the main results. With the 15,918 resulting redistricting plans for the new energy, the histogram of the election results and sorted district results are compared with our main results in Figure 16. Despite this drastic change in energy definition, the results taken with different energies are remarkably similar to the main results.  I. Error bounds for splitting VTDs to achieve zero population deviation. All of our redistricting plans do not split any VTDs. In practice, VTDs must be split to achieve a population deviation below 0.1%. Splitting VTDs may change the vote count in each district. In this section, we demonstrate that splitting VTDs will have a negligible impact on the partisan vote fractions in each district. To do this, we derive maximal and minimal error bounds on the democratic vote fractions for each redistricting plan by assuming that all of the changing votes will benefit only one party. We repeat this process for the three redistricting plans of interest. The results are plotted in Figure 17. The shifts in the margins and the three plans are barely visible. The small range of possible errors along with the test presented in Section C.I validate the idea that our results are robust when splitting VTDs to achieve a zero population deviation.

II. Bootstrapping.
We use bootstrapping to estimate the error on our samples. On our ensemble of 24,518 redistricting plans, we sample, with replacement, 24,518, and obtain a new mean, median, and upper and lower 50% bounds for the boxes in the box plots. We repeat the process 10,000 times. The maximum spread of resampled means, medians, and upper and lower 50% bounds for the boxes is less than 0.21% over all ordered marginal district distributions in the ensemble.

E. Details on examining nearby redistricting plans
To randomly sample nearby districts, we run the same MCMC algorithm described above, but add a small modification: If a proposed step ever would increase the Hamming distance between any of the districts from the original redistricting in question (either NC2012, NC2016, or the Judges) by more than 40 VTDs, the step is rejected. Alternatively, one can think of J(ξ) = ∞ for any ξ ∈ R which has a district that differs from the original redistricting by more than 40 VTDs. As before, we then threshold the results for NC2016 and the Judges on the population score, the county score, and the minority score. We do not threshold on the Isoperimetric Score as (i) keeping the redistricting near the original is likely sufficient (ii) the threshold would be too severe for the NC2016 redistricting, since it already violates the threshold, and (iii) we demonstrate (see Figure 18) that thresholding on compactness does not qualitatively affect our results. We use no thresholds for the NC2012 districting because most of the redistricting plans close to NC2012 would fail the population threshold since the compactness energy dominates here, as the districts are highly non-compact; the nearby NC2012 districts typically have a district with a few percentage points of population deviation. Subsampling with the above thresholds leads to ensembles of 2,523 redistricting plans about the NC2012 plan, 2,334 redistricting plans about the NC2016 pland, and 2,554 redistricting plans about the Judges plan. In order to justify the lack of thresholding on NC2012, we examine the difference in the local complementary cumulative distribution function for the Judges plan when thresholded and not. We find that there is only a modest difference between the thresheld and non-thresheld results from the Judges which provides evidence that using the non-thresheld results for NC2012 is unimportant for obtaining a representative space of nearby districts (see Figure 18).

F. Characteristics of the redistricting plans in the ensemble
We report the properties of the over 24,000 random restricting plans we have generated using the algorithm described in the preceding sections. All of the random redistricting plans passed the threshold test described in the previous section. As such, they all have no district with population deviation above 1%. However, most have a deviation much less than 1%: the mean population deviation taken over the more-than 13 × 24000 = 312, 000 districts is 0.16% with a standard deviation of 0.14%. Figure 19 gives a finer view for the distribution of the population deviation. We order each redistricting by the maximum population deviation over all districts. To simultaneously give a sense of the median population deviation of the districts with a given maximum population deviation, we examine the local statistics of the ordered districts to find the maximum and minimum values of the local median (plotted as the blue envelope) along with the standard deviation of the median (green envelop) and expected value of the median (dotted line). With this plot we notice that over 50% of redistricting plans have a worst case population deviation under 0.4% and many of these redistricting plans have a median population deviation well below 0.2%.
To compare the population deviation of our generated districts with the districting plans of NC2012, NC2016 and the Judges, we note that all of these districting plans all had to split VTDs in order to achieve a population deviation below 0.1%. Before splitting VTDs, NC2012, NC2016 and Judges had a district with maximum population deviation of 0.847%, 0.683%, and 0.313% respectively, and had median population deviations of 0.234%, 0.048%, and 0.078% respectively, meaning that the districts sampled by our algorithm are very similar to the three districts we have compared our results with in terms of population deviation.
Turing to the isoparametric ratios, recall that all of the districts have an isoparametric constant under our threshold value of 60. The mean isoparametric ratio of the more-than 13 × 24000 = 312, 000 districts is 36.9 with a standard deviation of 9. Examining the second part of Figure 19 gives an analogously finer view for the distribution of the isoparametric ratios of all districts. The figure shows that most redistricting plans have a median isoparametric ratio in the mid-thirties and that roughly 50% of our redistricting plans have a district with isoparametric ratio no worse than 55 for an isoparametric ratio.
When comparing our generated districts, we note that the NC2012, NC2016 and Judges redistricting plans have districts with maximum isoparametric ratio of 434.6, 80.1, and 54.1 respectively, and have median isoparametric ratios of 114.4, 54.5, and 38.2 respectively. The NC2012 and NC2016 districts would be rejected under our thresholding criteria, however we have seen above that lifting threshold conditions on compactness will note effect election results.
We examine the four districts with the highest minority population in each districting in Figure 20. The redistricting plans are ordered on the district with the highest minority population over the over 24000 accepted redistricting plans. The kink in this line at 44.46% occurs due to the minority energy function which does not favor any population above this limit (this number was based on NC2016 districting which was ruled to adhere to the VRA). Roughly half of the redistricting plans have a district with greater than 44.46% of the population as African-Americans, whereas the other half has between 40% and 44.46% in the district with the largest number of African-Americans. For the district with the second highest African-American representation, we remark that over 80% of all redistricting plans have more than 35% African-American representation in the second largest district; there is not a single redistricting that has the second largest African-American district with more than 40% of the population African-American.
Finally we display the histogram of the number of split counties over our generated redistricting plans. There is a median of 21 split counties with a mean of 21.6, and a range from 14 to 31. We remark that NC2012, NC2016, and Judges districting plans had 40, 13, and 12 split counties respectively. We display several of our generated redistricting plans in Figure 21. We display statistics of these selected redistricting plans in Table 3, and compare them with the NC2012, NC2016 and Judges districting plans.
We store the over 24,000 redistricting plans generated by the Monte Carlo algorithm. Each redistricting file holds the FID of the VTD, which are consistent with the FIDs in the Harvard's Election Data Archive Dataverse (31). Each district, labeled by the FID, is associated with a district labeled 1-13 in the second column. See git@git.math.duke.edu:gjh/districtingDataRepository.git  Table 3. We display the various energies for each of the districting plans that we present in the appendix. Note that reported numbers for districting plans are before VTD splits.    Fig. 20. The redistricting plans are ordered by the district with the largest African-American percentage (left). Subsequent ranges show standard deviations for districts with the second and third largest African-American representation. We plot the histogram of the number of county splits in each districting (right). The lighter histogram gives the number total split counties while the darker histogram gives only the number which splits the county into two parts each containing more than 10% of the total VTDs.