Emergence of order in self-assembly of the intrinsically disordered biomineralisation peptide n16N

Abstract We present the results of an aggregation study on the intrinsically disordered biomineralisation peptide n16N, which selects the aragonite polymorph of calcium carbonate and is expected to have aggregation-dependent structure and function. The peptide is a sub-sequence of the in vivo protein n16, with putative framework and polymorph selection roles in the nacre layer of pearl oyster (Pinctada fucata). Employing the intermediate-resolution coarse-grained protein model PLUM*, which has previously been validated with respect to n16N, we simulate assemblies of these peptide units for system sizes inaccessible to atomistic models. We use extensive conformational sampling to show that the configurational ensemble explored by n16N aggregates contains a significant proportion of ordered -structure, within which arrangement of monomers is consistent with a previous hypothesis on functionally distinct subdomains of n16N. We also study an n16N mutant which fails to aggregate in experimental studies and obtain very similar behaviour, the consequences of which are discussed.


Introduction
Biomineralisation refers to the formation of composite bioinorganic materials, known as biominerals, in controlled environments by certain living organisms. Such organisms possess an ability to form hierarchically nanostructured hard tissues via biomineralisation processes which are not well understood, but have a range of potential applications [1][2][3][4]. In a typical scenario, biomineralisation is biologically controlled in an extracellular environment [5,6] by an organic matrix [7][8][9], which becomes the organic component of the final structure. The matrix is typically formed of polysaccharides, glycoproteins and proteins [10] which are often members of the protein fold class known as intrinsically disordered proteins (IDPs).
IDPs are defined by their lack of a single stable tertiary structure [11,12], and often have a function for which their disorder is essential [13]. It has been suggested that biomineralisation proteins are the most disordered functional protein class [14], perhaps owing to required adaptability for multi-stage matrix assembly [13], interaction with ions [15] and selectivity of chemical environment in which assembly occurs [16].
A fully atomistic treatment of multiple n16N peptides, their aggregation and interaction with mineral ions, remains far beyond the reach of contemporary computational capability. This is particularly true in terms of time scale and hence the ability to sample the conformation space available to large aggregates of n16N. We are therefore motivated to study the structure of n16N aggregates using lower resolution, coarse-grained models. These can provide insight into the gross structural motifs adopted by aggregates of multiple n16N units, and hence provide structural input to detailed atomistic simulations of peptide-mineral interactions at the extremities.  [15]. Cationic and anionic residues are coloured in blue and red, respectively. An ellipsis indicates where the n16 parent protein's chain continues.
In previous work [37], we explored use of the coarsegrained protein model PLUM created by Bereau and Deserno [38]. PLUM is a four-bead-per-residue, implicit-solvent protein model. It includes steric backbone-backbone repulsion via Weeks-Chanlder-Anderson potentials, dipole-dipole interactions between peptide bonds, and an explicit backbonebackbone hydrogen bonding potential. Interactions between side chains are captured via further pair potentials parameterised from an analysis of square-well interaction strengths needed to reproduce structures of crystallised proteins. This analysis, which includes all 20 × 20 possible pairs of residues, implicitly captures electrostatic interactions between charged side chains. However, the model contains no explicit electrostatic terms and is hence unsuitable for simulating the interaction of peptides with mineral ions. All other parameters, including bond and angle potentials, were fitted by Bereau and Deserno to reproduce the conformational ensemble for a number of sample protein sequences.
We demonstrated that the standard PLUM parameter set required only minor modification to produce ensembles of n16N peptide structures in agreement with structural predictions and atomistic simulations [15] in the CHARMM22* model [39,40] with TIPS3P water [41]. Specifically, a 5.5% reduction in the strength of backbone-backbone hydrogen bonding produced encouraging preliminary results for both monomer and dimer, including distinguishing hypothesised subdomain regions by relative level of disorder, secondary structure manifestation, and degree and manner of involvement in dimerisation. We refer to the adjusted model as PLUM*.
This background offers the tantalising suggestion that the structure of large n16N aggregates, from which its function may derive, are accessible via simulations using PLUM-style models. The primary goal of this manuscript is therefore to build upon our one-chain and two-chain system findings by presenting results from simulations of three-, six-and eightchain systems of n16N. These system sizes are beyond those accessible to atomistic simulations.
As a secondary objective, we also simulate a mutant of n16N called n16NN, in which the acidic residues are replaced by charge-neutral counterparts (Asp → Asn, Glu → Gln, that is D→ N and E→ Q in Figure 1). n16NN aggregates differently [27,30] with circular dichroism spectra suggesting a greater preference for random coil structure compared to the original peptide. It also fails to form complexes with Ca 2+ [27], likely because Asp and Glu have an active role in organic-mineral association [28]. This hampers the aragonite selectivity of the peptide [29,31]. Our previous work studied the n16NN peptide in single-and two-chain systems, with the goals of gauging PLUM*'s sensitivity to the minor distinctions between the two chains, and forming a hypothesis about n16NN's failure to aggregate normally. In our previous work we observed lower conformational accessibility and greater stability of aggregationfacilitating hydrogen bonds throughout the chain, including in the aggregation-driving SD2 region (see Figure 1), which is identical in both systems. Here, we investigate if these distinctions persist to larger aggregates.

Simulation methodology
Each system simulated is referred to by the peptide name and number of chains in the system, e.g. n16N-3 for the trimer n16N system. Our aim in the present study is to establish the structure of oligomers, and the role of the three subdomains within these. Simulations of each system are initialised with the peptides in close proximity to encourage formation of oligomers. We do not sample configurations in which any of the peptides unbind. All statistics extracted from the ensemble generated should be considered conditional on oligomers of the size simulated having been already formed via aggregation. The simulation package LAMMPS [42] was used with a timestep of 3 fs and a Langevin thermostat with a damping parameter of 1000 fs.
As in our previous work, we employed the replica exchange molecular dynamics (REMD) technique [43,44], varying replicas by thermostatted temperature and keeping the Hamiltonian of each replica the same. The replicas at the highest temperatures achieved enhanced sampling through their ability to overcome barriers on the potential energy surface. We attempted moves to swap pairs of system coordinates regularly, at fixed intervals. These proposals were accepted or rejected according to the Metropolis prescription, allowing the enhanced sampling at high temperature to propagate to the reference temperature without violating canonical ensemble statistics [45].
The number of replicas required to obtain sufficiently overlapping energy histograms increases with the number of degrees of freedom in the system. For the smallest system considered here (3 peptides), we employ 27 replicas thermostatted at temperatures between 300 and 350 K, and for the largest (8 peptides) 84 replicas are used. We selected the temperature set for each system based on preliminary work, with the goal of achieving an acceptance rate of approximately 0.2 for swap attempts between replicas at adjacent temperatures.
Any memory of the initial configuration is very rapidly lost in the highest temperature replica. This de-correlation propagates to lower temperatures via exchanges. It is hence traversal of configurations from high to low temperature which is the limiting factor in determining the ability of our simulations to generate samples which are statistically independent of the starting configuration. To ensure the desired quality of sampling, we tracked the average time for a system to traverse the range of temperatures, and simulate for a time far in excess of this. Ordered by ascending system size, in pairs of n16N followed by n16NN systems, the simulation run times were equivalent to 6.4 and 6.4 µs, 2.5 and 3.3 µs, and 3.4 and 3.3 µs. We stress that the concept of time in coarse-grained models such as PLUM and PLUM* is not well defined, particularly when accelerating low temperature sampling via exchange with other replicas. Times reported above should be considered an indication of sample size only.

Analysis methodology
We visualise the conformational ensemble for each system simulated using the standard Ramachandran plot. Briefly, this visualises the frequency at which successive peptide-bond dihedral angles (φ and ψ measured between −180 • and +180 • ) occur in combination over all samples from the ensemble simulated. This is typically represented as a 'heat map' with hotter colours indicating a higher frequency of occurrence. A high frequency of data points in the upper-left quadrant of such plots indicates a preference for β-sheet structure.
To discern the most favoured aggregate conformations in each trajectory, we use a clustering analysis based on geometric similarity. The tool g_cluster is a part of the Gromacs package [46][47][48] which performs this function. In this work, we make use of the gromos clustering algorithm [49]. A pertinent detail arises from the indistinguishability of identical peptide chains. When calculating similarity to a reference structure, we must also compare to all structures resulting from permuting chain labels within that structure. For the eight-peptide system, this results in 40, 320 comparisons for each snapshot in the REMD simulation. We have modified the g_cluster tool to accommodate this, parallelising the permutations over processors using MPI.
For any clustering analysis, a root-mean-square deviation (RMSD) parameter, governing the allowable deviation of atomic coordinates of structures in the same cluster, must be set. Table 1 shows the values we used, determined as a compromise between the level of false negative groupings and false positive groupings. Note that in general, RMSD cut-offs used in geometric clustering analysis are size-extensive; however, the cut-off distances do not follow a clear-cut, simple dependence on system size in terms of number of atoms alone due to changes in the flexibility between oligomers of differing size. The main point of our clustering analysis is not to identify a definitive set of clusters that can be compared in some absolute sense, but to highlight key structural differences between the n16N-x and n16NN-x aggregates (for a fixed value of x), and between the subdomains therein. So long as the clustering cut-off is kept the same when comparing n16N-x and n16NN-x, and for each subdomain, this remains a fair and clear way to compare the two conformational ensembles.
This clustering analysis allows comparison of flexibility between chain regions and chain types via comparison of populations in the most probable cluster, and via the discrete entropy of the ensemble. Subdomain analyses used same-length representations of each chain region: residues 1 to 8, 9 to 16 and 23 to 30 were used. For all analyses, only backbone atom positions were passed to g_cluster as input.
The VMD viewer [50] was used to produce cartoon-style images of peptide aggregates. While automatic structureidentifying algorithms exist, these are not functional on the geometry of the PLUM model. Assignment was performed manually based on backbone dihedral angles and hydrogen bonding criteria. Figure 2 shows the evolution of the secondary structure of n16N chains with increasing system size. Between the n16N monomer and the n16N-6 system, these Ramachandran plots show a consistent migration of structure from the α-helix peak at (φ, ψ) ≈ ( − 60 • , −45 • ), and other accessible areas, to the βstructure peak at (φ, ψ) ≈ (−100 • , 130 • ). We will subsequently show in detail that the n16N-6 and n16N-8 systems aggregate into full β-sheets. In the n16N-6 and n16N-8 systems, the variance of accessed (φ, ψ) angles about the peaks is extremely low compared to the n16N-3 system.

Results and discussion
Two new peaks appear in n16N-6, Figure 2(c), and become dominant in n16N-8, Figure 2(e). The difference plots reveal that the magnitude of this change from n16N-6 to n16N-8 is approximately twice as great as any difference between the n16N-3 and n16N-6 systems. However, it will be illustrated that the n16N-8 system is in fact more strongly β-sheet-dominated than any of the smaller systems. We suggest these peaks result from an artefact of the PLUM* model occurring when β-strands are tightly packed, in which strands tend to adopt alternating pairs of these (φ, ψ) angles. Figure 3 is a trajectory snapshot which summarily displays the favoured aggregation behavior of the largest system of the n16N peptide, alongside the favoured structure of the monomer (inset). Geometric clustering of like conformations throughout the trajectory was used to generate the most likely structures in the conformational ensemble, and the most populous cluster is shown. This will henceforth be referred to as a system's top cluster. In the top cluster of the n16N-8 system, the chains lie in two highly-ordered β-sheets, both exhibiting a chain kink at P15 in the sequence. The smaller sheet lies at a 90 • rotation about its plane to the larger sheet, optimising overlap of sheet SD2 regions (see Figure 1 for a definition of SD2) which are hypothesised to be responsible for aggregation, without overlap of other subdomains. In stark contrast to the PLUM* predictions for the monomer system, almost no α-helical structure We refer readers to the supporting information for clustering results at every system size. A trend of aggregation through increasing β-structure content and decreasing α-helix content as chain number increases is demonstrated. Figure 4 shows the frequency of inter-peptide hydrogen bonds in n16N-8 over the entire ensemble. It is clear that chains exclusively undergo parallel β-sheet oligomerisation, and fall into a shifted register in which inter-peptide hydrogen bonding most frequently occurs between residues i and i + 2. This matches the staggered chains which constitute the sheets in Figure 3. The PLUM* model captures the fact that the proline residue is unable to form hydrogen bonds through its backbone nitrogen atom. The effect of this can be seen in the absence of hydrogen bonds from the heat map of Figure 4, and is likely related to the turn in each β-sheet of Figure 3. It is impossible for two proline residues to hydrogen bond to each other, which implies that a perfect parallel β-strand of two concurrent n16N chains cannot exist, and may account for the observation of chains preferentially aggregating out of step. Figure 4 suggests that shifts of ±n, where n = 1 or 2 account for the majority of bonding, with approximate symmetry between positive and  negative shifts. This leads to aggregates exhibiting a disorder in chain stacking, qualitatively distinct from amyloid-like structure. The lack of any hydrogen bonding away from the diagonal in Figure 4 signifies a degree of specificity in aggregation behavior of n16N, suggesting that the forms of ordered structures in the conformational ensemble of n16N-8 do not deviate significantly (other than registry shifts) from the cluster in Figure 3.
In Figure 5, we present the fraction of configurations in which each residue within n16N and n16NN is involved in hydrogen bonding. Compared to Figure 4, information on intra-chain bonding topology is lost, masking any possible difference in aggregation behavior between the two peptides. The area under the curves can be taken as a normalised measure of total interpeptide interaction, suggesting that hydrogen bonding in aggregates of both n16N and n16NN increases with system size, and also that differences in bonding between successive oligomers reduce with system size. We can therefore be confident that no qualitative difference in behavior will arise for larger aggregates than studied here.
Our expectation to see SD2 as the principal driver of aggregation in n16N is largely confirmed by Figure 5. The residues most involved in inter-peptide hydrogen bonding are always located in SD2. However, at the largest system sizes, the Nterminal region of SD3 is also highly involved, as it is in the aggregate in Figure 3. The continuation of β-structure well into SD3 is a result unseen at smaller system sizes. This defies the expected subdomain behavior, and is unusual given the highly charged nature of SD3. This behavior may arise from the crowding together of chains in the largest systems, coupled with the approximation inherent to the PLUM* model of describing all side-chain interactions on a simple hydrophobic scale, therefore omitting electrostatic repulsive interactions. Matching the subdomain hypothesis, SD1 remains mostly uninvolved in interpeptide hydrogen bonding for all systems.
In the two-unit systems previously studied [37], a wholechain difference was seen in measures of conformational accessibility, including the frequency of interpeptide hydrogen bonds between n16N and n16NN, with the latter being higher. This led to the hypothesis that the whole-chain conformational accessibility, which may be a prerequisite of assembly, was compromised in the mutant. However, Figure 5 reveals that increasing system size diminishes the hydrogen bonding differences between the two peptides: the 3-chain systems are similar up to residue K5, the 6-chain ones are similar up to residue W13, and the 8-chain ones are similar throughout the entire sequence. To examine this discrepancy fully, we include an additional approach to inspecting conformational accessibility.
In table 2, our clustering analysis is extended to include comparable, equal-length representations of subdomains SD1, SD2 and SD3, as well as the full systems. By comparing the population of every group's top cluster, we gain a further insight into the relative conformational accessibility of each. We also compute the discrete entropy (− i p i ln p i , where p i denotes the population of the ith cluster) over all clusters in the ensemble to provide an indication of the extent to which the remaining population is distributed over multiple clusters. Note that cluster population comparisons of differently-sized full systems are not meaningful. Similarly, comparison between the full chain and subdomain data is meaningless.
By this measure, there are no significant differences in levels of disorder in a given subdomain and system size between n16N Table 2. Population of most populated clusters in full-system and subdomain clustering analyses, and discrete entropy of the configurational ensemble, providing an insight into the stability of the system or chain region compared to regions or systems of the same size.

Population of region (%)
Discrete entropy (k B ) and n16NN. This includes the highly charged SD3 subdomain, where the point mutations are introduced. The population of the top cluster of SD3 increases with increasing system size, overtaking SD1 in the largest system, corroborating the evidence from the rising hydrogen bonding frequency in Figure 5 of falling disorder. Data in table 2 demonstrate that SD2 consistently explores fewer conformations than the other two subdomains at all system sizes. This supports suggestions that this subdomain is key to aggregation by presenting a more consistent geometry for inter-chain hydrogen bonding.

Conclusions
In summary, we have presented a specific, scalable oligomer structural model for n16N peptide systems involving staggeredchain β-sheets stacking at 90 • rotations to each other, with overlapping SD2 regions. The structure represents an emergence of order entirely unseen in the monomer state, matching expectations of disordered framework proteins [13]. The structure conforms in the main to existing hypotheses about the peptide's three subdomains [15], with aggregation being centred on SD2. We observe the maximum propensity for interchain hydrogen bonding between SD2 subdomains, particularly in the smaller aggregates, coupled with the most consistent subdomain structure. Results from this aggregation study do not highlight a clear role for SD1 and SD3. However these two regions are less involved in aggregation-enabling inter-peptide hydrogen bonds, which leaves them available to perform other roles, such as the 'fly-casting' mechanism proposed previously. Given our validation of the n16N PLUM* model against atomistic simulations, we expect this conclusion to be qualitatively robust. We caution that inclusion of side-chain interactions with atomistic detail could feasibly result in different quantitative conformational flexibility in this region.
We believe that the approach adopted here for simulating IDPs, i.e. validation of a PLUM-style model against atomistic simulations via the conformational ensemble for small numbers of peptides, followed by scale-up to larger aggregates, is a viable route to identifying representative structures for use in detailed atomistic simulations.
As a secondary objective, we have compared properties of the structural ensemble of n16N with those of n16NN in an attempt to understand their differing ability to facilitate biomineralisation. The resulting data do not demonstrate any significant differences in the ensemble statistics of the two peptides when formed into oligomers. Structural differences reported previously [37] for the dimer do not persist into these larger aggregates. Conclusions regarding n16NN are somewhat more tentative than for n16N, as PLUM* data have not been validated against atomistic simulations for the mutant. However, even if the PLUM* model is transferable to n16NN, there are a number of potential reasons why no difference in aggregation behaviour was observed. Firstly, we stress that our sampling does not consider unbinding of peptide chains and hence does not inform on the relative stability of bound versus unbound configurations. Detailed binding free energy calculations may reveal that n16NN structures are less stable than their n16N equivalents due to the weaker side-chain interactions. Secondly, the inability of the n16NN mutant to function correctly in biomineralisation processes may lie in kinetic effects (such as the mobility in solution of individual peptides), or interactions with ions during aggregation rather than in relative stabilities of different aggregate structures. Neither of these possibilities can be captured in the present model, motivating future studies into how such details can be captured in tractable models.