TMO: time and memory optimized algorithm applicable for more accurate alignment of trinucleotide repeat disorders associated genes

In this study, time and memory optimized (TMO) algorithm is presented. Compared with Smith (cid:1) Waterman ’ s algorithm, TMO is applicable for a more accurate detection of continuous insertion/deletions (indels) in genes ’ fragments, associated with disorders caused by over-repetition of a certain codon. The improvement comes from the tendency to pinpoint indels in the least preserved nucleotide pairs. All nucleotide pairs that occur less frequently are classi ﬁ ed as less preserved and they are considered as mutated codons whose mid-nucleotides were deleted. Other bene ﬁ t of the proposed algorithm is its general tendency to maximize the number of matching nucleotides included per alignment, regardless of any speci ﬁ c alignment metrics. Since the structure of the solution, when applying Smith (cid:1) Waterman, depends on the adjustment of the alignment parameters and, therefore, an incomplete (shortened) solution may be derived, our algorithm does not reject any of the consistent matching nucleotides that can be included in the ﬁ nal solution. In terms of computational aspects, our algorithm runs faster than Smith (cid:1) Waterman for very similar DNA and requires less memory than the most memory ef ﬁ cient dynamic programming algorithms. The speed up comes from the reduced number of nucleotide comparisons that have to be performed, without having to imperil the completeness of the solution. Due to the fact that four integers (16 Bytes) are required for tracking matching fragment, regardless its length, our algorithm requires less memory than Huang ’ s algorithm.


Introduction
In order to detect mutual similarities (differences) between two or more biological sequences (DNA, RNA or protein), the sequences have to be aligned.By aligning two or more biological sequences, a model that reveals not only the elements' matching and mismatching positions, but also the positions of elements' deletions (insertions) is generated.From an applicational viewpoint, the alignment is used for phylogenetic analysis, identification of functional motifs, profiling genetic diseases, prediction of proteins' tertiary structure upon similarities in proteins' primary structures, etc.As a computer process, each alignment requires time and memory.In general, the time and memory requirements for performing an alignment depend of the number and length of the sequences that are aligned.In some cases, when aligning many and long sequences, such as eukaryotic chromosomes or eukaryotic complete genomes, the time and memory complexity of an algorithm may be decisive factor whether the alignment of the sequences can be performed or not on a specific computer with limited processor speed and memory.Therefore, improvements of time and memory complexity are always welcomed, but not at the cost of generating partial solutions that do not completely depict the structural, functional or evolutionary relationship between the aligned sequences.
Depending of the number of sequences that are being aligned, there are algorithms for pairwise and multiple sequence alignment.Pairwise alignment algorithms align two sequences, whereas multiple alignment algorithms align more than two sequences.In terms of the scope of the generated solution, sequence alignment algorithms can be classified as local or global.Local alignment algorithms search for the most conserved (similar) DNA fragments between two or more sequences, whereas global alignment algorithms produce solutions that incorporate each element.In terms of the conceptual framework, if we look back in history, since the pioneering work of Needleman and Wunsch [1] in 1970, all algorithms for DNA pairwise alignment can be classified either as a dynamic programming based or they use some type of heuristic in order to obtain the solution (heuristic algorithms).Both groups of algorithms have their own pros and cons.Dynamic programmingbased algorithms always produce a solution that maximizes the alignment score or minimizes the distance (dissimilarity) for specific metrics.This solution is referred to as an optimal one and it requires O n£m ð Þ time and memory, where n and m are lengths of the sequences that are being aligned.The unfavourable quadratic time and memory complexity sometimes may limit the application of these algorithms, especially if long sequences are aligned on a standard PC configuration.In contrast to dynamic programming-based algorithms, heuristic algorithms are faster and they can be applied to long sequences, such as complete chromosomes or complete genomes.They gain on the speed of execution due to the significant reduction of the problem domain, by applying some type of filtering or prior indexing.As a result, sometimes partial or incomplete solution may be produced that does not fully represent the real structural, functional or evolutionary relationship between the sequences.A compromise between the opposite groups of algorithms is always welcomed.
The oldest, but commonly used DNA pairwise alignment algorithms, are NeedlemanÀWunsch [1] and SmithÀWaterman.[2] They are dynamic-programming based implementations that maximize the score of an alignment for a specific metrics.NeedlemanÀWunsch [1] generates global alignment, whereas SmithÀ Waterman [2] generates local alignment in quadratic O n£m ð Þ time and memory.Instead of maximizing the alignment's score, Sellers [3,4] minimizes the distance (dissimilarity) between two sequences.Ulam [5] gave the formal definition for distance (dissimilarity) between two sequences, as the minimal number of edit operations (element deletion, element substitution or element insertion) that have to be performed in order to transform one of the sequences into the other.Smith et al. [6] proved that no matter whether the alignment's score is maximized [1,2] or the distance (dissimilarity) between two sequences is minimized, [3,4] these two procedures are equivalent in terms of the structure of the generated solution under certain conditions.Rather than searching for one optimal solution, done in previous studies,[1À4] algorithms of Goad and Kanehisa [7] and Waterman and Eggert [8] generate a list of significant local alignments.For instance, Waterman and Eggert [8] identified a set of k best local alignments between two sequences.This is done by searching a dynamic programming matrix k times for k highest-scoring alignments.It is worth to mention that Fitch and Smith, [9] as well as Gotoh, [10,11] made a distinction between the penalty that is assigned for gap opening and the penalty that is assigned for extending an already opened gap.
The first tries for linearization of the memory complexity of dynamic programming algorithms were applications of divide and conquer approach.The idea is to find in each step an interception point ðu; vÞ such as the alignment of the sequences: a : a 1 . . .a n and b : b 1 . . .b m is generated by joining the optimal alignments of the subsequences: (a 1 . . .
. This is repeated until base is aligned to base or gap.Since the path of the optimal alignment cannot pass through the north-east and south-west quadrant in dynamic programming matrix, these quadrants can be rejected in each step of the procedure.This idea was theoretically discussed by Hirschberg [12] and it served as a basis for the memory linear implementations that were proposed afterwards by Mayers and Miller, [13] Huang et al. [14] and Huang and Miller.[15] Apart from the efforts for memory linearization, there were also attempts to improve the time complexity.The first approach is known as alignment within specific diagonal band or diagonal alignment.The idea is to calculate only those cells in the dynamic programming matrix that are located near by the main diagonal, since the path of the optimal alignment converges to it.If the length of the diagonal band equals k; k < m then the time complexity can be improved up to O k£n ð Þ.This approach was discussed by Sankoff and Kruskal, [16] whereas solutions of the problem were proposed by Ficket, [17] Ukkonen [18] and Chao et al. [19] However, drawback is its application that is limited only to very similar sequences.
Heuristic algorithms were introduced to meet the necessity for significant improvement of the time complexity of DNA alignment process.They can speed up the process up to hundred times, but there is no guarantee for optimality of the generated solution.The increased speed of execution makes these algorithms suitable for comparing query sequence to database, rather than comparing only two sequences.
For instance, FASTA, the DNA and protein sequence alignment software package, [20] looks for common hits between two sequences which are afterwards extended.Only 10 most significant extensions are included in the solution.Basic local alignment search tool (BLAST) [21] compiles an extended list of words that resemble the words extracted from one of the sequences.These words are searched in the other sequence in order to find the perfect hits.Each hit is extended in both directions until the score drops below certain threshold.Unlike BLAST, BLAST-like alignment tool (BLAT) [22] is less sensitive approach that looks for perfect or near-perfect hits, due to what can be applied only for comparison of evolutionary closely related sequences.PatternHunter [23] improves the BLAST sensitivity, due to the fact that PatternHunter looks for common hits that include at least k common nucleotides.Alike BLAST, these hits are afterwards extended in both directions.Fast length adjustment of short reads (FLASH) [24] is also seed-based approach that looks for common words between two sequences.After calculating the differences between the starting positions of common hits, the longest list of equal differences determines the most similar fragments between the sequences.Yet Another Similarity Searcher (YASS), a pairwise sequence alignment software, [25] looks for groups of close hits.The maximum distance between the hits in the groups should not exceed d where d is calculated in context to the frequencies of base substitutions and base insertions (deletions).
The most efficient approach to identify common hits between two sequences is to employ suffix tree.This data structure uses maximum unique matches-mer (MUMmer) [26] and AVID, a global alignment method.[27] MUMmer looks for maximum unique matches (MUMs) upon which an alignment is derived.Maximum unique match is defined as a common hit with maximum length that is not part of any other hit.By rejecting hits, whose length is less than half of the length of the longest common hit, AVID looks for a subset of noncrossing and parallel hits, which are identified by traversing suffix tree that is constructed for a sequence which is obtained by joining together the sequences being aligned.Unlike MUMmer and AVID, LAGAN, tool for large-scale multiple alignment of genomic DNA, [28] detects ungapped local alignments between two sequences, where, except of nucleotide hits, mismatches are also allowed.Local ungapped alignments are identified by CHAOS algorithm, a fast database search tool that creates a list of local sequence similarities, [29] which can detect short and partial hits between two sequences.LAGAN selects subset of consecutive ungapped alignments with maximum alignment score, which are afterwards connected.Super pairwise alignment (SPA) [30] exploits methods based on probability and combinatorics to find the positions where gaps should be inserted.SPA calculates the percentage of local similarity within shifting window of fixed size.The shifting window includes fragments of both sequences for which the percentage of similarity is calculated as a rate between the number of matching elements and the length of the shifting window.Gaps are inserted if minimum shift of nucleotides drastically increases the percentage of local similarity.
In terms of computational aspects (time and memory complexity) and comprehensiveness of the generated solution, a compromise between dynamic programming-based algorithms and heuristic algorithms is presented.The proposed algorithm has several distinctions from the existing ones.First, the proposed algorithm does not reject any of the hits that can be included in the solution.By applying dynamic programming algorithms, some of the hits may be rejected due to the decrease of the overall alignment score for a specific metrics, in spite of the fact that these hits can be included in the alignment.The same also happens if heuristic algorithms are applied, due to the filtering in the preprocessing phase when usually shorter hits are rejected on what these algorithms gain on the speed of execution.By generating a solution that always includes the maximum possible numbers of base-to-base hits, the proposed algorithm is applicable for comprehensive structural, functional and evolutionary analysis and it can be applied for distant homology detection.Even for dissimilar sequences, the proposed algorithm is superior to dynamic programming and heuristic algorithms in terms of accurate and comprehensive detection of matching elements between two DNA samples.Second, rather than inserting gaps randomly, they are always inserted in base pairs XY with minimal frequency of occurrence, or in other words, we take into consideration the preservation of base pairs when determining the position where an indel is localized.Most importantly, we found that this model allows a more accurate detection of continuous indels in polyglutamine tracts of spliced mRNA of huntingtin gene than the dynamic programming.Third, unlike dynamic programming algorithms, which require Oðn£mÞ base comparisons in order to identify the set of common hits, the proposed algorithm performs less than n£m comparisons for the same purpose.This can be interpreted as a gain in the speed of execution, but unlike heuristic algorithms, there is no loss of data.Furthermore, if similar sequences are aligned, the time complexity of the proposed algorithm approximates to Oðn£k£log 2 mÞ, where k is the number of common hits (k < m), m is the length of the shorter sequence and n is the length of the longer sequence.When dissimilar sequences are being compared, the time performance of our algorithm equals SmithÀWaterman or Oðn£mÞ time is required.Fourth, due to the memory efficient representation of each hit with data tuple that tracks the hit's starting and final positions, the memory requirement of the proposed algorithm is linear.In all tests that we performed, our algorithm required less memory than the length of the shorter sequence, regardless the identity of the sequences that were compared.This makes our algorithm one of the most memory efficient approaches proposed so far.

Materials and methods
The time and memory optimized (TMO) algorithm runs in two phases.In the first phase, we perform a search for 'common' and 'consistent' hits of two DNA sequences, upon which an alignment in the second phase is constructed.To optimize the memory requirement, a data tuple of four integers: ðp a;s ; p a;f ; p b;s ; p b;f Þ per identified consistent hit is tracked in the memory, where p a;s is the strating position of the hit in sequence a, p a;f is the final position of the hit in sequence a, p b;s is the strating position of the hit in sequence b and p b;f is the final position of the hit in sequence b.Under this approach, tuple ð1; 4; 1; 4Þ represents the first consistent hit ACAC in the samples a and b and tuples: ð7; 12; 9; 14Þ and ð17; 19; 15; 17Þ represent the second (TGGGGG) and the third (ACA) consistent hit, respectively (Figure 2).

Phase 1: searching and representing common hits
The set of consistent hits: fACAC : 1; 4; 1; 4 ð Þ ; TGGGGG : 7; 12; 9; 14 ð Þ ; ACA : 17; 19; 15; 17 ð Þ gis not the only one which can be derived for the samples that we consider.There are many other sets of consistent hits that mutually differ in the number of hits and in the number of matching nucleotides per hit, but the algorithm always tries to find the set of consistent hits which includes maximum possible number of matching elements, i.e. the set: fACAC : 1; 4; 1; 4 ð Þ ; TGGGGG : 7; 12; 9; 14 ð Þ ; ACA : 17; 19; 15; 17 ð Þ gis reported, because there is no other set that can be defined for the samples a and b that includes more matching nucleotides than the one being detected.
During the search for consistent hits, our algorithm exploits an idea which is based on partial comparison of mismatching DNA words, instead to comparing them entirely, or nucleotides at same positions in DNA words that mismatch are mutually compared, while the first pair of mismatching nucleotides is not detected.By employing this approach, only one base comparison is required to detect the mismatch of two DNA words in the best case, whereas in the worst case, all nucleotides have to be compared.
For instance, to detect the mismatch of the words: AGCT and GCTA, only the starting nucleotides have to be compared (A to G).The remaining nucleotides are not compared, since regardless if they match or not, the words AGCT and GCTA cannot perfectly match.The worst case scenario is when all nucleotides have to be compared, in order to detect the mismatch.That is the case when all nucleotides match, except the last ones.To detect the mismatch of the words: GCTA and GCTG, four comparisons have to be performed (the number of base comparisons in this case equals the words' length), since the last nucleotides (A and G) in these words constitute the first found pair of mismatching nucleotides.However, the number of base comparisons may be greater than one and less than the length of the words, depending of the position where the first pair of mismatching nucleotides is identified.For instance, when comparing the words TCTG and TACT, two base comparisons have to be performed in order to confirm the mismatch, since the first pair of mismatching nucleotides is found at position 2 (C to A).
Our algorithm looks for consistent hits in 'consistent domains'.'Consistent domain' is defined as a pair of sequences or subsequences in a and b which are located out of identifed hits and they do not form crossing diagonal to any of the hits, Figure 3.
At the beginning, there is only one consistent domain which is made of the sequences being compared (Figure 3(a)).The search there results in one consistent hit, which derives two new consistent domains where two new hits may be identified (Figure 3(b)).These two  hits together with the first identified match derive four new consistent domains, where we can also search for consistent hits (Figure 3(c)) and so on until the length of at least one consistent fragment equals 0 or all nucleotides in the consistent domains are compared (Figure 3(d)).
In each consistent domain, the algorithm detects the 'longest' and the 'leftmost' common hit by performing partial comparisons on mismatching DNA words of the shorter sequence/subsequence to the longer sequence/ subsequence until the hit is identified.
We will discuss this approach on the short samples a : AGCTAG and b : TGCTG (Figure 4(a)).At the beginning, there is only one consistent domain which is made of the samples a and b and this is the space where the algorithm tries to find the first consistent hit.Table 1 shows the order in which the algorithm compares mismatching DNA words and which nucleotides of them (highlighted) are mutually compared.
Words of 5, 4 and 3 nucleotides from both sequences are partially compared until the first hit GCT is identifed.The search requires totally 20 base comparisons (Table 1).The number of performed comparisons per pair of DNA words depends of the position of the first identified pair of mismatching nucleotides.For instance, one base comparison is required to detect the mismatch: (AGCTA, TGCTGÞ since the word's starting nucleotides mismatch.Two comparisons have to be performed to detect the mismatch: (TAG, TGCÞ, because the first mismatch of nucleotides is found at position 2 and four base comparisons are required in order to detect the mismatch of the words GCTA and GCTG since all nucleotides, except the last ones, match (Table 1).
The first identified hit GCT derives two new consistent domains.The first consistent domain is made of the nucleotides that precede GCT (A in a and T in b), whereas the second is made of the nucleotides that follow GCT in a and b (AG in a and G in b) (Figure 4

(b)).
There is no common hit in the domain that precedes GCT, whereas a single guanine match is found in the consistent domain that follows GCT (Figure 4(c)).The algorithm requires totally 23 comparisons to derive the set of consistent hits for the samples a : AGCTAG and b : TGCTG, fGCT : 2; 4; 2; 4 ð Þ ; Gð6; 6; 5; 5Þ} (Table 1).Since tuples are appended into a set as hits are identified, before moving to the second phase, the set of tuples needs to be sorted according to hits' occurrence in the analysed sequences.This means that if we have obtained the set of tuples: f 7;  moving to the second phase, the algorithm will swap the first two tuples, because Þ g , which tracks the natural order of appearance of the consistent hits found in the samples a and b (ACAC, followed by TGGGGG and then ACAÞ (Figure 2).
We can use the previous samples to analyse the gain in terms of memory reduction, compared to Huang's algorithm [14] and the gain in saved nucleotide comparisons regarding SmithÀWaterman.[2] For instance, if Huang is applied to a D ACACAATGGGGGCTCTACA and b D ACACTGTTTGGGGGACA, the memory requirement would be proportional to the sum of the lengths of the sequences being aligned, i.e. n C m D 19 C 17 D 36 integers would have to be stored in the memory, in contrast to number of hits £ 4 D 3 £ 4 D 12 integers that store the proposed algorithm for tracking hits' positions.To derive a set of consistent hits for the samples a : AGCTAG and b : TGCTG, SmithÀWaterman requires lengthOf a ð Þ£lengthOf b ð Þ D 6£5 D 30 comparisons, whereas our algorithm, for the same purpose, requires 23 comparisons, without losing data for any consistent hit that may happen in heuristic algorithms for local DNA alignment, such as: FASTA, [20] BLAST, [21] BLAT, [22] PatternHunter, [23] FLASH [24] and YASS.[25] Phase 2: generating local pairwise alignment At the beginning of the second phase, the number of indels between successive hits is precisely calculated.
Assuming that a is the longer sequence and b is the shorter sequence, then for each two successive hits, k 0 th and ðk C 1Þ 0 th the number of gaps n:g which have to be inserted between them is calculated according to Equation ( 1), where ðp a;s j k; p a;f j k ; p b;s j k ; are tuples that track their positions in a and b; respectively.The sign of n:g (>, < or 0) determines whether gaps should be inserted or not and if yes, in which mismatching fragment they have to be inserted, whether in sequence a or in sequence b.
If n:g > 0 then n:g gaps have to be inserted in b.If n:g < 0 then j n:g j gaps have to be inserted in a.If n:g D 0 then no indel is found between successive hits.
If we apply Equation (1) to the set of consistent hits fACAC : 1; 4; 1; 4 ð Þ ; TGGGGG : 7; 12; 9; 14 ð Þ ; ACA : ð17; 19; 15; 17Þg, which was derived for the samples a D ACACAATGGGGGCTCTACA and b D ACACTGTTTGGGGGACA, as a result, we obtain that two gaps have to be inserted between ACAC and TGGGGG in a and four gaps have to inserted between TGGGGG and ACA in b, i.e. n:g D p a;s j   6).Some base pairs may not occur at all, if short sequences are analysed.Sample a conatins one AA and one AT nucleotide pair, but they do not occur at all in the other sample ðB A; A ½ D 0; B A; T ½ D 0Þ (Figure 6).The algorithm pinpoints indels in mismatching fragments which are obtained by appending the last and the first nucleotide from surrounding hits.Extending is required, because except inside mismatching fragments, gaps can also be inserted between hits and mismatching DNA.For instance, two gaps except inside AA mismatch (Figure 7(a)) can be also inserted between the first hit ACAC and AA in sample a (Figure 7 The whole process of identifying indels step by step, according to the proposed methodology, is shown in Figure 8.Our algorithm reports CAÀAÀT modification, because AA and AT are considered for mutated triplets whose mid-elements were deleted (A À A ! AA; A À T !AT), due to the fact that CA nucleotide pair tends to be more preserved than AA/AT in a (CA occurs three times, whereas AA and AT occur only once).Therefore, rather than modifying a frequently occurring nucleotide pair (CA), our algorithm will choose to pinpoint indels in rarely occurring nucleotide pairs (AA/AT) (Figure 8(b) and 8(c)).
In general, for each extension, a list of all nucleotide pairs found there and data regarding their occurrence in a b ð Þ; is compiled.The number of occurrences of each base pair is read from matrixA (matrixB), depending on which sequence the extension belongs to.Data from the list is used to identify the rarest occurring nucleotide pair in each step, where, in fact, an indel is identified.The rarest occurring nucleotide pair XY in aðbÞ of all nucleotide pairs in the list is the one where the first gap is inserted: XY !X À Y.This nucleotide pair, in which an indel is identified, has to be removed from the list before repeating the search, if another gap has to be inserted.The same is repeated until all gaps are inserted (their number is calculated according Equation( 1)).If insertion has to be proceeded, but the list is empty (in all nucleotide pairs in the extension gaps were inserted), the proposed algorithm chooses to insert all remaining gaps next to the first nucleotide in the extension.
For the samples that we consider, CAAT is an extension of the mismatching AA fragment in sample a which is obtained by appending the last nucleotide C from the first hit (ACAC) and the first nucleotide T of the following hit (TGGGGG).This extension contains three nucleotide pairs: CA; AA and AT (Figure 8(a)).The numbers of occurrence of these nucleotide pairs are read from matrix A (Figure 6(a)) and they are stored in list L D fA CA ð ÞD 3; A AA ð ÞD 1; A AT ð ÞD 1g.Searching the list for the rarest occurring nucleotide pair in a of all nucleotide pairs in the list, the algorithm pinpoints the location of the first indel.The first found nucleotide pair is always reported and that is the nucleotide pair where the gap is inserted.
The corresponding nucleotide pair in this case is AA in the extension CAAT; which is transformed in (CAÀAT) (Figure 8 The algorithm pinpoints indels in 'the least preserved nucleotide pairs', because it assumes that they 'originate from codons whose mid nucleotides were deleted' when working with coding DNA.This property does not restrict the application of our algorithm only to coding DNA.It is a general approach that can also be applied to junk DNA (tandem repeats and interspersed repeats), where instead of codon mutations, mutations in non-coding triplets may be considered for accurate prediction of indels within satellite DNA or precise localization of Alu (LINE) repetitions.By applying our algorithm to eukaryotic chromosomes, we expect to retain Alu repetitions intact, due to its high occurrence rate, which can facilitate the identification of a newly evolved gene.On the other hand, SmithÀWaterman will pinpoint indels in one or more Alu repetitions, if that results in a better score alignment.
Figure 8(d) shows the alignment of the samples a and b according to the proposed model.As it can be noticed, all matching nucleotides are included in the solution.Shortened alignment, which partially depicts the structural relationship between the samples, is obtained if SmithÀWaterman is applied, based on metrics which penalizes mismatch alignments and gap insertions with À1 and awards C 1 for each nucleotide match (Figure 9).For this metrics, consistent hits ACAC and ACA, are not detected (included in the alignment) (Figure 9).The overall alignment score would not be increased if the first consistent hit ACAC is included in the solution.On the other hand, the overall alignment score would have been decreased for 1, if the third consistent hit ACA is included, since four gaps have to be inserted to align three matching nucleotides: A; C and A.
Obviously, when applying SmithÀWaterman, the output depends on the selected metrics.As a consequence, some local similarities 'remain unidentified' ('excluded of the alignment').For instance, although three matching fragments (ACAC; TGGGGG; ACA) can be included in the alignment of the samples a and b, SmithÀWaterman detects only one (TGGGGG) matching fragment.
Partial solutions derived from SmithÀWaterman may imply incorrect identification, characterization and classification of DNA data, such as errors when predicting disease predisposition of a specific gene and incorrect taxonomic classification.Unlike SmithÀWaterman, the proposed algorithm maximizes the number of matching nucleotides per alignment, regardless any specific metrics and rather than inserting gaps at arbitrary positions, they are inserted at positions in the least preserved nucleotide pairs.Due to these properties, a more accurate model for the relationship of trinucleotide repeat disorders associated with spliced mRNA can be derived in comparison to SmithÀWaterman.It is known that trinucleotide repeat disorders, such as Huntington's disease, [31,32] Spinocerebellar ataxia [33] and Dentatorubropallidoluysian atrophy [34] are genetic disorders which are associated with excessive expansion of trinucleotide repeats in certain genes' fragments.Usually, the number of uninterrupted trinucleotide repetitions determines whether an individual will be affected or not.For instance, Huntington's disease, which is associated with excessive number of uninterrupted CAG repetitions in Huntingtin gene, causes a progressive decline of brain cells.An individual is affected if the number of uninterrupted repetitions exceeds 35.We found that indels of CAG repetitions in spliced mRNA of huntingtin polyglutamine (polyQ) tract are more specifically identified by the proposed algorithm than SmithÀWaterman.
Regardless if non-glutamine to glutamine or non-glutamine to non-glutamine cytosine is aligned, the overall SmithÀWaterman alignment score is not affected and that is the only parameter, which SmithÀWaterman considers.The proposed algorithm does not separate cytosine in hunt3 huntingtin partial cds from … CCTCAAGTCCTT matching fragment, because its length is increased for one, if non-glutamine cytosines are  aligned (Figure 10(b)).Moreover, the nucleotide pair TC is considered to be unsplittable due to its frequent occurrence in hunt3 huntingtin partial cds (Figure 10(b)).
By knowing the number of uninterrupted glutamine repetitions in hunt1 huntingtin spliced mRNA, it is easy to find out whether the carrier of spliced mRNA, associated with hunt3 huntingtin, is affected by the disease or not.
In addition, we summarize the instructions in which our algorithm runs.Following these instructions, we developed a desktop application in C# which was tested on Acer Aspire 5570Z computer with Genuine Intel CPU 2080 at 1.73 GHz and 2 GB RAM memory.In total, there were nine tests that were performed on very similar huntingtin partial cds (ENA IDs: EU797016À24.1),which were retrieved from the European Nucleotide Archive, [35] as well as a test, which was performed on dissimilar coding DNA of human tetratricopeptide repeat protein 25 (UniProt ID: Q96NG3) to tetratricopeptide repeat protein 23 (UniProt ID: Q5W5£9).The same samples were also aligned using European Molecular Biology Open Software Suite (EMBOSS) Water implementation of SmithÀWaterman.[36] Both algorithms were compared in terms of computational performance and hit detection rate.Additionally, we analyzed the number of detected glutamine deletions in polyglutamine tracts and the number of detected matching residues in tetratricopeptide repeats (TPR), which are considered to be similar, but different.TPR is usually made of 34 amino acids and in spite of the fact that W 4 À L 7 À G 8 À Y 11 À A 20 À F 24 À A 27 À P 32 consensus has been derived, all residues, except alanine at positions 20 and 27 (A 20 ; A 27 ), are likely to change.

Results and discussion
We got better computational results when we analysed partial cds of huntingtin gene than when we analysed the coding DNA of rather than similar tetratricopeptide repeat proteins.Even for tetratricopeptide repeat proteins, for which the percentage of identity is approximately 24%, our algorithm requires less memory than Huang, but the time complexity seems to approximate SmithÀWaterman, i.e. in both cases, six seconds were required for deriving a solution.In all cases, regardless if similar or dissimilar sequences were analysed, our algorithm detected more consistent hits than SmithÀWaterman that has resulted in better detection of matching residues in tetratricopeptide repeats.Moreover, our algorithm pinpointed indels in polyglutamine tracts of partial cds of huntingtin gene more accurately than SmithÀWaterman's due to the fact that it considers the relative frequency of mismatching DNA, when it pinpoints an indel.whereas our algorithm detected 14 matching amino acids there (Table 6).Our algorithm detected between 11 and 12 matching residues more in TPR1, TPR2 and TPR3 than SmithÀWaterman and a total of four consensus elements were confirmed.
Even for sequences with small percentage of identity, such as human tetratricopeptide repeat protein 23 (25), our algorithm was able to detect more common hits than SmithÀWaterman, regardless of any specific alignment metrics what makes it suitable for comprehensive analysis of structural, functional and evolutionary relationship between coding DNA data.
The problem of identifying common hits between two DNA sequences has been widely explored in the science and different solutions have been proposed.Indexbased solutions, such as MUMmer, AVID and FLASH, can discover common hits faster than our algorithm, but they are expected to be worse in terms of memory complexity, due to the indexed data structure that they employ.For instance, MUMmer and AVID build suffix tree, whose size is proportional to the sum of the sequences' lengths and FLASH builds an index of all words found in both sequences, regardless if they are matching or not.There is also a preprocessing phase which precedes the actual hits' searching phase, which may be considered an additional computational cost.
Recent methods based on DNA hashing enable fast comparison on DNA words and efficient data storage.Instead of comparing words, their hash representations are mutually compared and the cost for storing hash of a DNA word is only 4 B (1 integer).Sequence Search and Alignment by Hashing Algorithm (SSAHA) [37] is one of the hash-based implementations which can be applied for discovering common hits based on their hash-representations.It assigns a unique number to each DNA nucleotide f b j À Á  In general, indexing and hashing allow faster comparison of DNA words, but there is an additional memory cost for storing word's index/hash.For instance, it would be faster to compare hashes fðGCTAAÞ and fðGCTAG) than performing partial comparison based on linear walk on GCTAA and GCTAG, because instead of comparing five characters (5 B), integer would have to be compared with integer, for what is a required comparison of 4 B. Faster comparison, based on indexing/hashing, is only possible at the cost of increased memory requirement, because 8 B (two integers) that track the hashes of GCTAA and GCTAG would have to be stored in the memory.There is also an additional computational cost for   E,L,T,M,G,L,L,L,Q,K,K,E,A,E,L,T,K,A,E,L,K,E  computing the hashes of the words, which is proportional to the length of words that are indexed.Our algorithm is a compromise between the time and memory requirement, i.e. it is expected to run slower than hash/ index-based methods, but there is no cost for storing DNA index/hash and no preprocessing phase that precedes the actual hits' searching phase.Theoretically, different sets of consistent hits can be defined if at least one of the common hits is found as a repeating sequence.For instance, if the trinucleotide GCT in the sample a : AGCTAG is extended in GCTGCTGCT trinucleotide microsatellite (a : AGCTGCTGCTAGÞ, then, instead of one, three different sets of consistent hits, which are equal in terms of the matching content:fGCT : 2; 4; 2; 4 ð Þ ; Gð12; 12; 5; 5Þ}, fGCT : 5; 7; 2; 4 ð Þ ; Gð12; 12; 5; 5Þ} and fGCT : 8; 10; 2; 4 ð Þ ; Gð12; 12; 5; 5Þ} can be defined for the samples a : AGCTGCTGCTAG and b : TGCTG.So, which one of the three candidates does the algorithm report? Our algorithm works accurately for coding and noncoding DNA, due to the fact that the leftmost positioned consistent hits in minisatellite (microsatellite) DNA are always reported.For the samples a and b; the first set:fGCT : 2; 4; 2; 4 ð Þ ; Gð12; 12; 5; 5Þ} is the one which is reported, because GCT : ð2; 4Þ is the leftmost positioned microsatellite triplet of all three (GCT : ð2; 4Þ,GCT : ð5; 7Þ and GCT : ð8; 10ÞÞ in the samplea : AGCTGCTGCTAG (Figure 13(a)).This approach enables accurate indel's prediction, since the alignment is built starting from the leftmost positioned match in the set of consistent hits.

Conclusions
In the present study, a TMO and more accurate algorithm than the well-known and often used dynamic programming algorithms is presented.Four crucial aspects outline the benefit of the proposed algorithm.First, our algorithm allows precise detection of missing glutaminecoding triplets in polyglutamine tracts of partial cds of huntingtin gene, due to its predisposition to pinpoint indels in less frequently occurring nucleotide pairs.This property makes our algorithm suitable for accurate detection of Huntington's disease and other diseases, which are associated with excessive trinucleotide repetitions, such as: Spinocerebellar ataxia and Dentatorubropallidoluysian atrophy.Second, our algorithm runs faster than SmithÀWaterman for very similar DNA, but not at the cost of generating partial or shortened alignments, which may happen in recent developments, such as FLASH, YASS, PatternHunter and other related approaches that usually cut in the search space in order to improve the time performance.It is also very important to note that the time improvement (the reduced number of comparisons) does not affect the completeness of the solution.Third, our algorithm requires less memory than the most memory efficient dynamic programming algorithm, the Huang's algorithm, regardless if similar or dissimilar DNA is analysed.The memory complexity of the proposed algorithm is linear and it is proportional to the number of hits being detected.Fourth,

A
'common' hit for two DNA sequences is a character or word found in both sequences.Each hit that does not form crossing diagonal with any other hit can be considered as 'consistent'.To introduce the concept of consistent and inconsistent common hits, the following samples are considered: a D ACACAATGGGGGCTCTACA and b D ACACTGTTTGGGGGACA.If the first occurrence of ACA in a b ð Þ is paired with the last occurrence of ACA in b a ð Þ, two inconsistent hits are obtained, due to the crossing diagonal that they form (Figure 1(a)).Consistent hits are formed by pairing first-to-first and last-to-last occurrence of ACA in a and b (Figure 1(b)).

Figure 1 .
Figure 1.Examples of inconsistent (a) and consistent (b) common hits.

Figure 2 .
Figure 2. Set of common and consistent hits.

Figure 3 .Figure 4 .
Figure 3. Searching for consistent hits: one consistent domain (a), one hit (two new consistent domains) (b), three hits (four new consistent domains) (c) and all nucleotides in consistent domains are compared (d).Note: arrows represent consistent domains; thickened and connected segments represent identified hits in each step; sequences (Seq) and ACA : ð17; 19; 15; 17Þ in sample b needs to be shifted for four positions (Figure 5(b)), if we want these hits to match perfectly within an alignment.Before localizing indels, two matrices ½A 4£4 and ½B 4£4 that track the numbers of occurrence of all nucleotide pairs XY; X; Y 2 fA; C; T; Gg in a and b are computed.Having marked each row and each column in A and B with one of the letters: A; C; T; G, which correspond to the four DNA nucleotides, A X; Y ½ ðB X; Y ½ Þ represents the number of occurrence of nucleotide pair XY in a b ð Þ.These matrices for the samples: a D ACACAATGGGGGCTCTACA and b D ACACTGTTTGGGGGACA are shown in Figure 6.For instance: A C; A ½ D 3 and B C; A ½ D 2; because CA occurs three times in a and two times in b (Figure (b)), as well as between AA and the second hit TGGGGG (Figure 7(c)).The CAAT-gapped modifications shown in Figure 7 are based on continuous insertion of gaps and they are three of six possible two indel transformations of the extension, since gapped transformations CÀAÀAT, CÀAAÀT and CAÀAÀT of CAAT, which are based on interrupted localization of indels, are also possible.

Figure 5 .
Figure 5. Indels À ð Þ or DNA shifts enable perfect alignment of matching fragments: TGGGGG needs to be shifted for two positions in sample a (a), ACA needs to be shifted for four positions in sample b (b).

Figure 6 .
Figure 6.Occurence of nucleotide pairs XY; X; Y 2 fA; C; T; Gg in a D ACACAATGGGGGCTCTACA and b D ACACTGTTTGGGGGACA.

Figure 7 .
Figure 7. Indels in extended mismatching fragments (marked with rectangles).Two gaps inserted: inside AA mismatch (a), between the first hit ACAC and AA (b) and between AA and the second hit TGGGGG (c).
(b)).According to the previous discussion, the nucleotide pair AA; where the gap is inserted, A AA ð ÞD 1 is removed from the list L D fA CA ð ÞD 3; A AT ð ÞD 1g.Once again the algorithm looks for the rarest occurring nucleotide pair in L D A CA ð ÞD f 3; A AT ð ÞD 1g; but now to identify the position where the second gap has to be inserted.The nucleotide pair AT is reported, since AT appears once and CA three times.The second gap insertion transforms (CAÀAT) in (CAÀAÀT) (Figure 8(c)).Due to the absence of mismatching DNA content between TGGGGG and ACA in b, all four gaps are inserted next to the last nucleotide G of the second hit (TGGGGG) (Figure 8(d)).

Figure 8 .
Figure 8. Steps for generating a solution: (a) CAAT is an extension of the mismatching AA fragment and contains three nucleotide pairs: CA, AA and AT; (b) gapped modification of AA; (c) gapped modification of AT; and (d) four gaps are inserted following the last nucleotide G in TGGGGG.

Input:
DNA sequence a, DNA sequence b Output: Local alignment A 10. Partly comparing mismatching DNA words, IDEN-TIFY ALL CONSISTENT HITS.20.REPRESENT each hit with data tuple: ðp a;s ; p a;f ; p b;s ; p b:f Þ in the memory.30.SORT the set of tuples according to their appearance in a and b. 40.CALCULATE the number of indels between successive hits by applying Equation (1).50.TRACK the number of occurrences of each nucleotide pair XY in aðbÞ in matrix ½A 4£4 (½B 4£4 ).60. EXTEND each mismatching fragment (where at least one indel is identified) with the last nucleotide from the preceding hit and the first nucleotide from the following hit.70.FOR EACH nucleotide pair in extension, read the number of occurrence from matrix A(B) and store it in list L. 80. UNTIL all gaps are inserted.90.SEARCH L for the rarest occuring nucleotide pair XY and TRANSFORM XY in X À Y in the extension.100.PRINT alignment A.

Figure 11 .
Figure 11.Output of the application near polyQ tracts of Homo sapiens huntingtin hunt1 to Homo sapiens huntingtin hunt3 partial cds alignment.

Table 1 .
Compared words and nucleotides, until consistent hits in the samples a : AGCTAG and b : TGCTG are identified.

Table 6 .
Detailed analysis of matching residues in TPR1ÀTPR4 repeats.

Table 5 .
Number of matching nucleotides per alignment.

Table 7 .
Summary of the proposed algorithm's key features.Requires Oðn£k£log 2 mÞ time for very similar sequences, whereas there is no time improvement over SmithÀWaterman for dissimilar DNA, i.e.Oðn£mÞ time is requested It can be applied on medium-size DNA sequences (long genes, chromosomes) Memory complexity Requires less memory than Huang's algorithm.Fixed memory requirement per hit: 4 int D 16 B, ragrdless its length Detection rate of matching nucleotides Higher than SmithÀWaterman, if it is based on metrics that assigns high penalties Suitable for the detection of complete homology between two DNA samples, regardless of any specific alignment metrics