In silico analysis and a comparative genomics approach to predict pathogenic trehalase genes in the complete genome of Antarctica Shigella sp. PAMC28760

ABSTRACT Although four Shigella species (S. flexneri, S. sonnei, S. dysenteriae, and S. boydii) have been reported, S. sp. PAMC 28760, an Antarctica isolate, is the only one with a complete genome deposited in NCBI database as an uncharacterized isolate. Because it is the world’s driest, windiest, and coldest continent, Antarctica provides an unfavourable environment for microorganisms. Computational analysis of genomic sequences of four Shigella species and our uncategorized Antarctica isolates Shigella sp. PAMC28760 was performed using MP3 (offline version) program to predict trehalase encoding genes as a pathogenic or non-pathogenic form. Additionally, we employed RAST and Prokka (offline version) annotation programs to determine locations of periplasmic (treA) and cytoplasmic (treF) trehalase genes in studied genomes. Our results showed that only 56 out of 134 Shigella strains had two different trehalase genes (treF and treA). It was revealed that the treF gene tends to be prevalent in Shigella species. In addition, both treA and treF genes were present in our strain S. sp. PAMC28760. The main objective of this study was to predict the prevalence of two different trehalase genes (treF and treA) in the complete genome of Shigella sp. PAMC28760 and other complete genomes of Shigella species. Till date, it is the first study to show that two types of trehalase genes are involved in Shigella species, which could offer insight on how the bacteria use accessible carbohydrate like glucose produced from the trehalose degradation pathway, and importance of periplasmic trehalase involvement in bacterial virulence.


Introduction
Shigella is a Gram-negative bacterium that is genetically related to Escherichia coli [1]. It is a facultative anaerobe and a non-spore former. It belongs to non-motile and rod-shaped bacteria. Shigella are among common causes of diarrhoea worldwide. Shigella infection is one of the top four infections among African and South Asian children [2]. Based on its serological features, Shigella genus can be differentiated into four species: S. dysenteriae (serogroup A), S. flexneri (serogroups B), S. boydii (serogroups C), and S. sonnei (serogroup D). Shigella species has a highly immunogenic O-antigen made of many oligosaccharides unit (O) repeats with a wide range of sugar components, number of repeats, arrangements, and linkages. Each Shigella species can be further differentiated into several serotypes based on O-antigen on its lipopolysaccharide layer: S. dysenteriae having 15 serotypes, S. flexneri having 6 serotypes with 15 subtypes, S. boydii having 18 serotypes, and S. sonnei having only 1 serotype [3][4][5]. Although serogroups A, B, and C are physiologically identical, due to its positive beta-D-galactosidase and ornithine decarboxylase activity, S. sonnei is distinguished as a single serogroup D [6]. A previous study has reported that 60% of all infections worldwide are caused by S. flexneri. Thus, S. flexneri has been intensively studied, which has enhanced our understanding of Shigella pathophysiology and the underlying "host-pathogen" communication [7]. S. sp. PAMC28760 is a lichen-associated polar bacteria isolated from Antarctica. It has been deposited in the NCBI (National Center for Biotechnology Information) database (https://www.ncbi.nlm.nih.gov/) as an uncharacterized organism. Antarctica is a geographical mass covered with up to 13000 feet of ice and bare rock, with small mosses and lichens being its primary vegetation [8].
Various microorganisms remain unknown in such a harsh environment since they have developed specific adaption abilities towards a wide range of extreme conditions to thrive in such habitat [9]. Generally, Shigella species can grow in a temperature range from (6)(7)(8) °C to (45)(46)(47) °C [10]. However, temperatures about 65 °C cause their rapid inactivation. Some Shigella species can survive for long durations when they are frozen at −20 °C or refrigerated at 4 °C [11,12]. Bacteria have developed a wide range of coping mechanisms to endure adverse environments such as food deprivation, biochemical and biological changes, and extreme temperatures. Temperature is one of the most crucial elements that can influence microbial protein expression. According to previous studies, expression levels of outer membrane proteins were analysed using proteome profiles of S. flexneri cells grown at 37, 38.5, and 40° C. Pathogens might use the overexpression of specific proteins (18.4, 25.6, and 57.0 kDa) to govern the expression of virulence-related proteins when cells were exposed to higher temperatures [13]. Moreover, cold-adapted enzymes from organisms living in polar regions, deep oceans, and high altitudes have several benefits, they have been increasingly analysed in recent years.
Trehalose is also essential to organisms as a survival mechanism in a stress environment because of its unique physiochemical properties, which allow it to protect cell integrity against a different environmental damage and nutritional limitations [14]. Also, trehalose and its derivatives have also been found to possess crucial functions in the pathogenicity of a wide range of organisms, including bacteria (Gram-positive and Gram-negative) and plants [15] Also, trehalose metabolism could be employed as a target for novel pathogen-specific treatments. Trehalose is a disaccharide produced by various organisms. It can be degraded via several pathways. Among these pathways, the trehalose-6-phosphate pathway (TPP) is used by many bacteria to degrade trehalose. This pathway has been investigated under conditions of low osmolarity in both Gram-positive and Gram-negative bacteria [16,17]. It was reported in E. coli K-12 that under different osmolarity conditions, it may survive on trehalose as its sole carbon source and uses different pathways for its breakdown. Likewise, the external trehalose is hydrolysed by periplasmic trehalase (TreA) at high osmotic conditions. At that moment, the glucose PTS then transports the produced glucose molecules back into the cytoplasm [17,18]. During the transition between high and low osmolarity, a second trehalase, cytoplasmic trehalase (TreF), is active which removes the internal pool of trehalose as the cells alter their metabolism to low osmolarity. TreF's low enzymatic activity is low enough not to interfere with trehalose biosynthesis during high osmolarity, but high enough to breakdown the accumulated trehalose during the return to normal conditions, when no more biosynthesis proceeds [19].
Several prokaryotes and eukaryotes can degrade trehalose to glucose through the enzyme trehalase [EC 3.2.1.28] [20,21]. It has been reported that E. coli has two trehalases, including cytoplasmic trehalase (TreF) and periplasmic trehalase (TreA). The periplasm is a small space between the outer and inner membranes of Gram-negative bacteria. Trehalases from E. coli, such as periplasmic TreA (Tre37A), have an extra C-terminal region, whereas TreF has an extended N-terminal region. Both enzymes are monomeric and have a 47% similarity [22]. Neutral trehalase (L72) is a protein found in Klebsiella oxytoca that has been linked to several functions, including energy sources and stress protection [23]. Experimental evidence of periplasmic treA gene in needed for optimal development of type 1 fimbriae for cell invasion and colonization in extraintestinal pathogenic E. coli (ExPEC) strain MT78 has been addressed in the previous study [24]. Similarly, in Burkholderia pseudomallei, a single trehalase-encoding gene, identical to E. coli TreA, which is involved in stress tolerance and virulence in mouse and insect infection models, plays a role in stress tolerance and virulence [25]. Despite its tiny size, the periplasm contains many important proteins required for a variety of physiological activities and bacterial survival under stress. Periplasmic proteins aid in the defence against different stresses, making it easier for bacteria such as S. Typhimurium to colonize the host [26]. However, there has been no complete analysis of the expression of many periplasmic proteins, especially periplasmic trehalase (TreA), in Shigella strains. The goal of this study was to determine the prevalence of two different trehalase genes (treF and treA) in 134 complete Shigella genomes, including lichen-associated S. sp. PAMC28760 isolated from the Antarctica region. Additionally, we would like to determine which trehalase genes (treF or treA) might contribute to virulence. It is thought that analysis of pathogenic and non-pathogenic trehalase might provide a new direction to understand bacterial pathogenic mechanism at the genetic level and to provide a new insight on drug development for the treatment of bacterial infections. The use of a bioinformatics tools such as MP3 can allow the study of virulence genes involved in respective strains without the need to perform hazardous laboratory experiments.

Data sources
The complete genome and amino acid sequences of Shigella species were obtained from the NCBI database (https://www.ncbi.nlm.nih.gov/) [27]. A total of 134 Shigella strains deposited in NCBI by September 2021 were analysed, including our Antarctica isolate S. sp. PAMC28760, whose genome size was 4,558,287 bp [28].

Phylogenetic tree construction and average nucleotide identity (ANI) analysis
To compare 16S rRNA sequences of S. sp. PAMC28760 with those in other complete genomes of Shigella strains (133 strains), phylogenetic analysis was performed using the ClustalW alignment tool and the Molecular Evolutionary Genetic Analysis (MEGA X) (https://www.megasoftware.net/) tools [29]. MEGAX was used to create the phylogenetic tree, which was built on a neighbour-joining tree method [30] and 1,000 bootstrap replications [31]. The online software Interactive Tree life (iTOL) v6 (https://itol.embl.de/) was used to analyse phylogenetic trees [32]. Orthologous Average Nucleotide Identity Software Tool (OAT) [33] was used to determine the average nucleotide identity (ANI) of 16S rRNA from closely related species acquired from EziBio Cloud (www.ezi biocloud.net) [34]. To determine if the strain PAMC28760 belonged to Shigella or Escherichia, EziBio cloud 16S rRNA sequencing was used. Secondary data was used to identify the cytoplasmic trehalase or periplasmic trehalase from the characterized strains E. coli strain K-12 substrain MG1655 (NC 000913.3) as a reference for the construction of a phylogenetic tree for trehalase genes (treA and treF) in those studied strains who possess both trehalase genes. NCBI, RAST, and Prokka were used to find the cytoplasmic and periplasmic genes. MUSCLE [35,36] was used to align amino acid sequences, and maximum-likelihood and neighbour-joining methods were used to build a phylogenetic tree.

Comparative genomic analysis and, prediction of periplasmic trehalase and cytoplasmic trehalase
The prevalence of trehalase genes in the studied genome, as well as to predict pathogenic and non-pathogenic factors, were carried out using the MP3 (offline version) program (http://metagenomics.iiserb.ac.in/mp3/index. php) [37]. This program uses two modules including Support Vector Model (SVM) and Hidden Markov Model (HMM) to predict pathogenic and nonpathogenic proteins in the genome. Furthermore, Rapid Annotations utilizing Subsystems Technology (RAST, https://rast.nmpdr.org/rast.cgi) [38] and Prokka annotation (Prokka 1.14.6 offline version) [39] were used to locate predicted trehalase genes. CGView ServerBETA (www.cgview.ca) was used to better visualization of location predicted trehalase genes [40].

Phylogenetic tree analysis of S. sp. PAMC28760
Phylogenomic analysis revealed that S. sp. PAMC28760 and S. dysenteriae ATCC12037 belonged to the same branch ( Figure 1a). MEGA X program was used to construct phylogenetic tree to analyse their evolutionary history using the neighbour-joining method [41] with 1,000-replicate bootstrap. Furthermore, ANI value revealed that S. sp. PAMC28760 had a close relationship with strains S. flexneri ATCC29903(T) Figure 1b). These results suggest that the S. sp. PAMC28760 strain is closely related to Escherichia strain as both belong to the same family Enterobacteriaceae.

Trehalase gene and its phylogeny
When complete genomes of 134 Shigella strains including our strain PAMC28760 were studied, only 56 strains were found to have two types of trehalase (treF and treA) genes. Furthermore, we employed RAST annotation database and, Prokka annotation to differentiate cytoplasmic (treF) and periplasmic (treA) trehalase. In addition, the CGview online server (Figure 2) visualize the predicted trehalase genes in S. sp. PAMC28760. When we aligned them with characterized trehalase genes (treF and treA) of E. coli K-12 substrain MG655, S. sp. PAMC28760 was found to also encode the same genes involved in trehalose degradation ( Figure 3). While 48, 47, and 47 of S. flexneri's strains had treF, treA, and both treF and treA genes, respectively, 39, 2, and 2 of S. sonnei's strains had treF, treA, and both treF and treA genes, respectively. In addition, of a total of 20 S. boydii strains, 18, 5, and 3 strains had treF, treA, and both treF and treA genes, respectively. For a total of 25 S. dysenteriae strains, 12,12, and 3 strains had treF, treA, and both treF and treA genes, respectively ( Figure 4). Results showed that S. sp. PAMC28760 had both trehalase genes treF (cytoplasmic trehalase) and treA (periplasmic trehalase).
Phylogenetic tree analysis of trehalase genes (treF and treA) with a characterized E. coli K-12 substrain MG 1655 revealed that treA of S. sp. PAMC28760 and E. coli K-12 substrain MG1655 shared the same clade with 100% sequence identity, whereas S. sp. PAMC28760 did not share the same clade as E. coli K-12 substrain MG1655, although both shared 99.82% sequence identity ( Figure 5). This shows that trehalase  genes (treA and treF) of S. sp. PAMC28760 could be distinctly divided into two major clades. It was found that treA and treF genes from studied genome clustered together more closely with both genes of S. flexneri. The treA gene is clustered with S. flexneri FDAARGOS-74 and S. flexneri WW1 whereas treF is clustered with S. flexneri 2016AM-0877 and S. flexneri 74-1170.
These results suggest that S. sp. PAMC28760 might have a trehalose degradation pathway like that of E. coli. Also, it has been reported that TreA in E. coli is a trehalase found in the periplasmic area of cells that hydrolyzes trehalose glucose under high osmolarity, whereas TreF is a cytoplasmic isoform of TreA trehalase that plays important role in trehalose breakdown produced within bacterial cells under high osmolarity conditions [42,43]. Similarly, in the case of cytoplasmic trehalase (TreF), it becomes active during the transition between high and low osmolarity. TreF can deplete the   internal trehalose pool as the cell metabolism shifts to a low osmolarity state. TreF has a low enzymatic activity that is low enough not to interfere with trehalose production under high osmolarity, but high enough to degrade the accumulated trehalose once the environment returns to normal [19].

Trehalose degradative pathway
Six routes of trehalose degradation pathways (trehalose degradation I, II, III, IV, V, and VI) have been found in organisms depending on their subcellular locations. These pathways have been reported in the MetaCyc pathway database [44]. They are summarized in (Figure 6). Depending on the organism, trehalose might enter cells via a permease where it remains unmodified, or it gets transformed to phosphorylated trehalose 6-phosphate forms via a phosphotransferase system (PTS  (Figure 7). It was revealed that our Antarctica isolate S. sp. PAMC28760 had the trehalase gene based on the prediction of trehalose degradative pathway. The result is summarized in Figures 2 and 6. Trehalose is broken down into two molecules of glucose and water by the trehalase enzyme that utilizes glucose as a carbon source. Trehalase is classified into glucoside hydrolase (GH) families such as GH37, GH65, and GH15 in the CAZy (Carbohydrate-Active Enzyme) database (http://www.cazy.org/) [45]. The GH37 family possesses only trehalase enzymes, whereas GH65 and GH15 families possess other enzymes along with trehalase enzymes. In 2007, it was reported that Mycobacterium smegmatis and Mycobacterium tuberculosis possessed trehalase that belonged to the GH15 family [46].
Trehalase belonging to the GH37 family can hydrolyse a molecule of ∝,∝-trehalose into two molecules of glucose by inverting the anomeric orientation. Trehalase belonging to the GH37 family have been found in different species, including bacteria, fungi, yeasts, plants, insects, and vertebrates [22]. GH family has been divided into "clans" in the CAZy database, where enzymes are regarded to have a common evolutionary origin. Clan GH-G was ascribed to GH37 enzymes, while clan GH-L was ascribed to GH65 and GH15 enzymes. Although clans GH-G and GH-L share only a low amount of sequence homology, such finding is significant. GH37 trehalase has two catalytic residues, Asp and Glu, in their CDs (catalytic domains). Asp and Glu residues tend to be involved in the function of GH65 and GH15 trehalases. These amino acid residues are most likely to be involved in a common inverting mechanism during catalysis [47]. Structures of these trehalases are comprised of conserved regions (CRs), Figure 6. Trehalose degradative pathways. Six different trehalose degradative pathways are found in organisms (bacteria, fungi, yeast, Arthropoda, and plants). Among them, only two degradation pathways (Trehalose degradation pathway II (cytosolic) and VI (periplasmic)) are found in Shigella species.
which include catalytic residues. These CRs can form active sites that usually have loops. CDs of GH enzymes contain well-known trehalase signature motifs, motif 1 (

PGGRFXEXY[G/Y] D[S/T] Y] and motif 2 (QWD[Y/ F]PN/Y) [G/A] W[P/A] P), whereas GH65 and GH15
trehalases do not [48,49]. Our Antarctica isolate S. sp. PAMC28760 possesses GH37 trehalase with two signature motifs (motifs 1 and 2) as well as highly conserved regions (CR3-CR5), which have also been found in E. coli. Further study confirms that S. sp. PAMC28760 possesses trehalase enzyme, a member of the GH37 CAZyme family (Figure 3). The Gram-positive bacteria like Bacillus subtilis (non-pathogenic) and Clostridioidess difficile (pathogenic) share a pathway in which exogenous trehalose can be imported by a PTS to produce glucose and glucose-6-phosphate via the phosphotreahalose TerA (analogous to the PTS-TreC system in pathogenic E. coli). Due to the acquisition of an additional cluster of trehalose metabolism genes, namely a second PTS that mediates high-efficiency trehalose uptake from the environment, epidemic C. difficile strains can also grow on low trehalose. By increasing toxin levels, both modified trehalose utilization systems contributed to the growth and toxicity of these epidemic C. difficile strains [49]. There have been no previous papers on the function of the trehalose degradation pathway in virulence in Antarctic isolates till date. However, in Variovorax sp. PAMC28711 [50], the presence of trehalose metabolic pathway was mentioned.

Prediction of pathogenic and non-pathogenic proteins
MP3 (standalone program) can predict the presence of pathogenic and non-pathogenic proteins in a complete genome of a microbe based on two models, SVM and HMM, and their hybrids (integrated SVM and HMM models). To predict pathogenic and non-pathogenic trehalase, we retrieved complete genomes of 134 Shigella species (strains) from the NCBI database along with our S. sp. PAMC28760 isolates from Antarctica. Our strain S. sp. PAMC28760 showed pathogenic proteins of 1,136 (based on SVM model) out of 4329 total proteins (Table 1), with periplasmic trehalase as a pathogenic trehalase (data not shown). MP3 tool can be used to compare numbers of pathogenic proteins in healthy and infected samples by precisely identifying pathogenic protein fragments (based on amino acid composition and dipeptide composition) commonly found in metagenomic data without needing a time-consuming homology-based alignment [37]. In comparison with other publicly available bioinformatic tools, this program can predict pathogenic proteins with improved accuracy (95.06%), sensitivity (85.59%), and specificity (96.64%) as it employs both SVM and HMM models. Also, it is essential to analyse complete genome sequences of pathogenic and non-pathogenic bacteria of closely related species to determine if any significant genomic changes have occurred. It has been proposed that both pathogenic and non-pathogenic strains have virulence factors/genes. They can be distinguished based on gene content. When other genes suppress the virulence factors/genes, the bacterium becomes non-pathogenic. However, when suppressing genes are lost, a commensal can become pathogenic [51].
In addition, the detection of transposon mutants in extraintestinal pathogenic E. coli (ExPEC) that are defective in binding to non-phagocytic cells is an unexpected finding on the probable role of periplasmic trehalase (treA) in virulence [24]. Furthermore, while trehalase enzymes are known to have a role in virulence of some fungal species, the occurrence of multiple enzymes can inhibit their potential as an antifungal drug target. Because the trehalose pathway and its enzymes are not found in mammals (including humans), fungi-specific inhibitors of the trehalose pathway and their enzymes should be generally non-toxic to mammals [52,53]. Likewise, a previous study has reported that inactivating trehalose biosynthesis pathways does not reduce resistance to oxidative stress in many bacteria, but a periplasmic trehalase gene (treA) mutant in Burkholderia pseudomallei shows increased sensitivity to oxidative stress despite elevated trehalose levels in the mutant, which is expected to protect against this stress [25]. Another study also reported that validmycin A was ineffective against Clostridioides difficile TreA, whereas trehalose derivatives such as epimers containing hydroxyl groups (2and 4-positions), and thiotrehalose derivatives showed promise as TreA inhibitors with a larger spectrum. The efficacy of these drugs in treating specific bacterial infections is currently being studied [54]. It has also been reported that the PTS route for trehalose uptake (trehalose degradation I, low osmolarity) is inhibited when the osmolarity is high. Thus, trehalase (TreA) in the periplasm can allow cells to utilize trehalose at a high osmolarity by breaking it down into glucose molecules, which can be subsequently transported by phosphotransferase mediated system [55]. Genome of Shigella strains were analysed for pathogenic and nonpathogenic trehalase genes in this study for the first time. It is assumed that studying trehalase in one pathogenic bacterium like Shigella species could be important for further studies. Trehalase (TreA) from the pathogenic strain of extraintestinal E. coli known as MT78 has also been identified as a member of glycoside hydrolase 37 (GH37). Similarly, deletion of these genes in the meningoencephalitis-causing yeast Crytococcus neoformans resulted in severe defects in spore production, a decrease in spore germination, and an increase in the production of alternative development structures, which spores forms are plausible infectious particles [56]. Trehalose does not have to solely play a role in osmoregulation. According to Lee et al., it has stated that if glucose is present in the cytoplasm, molecules like trehalose are produced at levels approaching 400 mM in the cytoplasm [57]. Glycine betaine and L-proline often accumulate in the cytoplasm (around 700 and 400 mM, respectively) and can replace trehalose [58]. Many species utilize these osmolytes, which appear to be well-adapted to cellular functions. The electro-neutral solutes trehalose, glycine betaine, and L-proline, as well as potassium glutamate, have various chemical characteristics that may suit their functions in cell survival during osmotic shock.

Conclusions
Although there are many studies on trehalase, it was not studied in Shigella species based on two different trehalase genes (treF and treA) and pathogenicity. Most Shigella species (S. flexneri, S. boydii, S. dysenteriae, and S. sonnei), as well as our strain S. sp. PAMC28760, have cytoplasmic trehalase, and all periplasmic trehalase predicted in the studied strains showed up as pathogenic proteins using MP3, RAST, and Prokka tools. Notably, treF was detected in all strains of S. sonnei, but treA was identified in only two strains. This sort of research on pathogenic and non-pathogenic trehalase could help researchers to elucidate how and why Shigella species have certain traits. Furthermore, before performing any  kinds of wet lab work, these bioinformatics tools are important in determining the nature of proteins present in a complete genome of bacteria.