Network-based strategies in metabolomics data analysis and interpretation: from molecular networking to biological interpretation

ABSTRACT Introduction Metabolomics has become a crucial part of systems biology; however, data analysis is still often undertaken in a reductionist way focusing on changes in individual metabolites. Whilst such approaches indeed provide relevant insights into the metabolic phenotype of an organism, the intricate nature of metabolic relationships may be better explored when considering the whole system. Areas covered This review highlights multiple network strategies that can be applied for metabolomics data analysis from different perspectives including: association networks based on quantitative information, mass spectra similarity networks to assist metabolite annotation and biochemical networks for systematic data interpretation. We also highlight some relevant insights into metabolic organization obtained through the exploration of such approaches. Expert opinion Network based analysis is an established method that allows the identification of non-intuitive metabolic relationships as well as the identification of unknown compounds in mass spectrometry. Additionally, the representation of data from metabolomics within the context of metabolic networks is intuitive and allows for the use of statistical analysis that can better summarize relevant metabolic changes from a systematic perspective.


Introduction
The last few decades were marked by a steep increase in capacities to generate, store, share and process data. Such developments have driven a revolution in the ways we investigate complex systems such as social networks, cities, and organisms. In biology, this started with the earlier sequenced genomes at the turn of the century, closely followed by the boom of other 'omics' including trancriptomics, proteomics and metabolomics, consolidating the trends on moving from previous reductionist to holistic approaches in the investigation of biological systems. The structure of data generated by these approaches led to considerable changes in how we conceptualize and analyze these results, with particular attention driven toward the network structures arising from the integration of such rich datasets.
The concept of biological systems as networks is nothing new, in particular in relation to metabolism. In its most familiar representation, metabolism is pictured as a network of metabolites interconnected by enzymatic reactions catalyzing their interconversion, which define the flow of mass and energy through the system. Despite the familiarity of this representation of a metabolic network in the sense of metabolic pathways, networks are a much broader concept, useful for not only visualizing but also analyzing the global structure of any dataset that can be represented as nodes connected by a specific feature. Such data can be analyzed via several parameters in the field of graph theory and unravel mechanisms and structures beyond the direct interaction between neighboring nodes [1][2][3]. The combined analysis of several organisms including eukaryotes, bacteria and archea, for example, revealed a common feature of metabolism through its organization in a scale-free topology [4,5]. In these systems, most metabolites are involved in a low number of reactions whereas few metabolites, such as pyruvate, ATP/ADP, NAD(P)/NAD(P)H and coenzyme A, function as metabolic hubs participating in dozens of reactions and connecting multiple metabolic pathways [4,5]. However, there is considerable debate as to whether metabolism displays small-world topology or not, with strong arguments showing how these conclusion can arise from other common distributions due to analytical limitations or arbitrary choices in how to represent data [6]. Indeed, recent examples have shown situations in which the small-world topology is not observed [7,8]. Therefore, a priori assumptions regarding network topologies, frequently incorporated into analysis pipelines should be carefully considered within the specificities of the investigated dataset. Despite the possibility that these different interpretations may lead to alternatives in our understanding of metabolic regulation, the fact is that network topology is an important feature of metabolic networks that can contribute to a phenotype interpretation.
The advent of metabolomics and its capacity for providing information on several dozen to hundreds of metabolites in a single experiment has given access to such higher levels of complexity. Subsequently, network analysis has become instrumental in exploring the full potential of these experiments. Among many popular approaches, metabolomics data have been used to produce correlation networks and to highlight the structural relationships between analytes in untargeted metabolomics and for pathway enrichment analysis of known metabolites. In this review, we will focus on those network strategies that are built exclusively upon analyzing data coming from metabolomics experiments to assist data visualization, explore metabolite-metabolite relationships and improve metabolite annotation. That said it is important to highlight that there are several other network strategies relying on mathematical modeling and simulations, such as those for the determination of metabolic fluxes including kinetic flux profiling [9] and flux balance analysis [10]. Such strategies for metabolic flux estimation are based on models built upon the underlying structures of metabolic network and may include metabolite levels and isotope enrichment data as means to quantify changes in fluxes or as constrains to reduce the solution space in some of these models [11]. However, they will not be discussed here and the interested reader is referred to the recent extensive reviews regarding such approaches [12][13][14].

Metabolite correlation networks
Most metabolomics experiments are designed to determine the relative quantification of dozens to hundreds of metabolites in relation to general internal standards [15,16], or less frequently the absolute quantification of a considerably reduced number of metabolites [17]. Irrespective of the measurement type, common analyzes pipelines include direct application of descriptive statistics over individual metabolites comparing a treatment group against a control with common approaches such as Student's t-test and ANOVA ( Figure 1). Multivariate analysis such as PCA is also frequently applied to highlight the most relevant metabolites responsible for differentiating between the groups of individuals under study. Such analyses are invaluable to access the differences between individual metabolites, which are very often instrumental in explaining the phenotypes, and understanding the effects of environmental and genetic perturbations. However, they still look at discrete parts of the whole system, and often miss important information comprised within the complex metabolomics datasets. A great example of this limitation is the many experiments where the use of correlation networks rather than metabolite levels could unravel links between distant nodes (metabolites) representing complex interactions between different pathways and physiological functions. The correlation matrixes of S. tuberosum, for instance, show marked differences for leaves and tubers, that can partially be explained by the different roles of these organs as source and sink, respectively [18]. Even more remarkably, the differential analysis of network connectivity of a potato line exhibiting an apparently silent suppression of sucrose synthase isoform II expression reveled metabolic alterations in carbohydrate and amino acid metabolism that were not captured by differential analysis of the levels of the metabolites [19].
It is evident from several other studies that correlation analysis is a powerful strategy to identify systemic metabolic changes which are often undetectable when relying solely on metabolite levels [19,20]. However, for proper interpretation, it is also essential to understand the particular mechanisms that may underlie these correlations in the specific case of metabolic networks [21]. Correlation analysis finds widespread application in different fields of biology, with many of the strategies used for generating metabolic correlation networks originally developed for other sources of data. Transcriptomic data analysis, in particular, frequently uses correlation (coexpression) networks to draw conclusions, and the concept of guilt-by-association is widely accepted as a means to identify transcripts putatively associated with related processes [22,23]. Metabolic networks differ in the sense that its components are intermediates in a chain of processes depending on one another for their synthesis. Several well-known examples of cyclic pathways such as the TCA cycle and the Calvin-Benson cycle highlight the complex level of interdependency across metabolites. Moreover, metabolism is highly flexible exhibiting faster responses, at the milliseconds to seconds, which can be several orders of magnitude faster than typical transcriptional and translational responses [18,24]. That said, correlations in cellular metabolism may arise from multiple distinct factors including chemical equilibrium, mass conservation, asymmetric control distribution, and unusually high variance in the expression of single genes [25].

Establishing metabolites correlations
The most frequently used methods for establishing metabolites correlations are based on Pearson's correlation coefficient (or Spearman). Whilst easy to calculate, such networks capture both direct and indirect connections between metabolites. Considering the overwhelming complexity of metabolic networks, excluding such indirect connections can considerably facilitate data interpretation. The use of Gaussian graphical modeling to produce partial correlation networks has attracted growing attention in systems biology due to its capacity of differentiating the indirect associations and Article highlights • The shear amount of metabolites measured by metabolomics platforms carries information beyond single metabolite levels that can be accessed with the use of network analysis. • Different information extracted from metabolomics can be visualized as networks. • Correlation networks can reveal co-regulated metabolites that are not directly related as substrate/precursors in the same pathway. • Correlation networks can detect differences across samples that may not be detected by statistical tests on individual metabolite levels. • Mass spectral similarity networks have been shown to considerably extend the proportion of signals assigned to putative metabolites in untargeted metabolomics. • Molecular ion mass differences and to a lesser extent retention times can be used to establish putative substrate/product relationships including unknown signals in untargeted metabolomics. • Mapping metabolite levels to metabolic models are a helpful way of interpreting the metabolic phenotype of an organism at a deeper systemic level.
generating sparse networks better representing causal relationships [26][27][28]. This can be achieved by conditioning each pairwise correlation to all other variables. Therefore, the partial correlation between X and Y corresponds to the Pearson correlation between the residuals from regressing each X and Y to the remaining variables [29]. In contrast to observations using direct pairwise correlation networks [30], initial works applying such approach to metabolomics datasets have shown very promising results in a large human population cohort, with the high partial correlations of 151 metabolites in blood samples generally tracing back to known biochemical reactions [26]. A major challenge for the application of partial correlations on the development of biological networks from omics datasets in general, is the need for at least as many observations as variables, condition rarely satisfied for such experiments [28]. Common approaches to work around this limitation include the use of lower-order partial correlations, where the correlations are conditioned to a smaller subset of the remaining variables [27], as well as regularization approaches such as the graphical lasso [31]. An alternative approach for identifying an association between metabolites that is worth noting is mutual information (MI). MI is an association measure borrowed from information theory, routinely used across other functional genomics platforms, and often considered convenient for capturing non-linear relationships in contrast with traditional linear correlations [32]. However, a few studies suggest that MI-based approaches may underperform in comparison with correlation-based approaches [32,33]. Indeed, the work by Suarez-Diez and Saccenti [33] provides an interesting assessment of multiple methods for generating metabolite association networks as well as the effect of sample sizes and dimensionality. Their results suggesting that correlationbased methods perform better than mutual information. Moreover, Song, Langfelder [32] in the same line suggest that even for the case of detecting non-linear associations, arguably one of the most attractive advantages of MI, there are computationally simpler regression-based alternatives such as spline and polynomial regression that could be considered.

Tools and strategies for metabolite correlation analysis
Despite the particularities of metabolism in relation to other biological networks, mathematical procedures for the generation and differential analysis are interchangeable with those developed elsewhere. In fact, most tools for the investigation of biological correlation networks were originally developed and indeed are more frequently used with transcriptomics data (see Table 1 for a summary of all tools presented in this review) [34,35]. Several tools originally developed for transcriptomics have been later used or considered the inclusion of metabolomics data [34][35][36]. In summary, most methods are based on generating pairwise correlations for the whole dataset followed by some statistical comparison between different conditions.
Large correlation networks including all samples of an experiment may still prove useful for data interpretation, particularly when the overall difference between groups is not too extreme [37][38][39]. However, in most cases, there is a greater interest in the differential correlation across distinct conditions. Therefore, correlation analysis on replicates may be more suitable than combining different conditions [25]. It is important to highlight that sample size should be larger than many common metabolomics experiments in which this is often limited to numbers between three to six biological replicates. For the generation of correlation networks, it is often recommended a minimum of 10 biological replicates [25]; however, it is important to highlight that this has to be evaluated in a case by case basis and this recommendation may be an underestimation.
Several examples of differential correlation network analysis applied to metabolomics describe alternative methodologies for performing such analysis [35,36,40,41]. One of the most simple and flexible approaches that can be easily integrated with other analysis pipelines is via functions from the R programing language. Built in functions such as corr or those from packages like hmisc or psych, can be used to calculate correlation matrixes and perform basic statistical tests from which data can be filtered based on any preferred cutoffs [42]. Furthermore, tools such as Cytoscape can assist data visualization, while providing several other functionalities for network structure analysis [43]. In a recent work by Batushansky, Matsuzaki [7] the authors provide a concise example of differential correlation network analysis following this methodology, including a descriptive analysis of some of the main network topological features. This work highlights specific changes in BCAA metabolism in response to fasting that were not captured by individual metabolite comparison suggesting a possible disruption of BCAA metabolism from glycolysis [7]. Network topology analysis can further efficiently extract information from larger networks. A good example is the use of clustering algorithms to identify modules of correlated metabolites, followed by pathway enrichment within differential modules across experimental groups [44]. Still considering the R environment, there are a few packages focused on correlation network analysis that have been used for metabolomics and provide more sophisticated approaches, amongst some of the most popular are DiffCorr [36] and weighted correlation network analysis (WGCNA) [45]. DiffCorr was developed exclusively to analyze differential correlation in biological networks. It is based on the identification of the first principal component-based 'eigen-molecules' in the correlation networks, which corresponds to the representative correlation pattern of a module composed of highly correlated metabolites, followed by testing differential correlation between two groups based on Fisher's z-test [36]. DiffCorr can be applied to trancriptomics data but it has also been shown to work on metabolomics datasets. Its methodology was recently incorporated into metaX [46], a metabolomics data processing software providing a graphical user interface that may be useful for those not familiar with command line tools. WGCNA is a very well-established approach in transcriptomics data analysis. Its implementation includes the identification of co-expression modules and integration of other phenotypic data, while still allowing for the differential comparison between networks. Despite its initial focus on microarray data, a few more recent studies have shown the application of this strategy also for the analysis of metabolomics data alone [34].
Debiased Sparse Partial Correlation algorithm (DSPC) is a regularized approach that has been developed for handling high dimensional mass-spectrometry based metabolomics data [28]. It is built upon a de-sparsified graphical lasso modeling procedure with the assumption that the number of true connections among the metabolites is much smaller than the available sample size, making it particularly relevant for such large-scale metabolomics datasets. Moreover, the procedure has been conveniently integrated into the Java application CorrelationCalculator and can be easily exported to the MetScape app in Cytoscape for visualization. A recent example using DSPC for differential network enrichment analysis has shown promising results, demonstrating alterations in triacylglycerols and cardiolipins-phosphatidylethanolamines that precede the clinical outcome of end-stage kidney disease by several years [47].
One crucial step that must not be overlooked in the generation of correlation networks is the establishment of a correlation cutoff threshold. Most network analysis data interpretation relies upon the establishment of a somehow arbitrary threshold to select meaningful correlations. Early works investigating topological features of metabolite correlation networks have suggested that the number of connected components transitions from small to large at an absolute correlation of approximately 0.5 [44], whilst biological meaning has been associated to moderate correlations in the order of 0.6 < |r| < 0.8 [25]. Very few works have focused on trying to develop alternatives to avoid such arbitrary threshold selection. An interesting approach trying to tackle this subject has been developed using once again a modification over an algorithm originally developed to infer transcriptional regulatory networks from gene expression [35,48]. The socalled Probabilistic Context Likelihood of Relatedness (PCLR) uses a resampling strategy for inference of the correlations. In each of the k iterations, 70% of the samples are randomly selected for calculating the correlations. Then, context likelihood of relatedness is used to estimate possible associations with the threshold automatically selected to keep the highest 30% of interactions [41,49,50]. The result is a weighted network where edges represent a probabilistic measure of edge likeness. This is an interesting way of assessing the significance of associations in relation to the local background of each pair of metabolites instead of using the pair correlation value.

Molecular networking strategies for metabolite annotation
Metabolite annotation is one of the main bottlenecks for most untargeted metabolomics pipelines. The structure of nucleic acids and proteins, composed of discrete unities of a limited number of building blocks, namely nucleotides and amino acids, significantly facilitate the identification of such structures. Metabolites, on the other hand, occupy a much broader space of possible structures and therefore, metabolite identification is often a bewildering task even for an experienced analyst [51]. Much of the knowledge regarding metabolite identification has been painstakingly accumulated over the years through complex procedures involving their isolation followed by further spectroscopic characterization and/or the synthesis of proposed structures. These strategies are powerful but even assuming access to cutting edge technologies such as LC-SPE-NMR systems, it still represents a time consuming and labor-intensive endeavor, limiting its scaling to the throughput necessary for most current metabolomics experiments. For these reasons, chromatographic separation coupled to mass spectrometry has become the preferred technique for metabolomics, providing the best compromise between metabolome coverage, structural information and throughput [51]. Despite its limitations for de novo characterization, mass spectra and retention times still provide rich structural information, particularly useful for the identification of known compounds and assignment of putative metabolite classes and/or structures. Additionally, increasing accessibility to systems incorporating ion mobility spectrometry provides an additional dimension based on drift times that can be converted into collisional cross-sections [52,53]. This is a valuable complementary feature to conventional chromatography/mass spectrometry, however it will not be discussed in detail here since its integration within metabolomics data analysis pipelines is still in its infancy.
Considering all the interesting features of mass spectrometry as a metabolomics platform, extensive efforts have been directed toward improving its limitations for metabolite annotation. Molecular networking is a strategy that has shown outstanding potential to increase by several orders the number of signals assigned to a putative chemical structure in mass spectrometry-based metabolomics [54]. Molecular networking relies on the assumption that related structures yield similar fragmentation patterns to generate a network of spectra similarity. The strategy is conceptually simple and current tools make it easy to implement and interpret with different datasets. Modules within these networks represent putatively related metabolites that can be identified by the association with other known metabolites. The concept was originally developed for the screening of secondary metabolites produced by microbial colonies [55]. Following this initial application, it quickly became an integral part of most metabolomics workflows focused on metabolite annotation being successfully applied for the characterization of multiple microorganisms [56,57], plants [56,58] and human samples [59,60]. This initial approach is built upon a modification of the MS-Cluster algorithm [61] originally developed for proteomics, which is used to simplify the original data by building consensus spectra. Pairs of spectra converted into unit vectors in n-dimensional space are compared by dot product calculation, representing the cosine angle between the two vectors, to generate a vector of spectra similarities where 1 represents an identical spectra [62]. A cut off is then established (usually between 0.5 and 0.7) above which edges are set between the pair of consensus spectra. The assignment of modules including analytes of similar mass spectra by itself already represents an interesting information for data interpretation; however, the real power of molecular networking for metabolite annotation is still ultimately dependent on spectra matching to previously known MS 2 spectra in order to act as a 'seed spectra' for the propagation of putative identities based on the network. The 'Global Natural Products Social Molecular Networking' (GNPS) [63], represents a great community effort into providing an opensource platform for sharing mass spectral data and data analysis based on molecular networking, and had an instrumental role on the popularization of this approach. Following the popularization of this platform, significant efforts have also been directed toward its integration to popular data processing tools including MS-DIAL [64], mzMine2 [65], OpenMS [66] and XCMS [67], by incorporating into these tools functionalities for exporting data into GNPS. Since the end of 2019, GNPS also incorporates 'Feature-based Molecular Networking' (FBMN) [68]. This new feature integrates molecular networking through the GNPS platform directly with the aforementioned processing tools. Besides facilitating the analysis pipeline, it also allows for the incorporation of MS 1 information such as natural isotopic pattern, retention time and ion mobility, significantly improving over the original approach relying exclusively upon MS 2 information. Despite the increase in computational demand, the new approach has already proven useful to distinguish structural isomers that could not be resolved by the original algorithm.
Following the initial spectra matching and molecular networking performed by GNPS, the resulting networks provide identities of metabolites matched to the library and the association with similar spectra. Further metabolite annotation relies upon manual investigation of each MS/MS spectra and its connecting nodes individually. More recent efforts have focused on automating this part of the process by propagating the annotations using in silico fragmentation [69]. This Network Annotation Propagation (NAP) tool is also accessible through the GNPS webplatform and it works by using the in silico prediction tool MetFrag [70] in combination to a re-ranking strategy based on the resulting candidate structures using the MetFusion [71] score and network topology. Additionally, GNPS platform also has included the MS2LDA strategy for identification of shared structures in unknown compounds based on spectra fragmentation [72]. MS2LDA is a complementary approach to spectra similarity that uses text mining tools to extract common patterns of mass fragments and neutral losses from collections of fragmentation spectra. This tool can significantly enhance annotation by revealing the shared structural moieties across connected nodes in the molecular network.
Another interesting tool recently incorporated was the algorithmic learning for auto-deconvolution of GC-MS data to enable molecular networking within GNPS [73]. The initial molecular networking strategies were developed with a focus on LC-MS 2 data, which is considerably different from traditional GC-EI-MS data. In the former information regarding precursor ion, usually a molecular ion is linked to the MS 2 spectra. In the latter, only the highly fragmented MS 1 originating from the EI is available, which rarely includes the molecular ion. As an essential step for GC-MS data processing, the multiple signals from these highly fragmented spectra have to be deconvoluted into individual analyte mass spectra. Despite this seemingly disadvantage, GC-EI-MS exhibits considerably higher reproducibility within ionization and a higher resolution within the chromatography in comparison with any LC-MS technique currently available. For those reasons, GC-EI-MS is a much more well-established technique in metabolomics, with considerably more data publically available. Until recently, most GC-EI-MS data processing was performed by the myriad of software available incorporating local libraries, many of which freely available online, for spectra and retention index matching for the identification of compounds [74]. The new functionality introduced in GNPS uses a machinelearning algorithm based on unsupervised non-negative matrix factorization to automatically select parameters for the deconvolution of GC-EI-MS data returning a 'balance score' that can be used to access how consistent the spectra is across the dataset. The strategy adopted makes data processing computationally efficient and more accessible to less experienced users, while also offering the direct integration with molecular networking available through GNPS [73].
An alternative strategy that can be seen as complementary to molecular networking based on spectra similarity, is the use of mass shifts between fragments from different spectra, matching against the differences expected from relevant biochemical reactions. Identifying relevant mass shifts is an efficient way of interpreting small modifications over a common structure often observed across multiple classes of metabolites. This approach has been successfully applied in several instances such as the characterization of plant secondary metabolism in multiple plant species [75,76] as well as the metabolome of coral reefs [77]. Indeed, Hartmann, Petras [77] integrated molecular networking based on spectra similarity to the mass shift approach, which not only facilitated the interpretation of the metabolic changes within network modules, but also highlighted interesting modification specificities across different coral reef types of holobionts. Calculating a matrix of mass shifts and matching it against a reference list of biologically relevant shifts is relatively simple to implement. However, there are a few tools available that are also simple to use and integrate in the preferred pipeline while providing some extra functionalities. An interesting example is MetNet [78], an R package available through Bioconductor. It can be easily integrated with the XCMS/CAMERA [67,79] pipeline commonly used for data processing within this platform. MetNet creates an adjacency matrix based on the structural information by matching absolute mass differences to a userdefined list of biologically relevant differences. Additionally, it can also generate a consensus adjacency matrix by combining multiple statistical models that can be combined with the structural information.
Most mass spectrometry metabolomics experiments rely on the hyphenation with chromatographic techniques to reduce matrix effects and obtain additional information in the form of retention times. However, even the best-hyphenated systems based on GC-EI-MS, orbitrap and qTOF technologies need to compromise mass accuracy, ultimately essential for molecular formula determination, in detriment of the higher scan rates necessary to cope with the chromatographic separation. Forcisi et al. [80] developed an interesting strategy based on mass difference networks to combine datasets from highly accurate direct infusion electrospray ionization Fourier transform ion cyclotron resonance/mass spectrometry (DI-ESI-FTICR -MS) data and UHPLC-MS, taking advantage of both. Despite being much less popular than LC-MS, the resolution provided by FTICR/MS and the simplicity of DI-ESI analysis provide a significant contribution to the identification of unknown compounds within other more common approaches previously described.

Integrating metabolomics data into metabolic pathways
The organization of metabolites into metabolic pathways is perhaps the most intuitive representation of metabolism. Such networks try to capture and map the direct relationship between known substrates and products of individual reactions in metabolism providing a blueprint for the categorization of certain metabolites into specific pathways. The mapping of metabolomics data into such metabolic maps can provide valuable insights into the shifts in metabolic fluxes upon specific perturbations.

Enrichment analysis
There are several strategies to integrate metabolomics data in such context and extract useful information, most of them based on mapping metabolites to pre-defined metabolic models followed by investigating it through enrichment analysis. As for correlation networks, much of such pathway enrichment analysis has been originally developed through modifications over commonly used transcriptomic tools. Two common approaches include metabolite set enrichment analysis (MSEA) and over representation analysis (ORA). In short, MSEA uses quantitative data to analyze the changes in pre-defined groups of functionally related metabolites across conditions, rather than relying on individual metabolites [81][82][83]. ORA on the other hand looks at the enrichment of a set of metabolites selected based on any chosen criteria against a list of metabolites, ideally representing the metabolic space evaluated within the experiment [84].
The most basic approach for pathway analysis is simply mapping identified metabolites into some pre-defined pathway, usually obtained from reference metabolic pathway databases such as KEGG [85] or MetaCyc [86]. This approach presents the obvious limitation of requiring previous data processing; however, the analysis and interpretation are straightforward once all input metabolites are known. An infinity of tools extensively described in recent reviews are available for performing such analysis including different data input and statistical tests used for the comparison [41,84,87,88]. Some of these tools providing interesting functionalities to map metabolites to pre-defined metabolic models include PAPi [89], MetaboAnalyst [90], MetExplore [91], MetScape [28] and PathVisio [92].

Challenges for conventional pathway analysis
It is important to highlight here some fundamental limitations of pathway analysis arising both from technical and conceptual challenges faced when investigating metabolism. We will follow this discussion with some interesting strategies that try to circumvent these challenges. A first clear drawback arises from our incomplete coverage of the metabolome and the interactions between metabolites represented in pathway databases and metabolic models [93]. As it has already been discussed here, the vast majority of signals detected by current metabolomics platforms remain unidentified. Moreover, it is evident that there is still a large gap of knowledge regarding all the intricate connections within the metabolic networks of any organism [8]. As a consequence, much of the information contained in metabolomics datasets is simply not integrated by such approaches relying on predefined pathways. Another very important limitation arises from the definition of metabolic pathways itself. Metabolic pathways represent subnetworks from the complete metabolic network. Despite efforts in trying to formalize the definitions of metabolic pathways [94], these are somehow subjective and can still significantly vary across multiple databases [3,95]. Additionally, several metabolites are present in multiple pathways complicating even further the interpretation of results based on predefined pathways.

Identifying metabolic sub-networks based on metabolomics data
An interesting but complex alternative to circumvent the bias introduced by using pre-defined pathways is representing the complete metabolic network containing all known and predicted reactions for a target organism as a graph and directly extracting the sub-networks based on the available data [3,8]. This strategy helps capturing possible variants and combinations of the reference pathways as well as to detect novel pathways. Metabolic graphs are usually represented by compound graphs where all pairs of metabolites belonging to a reaction are connected by a single edge. This is less intuitive than its representation by hypergraphs or bipartite graphs since a single reaction may be represented by several edges [96]. However, by having nodes only representing metabolites it facilitates further computational steps necessary to extract the sub-networks. The approaches for such unbiased subnetwork extraction are far from trivial and many new challenges arise. Once metabolites of interest are mapped into the large metabolic networks there is an infinity of possible pathways to connect these nodes. A multitude of strategies has been developed with the goal of reducing the search space and extracting only the most likely biologically meaningful sub-networks. These strategies have been recently covered in detail [3]; therefore, we will just briefly discuss some of the main relevant concepts.
The most basic approach is to simply search for the shortest paths between two nodes [97]. Whilst simple, shortest path search is severely limited by the frequent presence of side compounds such as energy carriers and proton donors leading to the detection of many irrelevant pathways [98]. A great deal of effort in this field concentrates toward finding alternative ways of dealing with this issue. Removing such compounds may help to mitigate the issue and there are procedures to identify them based on network topology, however they introduce other drawbacks. Common assumptions to define side compounds such as the use of node degree in the network may also apply to other highly connected relevant metabolites such as pyruvate and acetyl-CoA [4]. Additionally, certain metabolites such as ATP that are often a side compound can act as a substrate/product in other specific pathways. Alternatively, it is possible to use the topological features to weight the nodes or edges and search for the 'lightest pathway,' hence avoiding the exclusion of side compounds [99].
Some more promising strategies that do not rely solely on network topology involve introducing extra biochemical information into the network based on the chemical structures of metabolites. Mapping atom transitions between metabolites, for instance, allows for selecting only those pathways in which transfer of carbons between pairs of nodes occur [100,101]. Tracking structural changes based on structural similarity metrics such as the Tanimoto coefficient, can also be used for minimizing structural differences across each step of the pathway [102,103]. Both strategies have proven to be more efficient than relying exclusively on network topology for identifying biologically meaningful pathways and dealing with side compounds, besides providing relevant information for further data interpretation. Several recent tools allow for the incorporation of atom mapping and molecular similarity into network analysis such as Reaction Decoder Tool [104], SimIndex, and SimZyme [102] just to mention a few.
Finally, all these strategies concern finding optimal paths between two nodes. Their application to real metabolomics datasets composed of dozens to hundreds of metabolites would involve calculating every possible pairwise paths and merging the results. However, there are several common situations such as branching points that may lead to the optimal paths between individual pairs differing from that considering all combined metabolites [3]. Once again, several alternatives have been suggested including uses of centrality measures, Steiner Tree computation [105] and metabolic stories [106]. In summary, when compared to other network approaches for metabolomics data analysis previously discussed, we observe a much greater variety of alternatives based on very different strategies for the graph-based extraction of pathways from metabolic networks. This kind of analysis is much less widespread than pathway-based enrichment analysis, mostly due to its complexity, lack of well-established protocols easily implemented by non-experts, and computational burden. Most of these alternatives are built around specific assumptions that handle better or worse some of the different drawbacks intrinsic to the process, often finding complementary application. Next, we will briefly discuss a couple of applications, that whilst being far from comprehensive may provide the reader with an overview of the possible outcomes of such analysis applied to real metabolomics datasets. MetaboRank implements a centrality-based approach constructed around Google's PageRank algorithm in the freely accessible web server MetExplore [91,103]. The output of MetaboRank returns a ranked list of metabolites suggested to be relevant to any particular pre-defined subnetwork, in a similar way to how social media platform suggests new friends based on one's current social network. As an example, the authors validate their methodology by incorporating new biologically relevant metabolites into a metabolic fingerprint developed to identify patients with hepatic encephalopathy (HE) disease [107]. An interesting feature of MetaboRank implementation is that it allows for the incorporation of structural information about the substrate-product transitions such as chemical similarities and atom transitions, as well as data from transcriptomics and proteomics. Metabolomic Modularity Analysis (MMA) is an approach that uses a community detection algorithm that identifies modules of reactions based on the relationships between selected metabolites [108]. The input metabolic network for MMA uses a less usual reaction-centric network where metabolites are represented by the edges instead of the nodes of the network. This different perspective is particularly interesting to observe the role of high-degree metabolites that in this arrangement are not restrictively assigned to one pathway. Another interesting feature is its hierarchical output which allows for the ready interpretation of the relationships between modules based on its structure.

Integration of annotation strategies and pathway mapping
A common limitation of both pathway enrichment analysis and sub-network identification approaches is that they are unable to handle metabolites absent from the metabolic models. As such, much of the new interesting developments in pathway analysis have tried to integrate additional information, particularly those based on metabolite annotation and association network strategies previously discussed here. Network-based annotation of metabolites fits particularly well to this goal, potentially providing tools to expand our current knowledge regarding pathway interactions based on the untargeted integration of metabolites into known pathways. Some of the most interesting pathway-related tools for untargeted metabolomics have incorporated the principle of going from untargeted mass spectrometry features directly to pathways [87]. The approach implemented by the computational algorithm Mummichog was one of the first to explore the modular structure of metabolic networks to validate annotations obtained through spectra matching [109]. Mummichog adopts the assumption that correct matches could be identified through the enrichment of specific classes of metabolites within each module. PIUMet works on a similar assumption with an algorithm that facilitates further integration of multiomics data [110]. The BioCAn [111] workflow incorporates metabolite annotation steps using five different annotation tools: Metfrag [70], CFM-ID [112], NIST17, Metlin [113], and HMDB [114]. Moreover, it is flexible with regard to the metabolic model used for matching identified metabolites. By default, it assembles a metabolic model using enzymes and gene orthology from KEGG based on the organism of interest. However, other published or manually curated models can also be used. In a similar way, metabolic reaction networkbased recursive algorithm (MetDNA) [115] initially builds a metabolic reaction network based on known pairs of substrate and products, which is further used to match seed spectra prior to recursive propagation of the annotations.
The aforementioned tools provide interesting strategies for expanding our knowledge on the direct interactions within known pathways and some newly identified compounds. However, they still depend upon seed metabolites being present within the metabolic model used as a reference for the propagation of annotations. Consequently, these strategies work well for primary metabolites included within pathway databases such as KEGG. Nevertheless, its applications for the integration of new metabolic pathways that are not highly integrated in such global metabolic models is limited. This is often the case of specialized metabolites in plants and microorganisms, which are highly specific and often not well represented in pathway databases. Therefore, another important step forward in pathway analysis is the capacity of deducing new metabolic interactions to expand pathways beyond what is currently available. A limitation of previously mentioned enrichment analysis, is the need of identifiers to group metabolites, which limits its application to extended networks including new metabolites. BiNChE is a useful webtool for enrichment analysis that uses ChEBI ontology to link ontology classes to biological entities, considerably expanding the chemical space covered in relation to traditional pathway databases [116]. In a similar line, the ClassyFire tool converts traditional molecular descriptors such as SMILES or InchiKeys into hierarchical chemical ontology terms [117]. In fact, the incorporation of such classification into annotation pipelines such as those used for molecular networking can already bring the results much closer to the topological structures of metabolic pathways, allowing its direct integration with enrichment analysis strategies. A recent pipeline, MolNetEnhancer, has been developed exactly with this goal, combining GNPS molecular networking tools to ClassyFire in a single pipeline [118]. Another interesting approach on this direction include ChemRich which uses MeSH annotations and Tanimoto indexes to define modules of related molecules independently from pathways, followed by enrichment test using Kolmogorov-Smirnov (KS) test [93].

Conclusion
Metabolomics data represents complex interaction networks rather than a collective of individual components. Therefore, trying to conceptualize data analysis from a network perspective has the potential to reveal meaningful information not captured through differential analysis of individual metabolites. Indeed, recent efforts in developing network approaches for the analysis of metabolomics data from different perspectives, including quantitative association networks, as well as structural and biochemical relationships, has shown promising results in expanding our knowledge of how these metabolites are intricately related. Finally, all individual strategies currently used for metabolomics data analysis suffer from some limitations. These arise from multiple factors including technical limitations of instrumental setups, incomplete coverage of the metabolome within both compound and pathway databases, and the complexity in defining metabolites relationships itself, amongst many other. The integration of multiple different approaches has shown great complementarity, and currently represent the best alternative for minimizing these individual limitations and achieving a better understanding of these complex datasets.

Expert opinion
The final goal of metabolomics is to describe an organism's metabolism at a global level. It is evident, however, that the results of such experiments are frequently evaluated using reductionist approaches where the analyst focus on selecting and identifying individual components associated with specific conditions. We demonstrated here, that while still a valid approach, this is often unable to capture changes that a more global view could detect. This has been observed since the earliest experiments using correlation networks in metabolomics, and since them, we achieved significant progresses in our understanding of how metabolic networks are organized. Still such global analysis of metabolomics data is usually restricted to the simplest assumptions such as that related metabolites should exhibit linear relationships, which is often not the case. The use of more sophisticated methods for data analysis capable of capturing non-linear relationships between the different components of these networks, may reveal interesting and unexpected features of these systems [119]. In addition, most current strategies are focus only on establishing pairwise associations between metabolites. Whilst developments such as the use of partial correlation to try capturing causal relationships are a great step forward, there is still space for exploration on strategies to infer directionality within these networks. Some works have suggested the adoption of complex time series and multi-condition experiments to increase the information content within the experiment and assign this directionality, but so far few studies have obtained promising insights from these approaches applied to typical metabolomics datasets [120,121].
One of the major hindrances limiting the interpretation of mass spectrometry-based metabolomics experiments is the limited fraction of signals assigned to a known metabolite [122]. Whilst unambiguous identification still largely depends on the laborious characterization of purified compounds the development of unified platforms for depositing mass spectral information are an essential part of the efforts to fulfill this gap. As the size of public spectral libraries, and the number of annotated spectra increase, network strategies such as those described here will increase even further their potential for rapid dereplication of known metabolites and facilitate the rapid characterization of unknowns. Moreover, the recent success of molecular networking strategies has fired an exciting search for new methods of establishing structural relationships between mass spectra features. The recent development of MS2LDA [72] is a great example of how alternative metrics can be complementary to the already well-established spectra similarity, and the popularization of machine learning algorithms is providing some promising results in this area [123,124].
Facilitating the integration of multiple complementary approaches into single platforms or the integration across different platforms is a strong tendency in metabolomics, which should facilitate the exploration and interpretation of data from different perspectives. Molecular networking, for instance, has been proven to be of great utility for metabolite annotation; however, little has been done in terms of analyzing the structure of the resulting metabolic networks. Moreover, the incorporation of extra layers of information within these networks have the potential to reveal unanticipated features, as it has been observed with the specificity of certain classes of compounds detectable by GC-MS in regard to frogs, cheese, and beer samples [73], revealed simply by the integration of multiple datasets and addition of metadata. The incorporation of FBMN into GNPS represents a great step forward in this sense, adding a layer of quantitative information in the networks. The incorporation of ion mobility information is also of great interest. This technique is still not often integrated within well-established metabolomics platforms, but it is quickly growing in popularity due to its complementarity to traditional LC-MS systems [125]. It will be exciting to follow developments as to how the addition of collisional cross-section (CCS) information into these databases and networks may contribute to the annotation of metabolite structures and what it may reveal from network structures.
Finally, visualization of metabolomics data within the context of known biochemical pathways provides a great and intuitive representation for data analysis. The organization of data in this way allows for the use of enrichment analysis that can be of great assistance in summarizing these typically extensive datasets into biologically relevant information. However, one of the great limitations of such approaches is its limitation toward known metabolites and pathways. It is exciting to see several recent attempts into incorporating metabolite annotation and pathway prediction tools in an attempt to extend the current knowledge from untargeted metabolomics data. Indeed, we can observe a tendency of merging several of the different approaches discussed here with the final goal of assembling putative metabolic networks incorporating biochemical relationships directly from untargeted metabolomics data. A great example is the recent released MolNetEnhancer, which combines molecular networking, annotation propagation, chemical class classification, and enrichment analysis in a single pipeline [118] Funding This paper was funded by the European Commission, Horizon 2020 Framework Programme, H2020 European Institute of Innovation and Technology, grant numbers: FPA No. 664620, No 739582, SGA-CSA No 664621.

Declaration of interest
The authors have no relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending, or royalties.

Reviewer disclosures
Peer reviewers on this manuscript have no relevant financial or other relationships to disclose.