Topological representations of crystal structures: generation, analysis and implementation in the TopCryst system

ABSTRACT Main modern approaches to the topological representation of crystal structures of different chemical classes are overviewed. The problem of automated generation and analysis of such representations is discussed, and a new free web service is presented, which enables the user to describe topological features of crystal structures of any complexity and chemical composition in a fully automated mode. The service requires only an input file with crystallographic information in the standard CIF format and generates all reasonable representations of the structure by selecting structural units with rigorous algorithms. Then the representations are assigned to known topological types and relations to other crystal structures, which have to same architectures, are established. The service is interfaced to a set of topological databases, which in turn have gateways to world-wide crystallographic databases. A number of examples of topological analysis of different classes of chemical structures are presented, and an outlook is given for further development and applications of the service for big data analysis and data mining in crystal chemistry and materials science. GRAPHICAL ABSTRACT


Introduction
The atomic structures of crystalline solids were determined by diffraction methods during more than a century. Although the diffraction experiment provides the data on the distribution of electron density, thus keeping the crystal space continuity, this information is usually lost in public access. The retained structural information, which is now collected in a number of electronic world-wide databases, such as Cambridge Structural Database (CSD) [1], Inorganic Crystal Structure Database (ICSD) [2], Crystallography Open Database (COD) [3] and Pearson's Crystal Data (PCD) [4], describes only positions of maxima of electron density (atoms) and the structure symmetry, thus bearing only geometrical properties of the structure. When a chemist analyzes this information, he/she should again restore the structure connectivity, i.e. the bonds between atoms. Since besides the atom names, only geometrical information is available at this stage, the criteria for determining the bonds can also be only geometrical. One can distinguish two groups of such criteria: (i) distance criteria, which use interatomic distances or other parameters derived from them, such as atomic radii [5] or bond strengths [6], and (ii) polyhedron criteria, which rest upon Voronoi polyhedra [7]; the criteria from these groups can be combined. However, for a long time, there was no universal method for automated determination of atomic coordination numbers, and this problem hindered the application of machine methods to processing the crystallographic information. After determining the connectivity, the structural model can be transformed from a set of isolated atoms to a periodic net [8,9], which represents topological properties of a crystal structure. In this century, topological databases, Reticular Chemistry Structure Resource (RCSR) [10], Euclidean Patterns in Non-Euclidean Tilings (EPINET) [11] and TOPOS Topological Databases (TTD) [12], were developed, which gather periodic nets of a particular connectivity. The occurrences of the topologies in crystal structures are a subject of the Topological Types Observed (TTO) collection [12]. Next problem concerned recognition of structural units (molecules, ligands, clusters and tiles), analysis of their connection and the corresponding polymeric groups. Last, the topology of the periodic structural motifs should be determined and classified for establishing relations between crystal structures of different composition and complexity. All these problems taken together form a general challenge to the modern crystallochemical analysis: how to effectively use a huge amount of crystallographic information stored in the electronic databases. In our program package ToposPro [13], we proposed solutions of separate problems mentioned above, but there were no tool to perform the whole topological analysis in a fully automated way. In this paper, we present such tool, which unites the ToposPro algorithms in an easy-to-use Internet service.

What is topological representation of crystal structure?
As was mentioned above, the initial crystallographic information is purely geometrical, and it should be supplied with the information on the structure connectivity for further crystallochemical analysis. Thus, one comes to the notion of topological representation of crystal structure.

General concept
Any model of a crystal structure where connections between atoms and complex structural groups are established can be treated as a topological representation. The set of atoms or structural groups considered as a whole forms a topological space T, on which a topology is defined as a family of pairwise sets (links) from T as well as all their unions and intersections. Such topological space together with the defined topology can be visualized as a graph, which possesses translational symmetry to be equal to or higher than the symmetry of T, which is described by one of the space groups. This periodic graph is called a crystallographic or non-crystallographic net [14] depending on whether its symmetry group is isomorphic or non-isomorphic to a space group, and it describes a particular topological representation of the crystal structure. The most general representation includes all atoms of the crystal structure and all connections between them; we call it complete representation since all other (partial) representations can be derived from it. Certainly, the rigorous notion of complete representation is abstract; it is hardly possible to determine all interatomic links as in general their number can be large and depends on a particular task. For example, one could be interested not only in direct interatomic interactions, even the weakest, but also in relations between distant atoms, which is important when analyzing atomic sublattices (e.g. cation arrays [15]). There are three basic topological operations for generating partial representations: (i) removing an atom, (ii) removing a link and (iii) contracting an atomic group to its centroid, which is equivalent to separating structural units and representing them by their centroids. All these operations result in a simpler representation, thus they are called simplifications. The net, which is constructed for a particular representation and defines its topology, is called underlying net [16]. The structures that have the underlying nets of the same topology belong to the same isoreticular series.

How to determine structure connectivity?
To build the complete representation one should determine all links between atoms and classify the links in accordance with chemical reasons. As was mentioned above, interatomic distance is the primary geometrical descriptor for the analysis of chemical interactions in crystals. However, crystal chemists also use his/her intuition and experience to distinguish different kinds of bond, to accept or ignore weak interactions and to select structural units. It would be extremely useful to develop a computer procedure that mimics this human's reasoning and provides a structure model to be close to an ordinary crystallochemical representation. Recently [17], we proposed such procedure, which was implemented into ToposPro as the Domains algorithm. This algorithm uses parameters of atomic Voronoi polyhedra in addition to interatomic distances and atomic radii to account for the whole environment of the atom when analyzing a particular interatomic contact. The whole set of the contacts determined by the Voronoi polyhedron is then clusterized to separate bonds of different kinds such as valence, H bond, specific or van der Waals interaction. As a result, atomic connectivity can be analyzed in any kind of crystal structure with the same set of options, which is important when processing big sets of diverse structural data. For example, Voronoi polyhedron of a copper atom in the crystal structure of [Cu(acac) 2 ] (acac = acetylacetonato) (ACACCU41) 1 [18] is confined by 16 faces of different sizes (Table 1and Figure 1); as a result, the Domains algorithm distinguishes valence bonds, van der Waals contacts, and very weak 'indirect' contacts, which do not correspond to any bonding.

Representations of crystal structures
Any crystallochemical consideration of a crystal structure is a simplification because it is impossible and unreasonable to account for all interatomic interactions and sometimes even for all atoms in the structure. The corresponding topological representation depends on the nature of bonding in the crystal and on the crystallochemical task to be solved within this representation. Below, we consider typical topological representations for different classes of crystal structures.

Covalent crystals
In 3D covalent crystals, the structure framework is formed by strong valence bonds; all other (e.g. van der Waals) interactions are much weaker and usually should be ignored. All atoms of the framework are included into the representation, and the net coincides with the framework of valence bonded atoms. Thus, a variety of natural and hypothetical 3D carbon allotropes can be perfectly discriminated by topology of their atomic nets. These topologies are collected in the database SACADA [19] as well as in the ToposPro TTD Collection. If the crystal consists of lowdimensional (0D, 1D or 2D) structural units, two representations are possible: (i) representation of the low-dimensional structural unit, when only valence bonds are considered, and (ii) representation of the whole structure, when the contacts between the structural units are also taken into account. For example, the structure of the C8 carbon polymorph [20] can be described as the whole net with the pcb topology, but also as a network of 0D cubic units whose centers form a body-centered cubic (bcu) net ( Figure 2). Another example is the selenium structure [21], which consists of simple chains 2C1, but if one takes into account both intrachain valence bonds and interchain van der Waals contacts, the resulting topology is primitive cubic (pcu), which is realized in many structures with strong bonds like α-Po or NaCl ( Figure 3).

intermetallic compounds
Metals, metal alloys and intermetallic compounds are similar to covalent crystals since metal atoms being formally uncharged are connected by bonds of one kind, metallic bonding. Thus, the main topological representation includes all atoms and all direct contacts between them. However, to establish correlations between intermetallic compounds of different stoichiometric composition and structure, one can select polyatomic structural units using the nanocluster approach [22]. In this approach, intermetallic structure is represented as an assembly of multishell onionlike nanoclusters, whose centers are allocated in the most symmetrical positions of the structure. The nanoclusters have no common internal atoms but can share the atoms of their external shells. The underlying net consists of the nanocluster centers (atoms or centers of voids) and links between them, which correspond to the contacts between the outer-shell or shared atoms of the nanoclusters. This approach enables one to reveal simple topological motifs in Table 1      compounds are gathered in the ToposPro Topological Types of Nanoclusters (TTN) collection [25].

ionic inorganic compounds
In ionic crystals, there are two oppositely charged structural components, and this feature extends the set of possible topological representations. The most general representation is similar to the representation used for covalent crystals: all atoms are included into the net, and only strongest (ionic or ion-covalent) bonds are considered as the net edges. If cations or anions are complex, a simplified representation can be built where the complex ions are represented by their centroids [26] ( Figure 5). However, at least two more representations are viable: (i) the anionic packing since many ionic inorganic structures are based on such a packing, and (ii) the cation array, which can support the general structural motif in many cases [27]. In these representations, the atoms of one component (cations or anions) are removed, and the underlying net is determined by establishing direct links between the atoms of the other component using the distance criterion or Voronoi partition. Thus, the Na 3 PS 4 crystal structure can be considered as a packing of sodium cations or as a packing of centers of the PS 4 3anions, which coincide with the phosphorous atoms; in both cases, the topologies are well known and regular ( Figure 5).

coordination compounds
Structural units of coordination compounds naturally include complexing metal atoms and mono-or polyatomic molecular ligands. Such representation is widespread and applicable to any coordination compound that is why it is called 'standard' in ToposPro. However, if there are polynuclear complex groups, an alternative 'cluster' representation is viable, in which the nodes of the underlying net coincide with the centers of these groups, while edges mimic the links between them. To automatically recognize such groups and construct the 'cluster' representation, a rigorous topological criterion was proposed and implemented into ToposPro [28]. To apply this criterion, all the shortest atomic cycles are determined, which meet at each non-equivalent atom of the crystal structure, and the bonds belonging to short or long cycles are classified as intra-or intercluster links, respectively. The cluster representation can be realized in two ways [28,29] is obtained for with intercluster bonds belonging to the cycles of size higher than 12 and formed by one type of node, where the tetranuclear cluster is extended with the dicarboxyldiphenylpyridine fragments of two DDPP ligands.
The local coordination of ligands can be described with the nomenclature proposed in [31]; the coordination modes of the ligands in coordination compounds are gathered in the ToposPro Topological Types of Ligands (TTL) collection [32].

Molecular crystals
The most typical (i.e. 'standard' in the ToposPro terminology) representation of a molecular crystal consists of molecules connected by intermolecular interactions. Thus, the underlying net is formed by the molecular centroids linked by the edges, each of which mimics the whole variety of the interactions between a particular pair of molecules. However, if the intermolecular interactions can be differentiated, one can consider additional representations. For example, the crystals with hydrogen bonds can be represented as both molecular packings and networks of H-bonded molecular ensembles [33] ( Figure 7).
The method of connection of molecules can be formalized in a nomenclature to be similar to that for ligands [34]. With this nomenclature, the connection modes of molecules in molecular crystals were gathered in the ToposPro Topological Types of Molecules (TTM) collection [32].

Porous structures
A special kind of topological representation is possible for the structures that contain voids, pores and channels. The free space in such structures can be described by a natural tiling whose elements, natural tiles, are polyhedral units, which fill in the crystal space without gaps or intersections ( Figure 8). The procedure for constructing natural tilings for nets of any complexity was formalized in a rigorous algorithm [35] and implemented into ToposPro. In particular, natural tilings were constructed for all the known zeolite frameworks [36], presented at the International Zeolite Association (IZA) website (http://www.iza-structure.org/data bases/) and all natural tiles that occurred in these frameworks were included into the ToposPro Topological Types of Tiles (TTT) collection [36]. The underlying net in the tiling representation describes the system of cages and channels thus elucidating the free space topology and the method of assembling porous frameworks from polyhedral units [37,38].

Automation of topological analysis with TopCryst
The concept of topological representation admits a mathematical formalization of the crystal structure description that in turn enables one to automate crystallochemical analysis. ToposPro can essentially help in such analysis, but it assumes that the user has a good background in topological methods since the ToposPro applied programs contain a lot of options that should be correctly specified to obtain the required results. This is the reverse side of the ToposPro flexibility: the software renders many methods for the crystal structure analysis but requires strictness from the user. Moreover, if the analysis consists of several steps, the user has to run the procedure at each step manually. This also requires perfect understanding of all steps from the user and does not enable him/her to make the whole analysis fully automated. The TopCryst service described below was designed to overcome this ToposPro shortcoming, to make the ToposPro tools available for a broad crystallochemical community and to take one more step toward automation of the analysis of crystal structures resting upon the information from a CIF file (Crystallographic Information File) ( Figure 9 [39]). At present, the ToposPro TTD, TTO, TTL and TTT collections have interfaces with TopCryst.

TopCryst algorithms
TopCryst uses algorithms, which we earlier implemented into ToposPro, but now they are united in the same analytical procedure, which requires no initial information from the user besides standard CIF file. The entire procedure of topological analysis includes the following steps ( Figure 10): (1) Reading the information from a CIF file provided by the user. The standard CIF parser from the Python Anaconda library is used, which is supplemented with a special procedure of BONDS class that parses the topological data encoded with the new Topology CIF dictionary (https://www.iucr. org/resources/cif/dictionaries/cif_topology).
If the CIF file contains information on the structure connectivity encoded as a labeled quotient graph in accordance with the rules of the Topology CIF dictionary, the next step is passed. This option is useful if the structure connectivity should be calculated with another algorithm or the topology of an already simplified underlying net is analyzed.  All steps (1)- (7) are performed in an automated mode and require no user intervention. With the obtained information, the user can then look for other structures that have the same structural units or the underlying net topology in the TTL, TTO or TTT collections. Note that the current TopCryst version covers not all ToposPro features, but only those that can be generated in a fully automated mode and hence can be used by an amateur in topological analysis. An extended TopCryst version will include more procedures for building representations, in particular, the nanocluster and tiling approaches as well as recognizing interpenetration.

TopCryst tools and interfaces
The TopCryst user interface (front-end) is programmed in HTML, PHP, CSS and JavaScript in the form of a webpage available for free at https:// topcryst.com/. Conventionally, it can be divided into two main parts: a service for determining the topology of a structure from a CIF file, and a search engine for underlying topologies (Search topology) and topological objects (Search topological objects and Search structure) in the TTD, TTL, TTO and TTT collections.
The interface of the service for determining the topology from a CIF file is designed as a window for uploading the file to be sent to the server through an AJAX request for further processing. At the same time, this window displays the current status of the file processing: 'file transfer,' 'position in the queue,' 'analysis' and 'result.' After the analysis is completed, the result is displayed on the web page in the form of a PDF report, a list of the resulting CIF files with the underlying nets to be downloaded by the user, and a list of the underlying net topologies with the links to the webpages, which contain the information on these topologies.
The Search topology interface is represented by a field for the input of the topology name in one of available nomenclatures [32], a tool to download a CIF file with the net of this topology, links to RCSR and EPINET databases if they contain such net, a JSmol [41] net visualizer, and a list of representatives grouped by the representation types with the links to the CSD and ICSD records at the Cambridge Crystallographic Data Centre website (https://www. ccdc.cam.ac.uk/). The user can search by topology name, representative CSD Reference Code or ICSD Collection Code as well as the natural tile name.
The Search topological objects and Search structure interfaces enable the user to search for the webpages of natural tiles and ligands by their names, as well as for the webpages of individual structures by their CSD Reference Code or ICSD Collection Code. On these webpages, a JSmol visualization of natural tiles and ligands is available.
The server part of the service (back-end) is written in PHP and includes a queue system for processing user files (CIFs) and a topology search module. The queue system arranges the CIF files uploaded by the user depending on the availability of free processors, sends the files for calculation with the subsequent  transfer to the module, which provides determination of the underlying net topology. The results of the analysis are then displayed on the webpage as a list of underlying topologies for all possible representations of the crystal structure ( Figure 9). The output includes the name of the topology if it was found in the TTD collection, otherwise 'unknown topology' is indicated with the description of coordination numbers of all non-equivalent nodes (for example, 3^11,5,6-c = 3 11 ,5,6-c means a net with 11 3-coordinated, one 5-coordinated and one 6-coordinated nodes). The composition of structural units (secondary building units, SBUs) follows the underlying topology name. For the cluster representations, the size of cycles (RINGS) is indicated, which divides intra-or intercluster links. The current TopCryst version considers cycles up to size eight, so the representations with larger intercluster cycles are not listed (such large cycle do not correspond to compact cluster groups).
The computational part of the service is written in Python using external libraries os, sys, numpy, scipy, CifFile, itertools and re. It includes the main CrystNet module and 11 additional modules. Reading and parsing input files, as well as preparing output files, is designed using the read_write module. The constants module stores tabular data such as atomic numbers, symbols, radii, etc. The functions for calculating internal coordinates and transforming the coordinate system of a crystal structure are contained in the geometry module. The structure_data module is responsible for checking the completeness and correctness of the structural data received from the CIF file. The Polyhedron and ChemBond modules are used to build the Voronoi partition and to determine the structure connectivity. All symmetry transformations are handled using the symmetry module. The main functionality related to topological analysis of periodic nets is implemented in the structure and functions modules. Two additional modules exception and tl180 ensure uninterrupted operating of the service.

What tasks can be solved with TopCryst?
The current TopCryst version is intended to perform the standard topological analysis, which includes the determination of the local atomic connectivity (coordination numbers of atoms), recognition of structural units and determination of the method of their connection as well as the overall topology of the corresponding underlying net. The results of such analysis can be briefly expressed by the following phrase: 'The crystal structure X is composed by the structural units Y, which are connected into the Z topological motif,' and contain the minimal dataset called XYZ block. However, in most cases, additional information is provided depending on the crystal structure features: (1) all representations that can be built based on the chemical and topological structure with their description and classification (see part 2.3); (2) all other structures that have the same underlying net, coordination mode of a ligand, or topological type of a tile.
Besides the ordinary description of a crystal structure, these data enable one to solve typical tasks of crystallochemical analysis, which are considered below.

determination of the underlying topology
In most cases, the determination of the overall topology is not a trivial task. Visual analysis can result to errors because of the structure complexity and/or geometrical distortions of the atomic network. If the degrees of the underlying net nodes are higher than six, the reliable visual determination is possible only in special cases when the underlying net has no more than two or three topologically different nodes, high space-group symmetry, or a well-established isostructural analog. Thus, the following problematic cases can take place: (1) Different underlying nets have very similar local topologies. Since the visual analysis always rests upon the consideration of a finite part of the crystal structure, such similarity can result in a wrong assignment of the overall topology. For example, the und and unc nets look very similar ( Figure 11 [42,43]), but differ from each other starting from the fourth coordination sphere of the nodes. (2) The underlying net is too complicated for the visual analysis. Thus, we will not be able to visually determine the topology of the structure [Ce 6 (OH) 4 (Figure 12). (3) The underlying net is highly distorted. For example, the complex shape of an optically active molecule of myrotheciumone A essentially hinders visual analysis of the packing motif, which, however, belongs to a quite common topological type of hexagonal close packing (hcp) [45] (Figure 13). (4) Different underlying nets are geometrically close. In this very rare case, the structures are geometrically similar and hence have the same local topological parameters, but nonetheless their overall topologies are different. For example, β-and γ-HgSeO 3 have the same space group (P2 1 /c), close unit cell dimensions, equal numbers of non-equivalent atoms and the same coordination numbers of the corresponding atoms that allows one to formally relate these phases to the same structure type. However, the corresponding underlying nets have different topological indices and hence different topologies [46,47].

relations between structures of different composition and nature
Underlying topology is an important criterion for establishing relations between geometrically and chemically different crystal structures. TopCryst enables the user to find all topological analogs for a particular underlying topology i.e. to immediately relate the compound A under consideration to the corresponding isoreticular series. As a result, the following questions can be answered: (1) Is this topology unknown? In this case, the architecture of A is unique, and hence it can possess some special properties. (2) How numerous is the isoreticular series if the topology has representatives among other compounds? If the topology is rare, the chemical composition of A is very likely unique that could again result in unusual properties. On the contrary, if the isoreticular series contains many representatives, the user can find and explore chemically similar compounds to uncover similar properties in A. (3) Does this topology occur only in a particular class of compounds or in different classes?
The answer to this question determines whether the topology is specific and formed, thanks to a special combination of the structural units, or it is common and does not  depend on the chemical nature of the compound. For example, currently, the TopCryst database contains 39,794 references to crystal structures with the dia topology, and these compounds belong to all chemical classes distinguished in the database. This means that the dia topology is caused by the factors to be common for quite different compounds and is not determined by their chemical composition.
(4) What structural units and what coordination are needed to build this topology? There are clear relations between the local coordination of structural units and the overall topology of the underlying net [47]. The information on the underlying net topology together with coordination modes of ligands enables one to find relations for a given compound or a class of compounds.  Materials Science. His research interests concern geometrical and topological methods in materials science and crystal chemistry and their computer implementation. He is the main developer of the program package ToposPro since 1989. He invented many original algorithms for analyzing and classifying crystal structures, searching for correlations in crystallographic data and predicting new crystalline materials. Now he works under development of knowledge databases and artificial intelligence systems in materials science.