DL_POLY - A performance overview analysing, understanding and exploiting available HPC technology

ABSTRACT This paper considers the performance attributes of the molecular simulation code, DL_POLY, as measured and analysed over the past two decades. Following a brief overview of HPC technology, and the performance improvements over that period, we define the benchmark cases – for both DL_POLY Classic and DL_POLY 3 & 4 – used in generating a broad overview of performance across well over 100 HPC systems, from the Cray T3E/1200 to today’s Intel Skylake clusters and accelerator technology offerings from Intel’s Xeon Phi co-processor family. Consideration is given to the tools that have proved helpful in analysing the code’s performance. With a more rigorous analysis of performance on recent systems, we discuss the optimum choice of processor and interconnect, and present power measurements when running the code, comparing these to measurements for other community codes.


Introduction
Molecular dynamics (MD) methods are now well established as a vehicle for determining the equilibrium and transport properties of a classical many-body system, with the associated techniques extensively used to simulate, for example, complex biological systems, in areas such as disease, vaccine and drug research, placing increased demands on high performance computing (HPC) resources. A wide range of molecular dynamics codes are used by the community in pursuing these research goals, in particular, NAMD [1], AMBER [2], CHARMM [3], GROMACS [4,5], LAMMPS [6] and DL_POLY [7], all leading packages that are used extensively by the MD community. One of the key challenges facing these codes is that of exploiting the full power of available HPC platforms on a given scientific problem through effective utilisation of all available resources -CPUs, memory, and, in many cases, I/O. This paper considers the performance attributes of one of these codes, DL_POLY, as measured and analysed over the past two decades, and seeks to rationalise that performance against the evolution of HPC systems over the same period.
Before considering the performance of DL_POLY itself, we provide in section 3 a variety of background material that will enable us to comment on and rationalise the measured performance of the code. First, we provide a brief overview of the evolution of HPC technologyprocessors, memory and HPC interconnectsand the performance improvements over a period where Intel Xeon processors have increasingly dominated the HPC landscape (94.2% of the systems in the November 2017 edition of the Top 500 list [20] are now using Intel processors). In so doing, we consider the deployment of multiple generations of Intel architectures and the associated Xeon chips, highlighting the potential performance enhancements and the key characteristics of any application seeking to capitalise on those enhancements. We illustrate these requirements through a number of synthetic benchmarks. Consideration is given to the tools that have proved helpful in analysing the code's performance, notably IPM (Integrated Performance Monitoring [21][22][23]) and Allinea's Performance Report [24][25][26]. IPM is a profiling tool that helps analyse MPI 1 programmes through a lightweight profiling interface (with very low overhead). Allinea Performance Reports provides a mechanism to characterise and understand the performance of HPC application runs by analysing how the application's wallclock time was spent, broken down into CPU, MPI and I/O [26]. We provide examples using the DL_POLY benchmark cases that enable a clear framework to assess the performance of the code as a function of computer system.
Having provided insight into the performance attributes of DL_POLY in section 3, we turn in section 4 to consider the measured performance of both DL_POLY Classic and DL_POLY 3/4. A comparison of these performance improvements against the peak achievable provides a clear understanding of the limitations of the present software and suggests a strategy for improving the code's performance.
We continue in section 5 to reflect on the performance of DL_POLY on the most recent HPC systems, including five Skylake Xeon-based clusters [27] and accelerator technology offerings from Intel's Xeon Phi co-processor family [28,29]. To complete this performance overview, we move from processors to interconnectthe fabric to be used for an HPC cluster network. We consider the performance of DL_POLY as a function of both hardwarecomparing the codes scaling as a function of both Mellanox EDR [30] and Intel's Omnipath (OPA) [31] and software, assessing the impact of Intel MPI [32,33] and Mellanox's HPC-X Toolkit [34]. Finally, in discussing the optimum choice of processor and interconnect, we present power measurements when running the code, comparing these to measurements for other community codes.

DL_POLY -Background and benchmark cases
DL_POLY [7] is a package of subroutines, programmes and data files, designed to facilitate molecular dynamics simulations of macromolecules, polymers, ionic systems and solutions on a distributed memory parallel computer. It is available in two forms: DL_POLY Classic [DL_POLY_2] [9,10] and DL_POLY 3/4 [14,15]. Both versions were originally written on behalf of CCP5, the UK's Collaborative Computational Project on Molecular Simulation, which has been in existence since 1980 [35]. The two forms of DL_POLY differ primarily in their method of exploiting parallelism.
DL_POLY Classic uses a Replicated Data (RD) strategy [11,[36][37][38] that works well for simulations up to 30,000 atoms on up to 100 processors. DL_POLY 3/4 is the more recent version (first released in 2001) and, based on the Domain Decomposition (DD) strategy [14,15,35,36,39,40], is best suited for large molecular simulations from 10 3 to 10 9 atoms on large processor counts. The two packages are reasonably compatible, so that it is possible to scale up from a DL_POLY Classic to a DL_POLY 4 simulation with little effort.
Computationally the DL_POLY 3/4 code is characterised by a series of time step calculations involving exchanges of shortrange forces between particles and long-range forces between domains using 3-dimensional fast Fourier transforms (FFTs). DL_POLY 3 uses a domain decomposed algorithm based on the link cell method [39][40][41]. Possibly the most novel feature of the code is the implementation of the smooth particle mesh Ewald algorithm [42] for the evaluation of the Fourier component of the Ewald energy and associated quantities. Again, this requires a three-dimensional 3D FFT at each time-step. While parallel implementations of 3D FFTs do exist (e.g. Fastest Fourier Transform in the West, FFTW, [43]), they require a different data distribution from that employed in the link cell algorithm, which would require an expensive data redistribution. Instead, a 3D FFT has been developed [17,44,45] that maps directly onto the distribution used by DL_POLY 3. In addition to avoiding data redistribution, this method also has the attractive feature of using fewer, longer messages than those used by FFTW and similar packages, and is thus less sensitive to latency. However, it does require more data to be sent. The reader interested in details of the algorithmic implementations and various differences between the replicated and distributed data versions of the algorithms implemented in different versions of DL_POLY should consult references [11][12][13][14][15]17,38,44,45], a full review here being beyond the scope of this article.
DL_POLY_Classic traditionally ran on CPU-based machines. With DL_POLY_4 however, the authors become more adventurous and ported the code to GP-GPUs, Xeon Phi accelerators and ARM processors. While in this paper we concentrate mainly on x86-64 based machines, we briefly discuss performance on Xeon Phi Knights Corner and Knights Landing. Results for the GP-GPU implementation of DL_PO-LY_CUDA port are presented in [19]. Although this port showed good results, with all cases analysed outperforming the CPU implementation, there were a number of issues that limited the sustainability of the port and it was duly removed around 2016. The GP-GPU code itself was very specialised and difficult to maintain given the limited resources of the DL_POLY team. Rapid changes in the GP-GPU landscape along with the lack of performance transferability between different GPUs certainly hindered these developments.
The test cases examined here and described below are molecular simulations involving NaCl, Gramicidin, and Potassium disilicate glass, with the specifics of each data set a function of the version of the code in use.

DL_POLY_2 and DL_POLY classic
The performance attributes of DL_POLY Classic [8] have been analysed by reference to three distinct test cases summarised below.
i. Bench4: A straightforward simulation of sodium chloride at 500k, using the standard Ewald summation method to handle the electrostatic forces. A multiple time-step (MTS) algorithm is used to increase performance, which requires recalculating the reciprocal space forces only twice in every five time steps. The electrostatic cut-off is set at 24 Å in real space, with a primary cut-off of 12 Å for the MTS algorithm. The Van der Waals terms are calculated with a cut-off of 12 Å. The simulation is for 500 steps with a time step of 1 fs in the Berendsen constant temperature, constant volume (NVT) ensemble. The system size is 27,000 ions. ii. Bench5: This simulation is of 8,640 atoms of an alkali disilicate glass at 1000k. The electrostatics are again handled by the Ewald sum, with the interaction potential including a three-body valence angle term, which requires a link-cell scheme to locate atom triplets. The electrostatic cut-off is 12 Å and the Van der Waals cut-off is 7.6 Å; 3-body forces are cut-off at 3.45 Å. The simulation is for 3000 steps in the Hoover NVT ensemble, with a time step of 1 fs iii. Bench7: This system is comprised of 12,390 atoms, including 4012 TIP3P water molecules solvating the Gramicidin-A protein molecule at 300k. Both the protein and water molecules are defined with rigid bonds and maintained by the SHAKE algorithm [11,46,47]. The water is held completely rigid, while the protein has angular and dihedral potential terms. Electrostatic interactions are handled by the neutral group method with a Coulombic potential truncated at 12 Å. The Van der Waals interactions are truncated at 8 Å. The simulation is for 5000 time steps in the constant energy, constant volume (NVE) ensemble with a 1 fs time step.

DL_POLY 3 and DL_POLY 4
The first benchmark [18] is a Coulombic-based simulation of NaCl, with 216,000 ions. The second is a much larger macromolecular simulation and consists of a system of eight Gramicidin-A species (792,960 atoms). The costs of the initial set-up and the final output have been removed from the results presented. Specifically i. Test2 -A Coulombic-based simulation of NaCl with 216,000 ions, 200 time steps, cut-off = 12 Å. ii. Test8 -A system of eight Gramicidin-A species (792,960 atoms) in water; rigid bonds +SHAKE: 50 time steps.
Earlier Benchmarking of DL_POLY_2 (version 2.11) [48][49][50] clearly revealed the limitations inherent in the replicated data strategy, with limited scalability on current high-end platforms. For both DL_POLY 3/4 datasets above, the results show a marked improvement in performance compared to the replicated data version of the code, with enhanced scalability for both the ionic and macromolecular simulations.
Other work relating to the benchmarking of molecular dynamics simulations to assess the performance of various popular MD software programmes, including DL_POLY, can be found in references [51][52][53][54].

Computer systems
The DL_POLY benchmark timings reported in this paper have been generated on a wide variety of computer systems, starting with the Cray T3E/1200 [55] in 1999. Note that the original T3E (retrospectively known as the T3E-600) had a 300 MHz processor clock. Later variants, using the faster DEC Alphabased 21164A (EV56) processor system, comprised the T3E-900 (450 MHz) and the T3E-1200 (600 MHz), the system reported here.
Access to these systems was initially undertaken as part of Daresbury's Distributed Computing support programme (DiSCO [56]), with the resulting benchmarks presented at the series of annual Machine Evaluation Workshops (1999-2014 [57]), and at STFC's successor Computing Insight (CIUK) conferences from 2015 [58]. Access to many of these systems was typically short-lived as they were provided by suppliers to enhance their profile at the above Workshops, with the opportunity for in depth benchmarking often limited. The following points should be noted: (1) The systems include a wide range of processor offerings.
Representatives can be found from over a dozen generations of Intel processors, extending from the early days of single processor nodes housing Pentium 3 and Pentium 4 CPUs [59], through dual processor nodes featuring dualcore Woodcrest, quad-core Clovertown and Harpertown processors [60], along with the Itanium and Itanium2 CPUs [61], through to the extensive range of multi-core offerings summarised in Table 1 -Westmere through Skylake [27]. These processors have featured in systems from  Figure 2), along with the now defunct offerings from Myrinet, Quadrics and QLogic. The recent appearance of the Truescale interconnect from Intel, along with its successor, Omnipath, are also included in a number of systems. (4) Dating from the appearance of Intel's Sandy Bridge processors, many of the timings have been generated with the Turbo mode feature [65] enabled by the system administrators. Such systems are tagged with '(T)' notation in the subsequent analysis.
As for the software systems in use, most of the commodity clustersthose featuring Intel processorsused successive generation of Intel compilers along with Intel MPI, although a range of MPI libraries have been used -OpenMPI, MPICH, MVAPICH and MVAPICH2. The proprietary systems from Cray and IBM used the system specific compilers and associated MPI libraries.
A wide variety of storage systems, including the Lustre file systems supplied by DataDirect™ Networks [66], have been used.

Processor & interconnect technologies
Given the predominance of Intel processors in the clusters considered in this paper, we here review the major characteristics of these processors and the developments that undoubtedly have shaped the performance of DL_POLY over the period considered. Indeed, the processors considered in this paper span multiple generations of Intel architectures and the associated Xeon chips.
Many of the earlier Intel processors [67] had well-publicized memory bandwidth problems that severely limited their performance. Indeed, prior to the release of the Intel Nehalem (Xeon 5600 series) introduced in 2009, AMD Opteron processors were often the processor of choice for scientific workloads. Nehalem offered some important initial steps toward ameliorating the problems associated with the sharing of the front-side bus (FSB) in previous processor generations by integrating an on-chip memory controller and by connecting the two processors through the Intel QuickPath Interconnect (QPI).
In assessing the processor generations, it is worth noting that until recently Intel have employed what it calls a 'tick-tock' model in designing and manufacturing Xeon chips. The 'tocks' are when a new microarchitecture is rolled out and the 'ticks' are when that microarchitecture is modified somewhat and then implemented on a new manufacturing process that shrinks the transistors. Performance generally improves more between 'ticks' and 'tocks' than between 'tocks' and 'ticks'. In recent years, this shrinkage has allowed for more cores and L3 cache memory to be added to the die as well as other features.
We summarise in Table 1 and outline below the major characteristics of the post-Nehalem processors consideredinitially the 'Westmere' processor, a die shrink of the third generation Nehalem architecture release, the successor fourth generation architecture and the Sandy Bridge processor [68] and the subsequent 'tick' releasethe so-called Ivy Bridge processor [69,70]. Ivy Bridge was followed in 2013 by the Haswell processor architecture release [71], followed in turn by the 'tick' release of Broadwell in 2015 [72]. The most recent processor considered in this paper is Skylake [73], a microarchitecture redesign using the same 14 nm manufacturing process technology as its predecessor Broadwell, serving as a 'tock' in Intel's 'tick-tock' manufacturing and design model. For each processor family, we summarise below the variety of systems featured in this study that include each processor type. Table 1 summarises for each processor family the evolution of a number of key attributes that would be expected to lead to enhanced performance, specifically the increasing number of Cores / Threads, the size of Last-level cache, the maximum memory channels, speed / socket, the availability of New instructions, the QPI / UPI Speed (GT/s) enhancing on-node processor capabilities, PCIe Lanes / Controllers / Speed (GT/ s) improving inter communications and the associated power consumption, reflected by the Server / Workstation Thermal Design Power (TDP).

Intel processor architectures
3.1.1. Intel X5600 'Westmere' processor The Intel X5600 launched in 2010 is the Westmere series, a 32 nm die shrink of Nehalem. As mentioned above, the Nehalem architecture had overcome problems associated with the sharing of the front-side bus (FSB) in previous processor generations resulting in more than three times greater sustained memory bandwidth per core than the previous-generation dual-socket architecture. It also introduced hyper-threading (HT) technology (or simultaneous multi-threading, 'SMT') and Intel Turbo Boost technology 1.0 ('Turbo mode') that automatically allow processor cores to run faster than the base operating frequency if the processor is operating below rated power, temperature, and current specification limits [65].
Six systems featuring the quad-core Nehalem EP processors feature in the present study, from the NehalemEX 1.87 GHz L7555 through to the X5570 Nehalem 2.93 GHz, and five of the six-core Westmere-based systemsfrom the X5650 2.66 GHz to the X5675 3.07 GHz.

Intel Xeon E5-2600 'Sandy Bridge' processor
In 2012, Intel introduced the fourth-generation eight-core architecture, the E5-2670 Intel Xeon processor ('Sandy Bridge') that introduced new architectural features, extensions, and mechanisms, which has significantly improved overall performance. The performance improvement of the Sandy Bridge architecture over the Nehalem architecture (Nehalem and Westmere processors) is attributed [68] to increasing three memory channels to four, increasing memory speed from 1333 MHz to 1600 MHz, and new technology/architecture such as ring connecting cores, L3 cache (2.5 MB vs. 2 MB per core), increased QPI links and link rate, memory controller and I/O controller, new Turbo Boost version and system agent (see Table 1). Of particular significance was the introduction of the new Intel Advanced Vector Extensions (AVX) unit with wider vector registers of 256 bit in Sandy Bridge instead of 128 bit in Westmere, thereby doubling the floating-point performance [74]. Six systems featuring the 8-core Sandy Bridge processors feature here, including the e5-2670 2.6 GHz, e5-2680 2.7 GHz and e5-2690 2.9 GHz processors. The key issue from a DL_POLY perspective is whether the code is able to exploit the new instructions described in Table 1.
3.1.3. Intel Xeon E5-2600 v2 'Ivy Bridge' processor Introduced in April 2012, Ivy Bridge (IVB)an Intel 'tick' releaseused the same architecture as Sandy Bridge, with the main difference between the two processor generations being the shrinking of process technology from 32 nm to 22 nm. Both processors have an integrated memory controller that supports four DDR3 channels. Sandy Bridge (SB) processors support memory speeds up to 1600 MegaTransfers per second (MT/s), whereas IVB support memory speeds up to 1866 MT/s. As noted above, performance generally improves more between 'ticks' and 'tocks' than between 'tocks' and 'ticks,' and this is evident in the relationship between Sandy Bridge and Ivy Bridge [69,70]. An important factor when comparing systems performance is to realise that using just micro-benchmark and synthetic benchmark data to evaluate systems does not always translate into gains for real-world use cases. An example of this when assessing the relative performance of SB and IVB can be found in the detailed application benchmarking studies conducted by NCSA [70].

Intel Xeon E5-2600 v3 'Haswell' processor
The launch of the 'Haswell' generation of Xeon E5-2600 processors, the third generation to bear that name, provided a reasonable increase in both single-thread performance and a significant jump in overall system throughput for the twosocket servers [71]. The Haswell Xeon E5s followed soon after the 'Ivy Bridge' Xeon E5-2600 v2 processors announced a year earlier. Important changes available in E5-2600 v3 'Haswell' included: . Support for new DDR4-2133 memory . Up to 18 processor cores per socket (with options for 6-to 16-cores There were 26 different versions of the Haswell Xeon E5 processors aimed at two-socket servers plus one for workstations and two more for single-socket machines, but even those lines are somewhat blurred. The diversity of feeds and speeds and prices with the Haswell Xeon E5s was both wider and deeper than with the prior Ivy Bridge Xeon E5 v2 chips bearing the 2600 designation.
The high core count, or HCC, variant of the Haswell Xeon E5 had from 14 to 18 cores and two switch interconnects linking the rings that in turn bind together the cores and L3 cache segments together with the QPI buses, PCI-Express buses, and memory controllers. Three HCC systems feature here, including the 14core e5-2695v3 2.3 GHz and e5-2697v3 2.6 GHz processors.
The medium core count, or MCC, variant had from 10 to 12 cores and basically cuts off six cores and bends one of the rings back on itself to link to the second memory controller on the die. Two MCC-based systems are included, the 12-core e5-2680v3 2.5 GHz and e5-2690v3 2.6 GHz processors.
The low core count, or LCC, version of the Haswell Xeon has from 4 to 8 cores and is basically the left half of the MCC variant with the two interconnect switches, four of the cores, and the bent ring removed. No LCC system is included in the present study.
Most of the Haswell Xeon E5 chips supported Turbo Boost and HyperThreading, the former allowing for clock speeds to be increased if there is thermal room to do so and the latter presenting two virtual threads per core to the operating system, which can increase throughput for many workloads.
In general, the Haswell core delivers about 10% better Instructions per cycle (IPC) than the Ivy Bridge core, through a mix of better branch prediction, deeper buffers, larger translation lookaside buffers (TLBs), and a larger number of execution units. To feed the higher floating-point capacity of the Haswell core, Intel also doubled up the bandwidth on the L1 and L2 caches in each core. The FMA, AVX, and cryptographic elements of the Haswell core had also been significantly improved.
With the Haswell Xeon E5 chips, the top-end QPI speed had been increased by 20% to 9.6 GT/sec, providing the extra capacity to balance out the faster main memory and increased core counts per socket that came with these chips. Processors with fewer cores have QPI ports that run at 8 GT/sec or 6.4 GT/sec and also use lower memory speeds to keep things in balance.
As with the prior generations of Xeon E5 chips, the Haswell iteration allocated 2.5 MB of L3 cache per core, which was united into a single L3 cache by the rings and shared by all of the cores. Significant extensions to the Turbo capability were also in place compared to the preceding Sandy Bridge and Ivy Bridge processors.
The Haswell Xeon E5 chips had the same 40 PCI-Express 3.0 lanes linking to on-chip controllers, and the processors plugged into the 'Wellsburg' C610 chipset, which has four lanes of DMI2 (the link between the Northbridge and Southbridge of the motherboard) into the processor and then support for up to ten SATA3 ports, eight USB2 ports, six USB3 ports, and one Gigabit Ethernet port coming out the other side of the chipset.
Floating-point performance could be as high as 2× thanks to the doubling of flops per clock per core and the 50% increase in the cores, using the Linpack test as a guide. The mileage on real applications varies depending on the sensitivity of the workload to frequency, core count, and features such as AVX2 vector math units [71]. Again, can DL_POLY exploit the AVX2 instruction set?if not, performance gains are likely to be small compared to the v2 processor family.
3.1.5. Intel Xeon E5-2600 v4 'Broadwell' processor The next tick came with the 'Broadwell' Xeons, which moved the Haswell architecture to a 14 nanometre processes. The key focus of Broadwell was to enhance power efficiency while improving performance on a smaller 14 nm node [72]. Intel's Broadwell had managed to integrate new features aside from just core enhancements. The E5-2600v4 can have up to 22 cores or up to 44 hardware threads per socket, up to 55MB of last-level cache (LLC, also known as L3)up from 45MB with each core having 2KB of data L1 cache and 32KB of instruction L1 cache, and 256KB of L2 cache. Intel still enabled four channels of DDR4 memory, but increased the peak data rate to 2400 MT/s (a 15% improvement). Intel also added new memory features, such as support for 3DS LRDIMMs and DDR4 Write CRC (an enhanced form of error control). Most platform features remained unchanged, i.e. there remained 40 lanes of PCIe 3.0 and two QPI 1.1 ports.
The vector floating-point multiplication instructions MULPS and MULPD had been reduced in latency to three cycles from five. Similarly, various floating-point division instructions (DIVSS, DIVSD, DIVPS and DIVPD) had been cut in latency. For example, 256-bit single-precision vector divisions had a 16-cycle latency rather than the 20 in Haswell, and double-precision had 22 cycles rather than 34. Scalar divides could be split in two and processed in parallel.
3.1.6. Xeon Scalable processor family 'Skylake-SP' CPUs The final series of processors considered in this paper is the 14 nm Intel Xeon Processor Scalable Family (codenamed 'Skylake-SP' or 'Skylake Scalable Processor' [27,73]). 'Skylake-SP' processors replaced the previous 14 nm 'Broadwell' microarchitecture (both the E5 and E7 Xeon families) and became available in July 2017. With this new product release, Intel merged all previous Xeon server product families into a single family. The old model numbers -E5-2600, E5-4600, E7-4800, E7-8800are now replaced by these 'Skylake-SP' CPUs. While this opens up the possibility to select from a broad range of processor models for any given project, it requires attention to detail, for there are more than 30 CPU models to select from in the Xeon Processor Scalable Family.
This processor family is divided into four tiers: Bronze, Silver, Gold, and Platinum. The Silver and Gold models are in the price range familiar to HPC users/architects, while the Platinum models are in a higher price range than HPC groups are typically accustomed to. The Platinum tier targets Enterprise workloads, and is priced accordingly [73]. With that in mind, our interest here is focused on CPU models which fit within the existing price ranges for mainstream HPCwe restrict out attention to a single Platinum CPU model which is of interest to DL_POLY users, but comes at a higher price, the 26-core Platinum 8170 2.1 GHz processor. From the perspective of a DL_POLY user, our guidance for selecting Xeon tiers is as follows: . Intel Xeon Bronze -Not recommended: Base-level models with low performance. . Intel Xeon Silver -Perhaps Suitable for entry-level DL_POLY usage: Slightly improved performance over previous Xeon generations. . Intel Xeon Gold -Recommended for all DL_POLY production workloads: The best balance of performance and price. In particular, the 6100-series models should be preferred over the 5100-series models, because they have twice the number of AVX-512 units, have DDR4-2666 support while the 5100-series is limited to DDR4-2400 as standard, plus an enhanced number of UPI links to other processors. Although as we shall see the current DL_POLY software cannot effectively exploit the AVX-512 units, the other factors argue for the 6100 series. . Intel Xeon Platinum -Recommended for some DL_POLY workloads: Although these models provide the highest performance, their higher price makes them suitable only for particular workloads which require their specific capabilities (e.g. large SMP and large-memory Compute Nodes).
With a product this complex, it is very difficult to cover every aspect of the designwe merely note here the important changes available in the 'Skylake-SP' CPUs, including: . Up to 28 processor cores per socket (with options for 4-, 6 the processor (only available on certain SKUs) . CPU cores are arranged in an 'Uncore' mesh interconnect (replacing the older dual-ring mesh interconnect) . Optimised Turbo Boost profiles allowing higher frequencies even when many CPU cores are in use . All 2-/4-/8-socket server product families (sometimes called EP 2S, EP 4S, and EX) are merged into a single product line . A new server platform (formerly codenamed 'Purley') to support this new CPU product family.
In addition to the 26-core Platinum system mentioned above, five systems based on Skylake Gold processors feature here, including the 16-core SKL Gold 6130 2.1 GHz and 6142 2.6 GHz, 18-core SKL Gold 6150 2.7 GHz and the 20-core Gold 6148 2.4 GHz processors.

Synthetic benchmarks
We have outlined in section 3.1 the evolving improvements in Intel's processor capabilities across processor families. We now consider a number of synthetic benchmarks that aim to quantify these improvements in terms of application performance, together with a limited number of DL_POLY-specific tests to determine the relative importance of the attribute under consideration. Section 3.2.1 considers the increasingly important role of memory bandwidth [75,76] given the rapid increase in the multi-core nature of processors, while 3.2.2 considers the evolution of interconnect capabilities through the IMB benchmarks [77,78].

Memory bandwidth and STREAM
One performance aspect that is commanding increasing attention is memory bandwidth: how quickly data can be written to or read from memory by the processor. This driver of application performance is very important because of the presence of a deep memory hierarchy in today's parallel machines, and the overall impact of how quickly the OS can obtain data into and out-of memory for processing. If memory bandwidth is low, then the processor could be waiting on memory to retrieve or write data. If memory bandwidth is high, then the data needed by the processor can easily be retrieved or written.
One of the most commonly used benchmarks in HPC is STREAM [75], a synthetic benchmark that measures sustainable memory bandwidth for simple computational kernels. Four benchmarks compose Stream -COPY, SCALE, SUM and TRIAD. The TRIAD benchmarka(i) = b(i) + q*c(i)allows chained or overlapped or fused, multiple-add operations. It builds on the SUM benchmark by adding an arithmetic operation to one of the fetched array values. Given that FMA operations are an important operation in many basic computations, such as dot products, matrix multiplication, polynomial evaluations, Newton's method for evaluation functions, and many digital signal processing (DSP) operations, this benchmark can be directly associated with application performance. The FMA operation has its own instruction set now and is usually done in hardware. Consequently, feeding such hardware operations with data can be extremely importanthence, the usefulness of the TRIAD memory bandwidth benchmark.
Alongside our benchmarking of DL_POLY, we have routinely performed STREAM benchmarks and present here a brief analysis of the results from a variety of HPC nodes defined in Table 2. We focus on the Intel processors outlined in section 2, along with representatives from the IBM power family and the more recent AMD EPYC processor. Our particular interest is in memory bandwidth per core (given the requirement to use all cores available). Figure 1 shows both the total memory bandwidth per node, and more importantly perhaps, the memory bandwidth per core. These show that the per-core metric varies with the 'tick-tock' model. In general, the total memory bandwidth per dual socket node trend is upward (i.e. more memory bandwidth per node over time), but this is not the case for memory per core. From the figure, it can be seen that the memory bandwidth on the system has certainly increased over generations.
Perhaps more relevant, however, are the results of Figure 1(b) that show a quite different trend. As an example, the theoretical maximum memory frequency increased by 12.5% in Broadwell over Haswell (2133 MT/s to 2400 MT/s) and this translates into a 10%-12% better measured memory bandwidth. However, the maximum core-count per socket has increased by up to 22% in Broadwell over Haswell, and so the memory bandwidth per core depends on the specific SKU. Thus the 20 and 22 core BDW processors support only ∼3 GB/s per core and that is likely to be very low for most HPC applications, while the 16 core BDW is on par with the 14-core HSW at ∼4 GB/s per core. Consequently, several ISV companies recommended that their customers not use all of the cores on their processors, increasing the memory bandwidth per core used and in many cases improving the overall time to solution.
A routine method to assess the impact of memory bandwidth on the performance of an application is to simply reduce the number of MPI processes per node, i.e. for a fixed core size job, increase the number of nodes available to the simulation, thereby increasing the effective memory bandwidth per core on a given node. This has shown to have a major impact on the overall run time for those simulations with a heavy dependence on memory bandwidth e.g. the engineering codes PCHAN and OpenFOAM [76], where halving the number of MPI processes per node leads to a 50% improvement in time to solution. However, this may be of limited benefit in terms of cost as most HPC centres charge for nodes accessed regardless of occupancy i.e. double the node count, double the cost. Table 3 demonstrates this effect for each of the data sets used in DL_POLY Classic and DL_POLY 4 where the improvements in time to solution are reported for simulations undertaken on the 'Diamond' Intel Broadwell cluster with e5-2697A v4 [16c] 2.6 GHz (T) with Mellanox EDR interconnect and Intel compiler ifort version 18.0.1.
The improvements noted are based on the simulation times when reducing the number of MPI processes per node [i.e. cores] from 32 (corresponding to fully occupied nodes) to 16 i.e. half the node occupancy. In most cases the performance improvement is at best modest, the exception being the NaCl simulations in DL_POLY Classic (Bench4) and DL_POLY 4, where we find 18.6% and 11.0% improvements respectively on 256 MPI processes. While noting the 15.1% improvement on the 32-core Gramicidin data set (DL_POLY 4), we would conclude at this stage that memory bandwidth sensitivity on the current generation of multi-core processors is unlikely to play a dominant role in impacting on DL_POLY performance. This is not perhaps not surprising since molecular dynamics applications in general feature irregular memory access patterns. Irregular accesses typically make it difficult to keep data in cache, resulting in many cache misses and low performance. Such applications are typically characterised at best by strided memory accesses, and are hence more sensitive to the memory latency than bandwidth [49,50].
Developing this theme further, the dominance of multi/ many-core architectures in today's supercomputers makes the efficiency of on-chip parallelism increasingly important. The challenge of achieving efficient on-chip parallel molecular dynamics applications mainly arises from two issues. First, as mentioned above, MD applications are characterised by irregular memory access which imposes difficulty on locality optimisation. Secondly, many-core hardware limitations typified by the amount of on-chip memory, bandwidth of on-chip networking, etc. constrain the size of working-set per core which imposes difficulty of on-chip parallelisation.
An excellent account of navigating around these limitations describes accelerating MD simulation to achieve high on-chip parallel efficiency on the representative many-core architecture Godson-T [49,50]. This involved a number of steps. First, a pre-  processing step that leverages an adaptive divide-and-conquer framework designed to exploit locality through memory hierarchy with software controlled memory. Then three incremental optimisation strategies were proposed: (1) A novel data-layout to re-organize linked-list cell data structures to improve data locality and alleviate irregular memory accesses; (2) An on-chip locality-aware parallel algorithm to enhance data reuse to amortize the long latency to access shared data, and (3) A pipelining algorithm to hide latency to shared memory.
These techniques led to the parallel MD algorithm scaling nearly linearly with the number of cores, with data locality and data reuse schemes essential in achieving high scalability. We return to these challenges in section 5.

Interconnect requirements and IMB
The Intel MPI Benchmarks [77,78] give a measure of MPI communication performance. Figure 2 shows results from the Ping-Pong benchmark, where the two MPI processes reside on separate nodes i.e. a measure of inter-node communication performance. Although DL_POLY is typically run using fully occupied nodes, the inter-node communication performance results of Figure 2 provide the standard benchmark for showing the impact of latency and bandwidth as a function of evolving communication fabrics. The performance at low message sizes suggest there is at best a factor of x 2.2 difference between the systems, i.e. those latency-bound codes dominated by point-topoint small message will show little improvement in MPI performance across the variety of systems considered in the move from Infiniband DDR to EDR. There is greater improvement at large message sizes, with the measured Pingpong bandwidth increasing from 1.7 MB/sec (DDR) to 11.5 MB/sec (Mellanox EDR and Intel OPA) for 4 MByte messages i.e. a 6 × increase in performance.

Understanding performanceuseful tools
The performance of a parallel code is typically dependent on a complex combination of factors. It is therefore important that developers of HPC applications have access to effective tools for collecting and analysing performance data. This data can be used to identify such issues as computational and communication bottlenecks, load imbalances and inefficient CPU utilisation. In this study, we used both IPM (Integrated Performance Monitoring) [21-23] and Allinea's Performance Report [24-26] to gain additional insight into the run-time characteristics and performance of both DL_POLY Classic and DL_POLY 4.
Before considering the tools-based evidence, we consider briefly the impact of compiler optimisation on the observed performance of both versions of the code, noting that most of the performance data of section 4 was generated used either -O2 or -O3. Tables 4 and 5 shows the improved performance levels relative to that obtained using simply -O1 on an Atos Broadwell cluster with Intel compiler ifort version 18.0.1.
As can be seen from Table 4, there is an improvement of ∼20% in performance for each DL_POLY Classic test case when increasing optimisation from -O1 to -O2, but little gain thereafter when using the higher optimisation flags captured by -O3, -Ofast, -O3 -xAVX and -O3 -axSSE4.2. The perhaps surprising exception is found when using the CORE-AVX2 flag that recognises the AVX2 capabilities of the Broadwell processor.
A similar picture is found for DL_POLY 4 (Table 5), although a smaller performance improvement of ca. 10%-12% is seen when increasing optimisation from -O1 to -O2 compared to that found in DL_POLY Classic. Again, there is little gain thereafter when using the higher optimisation flags captured by -O3, -Ofast, -O3 -xAVX and -O3 -axSSE4.2. The perhaps surprising exception is seen in both codes when using the -xAVX2 flag that recognises the AVX2 capabilities of the Broadwell processor.
We also note the decrease in relative performance as the processor count increases, consistent with the growing contribution of communication time, and corresponding decrease in processor time.

Integrated performance monitoring (IPM)
IPM [21-23] is a portable profiling infrastructure for parallel codes. It provides a low-overhead performance profile of the performance aspects and resource utilisation in a parallel programme. Communication, computation, and I/O are the primary focus. While the design scope targets production computing in HPC centres, IPM has found use in application development, performance debugging and parallel computing education. The level of detail is selectable at runtime and presented through a variety of text and web reports. IPM has extremely low overhead, is scalable and easy to use and requires no source code modification. It brings together several types of information important to developers and users of parallel HPC codes, with the information gathered in a way that tries to minimize the impact on the running code, maintaining a small fixed memory footprint and using minimal amounts of CPU. The data from individual tasks is aggregated in a scalable way in the generated profile.
There are a variety of primary monitors that IPM currently integrates, with the interest here focused on the MPI: communication topology and statistics for each MPI call and buffer size, with examples provided at [21]. By default, IPM produces a summary of the performance information for the application on stdout. IPM also generates an XML file that can be used to generate a graphical webpage.
Aside from overall performance, reports are available for load balance, task topology, bottleneck detection, and message size distributions.
The 'integrated' in IPM is multi-faceted. It refers to binding the above information together through a common interface and also the integration of the records from all the parallel tasks into a single report. At a high level, IPM seeks to integrate together the information useful to all stakeholders in HPC into a common interface that allows for common understanding. This includes application developers, science teams using applications, HPC managers, and system architects. Figure 3 shows a breakdown of the time spent in each MPI function for the three DL_POLY Classic test cases during 256 core simulations on a Broadwell-based e5-2697A v4 [16c] 2.6 GHz (T) cluster with Mellanox EDR interconnect. Note that the number of time steps in each simulations was increased by x20 to minimise any impact of the input and output / analysis phases. The percentage of total run time spent in communications is similar in Bench4 (32.2%) and Bench5 (35.7%), but increases significantly in the Gramicidin, Bench7 simulation, to 57.5%. The MPI_Allreduce collective is the dominant MPI function in all three simulations, accounting for 22% in Bench4, 21% in Bench5 and 38% in Bench7.
Note that the dominance of MPI collectives in the charts of Figure 3 is a natural result of the replicated data implementation used in DL_POLY Classic.
Turning to the distributed data code, Figure 4 shows a breakdown of the time spent in each MPI function for both NaCl and Gramicidin during 256 core simulations on the same Broadwell-based e5-2697A v4 [16c] 2.6 GHz (T) cluster. Note again that the number of time steps in both simulations was increased by x20 to minimise the impact of the input and output/analysis phases. Expressed as a percentage of the overall run time, these suggest that MPI communication traffic accounts for 27.7% of the overall run time for NaCl, and 27.8% for Gramicidin. The MPI_Allreduce collective is the dominant MPI function in Gramicidin, while MPI_Recv consumes the largest fraction in NaCl. Note the significant decrease in the MPI contribution to the overall run times in DL_POLY 4 compared to those found in DL_POLY Classic ( Figure 3). This is mainly due to the usage of non-blocking, point-to-point communication as opposed to MPI collectives in DL_POLY_Classic. MPI_Recv usage has its root in different physical statistics collecting routines and writing restart files. MPI_Allreduce is used in the majority of cases to check the safety of various physical operations over MPI ranks, e.g. SHAKE is one example of such an operation. While the amount of data involved in MPI_Allreduce is modest, the dominant weight in the runtime analysis is due to its widespread usage.
Additional insight is provided in Figure 5 where we take a closer look at the time spent as a function of the Buffer size for each MPI function for both NaCl (i) and Gramicidin (ii) simulations. In both cases calls to MPI-recv and MPI-allreduce account for most of the communication time, consistent with Figure 4. The message sizes in MPI-allreduce are very small, 4-256 bytes, where it is clear that communication latency is the limiting factor. Two message lengths dominate the MPI-recv callswith the greater of these consuming more time. The associated message sizes in Gramicidin (ca 256 kB) are somewhat greater than those in NaCl (64 kB), due in large part to the different size of the two problems considered. The Gramidicin simulation also uses SHAKE which increases the number of MPI_Allreduce calls.

Allinea performance reports
Allinea Performance Reports is a low-overhead tool that produces one-page text and HTML reports summarising and characterising both scalar and MPI application performance, and as such provides an effective way to characterise and understand the performance of HPC application runs [24][25][26]. Performance Reports is based on Allinea MAP's low overhead adaptive sampling technology that keeps data volumes collected and application overhead low. Advantages of using this tool are (i) it runs transparently Table 5. % Performance improvement of DL_POLY 4 as a function of compiler option relative to that obtained using -O1.

Ncores
-O2 -O3 -Ofast -axSSE4. on optimised production-ready codes by adding a single command to your scripts; (ii) there is only a modest application slowdown (ca. 5%) even with 1000s of MPI processes; (iii) Performance Reports runs on existing codes: a single command merely added to execution scripts, and (iv) If submitted through a batch queuing system, then the submission script is modified to load the Allinea module and one merely adds the 'perf-report' command in front of e.g. the required mpiexec command.   Performance reports for both the NaCl and Gramicidin simulations using DL_POLY 4 are presented in Figures 6 and  7 respectively. All reports were generated on a Broadwellbased e5-2697A v4 [16c] 2.6 GHz (T) cluster with Mellanox EDR interconnect, both figures capturing output from the reports generated at 32, 64, 128 and 256 cores. Each figure presents two schematicsa CPU breakdown capturing the percentage of time spent in scalar numeric Ops, vector numeric Ops and memory accesses, plus the breakdown between the percentage of time between CPU and MPI related activities. As can be seen from both Figures, the relative proportions of each of the CPU components stay roughly constant regardless of core count, with both simulations pointing to some 50%-60% of the time spent in memory accesses, 35%-40% in scalar numeric Ops and only ∼5%-10% in vector numeric Ops. There is little difference between the findings from the NaCl and Gramicidin simulation. It is fair to say that this is a somewhat bleak picture in terms of the code being well positioned to capitalise on those processors where performance enhancements rely on e.g. enhanced vector (AVX) instructions [74].
Both simulations lead to the same conclusionthe per-core performance is memory-bound, with little time spent in vectorised instructions.
In terms of the breakdown between CPU and MPI instructions, the second plot in Figure 7 suggests that for the NaCl simulation, CPU instructions dominate at 32 and 64 cores, while MPI communications become dominant at 128 and 256 cores, with the latter suggesting an 80% contribution from MPI-related operations. This onset of MPI dominance is even more marked for Gramicidin, although these MPI contributions are undoubtedly overestimates for the actual MD simulation itself. Remember that the key timings reported throughout this paper refer to just the time spent in the multiple time steps, while the Perf Reports are based on the overall elapsed time for the job i.e. are capturing both the initial input and final output/analysis.

Analysis of the DL_POLY benchmarks
The DL_POLY benchmark timings presented in this paper have been generated on a wide variety of computer systems (see section 2.3 for the relevant background and details), featuring systems that include representatives from multiple generations of Intel processors [59 -61]. Following the early single processor nodes with Pentium 3 and Pentium 4 CPUs, dual processor nodes dominate, featuring dual-core, quad-core and the extensive range of multi-core offerings as summarised in Table 1 -Westmere through Skylake [27].
The systems featured also include a wide range of network interconnects. Following the initial use of Fast Ethernet and GBit Ethernet, the family of Infiniband [64] interconnects are prominentfrom Voltaire, Silverstorm and Mellanox (SDR, DDR, QDR, DFR and now EDR, see Figure 2), along with the more recent Omnipath interconnect from Intel.
As for software systems, many of the clusters featuring Intel processors have relied on successive generations of Intel compilers along with Intel MPI, although a variety of MPI libraries have been used, including OpenMPI, MPICH, MVAPICH and MVAPICH2. In contrast, the proprietary systems from Cray and IBM have typically used the system specific compilers and associated MPI libraries.
As mentioned in section 2.3, the following points should be noted: (a) Many of the timings from systems starting with Intel Sandy Bridge processors have Turbo mode [65] enabled. Such systems are tagged with '(T)' notation in Table 6. (b) A wide variety of storage systems have also been usedranging from simple NFS /home partitions through to Lustre file systems.
Our performance analysis below is presented in the two distinct sections, with consideration first given to results using the DL_POLY Classic code, followed by the corresponding analysis based on DL_POLY 3 and 4.

Performance measurements using DL_POLY Classic
The naming convention adopted for each of the systems is broadly defined as follows: [ The systems of Table 6 are arranged in chronological order according to their date of benchmarking, with many tagged with the 'Commodity System' (CS) numbering used by the DiSCO programme. Thus the DL_POLY_2 performance figures typically commence with the Cray T3E/1200 and culminate with the Intel Skylake systems. All the performance figures reported are normalised with respect to the simulation time recorded on the Cray T3E/1200. Figure 8 shows the relative performance of the DL_POLY_2 (DL_POLY Classic) NaCl (Bench 4) and Gramicidin (Bench7) simulations on a range of cluster systems using 32 MPI processes (see Table 6 for the index of Systems). Systems to the left side of the figures feature a variety of AMD Opteron systems alongside systems comprised of Intel processors -Pentium 3, Pentium 4, Itanium and Itanium2with a number

Overall performance profile
An initial analysis of the 32-core MPI performance curves of Figure 8 suggests an overall increase in performance by a factor of 112 (NaCl) and a significantly smaller figure of 47 for the Figure 8. (Colour online) Relative performance of the DL_POLY_2 (DL_POLY Classic) NaCl (Bench 4) and Gramicidin (Bench7) simulations on a range of cluster systems using 32 MPI processes (cores) see Table 6 for the index of Systems.
Gramicidin test case. This behaviour is not unexpected given the percentage of total run time spent in communications. As shown in Figure 4, this percentage is similar in the Bench4 (32.2%) and Bench5 (35.7%) simulations, but increases significantly in the Gramicidin, Bench7 simulation, to 57.5%. The % improvement in MPI performance over the period in question is certainly less than that in CPU (see Figure 2), especially for short messages, and with the 3D torus topology interconnection a relatively strong feature of the T3E/1200, one would expect the scope for improved performance to be less than that found in Bench4. We again note that the MPI_Allreduce collective is the dominant MPI function in both simulations, accounting for 22% of the elapsed time in Bench4 and 38% in Bench7.
At first sight, these overall improvement factors certainly seem modest bearing in mind the Gflop peak performance of a single T3E/1200 node and the multi-Tflop attributes of the Skylake nodes. However, remember we are comparing the performance of 32 corescorresponding to 32 T3E/1200 nodes and a fraction of many of the current Syklake Nodes. We consider this point further below.

Performance peaks and troughs
Scanning across the spectrum of systems reveals a number of key performance points, with both high and low points of the performance profile. Considering the Bench4 (NaCl) benchmark and the peaks of Figure 8, then of the early systems, both the proprietary IBM pSeries (#15) and SGI Altix 3700 (#22) with power4 and itanium2 processors respectively outperformed the Intel Pentium and AMD Operon clusters. There are three main reasons for those systems that demonstrate relatively poor performancethe troughs in Figure 8.  Figure 8  Consideration of other processors e.g. Itanium2, IBM Power: DL_POLY performance on the Itanium2-based systems (#19, #20, #22, #33 and #42) from SGI and HP were highly competitive with the Pentium4 and Woodcrest processors at the time, but the failure to increase the clock speed led to an increasingly uncompetitive position.
In similar fashion, DL_POLYs performance on the early releases of IBM's power processor and the associated pSeries of systems (#11, #15 and #23) was highly competitive, although performance on the power6 pSeries 575 (#59) and Power8 S822LC 2.92 GHz (#99) was at best similar to that found on the Intel processors at the time. Table 1 has shown the evolution of the Intel processor family over time and the associated feature improvement. In order to demonstrate how these architectural improvements translate into the performance of DL_POLY, we show in Table 7 the corresponding improvements in DL_POLY performance as a function of processor family, taking for each architecture the top SKU considered, and the percentage performance improvement compared with the top SKU of the preceding processor generation.

Relative ordering of the Top SKUs within each processor generation
It should be noted that the performance of systems post-Westmere is somewhat inflated given that all simulations were undertaken in Turbo mode. Having said that, we find a typical performance improvement for both simulations of between 10% and 20% on moving from one processor generation to the next, very far from that associated with the potential offered through exploitation of the AVX instructions. In both benchmarks the exception lies in the Ivy Bridge (v2) and Haswell (v3) families where much smaller improvements are found -5.3%/1.2% and 0.9%/1.8% respectively. Interestingly perhaps, the initial release of the quad-core Clovertown shows a decrease in performance compared to the earlier dual core Woodcrest, by 10 3 ) that measures the throughput or rate of a system carrying out a number of tasks. Covering the period 2006-2017, this is not entirely straightforward, for these benchmarks typically focus on a fully loaded node rather than the HPC context of multiple nodes using MPI, and there is no easy way to compare 32 PE benchmarks in use here across the variety of systems under consideration. We have chosen instead to simply use the SPEC CFP2006 benchmarks, and compare in Table 8 the relative performance of SPEC ratings for systems analogous to those used in the present study. Table 8 reveals a total performance increase by a factor of 11.5 when comparing the SPEC CFP 2006 (Peak) rating of the Intel Xeon Gold 6150 system with that based on the AMD Opteron 2218. That this factor is significantly greater than the corresponding factor derived from the DL_POLY Classic benchmarksby roughly a factor of twois not surprising. The SPEC CFP 2006 benchmark is measuring the performance of a single process that is not constrained by the inevitable memory bandwidth issues encountered on a fully loaded node (in contrast to the corresponding SPECfp RATE benchmark), and does not reflect the impact of communication constraints encountered in the MPI-based DL_POLY benchmarks.

Performance measurements using DL_POLY 3/4
As with DL_POLY Classic, the naming convention adopted for each of the systems listed in Table 9 is broadly defined as follows: [ The systems of Table 9 are arranged in chronological order according to their date of benchmarking. The DL_POLY 3 performance figures commence with a system some 5 years on from the Cray T3E/1200 used in the DL_POLY_2 analysis,  Figure 10) show an even greater improvement when moving from DL_POLY 3 to DL_POLY 4 -90% and 172% faster for the 32 and 128-core simulations respectively.
As with the Westmere-based systems above, the performance improvements moving from DL_POLY 3 to DL_POLY  recent system benchmarked featuring the AMD EPYC 7601 (#92), a processor that promises to place AMD back on the HPC roadmap.

Overall performance profile
An initial analysis of the NaCl 32-and 128-core performance plots of Figure 9 suggests an almost identical increase in performance relative to the IBM Opteron-based system (#1)a factor of 32.7 (32-core) and 32.9 (128-core). Corresponding values for the Gramicidin simulations of Figure 10 show increased figures of 41.1 (32-core) and 61.5 (128-core). The effectively identical NaCl factors are perhaps surprising in that the 128-core figure appears insensitive to the inherent limitations of the GBitE connectivity of the baseline system. Using the IPM environment variable I_MPI_STATS enables an estimate of the percentage of total run time spent in communications. A breakdown of the time spent in each MPI function as a percentage of the overall run time for both 32-core NaCl and Gramicidin simulations on the Broadwell-based Infiniband (EDR) cluster [#79] suggest that MPI communication traffic accounts for 9.7% of the overall run time for NaCl, and 10.1% for Gramicidin. Corresponding figures for the 128-core runs are 25.1% (NaCl) and 22.3% (Gramicidin). Note the significant decrease in these contributions in DL_POLY 4 compared to those found in DL_POLY Classic (Figure 3).

Performance peaks and troughs
Scanning across the spectrum of systems reveals a number of key performance points, with both high and low points of the performance profile. Considering the 32-core NaCl benchmark and the peaks of Figure 9, then of the early systems based on DL_POLY 3, strong performance is seen from the Woodcrest-based dual-core system from Streamline (#12, PIF = 6.9), the quad-core SGI and Bull Harpertown clusters (#27, #32, PIF = 7.4) and the subsequent Nehalem X5570 cluster (#43, PIF = 10.2). A significant performance increase arrived with the transition to DL_POLY 4 and those systems featuring Intel's Sandy Bridge processors (#57, PIF = 22.0). Systems based on subsequent generations of Intel processors showed Table 9. HPC and Cluster Systems Naming Convention a used in analysing the performance of DL_POLY 3 and DL_POLY 4. Similar features are found with the Gramicidin benchmark and the peaks of Figure 10, at least for the 32-core simulations. Of the early systems based on DL_POLY 3, strong performance is again seen from the Woodcrest-based dual-core system from Streamline (#12, PIF = 6.6), the quad-core Bull Harpertown cluster (#32, PIF = 6.5) and the subsequent Nehalem X5570 cluster (#43, PIF = 11.1). A significant performance increase arrived with the transition to DL_POLY 4 and those systems featuring Intel's Sandy Bridge processors (#57, PIF = 26.9). Systems based on subsequent generations of Intel processors showed ca. 10% improvements in 32-core simulations, notably the Ivy-bridge (#66, PIF = 30.1), Haswell (#74, 33.3) and  Table  9 for the index of Systems).   Table 9 for the index of Systems).
Turning to the weaker performing systems, there are three main characteristics of those systems demonstrating relatively poor performancethe troughs in Figures 9 and 10. Note that Figure 11 provides greater granularity by focusing on the 128-core simulations based on the use of DL_POLY 4.
The poor showing of the more recent AMD Opteron-based systems. All of these systemsthe Cray XT4 Opteron 2.3 GHz QC (#35), the 12-c AMD Magny-Cours 6174 2.2 GHz (#44) and 8-c AMD Interlagos Opteron 6220 3.0 GHz (#45)fall well behind the surrounding systems housing Intel processors in both the NaCl and Gramicidin simulation.
The major role of processor clock speed. As with DL_POLY Classic, one clear indicator across all processor Figure 11. (Colour online) Relative performance of the DL_POLY 4 NaCl and Gramicidin simulations on a range of cluster systems using 128 cores (MPI processes) (see Table 8 for the index of Systems). families of Figures 9 and 10 is the improvement in DL_POLY 3/ 4 performance with increasing clock speed. Thus considering the systems based on Intel's Nehalem processor based-systems, and the 32-core NaCl simulations, we find improved performance relative to the L7555 Nehalem EX (#37) of 12.76% by the Viglen-based Nehalem E5520 2.27 GHz (#38), 24.6% by the X5550 2.67 GHz (#39), 29.1% by the X5560 2.8 GHz (#40) and 36.2% by the X5570 2.93 GHz (#43). Corresponding improvements in the 32-core Gramicidin simulations are 19.3% by the Nehalem E5520 2.27 GHz (#38), 30.6% by the X5550 2.67 GHz (#39), 33.3% by the X5560 2.8 GHz (#40) and 39.3% by the X5570 2.93 GHz (#43).
Systems featuring the Skylake Gold 6150 2.7 GHz (#91) and Gold 6142 2.6 GHz (#89) show 32-core NaCl improvements by 22.0% and 17.7% respectively relative to the Platinum 8170 2.1 GHz system (#87). Corresponding figures for Gramicidin are 23.0% for the Skylake Gold 6150 2.7 GHz (#91) and 18.7% for the Gold 6142 2.6 GHz (#89). Note that similar behaviour to that noted above is found across all the processor families when considering the 128-core simulations for both NaCl and Gramicidin.
Consideration of other processors e.g. Itanium2, IBM Power. DL_POLY performance on the Itanium2-based systems, the SGI Altix (#4) and HP SD64B Itanium2 9050 (#16) was at best competitive with the AMD dual core Opteron processors at the time, with the failure to increase the clock speed beyond 1.6 GHz leading to an increasingly uncompetitive position. The scalability of both systems is modest, with the NaCl parallel scaling factors (Time 32 /Time 128 ) ca. 2.70, and those for Gramicidin lower still, just 1.97 for the HP SD64B Itanium2 9050 system.
In similar fashion, DL_POLYs performance on the IBM power6 pSeries 575 (#23) and Power8 S822LC 2.92 GHz (#85) is at best similar to that found on the Intel processors at the time. Thus the pSeries 575 delivered a PIF value of 5.22, compared to the SGI Ice X5365 Clovertown system (#21, PIF = 5.23), while the PIF for the more recent IBM Power8 S822LC system (29.38) is comparable to that of the Boston Broadwell system with 2.2 GHz processors (29.77). A somewhat higher PIF value of 5.87 is found for the 32-core Gramicidin on the IBM pSeries 575, while the IBM Power8 S822LC system shows an identical PIF value to that for NaCl (29.38), with the latter increasing significantly to 40.59 on 128 cores, consistent with the use of DL_POLY 4. Table 1 has shown the evolution of the Intel processor family over time and the associated feature improvement. In order to demonstrate how these architectural improvements translate into the performance of DL_POLY, we show in Table 10 the corresponding improvements in DL_POLY 3/4 performance as a function of processor family, taking for each architecture the optimum system, typically the top SKU considered, and the percentage performance improvement compared with the optimum SKU of the preceding processor generation.

Relative ordering of the Top SKUs within each processor generation
It should be noted that the major improvement in performance of the Westmere system (#55) arises from the use of DL_POLY 4all earlier systems used DL_POLY 3. Having said that, we find a typical performance improvement for both simulations of between 10% and 20% on moving from one processor generation to the next, very far from that associated with the potential offered through exploitation of the AVX instructions. In both benchmarks the exception lies in the Ivy Bridge (v2) family where much smaller improvements are found, 5.6% and 7.0% for the 32-and 128-core NaCl simulations, 4.0% and 1.0% for the corresponding Gramicidin simulations. As with DL_POLY classic, the initial release of the quad-core Clovertown shows a major decrease in performance compared to the earlier dual core Woodcrest, by 32.1% (NaCl) and 70.2% (Gramicidin) based on the 32-core simulations.

Comparison with SPEC CPU 2006
As with DL_POLY Classic, we consider the SPECfp and SPECfp RATE benchmarks from the Standard Performance Evaluation Corporation (SPEC) by way of a benchmark comparison. There are a number of available benchmarks, with the most appropriate here being the SPEC CPU 2006. We again chose to simply use the SPEC CFP2006 benchmarks, and compare in Table 11 the relative performance of SPEC ratings for systems analogous to those used in the present study. As before, Table 11 reveals a total performance increase by a factor of 11.5 when comparing the SPEC CFP 2006 (Peak) rating of the Intel Xeon Gold 6150 system with that based on the AMD Opteron 2218, to be compared with the DL_POLY 3/4 factor of 6.39. We again note that this factor is significantly less than that given by the SPEC CFP ratingby roughly a factor of two. This is only to be expected as the SPEC benchmark is (i) measuring the performance of a single process that is not constrained by the inevitable memory bandwidth issues encountered on a fully loaded node (in contrast to the corresponding SPECfp RATE benchmark), and (ii) does not reflect the impact of communication on performance as encountered in the MPI-based DL_POLY benchmarks.

Selecting fabrics -MPI optimisation
The performance of collective communication operations is known to have a significant impact on the scalability of some applications. Indeed, the global, synchronous nature of some collective operations implies that they will become the bottleneck when scaling to very large node counts. This has led many researchers to try to improve the efficiency of collective operations, with one popular approach to improving the implementation of MPI collective operations featureing the use of intelligent or programmable network interfaces to offload the burden of communication activities from the host processor(s). One such MPI library that offers such capabilities is Mellanox's HPC-X™ [34], and we consider here the impact of this offload approach to the performance of DL_POLY.
Mellanox HPC-X™ is a software package which provides enhancements to increase the scalability and performance of message communications in the network, supporting MPI, PGAS/SHMEM and UPC. This is addressed through various acceleration packages, including MXM (Mellanox Messaging) which accelerates the underlying send/receive (or put/get) messages, and FCA (Fabric Collectives Accelerations) which accelerates the underlying collective operations used by the MPI/PGAS languages [79][80][81].
Mellanox HPC-X™ looks to take advantage of the Mellanox hardware based acceleration engines that are part of the Mellanox adapter (CORE-Direct engine) and switch (SHARP engine) solutions. The latter improves upon the performance of MPI operations by offloading collective operations from the CPU to the switch network, and by eliminating the need to send data multiple times between endpoints. Implementing collective communication algorithms in the network also has additional benefits, such as freeing up CPU resources for computation rather than using them to process communication.
To shed additional light on both optimal interconnect -Mellanox EDR vs. Intel Omnipath [82]and the MPI library of choice, two DL_POLY 4 performance comparison exercises were undertaken.
(1) Analysis of the performance with increasing core count using both Mellanox EDR and Intel OPA as the system interconnect. Other applications considered and reported elsewhere include GROMACS [4,5], Quantum Espresso [83], VASP [84] and OpenFOAM [85]. (2) For each application (and associated data sets), contrast the performance using the Intel MPI and Mellanox HPC-X MPI libraries on the EDR connected cluster i.e. T HPC-X / T Intel-MPI  Figure 12 shows the results for the NaCl and Gramicidin simulations as a function of the number of cores, using the Dell E5-2697A v4 Broadwell-based 'Thor' cluster at the HPC Advisory Council, together with the Intel 'Diamond' Cluster, featuring the same E5-2697A v4 Broadwell-based node with OPA interconnect. These clearly suggest that for both simulations there is little to choose between the Mellanox EDR and Intel OPA connected cluster at small-medium processor counts. As the processor count increases for NaCl, Figure 12 (a) suggests that the EDR connected cluster is the superior network, outperforming the OPA cluster by some 26% at 1,024 cores when both are based on Intel MPI. In this case the latter library appears more effective than HPC-X. However, for Gramicidin the position is reversed, with the OPA cluster marginally faster (7%) than that based on EDR, with the HPC-X enabled code outperforming that based on Intel MPI, albeit by only 5%. We note in both instances the OPA-connected Diamond cluster is marginally slower than its Thor counterpart.

Performance on multi-core systems
Section 4 has provided at best a qualitative view of the DL_POLY performance across a broad variety of systems covering multiple generations of processor and cluster technologies. We augment that discussion in the present section with a more considered view of the performance on recent cluster systems, focusing on both Intel Broadwell and Skylake clusters. Using the 32 core timings from the Fujitsu CX250 Sandy Bridge e5-2670/2.6 GHz IB-QDR cluster as a base line, we show in Figure 13 the performance improvements for DL_POLY 4 over a range of systems, considering both data sets -NaCl and Gramicidindescribed in section 2.

Results with Xeon Phi Knight's corner (KNC)
The first generation of Intel's Xeon Phi co-processor [28], Knights Corner (KNC), was a novel accelerator technology that provided a number of appealing features [29]. These included many cores, 60 cores with 240 hardware threads for the mid model, low power consumption, wide vector instructions, the same set of instructions as an Intel CPU, support for popular and standardised programming models such as MPI and OpenMP and a theoretical peak of 1 TFlop/s. Due to its versatility it could be used in native, offload mode, or MPI symmetric mode. In this section we will concentrate on results for the Gramicidin test case in native mode for DL_POLY 4.
Initial MPI-only testing on KNC showed poor performance. This is attributed to a number of factors but was mainly due to the ring architecture for memory for in-node tests or poor connectivity of inter-nodes housing KNCs.
We have already seen from previous sections of this paper the poor vectorisation of DL_POLY 4. In this section we will analyse an OpenMP version of DL_POLY 4. The work was carried at Irish Centre for High End Computing [86] under the umbrella of an Intel Parallel Computing Centre [87]. From in depth benchmarking we decided on two targets for OpenMP parallelisation focused on three segments of the code: (i) linked cell lists creation, (ii) two body forces, short range calculations and (iii) holonomic constraints via the SHAKE iterative method [46,47]. All three segments account for around 90% of the computation time for each time step. In addition, these All the results presented below were obtained using Intel Cluster Studio 2016 β1 [88] and mpss 3.5 [89] on Intel Xeon Phi 5110P with an Intel Xeon E5-2660 v2 @ 2.20 GHz -Ivy Bridgehost processor. Compiler options used are '-O3mmic'.
The Gramicidin system was run using direct evaluation of the kernel for the two body forces rather than the classical tabulated form. The strong scalability for OpenMP is presented in Figure 14, for runs invoking both one and two MPI processes. The OpenMP process placement was driven via KMP_PLA-CE_THREADS, replacing the by-hand placement used initially. For a constant number of threads per core, good scalability with increasing core numbers is seen for both MPI cases, varying from ca. 80% efficiency when using only one thread per core to 52% when using four threads per core and one MPI process. In the case of two MPI processes with four threads per core we obtain an efficiency of 60%. The overall speedup factor on the Intel Xeon Phi for a single MPI process is ca. 63, decreasing to ca. 37 for two MPI processes, indicating a significant performance degradation attributed to poor MPI communication. We notice that for the single MPI process case and one core, increasing the subscription of cores from one to two threads gains around 48% while with one to three threads, the gain is 78% and using all four threads on the core results in a 97% gain.
For Xeon we used 10 MPI processes for one socket and 20 MPI processes for the two sockets case. In the case of the Xeon Phi the best performance is found when using 30 MPI processes, with each of them featuring two cores fully subscribed with threads. For the Xeon Phi one can see how performance improves by using extra threads on each core.
One notable result is the improvements on both host and Xeon Phi for two body forces resulting from changing the way in which the two body forces are evaluated. The gains in the case of Gramicidin are ca. 11% on host and 24% on Xeon Phi for two body forces, see Figure 15. Since two body forces represents only around 30% of the total time for a time step, these improvements do not translate into major improvements overall. This result is perhaps somewhat unexpected, since we are increasing the number of floating-point operations but achieving better times. This is explained simply by better utilisation of vector instructions.
These results suggest that for DL_POLY 4 to take advantage of modern HPC hardware, one needs a major change to the current memory access patterns to allow both efficient vectorisation and multithreading.   (Colour online) Optimum performance for two Body Forces on Xeon E2660-v2 (one and two sockets) and Xeon Phi (5110 and 7120), the lower the better. The pale orange area indicates that a tabulated method was used for the potential, while for the green areas, direct evaluation of the potential was used.
. xCOMMON-AVX512: Compiles for any currently available AVX-512 hardware, enabling only F and CD instructions. . qopt-zmm-usage = high: Enables the compiler to more aggressively target zmm register usage, which can improve performance for some code bases when using AVX512 architectures. As by default with xCORE-AVX512, the compiler is quite cautious about using zmm registers.
The results with Skylake show relatively small changes in performance with different instruction sets enabled, which is due to the low percentage of DL_POLY 4 which can be vectorised by the compiler. As expected, using xCORE-AVX512 performed better than xCOMMON-AVX512 as it can make use of the extra features available in Skylake Xeon processors, even though it may be less aggressive in trying to use the new zmm registers than the xCOMMON-AVX512 flag.
It is apparent that DL_POLY 4 does not make good use of the hyperthreading available on Xeon Phi KNL 7210. This is likely due to the cost of using increased number of MPI ranks to perform the same computation (as load balance may worsen and the cost of collective operations increases), as when using a single node, the performance is similar when using either one or two MPI ranks per core.
To better understand the performance on KNL we performed a roofline analysis [93]. Figure 18 shows a roofline analysis for DL_POLY 4 on the KNL 7210, featuring the major hotspots of the code (this analysis was single threaded). All of the hotspots are significantly below the DRAM bandwidth peak, implying the code does not make good use of caches and could potentially see significant improvements from trying alternative data layouts, such as storing the data for performance critical sections of the code in structures that fit in small number of cache lines, rather than being spread amongst a variety of arrays.

Power consumption measurements
Although much of this paper is focused on the performance attributes of DL_POLY, given the high energy costs, finegrained power consumption accounting and being able to make users of HPC clusters aware of the cost of their computation is becoming increasingly important. In this section we present power consumption measurements of both individual application executions and whole computing nodes. In addition to DL_POLY, we include the molecular dynamics code Gromacs [4,5], the materials modelling codes -Quantum Espresso [83] and VASP [84], and the computational engineering code, OpenFOAM [85].
Power consumption was measured on the Intel Benchmarking facility for the above applications (see Figure 19), with power collection conducted through an administrative node using an IPMI LAN interface based on a tool written in C named ISCoL [94]. Collection of System, Processor and Memory power usage was made using Intel NodeManager API version 3.0. No software agents were running on the CPUs, with all queries administered by BMC ensuring there was zero overhead on the tested applications.
The following fields were collected and stored in CSV format for each node: . TIME_US: time stamp in microseconds from start. . POWER_GLOB_SYS: system power read from PSUs. . POWER_GLOB_CPU: Combined processor domain power. When DTS value reaches Tcontrol (8-10°C)the system fans spin up to max; when DTS reaches 0°Cthermal throttling is initiated. . TEMP_CPU, CH, DIMM: absolute temperature in°C for each DIMM installed in the system Data was pooled approximately every 50 ms. (i.e. 20 times per second). Actual time between samples varied depending on the management network utilisation, but for the purposes of providing a power usage envelope and summary, small discrepancies in interval were considered acceptable.
The Intel Benchmarking facility that ran the tests provided a detailed breakdown of power consumption for the applications running across a range of nodes. Due to time constraints it was not possible to run all tests across identical node counts. However, the range of results provided should give a very comprehensive view of the anticipated power usage as a function of end application code. Table 12 presents the Mean and Peak (W) per node as a function of application, associated data set and number of nodes, with a summary graphs provided in Figure 20, including   a summary of the Mean and Peak power usage for each application. A clear conclusion from the power consumption ratings above and those from Figure 20 is the greater power usage figures arising from the materials science codes -VASP and Quantum Espressocompared to those from the classical molecular dynamics codes. This can certainly be attributed in part to the far higher utilisation of vector instructions by the materials codes, compared for example to DL_POLY (see section 4). Thus both Quantum Espresso simulations show ca. 40% usage of AVX/SSE instructions based on analysis using Allinea's Performance Reports, while VASP usage ranges from 20% (zeolite cluster) to 30% (PdO complex). In contrast, DL_POLY 4's usage is as low as 5% in the Gramicidin simulation (see section 3.2).
The DL_POLY Gramicidin simulation shows comparable peak node power usage figures to both the ion channel and lignocellulose simulations using Gromacs (ca. 450W/node). In contrast the Pd-O complex and zeolite simulations using VASP are more power demanding, with 505W and 534W/node respectively. The heaviest usage however is by the Quantum Espresso materials code, consuming 560W/ node in the 7-node simulation of the Au 112 cluster, and 614 and 636W/node in the 7 and 13 node GRIR443 benchmarks.
The above discussion of power usage may suggest that high power consumption alone is unwelcome, but that is not necessarily the case. The key consideration here is energy consumption, so if doubling the mean power consumption means more than a factor of two decrease in time to solution, then it is beneficial. Based on the restricted data available, it is difficult to make any quantitative statements on energy efficiency. One area of future study, as suggested by a Referee, will be to look into the impact of increased vectorisation on energy efficiency by studying in more detail power measurements for the runs of Figure 16.

Conclusions
In this paper, we have conducted a comprehensive performance evaluation and analysis of the DL_POLY molecular simulation codeboth DL_POLY Classic and DL_POLY 3 & 4using a number of standard benchmark cases. Our key findings are as follows: . Performance reports for both the NaCl and Gramicidin simulations using DL_POLY 4 suggest that some 50%-60% of the processor time spent is spent in memory accesses, 35%-40% in scalar numeric Ops and only ∼5%-10% in vector numeric Ops. There is little difference between the findings from the NaCl and Gramicidin simulations. Clearly the present code is not well positioned to capitalise on the evolving architecture of Intel processors where enhancements rely on e.g. enhanced vector (AVX) instructions. . Notwithstanding the time spent in memory accesses, a variety of experiments suggests that memory bandwidth sensitivity on the current generation of multi-core processors is unlikely to play a dominant role in affecting DL_POLY performance. This is not surprising since molecular dynamics applications in general feature irregular memory access patterns, making it difficult to keep data in cache, resulting in many cache misses and low performance. Such applications are typically characterised by strided memory accesses, and are hence more sensitive to the memory latency than bandwidth. . The percentage of total run time spent in communications is different in DL_POLY classic compared to the distributed data version of the code. In the former IPM studies suggest that this percentage is similar in 256-core runs of Bench4, but increases significantly in the Gramicidin, Bench7 simulation (to 57.5%). The MPI_Allreduce collective is the dominant MPI function in all three simulations, accounting up to 38% of the run time (in Bench7). These overall percentages are reduced somewhat in DL_POLY 3 and DL_POLY 4. 256-core runs of DL_POLY 4 show that 27.3% of the total run time is spent in communications in the NaCl simulation and 27.8% in Gramicidin. The impact of the MPI_Allreduce collective is significantly reduced compared to DL_POLY classic, to 9.1% and 7.9% respectively. Message passing in both replicated and distributed data codes is dominated by short messages (< 256 Kbytes). . An analysis of the performance of the DL_POLY classic (DL_POLY_2) code carried out on 108 systems shows an overall increase in 32-core performance relative to the Cray T3E/1200 by a factor of 112 (NaCl) and a significantly smaller figure of 47 for the Gramicidin test case. . The corresponding analysis for DL_POLY 3/4 carried out on 92 systems shows an overall increase in performance relative to the IBM e326 dual-core Opteron280 2.4 GHz by a factor 32.7 (32-core) and 32.9 (128-core). Corresponding values for the Gramicidin simulations show increased factors of 41.1 (32-core) and 61.5 (128-core). . The optimisations conducted in DL_POLY 4 lead to a significant improvement in performance in both NaCl and Gramicidin, particularly the 128-core runs of the latter. . Both DL_POLY classic and DL_POLY 3/4 performance results reveal a variety of systems that demonstrate relatively poor performance, caused by -The poor performance of GbitE connected systems in both benchmarks -The poor showing of the more recent AMD Opteronbased systems, with all such systems falling well behind the surrounding systems housing Intel processors, and -The major role of processor clock speed, with one clear indicator across all processor families being the improvement in DL_POLY performance with increasing clock speed. . We find a typical performance improvement for both DL_POLY classic and DL_POLY 3/4 simulations of between 10% and 20% on moving from one Intel processor generation to the next, very far from that associated with the potential offered through exploitation of the AVX instructions. . In order for DL_POLY 4 to take advantage of modern HPC hardware, one needs a major change to the current memory access patterns to allow both efficient vectorisation and multithreading. . Power consumption measurements are reported for DL_POLY 4 along with Gromacs and the materials modelling codes -Quantum Espresso and VASPand the computational engineering code, OpenFOAM. A clear conclusion from these measurements is the greater power usage figures arising from the materials science codes -VASP and Quantum Espressocompared to those from the classical molecular dynamics codes, although we have insufficient data to draw meaningful conclusions regarding energy consumption.