Sunway TaihuLight (ST) is the new No. 1 system of the June 2016 TOP500 list of the most powerful supercomputers in the world.
The exascale (1000 Pflop/s)1 supercomputer race is roaring, but does HEP need supercomputers?
Sunway TaihuLigh exhibits 93 PetaFlop/s on the LINPACK benchmark (HPL) for a theoretical peak of 125 PFlop/s. Second is Tianhe-2 (MilkyWay-2), another Chinese system rated at 39 PFlop/s. Third is the US Titan from the Cray company with 16 PFlop/s. In total China amount to 211 PFlop/s with 167 systems although the US reaches 173 PFlop/s with 165 systems. Concerning ST, two important points should be stressed:
- The ~10 M cores on ~40,000 nodes of the ST are Chinese built, when the previous no.1 Tianhe-2 (was) is Intel powered.
- The peak power consumption under load (running the HPL benchmark) is at 15.37 MW, or 6 GFlops/Watt. Which would rate it 2nd in the November 2015 Green500 list. (but the June 2016 listing is not yet available).
Among nations, the competition is strong, China, US, Japan, Germany, France, UK, Saudi Arabia, Switzerland, Italy and S. Korea are the 10 first countries in term of aggregated performance.
IBM is preparing the US answer to China with the Summit mainframe and should regain the leadership in 2017. The Summit is based on Power9 chip 24 cores + Nvidia’s Volta as co-processor. It will exhibit a 200 PFlop/s peak performance for a 10 MW power consumption.
Supercomputer applications are numerous from defense and weapon simulations to data mining, climate change simulation, weather and tsunami forecast, drug screening and the like. But the competition is also fueled by a political and geo-strategic agenda: computing independence, showcasing technology excellence and related military supremacy.
Still, these systems are offering a lot of computing power of which some part is usually made available to the scientific community. To make a comparison, the LHC world-wide Grid accounted 260,000 processing cores in 2013. It is roughly 40 time less than the Sunway TaihuLight 10M cores. A small share (2-3%) of this monster would cover all the LHC needs, at a single site and without having to transfer and duplicate data over many clusters. Why is it that the HEP community is not using these resources?
The main reason is that most of HEP programs with one major exception are not parallel and therefore cannot use efficiently supercomputers. Computer clusters (or farms) are much more appropriate. Each event are allocated to a CPU-memory element and the full reconstruction or simulation of each event is sequentially executed. The more computing elements are available the more events are reconstructed or simulated in a given wall time.
One event/one CPU has been the principle which gave rise to the GRID architecture: a distributed tree of tiers of clusters of processing elements (13 Tiers 1 and 160 Tiers 2). The data produced at CERN (Tiers 0) is prepared and dispatched to the various Tiers 1. That requires high-speed long range connections and the storage and management of multiple sets of the same original data (insuring in a way, a high level of data security against possible loss)
Today we are facing a change in paradigm. Increasing the CPU performance, for many years, has simply been a matter of rising the chip clock frequency. But the generated heat, proportional to the frequency, reaches a limit over which overheating would destroy the chip. In addition, electricity consumption for large servers becomes unacceptable both for operational cost reasons and energy/environmental issues. Chip makers have therefore switched their approach from high frequency to high density and chip size and have multiplied the number of processing cores on a single flake. For example the ST basic chip contains 260 processing cores sharing the same 32 GB memory. But the memory size and bandwidth would not be large enough to run 260 independent copy of the same HEP programs efficiently. So to benefit from this large number of cores, the basic program it-self has to be parallelized so as to use as many as possible cores for the same event.
This is exactly what the HEP community is now preparing. The sequential Geant 4, the main engine at the heart of the HEP event simulation is being rewritten so that it can run on parallel architecture. Geant V 23 will have each basic physics algorithm parallelized.
The notable exception mentioned earlier is the “Lattice QCD Calculation” (LQCD4). The purpose is to compute, non-perturbatively (to the full extent of the theory) quantities like particle masses and couplings in the QCD framework of strong interaction (the force that hold nuclei and nucleons together). The calculation involves the inversion of huge sparse matrix on a 4-D lattice. The best way to address this CPU intensive calculation if by distributing the matrix elements over as many as possible nodes carrying in parallel the computation. Recently both a breakthrough in the theoretical approach and the availability of much more powerful supercomputers or dedicated parallel systems have resulted, in particular, to a very precise hadron spectrum calculation.
This great achievement can go further and take full advantage of the monster supercomputers would they be made available.
As we move to exascale resources, the list of mature calculations will grow. Examples that we expect to mature in the next few years are results for excited hadrons, including quark-model exotics, at close to physical light-quark masses; results for moments of structure functions; (…)5
Finally, a supercomputer, in reality, has never 100% full occupancy.
Since HPC centers are geared toward large-scale jobs, about 10% of capacity of a typical HPC machine is unused due to a mismatch between job sizes and available resources. A rough estimation of a potential resource is 300 M CPU hours per year, primarily available via short jobs. PanDA6 can be a vehicle to harvest Titan resources opportunistically. Klimentov et al. (7
It has been shown that on Titan (US, top 1 on Nov. 2013 top500 list) these “free” cores can be used for running the usual LHC data analysis and simulation (see also) without degrading the supercomputer regular activity.
In this race to the exascale frontier, HEP research may get more power from supercomputers and we have shown that HEP can make a good and efficient use of it. But the HEP computing community should get more involved in the design, test and benchmarking of the coming supercomputers8
- a PFlop/s is a quadrillions of calculations (floating point) per second ↩
- ACAT 2014 J. Apostolakis’s presentation ↩
- ACAT 2016 G. Amadaio’s presentation ↩
- For a recent review see W. Bietenholz ↩
- Particle Data Group “outlook” ↩
- Production and Distributed Analysis ↩
- ACAT 2014 presentation ↩
- Attending a presentation of the Kei supercomputer in Japan, I was shocked to see that many research domains had provided specific benchmarks (condensed matter, bio, environment, CAD, …) to help designing the system, but none were originated from HEP ↩