High-Performance Computing Facility
The need for computing power and data storage space at MPI DS originates from both numerical computations and simulations as well as from data aquisition and evaluation in experiments. For both purposes the HPC group provides file servers for personal and project data, HPC clusters with fast local data storage space for parallel computing and HPC systems for GPU accelerated codes. The necessary computing infrastructure scales way beyond single workstations, but well below large computing centers. As a mid-sized HPC facility it has to allow for interactive use, e.g. for developing large-scale parallel applications or directed parameter space exploration. The Linux workstations are part of the HPC systems at MPI DS. Scientists can directly work on their data and control their jobs from their desktop systems.
The HPC group puts a lot of effort in keeping hardware as homogenous as possible in order to minimize the maintenance workload and maximize interoperability between the scientific working groups. Currently, the HPC hardware at MPI DS is mainly built of Lenovo systems using Intel Omnipath network interconnects for the parallel clusters. A few older DELL clusters with Mellanox Infiniband networks are still being maintained. Scientists at MPI DS have direct access to HPC clusters with an overall size of about 1000 HPC systems (more than 26,000 CPU cores, approximately 160 TB RAM, and 20 PB of data storage capacity).
Hosting computing facilities of this size requires a very dense packing of servers which is ensured by using multicore machines and efficient system designs like blade server enclosures. Corresponding power densities of more than 20kW per square meter cannot be cooled by traditional open air flow cooling with false floors. An efficient cooling system is required from an environmental perspective, but is mandatory from a budget point of view too, as electricity costs for cooling can be as high as one third of the total electricity costs with traditional cooling. MPI DS was among the first institutes of the Max Planck society to solve this issue by using optimized water cooled cabinets in order to cool the necessary parts of the server rooms only, as shown in the figure.
To improve service reliability half of the MPI DS HPC systems are located in a server room in the institute’s building at Fassberg, whereas the other half is located at an external computing center site in the former ’Fernmeldezentrale’ (FMZ) of Göttingen University. The smaller department server rooms at Fassberg were recently refurbished and can now be used for project data file servers and infrastructure servers of all groups.
In order to manage such a complex facility at different sites, MPI DS uses provisioning, configuration and monitoring systems based on open source software. The monitoring system collects important health data of the HPC hardware and the cooling facilities on a frequent basis. This data is summarized into a comprehensive overview and its history can be viewed for further diagnostics. In case of a cooling failure the monitoring system is able to perform an emergency shutdown autonomously in order to prevent machine damages by overheating.