Platforms, Codes, and Facilities Form a Three-Pronged Supercomputing Strategy
The feature article, A Center of Excellence Prepares for Sierra, details the initial key steps taken in executing Lawrence Livermore’s high-performance computing (HPC) strategy for the next decade, including code preparation for the soon-to-arrive Sierra supercomputing system and its institutional companion. Also described is the paradigm shift underway in system architecture, as we move from a supercomputing world in which floating-point operations (flops) are expensive to achieve, to one in which bandwidth and memory are the most precious commodities. The shift may seem a small matter but has driven substantial revisions to our codes and, in some cases, compelled us to rethink code designs entirely.
Given the Laboratory’s leadership role in the world of supercomputing and the mission-critical work we perform with our HPC capabilities, we necessarily take a broad view that goes beyond siting a single system and preparing codes to run efficiently on it. In fact, if we are to sustain our competence in computing and continue to lead in supercomputing, we must strategically focus on three interrelated areas over the coming decade: (1) platforms and the direction that vendors’ business models are taking; (2) codes that effectively use these platforms, possibly in novel ways; and (3) facilities that can effectively operate very large computers.
As for platforms, the Department of Energy’s National Nuclear Security Administration (NNSA) and Office of Science are both investing in developing exascale technology, including systems software, hardware components, codes, and engineering technology. The Exascale Computing Program’s goal currently is to build the first exascale system by 2021, while NNSA’s Advanced Simulation and Computing (ASC) Program aims to site production-ready exascale platforms soon thereafter. By 2023, an ASC exascale system should be in production at Lawrence Livermore. Many people at Livermore are already involved in exascale work, in both research and development and leadership capacities.
At the same time, hardware vendors are moving strongly in the direction of artificial intelligence, deep learning, and other brain-emulating cognitive processes. One could even say these will be front and center in vendors’ business plans, with simulation being secondary. We will therefore need to invest cleverly so that such capabilities can be harnessed to support our needs. Sierra will feature elements likely to play a role in deep-learning computing, including use of graphic processing units and substantial amounts of nonvolatile memory. In this sense, we can view Sierra as an on-ramp, or training ground, for what is coming soon.
As for the second strategic element, the ASC Program is investing heavily in next-generation codes, in particular developing a next-generation nuclear weapons performance code. The Center of Excellence (COE) for Sierra and its institutional counterpart are playing an important role in code preparation for Sierra and beyond. The creation of an institutional COE for non-ASC codes was driven by the realization that the Laboratory cannot be healthy as an institution unless all its scientists have access to a similarly advanced computational environment. What must come soon is aggressive investment in cognitive simulation, that is, simulation assisted by cognitive processes. One such scenario is a cognitive process monitoring a simulation and redirecting its execution when it senses trouble or opportunity. Some intriguing early work already conducted at Livermore indicates that such techniques could potentially speed up some computational studies by an order of magnitude or more.
The third element of Livermore’s supercomputing strategy is facilities. Computers are becoming more power-hungry and storage-heavy. The Laboratory is currently exploring a variety of strategies to create an efficient, sustainable facility that could support multiple exascale platforms simultaneously. The most ambitious and leading candidate is a facility—achievable by 2020—that would provide Livermore computers up to 80 megawatts and sufficient cooling capacity. Last year we built a new facility—Building 654—that is an engineer’s dream of efficiency (if not an architect’s vision of beauty). This new facility will serve our needs for many years and can be expanded in modular fashion for additional space.
In short, supercomputing at Livermore remains a vital and rapidly evolving core component of overall institutional strategy. We continue to take a long view. Livermore already has a vision of what supercomputing could look like in 10 years. Now it is time to execute that vision.