View Article in PDF
AT Lawrence Livermore and across the National Nuclear Security Administration’s (NNSA’s) complex, leadership in supercomputing is not only a core competency but also essential to the national security mission. Demands on computer systems used for weapons assessments and certification continue to grow as the nuclear stockpile moves further from the test base against which simulations are calibrated. Simulation capabilities are also strained because weapons behavior spans such a range of timescales, from detonation, which happens on the micro- to millisecond scale, to nuclear fusion, which lasts mere femtoseconds.
The perpetual push for higher resolution computer simulations and faster number crunching is motivated by a desire to verify, with high confidence and without resorting to nuclear testing, that a critical national resource remains secure and functional. Says Fred Streitz, director of Livermore’s High Performance Computing Innovation Center, “Every time we jump to a new generation of supercomputers, we open another window through which we can look to discover new science, and that helps in our quest to understand materials and phenomena with a high degree of accuracy.”
In 2006, the 100-teraflop (trillion floating-point operations) Advanced Simulation and Computing (ASC) Purple supercomputer delivered the first-ever three-dimensional, full-physics nuclear weapons simulation, and the 360-teraflop BlueGene/L was in the midst of its three-year reign as the world’s fastest computer. While these two Livermore systems were still setting records, leaders in NNSA’s ASC Program—the organization that integrates work by researchers at Los Alamos, Lawrence Livermore, and Sandia national laboratories to develop nuclear weapons simulation tools—were planning for a new computing system that would be delivered in 2012. Supplying the next level of performance necessary for supporting stockpile stewardship work, they soon realized, would require the most ambitious leap in computing ever attempted. They needed a machine that could deliver 12 to 24 Purple-class weapons calculations simultaneously to credibly evaluate the uncertainties in weapons physics. The machine also needed to have 20 to 50 times the capability of BlueGene/L for running materials science codes.
These performance milestones formed the design basis for Sequoia, the “serial number one” IBM BlueGene/Q machine. The newest and largest member of Livermore’s supercomputing arsenal, Sequoia has a peak performance speed of 20 petaflops (quadrillion floating-point operations per second). IBM designed BlueGene/Q in partnership with Lawrence Livermore and Argonne national laboratories, with national laboratory researchers providing user input at every stage of development. Livermore experts worked to create a machine that could be programmed with relative ease, while helping to explore and prepare for future computer architectures. Most critically, the machine would provide an effective platform for the existing trove of weapons codes, in which NNSA has invested several billion dollars and more than 15 years of effort.
A breakthrough system with over 1.5 million processor units, or cores, and 1.6 petabytes of memory, Sequoia serves as a bridge, design-wise, between supercomputers of the past 15 years and exascale machines some 100 times faster than today’s top performer (an exaflop machine would perform at least 1 quintillion floating-point operations per second). While BlueGene/Q is still grounded in today’s computer architecture, it introduces hardware features and supports programming models likely to carry over to and potentially proliferate in tomorrow’s systems.
According to Michel McCoy, ASC program director at Livermore, computer architecture is undergoing its toughest transition in the past 70 years. The last truly revolutionary design shift came in the mid-1990s, when groups of interconnected cache-based microprocessors became standard in computers of every scale. Succeeding generations of these microprocessors have grown faster by boosting the speed and shrinking the size of transistors, effectively packing more calculations into every unit of time and space occupied by the computer. But now transistors are approaching a lower limit in size and an upper limit in speed. Although individual transistors could be pushed to run yet faster, further speeding up the millions of transistors on a typical microprocessor would drive energy demands and operational costs for large-scale computing to unsupportable levels.
Researchers are therefore seeking new ways of designing processors to increase their performance, while reducing energy consumption enough to make the units affordable to operate. One promising approach, exemplified by IBM’s BlueGene series, is using large numbers of relatively low-frequency, simple processing units to achieve high cumulative performance with moderate energy usage. With a peak power requirement of less than 10 megawatts and the capability to perform 2.1 billion floating-point operations per second per watt, BlueGene/Q is among the world’s most energy-efficient supercomputers.
BlueGene/Q’s ingenious hardware design begins with the chip. Processors, memory, and networking logic are all integrated into an exceptionally energy-efficient single chip that operates at both lower voltage and lower frequency than many of the chips found in leading supercomputers of the past decade. Each BlueGene/Q compute chip contains 18 cores—more than four times the core density of its recent predecessor, the BlueGene/P chip. Sixteen cores are used for computation, while one runs the operating system and another serves as a backup to increase the yield of usable chips. Each core has a relatively small amount of nearby memory for fast data access, called a Level 1 cache, and all the cores on the chip share a larger Level 2 cache. Each compute chip is packaged with memory chips and water-cooling components into a compute card, and compute cards are organized in sets of 32 (called a node card). Thirty-two node cards comprise a rack, and Sequoia has 96 of these racks, for a total of just fewer than 100,000 16-core processors.
Modern supercomputers, including BlueGene/Q, have several types of on-chip memory storage. “However, real problems don’t fit in the cache,” says Bronis de Supinski, Livermore Computing’s chief technology officer. Most operations cannot simply access data from a nearby memory cache but instead must fetch data on a separate memory chip or another module. As the volume of cores and tasks has grown, the rate at which the system can retrieve data for the processor in order to complete calculations (known as bandwidth) has gradually become the limiting factor in computer performance, more so than how fast the processor can theoretically perform calculations.
Hardware designers have implemented various methods to ameliorate delay and improve performance. BlueGene/Q uses hardware threading, an energy- and space-efficient alternative. Each processor supports four execution threads that share resources. The situation is analogous to four workers who must share a single tool. While one worker (thread) is waiting for data retrieval, it can lend the tool to another worker so that the core’s resources are always being used.
Keeping otherwise idle threads or cores occupied was a goal for the BlueGene/Q hardware designers. Many parallel programs contain shared data and portions of code that must be executed by a single process or thread at a time to ensure accurate results. Threads attempting to update or access the same data must take turns, through locks or other mechanisms, which can cause a queue to form. As the number of access attempts increases, program performance deteriorates. BlueGene/Q’s multiversioning Level 2 cache supports two new alternatives to locks: speculative execution and transactional memory. Together, they may improve application performance and ease the programmer’s task compared with traditional locking methods.
Using either memory feature requires that a programmer designate regions in the code as speculative. Instead of waiting for up-to-date versions of all the data it needs, which might depend on another core finishing a computation, a thread is allowed to begin speculatively performing work with the available data. The cache then checks the work. If the data are fresh, the work will be saved, and the system will gain a performance bonus because the labor was completed before the relevant value was changed by another thread. If the data are stale, the speculative work is discarded, and the operation is reexecuted with the correct value.
Another innovative memory element is BlueGene/Q’s intelligent Level 1 cache list prefetcher. This cache can learn complex memory access patterns and retrieve frequently visited data sets from the more distant memory devices, making them available in the cache when needed.
Improving computational performance and bandwidth through threading and special memory functions would largely be for naught without a high-speed method for moving the data between nodes. Most supercomputers use commodity interconnects or Infiniband for their primary connections. BlueGene/Q has a novel switching infrastructure that takes advantage of advanced fiber optics at every communication level. With the latest generation of BlueGene, IBM has also upgraded the toroidal connection between node cards from three to five dimensions for more robust interconnection and to halve the maximum latency. In this configuration, each node is connected to 10 others (instead of 6), thereby reducing the maximum distance between nodes from 72 hops to 31.
Accompanying Sequoia and connected via high-speed parallelized interconnect is one of the world’s largest file storage systems. The 55-petabyte Grove is designed to manage and store the immense amounts of data needed for Sequoia’s research functions.
Knowing that supercomputer hardware and software must function as a team, Livermore experts have worked to ensure not only that Sequoia’s hardware will be compatible with ASC codes, but also that these codes are tuned to run on Sequoia. Because one goal is to improve the codes’ predictive capability, code preparation has required new physics models, new mathematical representations for the models, and algorithms for solving these representations at a new scale. In addition, researchers have begun analyzing threading techniques and other hardware features to determine how and where to incorporate these innovations into the codes. The undertaking began in 2009, a full three years before Sequoia arrived. Dawn, a 500-teraflop BlueGene/P machine, served as the primary platform for code preparation and scaling. Additional support was provided by BlueGene/Q simulators, other Livermore high-performance computing (HPC) systems, and starting in 2011, some small-scale BlueGene/Q prototypes.
Physicist David Richards observes, “With parallel computers, many people mistakenly think that using a million processors will automatically make a program run a million times faster than with a single processor, but that is not necessarily so. The work must be divided into pieces that can be performed independently, and how well this can be done depends on the problem.” For instance, Richards has found that running a 27 million particle molecular-dynamics simulation on 2,000 nodes of Sequoia is barely faster than doing so on 1,000 nodes. Determining the optimal level of task division for the given problem is essential.
Computational physicist Bert Still divides the code readiness work into two distinct components—scaling out and scaling in. Scaling out is modifying the code to perform effectively across a greater number of nodes by breaking a problem or calculation into smaller but still functional chunks. Using more nodes allows researchers to save on calculation time, increase the problem size or resolution, or both.
Essentially, the scaling-in task entails streamlining operations within each node. In the past, programmers focused on parallelism exclusively at the node level. Sequoia offers parallelism opportunities at lower levels within the node as well. Sequoia’s individual cores are comparatively slow and equipped with rather simple logic for deciding which tasks to run next. The programmer is largely responsible for efficiently organizing and apportioning work, while ensuring that the assignment lists fit within the small memory of these processing units. Says physicist Steve Langer, “Almost any next-generation machine, whatever its architecture, will require the ability to break the code down into small work units, so the code preparation work for Sequoia is an investment in the future.”
An opportunity for parallelism also lies within single-instruction, multiple-data (SIMD) units. Modern cores are designed to perform several of the same operations at the same time, for the highest possible throughput. In the case of Sequoia, a four-wide SIMD configuration allows each core to perform up to eight floating-point operations at a time, but doing so quadruples the bandwidth requirements. Programmers must balance their code accordingly. Each core has two functional units, one for floating-point math and one for integer math and memory loading and storing operations. Livermore programmers can attempt to enhance operations by equalizing memory access and calculation functions, cycle by cycle.
The “Livermore model” for HPC programming, originally developed in the early 1990s and designed to scale through petascale system operations, relies on the message-passing interface (MPI) communication standard to tie processors together for tackling massive problems. But while MPI is very good at managing communications up to a few hundred thousand cores, it is not the optimal approach for a million-core machine. As a problem is split into more pieces, the volume of MPI communication increases dramatically, reducing available memory and potentially degrading overall performance. More communication also uses more electricity. Exploiting the full capabilities and memory bandwidth of the system demands a hybrid style of programming that combines MPI with hardware-threading methods.
Several projects were launched to assist Livermore computer scientists with applying these hybrid methods and implementing new programming styles. One such project investigated accomplishing larger tasks using a group of threads. Strategically deciding when, where, and how much threading to add to the code is crucial. Shifting to a more thread-focused scheme allows code writers to worry less about communication and more about coordinating thread activity through a strategic combination of locks, speculative execution, and transactional memory. The latter two features have just begun to be explored, but Livermore programmers suspect that the new memory features could boost performance in code sections not already optimized for threading and locks.
Code preparation, enhancement, and testing have been challenging, surprising, and even gratifying. Livermore researchers have found that MPI programming functions better than anticipated on Sequoia. In addition, says Still, “While we knew that Sequoia would be a good HPC throughput engine, we were surprised how good it was as a data analytic machine. Graph analytics is representative of a large class of data science and is relevant to mission work at the Laboratory.” Sequoia maintains the highest ranking on the Graph 500 list, which measures the ability of supercomputers to solve big data problems. (See S&TR, January/February 2005, Experiment and Theory Have a New Partner: Simulation.)
Sequoia has also presented new scaling complications. While preparing the laser–plasma code pf3D for a simulation requiring 1 trillion zones and 300 terabytes of memory, Langer realized that scaling up the code for Sequoia would require him to parallelize every aspect of the effort, including preparatory calculations and post-simulation data analysis. Tasks previously considered relatively trivial or not particularly data intensive were now too large for a typical workstation to handle. Langer notes, “Sequoia is so big, it challenges our assumptions. We keep bumping against the sheer size of the machine.”
While software experts were beginning code optimization and development work for Sequoia, engineering and facilities teams were preparing to install the new system in the two-level space previously occupied by ASC Purple at Livermore’s Terascale Simulation Facility. (See S&TR, January/February 2013, Dealing with Data Overload in the Scientific Realm.) The teams first created a three-dimensional model of the computer room to determine how to best use the available space, with consideration for sustainability and energy efficiency. They then made changes to the facility on a coordinated schedule to avoid interrupting existing operations. Changes included gradually increasing the air and chilled-water temperatures to meet Sequoia’s requirements and to save energy.
Sequoia is equipped with energy-efficient features such as a novel 480-volt electrical distribution system for reducing energy losses and a water-cooling system that is more than twice as energy efficient as standard air cooling. Incorporating such systems into the space necessitated two significant modifications to IBM’s facility specifications. First, the facilities team designed innovative in-floor power distribution units to minimize congestion and reduce, by a factor of four, the conduit distribution equipment bridging the utilities and rack levels. Second, although IBM specified stainless-steel pipe for the cooling infrastructure, the team selected more economical polypropylene piping that met durability requirements and relaxed water-treatment demands. The polypropylene piping also contributed to the building’s already impressive “green” credentials as a Leadership in Energy and Environmental Design gold-rated facility. Facilities manager Anna Maria Bailey notes, “Sequoia’s electrical, mechanical, and structural requirements are demanding. Preparing for Sequoia really pushed us to think creatively to make things work.”
Integrating any first-of-its-kind computer system is challenging, but Sequoia was Livermore’s most grueling in recent history because of the machine’s size and complexity. System testing and stabilization spanned a full 14 months, but the integration schedule itself was unusually tight, leaving a mere 5 weeks between delivery of the last racks and the deadline for completing Linpack testing—a performance benchmark used to rank the world’s fastest computers. Issues ranged from straightforward inconveniences, such as paint chips in the water system during pump commissioning and faulty adhesives on a bulk power module gasket, to more puzzling errors, such as intermittent electrical problems caused by connections bent during rack installation. Sequoia’s cooling infrastructure also presented some initial operational challenges, including uneven node-card cooling and false tripping of a leak-detection monitor.
A more serious manufacturing defect was encountered during the final integration phase. During intentionally aggressive thermal cycling as part of the Linpack testing, the team experienced a high volume of uncorrectable and seemingly random memory errors. In effect, compute cards were failing at alarming rates. The integration team began removing cards while IBM performed random dye-injection leak tests. The tests revealed that the solder attaching chips to their compute cards was, in some instances, exhibiting microscopic cracks. Investigation revealed that unevenly applied force during manufacturing tests had damaged the solder on a portion of Sequoia’s cards. These cracks had widened during thermal cycling, overheating the memory controllers beneath.
The Livermore team overcame this and other integration hurdles with assistance from IBM computer scientists and BlueGene/Q’s unusually sophisticated error-detection control system software. Within 40 days of detecting the memory errors, IBM and Livermore troubleshooters had pinpointed the cause and replaced all 25,000 faulty cards. Although the system was accepted in December 2012, IBM continued to work with Livermore to fine-tune Sequoia hardware and software until the machine converted to classified operations in April 2013, demonstrating a notable level of dedication and partnership in machine deployment.
None of the integration challenges prevented Sequoia from completing, with only hours to spare, a 23-hour Linpack benchmark run at 16.324 petaflops and assuming the lead position on the Top500 Supercomputer Sites list for a 6-month period in 2012. Kim Cupps, a division leader in Livermore Computing, observes, “We’re proud that Sequoia was named the world’s fastest computer, but really, what’s important about the machine is what we can do with it. When we hear people talk about the work they’re doing on the machine and what they’ve accomplished, that’s what makes all the work worthwhile.”
The speed, scaling, and versatility that Sequoia has demonstrated to date is impressive indeed. For a few months prior to the transition to classified work and the access limitation that entails, university and national laboratory researchers conducted unclassified basic-science research on Sequoia. These science code and multiphysics simulation runs helped unearth a number of previously undetected hardware and software bugs and provided scientists with a preview of Sequoia’s capabilities, while accomplishing compelling research.
Over the course of the science runs, Sequoia repeatedly set new world records for core usage and speed. The Livermore–IBM-developed Cardioid code, which rapidly models the electrophysiology of a beating human heart at near-cellular resolution, was the first to use more than a million cores and the first to achieve more than 10 petaflops in sustained performance. Cardioid clocked in at nearly 12 petaflops while scaling with better than 90-percent parallel efficiency across all 1,572,864 cores on Sequoia. Scientists hope to use Cardioid to model various heart conditions and explore how the heart responds to certain medications. (See S&TR, September 2012, Venturing into the Heart of High-Performance Computing Simulations.) In another study, HACC, a highly scalable cosmology simulation created by Argonne National Laboratory, achieved 14 petaflops on Sequoia (or 70 percent of peak) in a 3.6 trillion particle benchmark run. Argonne’s code is designed to help scientists understand the nature of dark matter and dark energy. The HACC and Cardioid projects were 2012 finalists in the prestigious Gordon Bell competition for achievement in HPC.
Using a sophisticated fluid-dynamics code called CharLES, researchers at Stanford University’s Center for Turbulence Research modeled noise generation for several supersonic jet-engine designs to investigate which design results in a quieter engine. A calculation that had taken 100 hours to run on Dawn took just 12 hours on Sequoia. In one of the final science runs, a plasma physics simulation performed by Livermore scientists using the OSIRIS code also displayed magnificent scaling across the entire machine. This run demonstrated that, with a petascale computer, researchers can realistically model laser–plasma interactions at the necessary scale and speed for optimizing experimental designs used in a fusion approach called fast ignition.
In July, following a period for classified code development, science runs, and system optimization, Sequoia will become the first advanced architecture system to shoulder ASC production environment simulations for Livermore, Los Alamos, and Sandia national laboratories. Before now, stockpile stewardship and programmatic milestones were met using commercial-grade capability systems. Sequoia will serve primarily as a tool for building better weapons science models and quantifying error in weapons simulation studies. By supporting computationally demanding, high-resolution, three-dimensional physics simulations, Sequoia will allow researchers to gain a more complete understanding of the physical processes underlying past nuclear test results and the data gathered in nonnuclear tests. Weapons science results such as these may be used to improve integrated design calculations, which are suites of design packages that simulate the safety and reliability of a nuclear device.
Integrated design calculations are the target for Sequoia’s uncertainty quantification (UQ) efforts. Researchers have endeavored to incorporate UQ into their annual weapons assessments to better understand and reduce sources of error in these studies. Until now, they lacked the computing resources to perform UQ at the desired scale and level of mathematical rigor. Scientists conducting UQ analyses will run many integrated design calculations simultaneously on Sequoia and examine how the outcomes are affected by slight variations in the input parameters. A complete study could involve hundreds or thousands of runs. Sequoia will be the first system to allow for the routine use of two-dimensional UQ studies at high resolution; it will also be capable of entry-level three-dimensional UQ studies. Routine, highly resolved, three-dimensional UQ must await more powerful platforms, such as the Livermore Sierra system slated to go into production in 2018.
As a platform for both UQ and weapons science research, Sequoia is a powerful resource that will improve the predictive capability of the ASC Program. These capabilities have broader national security applications as well, according to computational physicist Chris Clouse. “It’s not just about predicting an aging stockpile,” says Clouse. “We also need powerful computers and UQ to understand the weapons design of proliferant nation states or organizations, for instance, where we don’t have a large test base for calibration.”
Sequoia serves a vital function beyond weapons research. Still remarks, “Sequoia is both a production platform for UQ and weapons research and an exploration platform for future architectures. It really serves an amazing role as a prototype for advanced technology and as a way to develop a system strategy that will propel us forward.” Finding the ideal recipe for an exascale computer that balances power-consumption limits, memory and communications requirements, programmability, space limits, and many other factors may take some years. However, using Sequoia, Livermore programmers can explore how best to exploit both new and potentially enduring architectural elements, such as shared memory, vast quantities of cores, and hardware threading. Whatever the future of computing might bring, ASC codes need to be compatible. To that end, Still is leading a Laboratory Directed Research and Development project, using Sequoia, to make codes more architecturally neutral.
Livermore computational experts aim to do far more than simply react to computer architecture trends, though. Even during Sequoia’s integration, Laboratory researchers were developing a research portfolio to propel innovation and prepare for exascale computing. Through efforts such as NNSA’s Fast Forward Program and in partnership with HPC companies, they have begun exploring potential computer technologies and testing prototype hardware in Livermore environments and with Livermore simulations. Given NNSA laboratories’ expertise in leading-edge HPC and their history of successful code development with hardware companies, a collaboration between laboratory and industry experts has an excellent chance of addressing the obstacles on the path to exascale supercomputing, while ensuring that next-generation computer designs continue to meet ASC Program and other mission-driven needs.
Says McCoy, “Our Laboratory’s intellectual vitality depends on our staying in a leadership position in computing, from the perspective not only of the weapons program but also of every scientific endeavor at Livermore that depends on vital and world-class computing. The 21st-century economy will depend on using HPC better than our competitors and adversaries.” The knowledge and vision that have helped make Sequoia a success have also positioned Lawrence Livermore to help forge a path to a new era of computing.
Key Words: Advanced Simulation and Computing (ASC); BlueGene; Cardioid; exascale; hardware threading; high-performance computing (HPC); Leadership in Energy and Environmental Design; Linpack; message-passing interface (MPI); parallel computing; single-instruction, multiple-data (SIMD) unit; stockpile verification; uncertainty quantification (UQ).
For further information contact Michel McCoy (925) 422-4021 (firstname.lastname@example.org).
View Article in PDF