Powering on a laptop or smartphone activates its operating system (OS), which communicates between hardware and software, manages the flow of data between disk and memory, and conducts a host of other essential processes. The OS runs the machine and provides the user’s interface to programs and applications such as email, web browsers, spreadsheets, and games. The OS also plays an important role in the home computing environment where a device may be connected to a Wi-Fi network, a mouse, a printer, cloud-based storage, and other devices. Most personal computing devices run on Microsoft (Windows), Apple (macOS), and Android OSes.
In high-performance computing (HPC), OS choices and capabilities are significantly more complicated than on a laptop. Open-source Linux software forms the foundation of most HPC OSes, which are further specialized to accommodate variations in computing hardware, such as processors, high-speed networks, and data storage. Commercial vendors may provide OS support tailored to their hardware, but one OS is generally incompatible with another unless customers are willing to make extensive modifications. As HPC technology evolves, the bespoke Linux OS running on an older HPC system is unlikely to work “as is” on a newer system.
Lawrence Livermore’s HPC center features a range of systems—from commodity clusters made up of thousands of interconnected computing nodes to specialized supercomputers including the exascale system El Capitan, which can perform more operations in one second than a million smartphones can. (See S&TR, December 2024, Introducing El Capitan.) These systems incorporate hardware from multiple vendors, different versions of software, and unique configurations. For example, the Laboratory’s current HPC lineup includes 11 types of central processing units and 5 types of graphics processing units. No two systems are exactly alike, yet all are running the same OS.
This remarkable feat reflects an ongoing investment from the National Nuclear Security Administration (NNSA), a semi-autonomous Department of Energy (DOE) agency responsible for national security via nuclear science applications. NNSA relies heavily on HPC capabilities to carry out this mission, and for two decades its Advanced Simulation and Computing program has funded development of the Tri-Lab Operating System Stack (TOSS). The homegrown OS serves Lawrence Livermore, Los Alamos, and Sandia national laboratories—the Tri-Labs—as well as other DOE sites and even NASA.
Order from CHAOS
As computing technology has evolved over the decades, Livermore’s HPC system roster has grown steadily to meet NNSA’s advanced simulation needs. By the early 2000s, the uptick in users and machines became unwieldy for system administrators managing the operations, not to mention frustrating for users running codes on multiple systems. This fragmentation inspired Livermore’s HPC experts to create a custom OS to standardize the Laboratory’s computing environment and ensure reliability and compatibility across systems and on both classified and unclassified networks.
Fittingly, this early OS was called the clustered high availability operating system, or CHAOS. The development team did not need to start from scratch. Linux-based OSes—also known as distributions—had emerged in the HPC industry, and Livermore began working with Red Hat, a company offering enterprise-level Linux distributions to HPC centers. “CHAOS was built on top of the Red Hat distribution because we had hardware components and other needs that Red Hat didn’t yet support,” recalls Trent D’Hooge, Livermore Computing’s (LC’s) deputy division leader for operations. For example, commercial Linux distributions had not yet incorporated resource management software, which automatically allocates processors, memory, and storage for HPC workloads. CHAOS filled this gap with the Laboratory-developed Slurm (simple Linux utility for resource management) software.
CHAOS brought much needed order to Livermore’s HPC environment. System administrators more easily managed computing resources while users enjoyed consistent interfaces. This cost-effective success caught NNSA’s attention, and in 2007, CHAOS became TOSS to expand its capability to Tri-Lab programs and users. Today, Livermore leads a multisite team to adapt TOSS from Red Hat Enterprise Linux (RHEL, rhymes with “bell”) and further develop it via partnership with Red Hat.
Computing in Common
Widening TOSS’s scope means the team at one site may not be familiar with computing resources at other facilities. For example, Livermore’s El Capitan and Tuolumne and Sandia’s El Dorado have in common Advanced Micro Devices’ (AMD’s) MI300A processors, while NASA was the first participating site to install TOSS on a system with NVIDIA’s Grace Hopper chips. Site-specific teams must debug and test their implementations to ensure compatibility.
D’Hooge states, “Others are using the same OS but with different hardware or different application codes, so they might see bugs we don’t, and we work together to address any issues. A nice feedback loop exists between us and the other laboratories.” The variation paves the way for quickly installing TOSS on subsequent systems with known hardware.
TOSS has also simplified HPC system management. “TOSS gives system administrators the flexibility to shift resources around their sites, and it removes knowledge silos in terms of who can do what. It’s a common basis of operations,” notes Jim Foraker, LC’s Systems Software and Security Development group leader. For users, TOSS brings familiarity and consistency. Once they learn how to use systems installed with TOSS, users do not have a steep learning curve as TOSS evolves. “Part of why we use Red Hat is because certain aspects of their distribution don’t change. If someone hopped on one of our machines that’s running TOSS a few years ago, then returned after a while, the interface would look the same to them,” explains D’Hooge.
Best of Both Worlds
Reciprocity between the HPC industry and leading computing centers such as the Laboratory continuously produces innovations in the field, including in OSes. With unique HPC capabilities serving a national security mission, Livermore often requires software solutions that fall somewhere between completely custom and commercial or open source. Public–private partnerships with vendors such as Red Hat play crucial roles in this context—to everyone’s benefit. “We’ve been at this with Red Hat for about 25 years. They treat us as more than a customer they’re selling a product to. We’re tightly integrated with them, developer to developer,” notes Foraker, who holds the designation of Red Hat Partner Engineer.
Version Control
In software development—including that of high-performance computing (HPC) operating systems (OSes)—code changes are packaged into versions for periodic release to users. Versions are usually named with sequential numbers, with minor versions incrementally released between major versions.
Keeping the Tri-Lab Operating System Stack (TOSS) up to date with Red Hat Enterprise Linux (RHEL) requires careful management of software versions. Although TOSS builds on RHEL functionality, their version numbers are mutually exclusive. Over the years, TOSS 4 has been updated with RHEL 8.2 through 8.10. Most of Livermore’s HPC systems run TOSS 4, although TOSS 5 (based on RHEL 9.6) was installed on a new computing cluster in late 2025.
Seeing RHEL 10 on the horizon, Livermore TOSS project lead Trent D’Hooge and the team are preparing for a faster transition to the next major version while minimizing disruption to users. He says, “We can keep TOSS 4 stable and secure for a few more years, but it’s more work to maintain multiple concurrent versions.”
When Red Hat releases a new version of RHEL, the TOSS team thoroughly evaluates new features, security patches, performance improvements, bug fixes, and any other code changes. However, the team does not simply upload the upgraded OS software into TOSS. It must first gauge the impact each change may have on TOSS and users, then test the integration while validating overall stability with representative workloads and benchmarks. This process has stabilized into a monthly TOSS release cycle independent of RHEL, which Red Hat releases frequently in small increments. D’Hooge points out, “We track all RHEL releases but might not necessarily roll out all updates to the Tri-Lab systems.” (See "Version Control," right.)
Additionally, the TOSS team pushes its own enhancements upstream to Red Hat for consideration in future RHEL versions. For example, NNSA’s procurement of systems featuring AMD’s MI300As led to support for the new processor in RHEL’s core functionality—support that could range from incorporating vendors’ hardware specifications to building new functionality. This collaborative, multistep approach also enables quick implementation of urgent patches. Foraker states, “We try to share the work as broadly as we can, so we don’t have to solve the same problem twice or make the same mistake twice. The process maximizes the impact our work has on the Linux and HPC communities.”
HPC Stewardship
TOSS is integral to NNSA’s stewardship of the nation’s HPC resources, which in turn are integral to stewardship of the U.S. nuclear stockpile. Moreover, scientific applications increasingly make use of machine-learning workflows and cloud-computing resources, so the TOSS team must anticipate support for these needs and other emerging technologies. Foraker adds, “Our job is to make sure the OS foundation is solid so users can do what they need to do.”
TOSS is more than the core functions and processes installed on 60-plus HPC systems, and more than the teamwork responsible for its development. It also represents a methodology that prioritizes high-quality, leading-edge management of NNSA’s computing environments—supporting large-scale scientific simulation codes, the HPC systems they run on, and the users who depend on both.
D’Hooge explains, “We do the work once in TOSS and share it with others. Each laboratory or site may need to make some adjustments, but we get them most of the way there and avoid duplication.” The results are proven: long-term OS stability, scalability, and security; lower maintenance and development costs; and portability across HPC systems.
—Holly Auten
For further information contact Trent D’Hooge (925) 423-6100 (dhooge1 [at] llnl.gov (dhooge1[at]llnl[dot]gov)) or Jim Foraker (925) 422-0252 (foraker1 [at] llnl.gov (foraker1[at]llnl[dot]gov)).




