Big Data in Science Is a Big Challenge
When the Harvard Business Review takes on the topic of data science and announces that data scientist is “The Sexiest Job of the 21st Century,” one knows that “big data” has become the latest big news. However, data science has been at the core of the Laboratory’s mission from the very beginning, when data was measured in kilobytes, as compared to the petabytes that are processed and analyzed today.
We live in a world awash in data, a world of complex, interconnected systems and networks, from the vast systems that control and run various aspects of the country’s infrastructure, to systems embedded in manufacturing processes, to the smartphones and tablets on which we depend in our personal lives. Here at Lawrence Livermore, data science involves taking massive amounts of raw data in a variety of forms—for example, ocean-surface temperatures, spectra gathered from telescopes scanning the night sky, and DNA sequences—and analyzing it to extract information about relationships, patterns, and connections, and then presenting that information in forms that can be used to make decisions.
Today, the challenges of big data sweep across all our mission areas—from biosecurity and counterterrorism to nonproliferation and weapons systems. Data science also plays a major role in science-focused areas such as climate, energy, astrophysics, and high-energy physics. Any mission that is involved with collecting and analyzing enormous quantities of data will turn to data science as part of the analysis process. The difference between the Laboratory and a commercial entity is that our focus is on ensuring national security—the driver behind our efforts.
All our missions are affected by the need to push the limits and applications of data science. To be relevant in the current century, we must excel in this field. To that end, the Laboratory has established a new core competency initiative in data science, headed by James Brase, deputy associate director of Data Science in the Computation Directorate. As the article Dealing with Data Overload in the Scientific Realm describes, the initiative will focus on pattern discovery, predictive models and simulation, and the underlying data-intensive computing and management that enables us to handle high data volumes and rates.
The pattern discovery area involves machine learning—exploring methods and algorithms that will allow data-science tools to evolve to detect ever-changing patterns and activities. One such effort is Ana Paula de Oliveira Sales’s project to develop a method for analyzing streaming data as it evolves.
The data-intensive computing thrust will explore the computer architecture required to meet data-science challenges. For example, a Laboratory Directed Research and Development project is exploring new computer architectures that use flash memory. (See S&TR, January/February 2012, Finding Data Needles in Gigabit Haystacks.)
Data science has huge challenges ahead, but vast rewards beckon. At Livermore, we are engaged in forward-leaning areas of investigation, pushing the boundaries in all directions, whether the subject is biosecurity, such as typified by the effort to predict the viability of mutated viruses, or climate science research, such as the Earth System Grid Federation, an international collaboration whose portals include Livermore’s Program for Climate Model Diagnosis and Intercomparison.
To succeed in these areas and elsewhere, addressing the data tsunami is vitally important. The Laboratory will contribute with the excellence of effort that it brings to bear on all its missions, providing the country’s decision makers with the information they need to do their jobs.