IN chess, whether one is playing against a human or a computer, each move typically builds on the opponent’s previous maneuvers. Humans have one significant advantage, however: their innate ability to analyze many factors before taking a turn. For example, human players may consider what they know about an opponent’s behaviors or past strategies. They then use this knowledge to devise a game plan that gives them the advantage.
Traditional artificial intelligence systems, even those that model real-world adversarial scenarios, are less adaptive. In these systems, players, or “agents,” are programmed to process a far less nuanced, narrow set of conditions before making a decision, and they often select moves with little or no regard for the behavior of others in the game environment.
As part of a two-year project funded by the Laboratory Directed Research and Development Program, a research team in Livermore’s Engineering Directorate is attempting to develop computer modeling systems that process decisions more like humans do, adapting to the changing environment. The team, which includes computer scientists Brenda Ng and Kofi Boakye, operations researcher Carol Meyers, and former summer intern Andrew Wang from the Massachusetts Institute of Technology, wants to devise a system that can analyze adversarial relationships for a wide range of national security and law-enforcement applications. “The idea is to create a framework that takes into account an agent’s intent rather than simply its static behavior,” says Ng, who leads the project. This type of system will improve the effectiveness of computer simulations designed to analyze response scenarios against real-world adversaries.
Outwitting the Enemy
“This process is analogous to attempting to optimize a retirement portfolio,” says Ng. “One can only surmise through day-to-day reports (observations) the general health of his or her investment (the agent state). Possible actions are to buy, sell, or cash out, but each action has an associated cost—a reward or a penalty.”
As the number of agents increases, the models become more complex. The interactive, partially observable Markov decision process (I-POMDP) model is well suited for adversarial scenarios between multiple agents because it allows agents to consider the capabilities and beliefs of their adversaries before making the next move. Within such an environment, agents repeatedly interact with one another, and each agent’s actions affect the joint state of all agents, which in turn affects every agent’s observations.
One drawback with the I-POMDP model is with the built-in assumption that agents know all of the model parameters. In the real world, many conditions remain unknown until people interact with each other, whether they are allies or adversaries. To make the agents’ simulated behavior more realistic, the Livermore team incorporated reinforcement learning into the I-POMDP model.
With the new framework, agents learn as they make choices within the established environment. Each interaction provides information that helps them select the optimal action for a given situation, allowing the agents to maximize their rewards. Model parameters are not fully known beforehand, but agents learn them through trial and error as the players interact.
“Our goal is to bridge the gap between theory and practice in what an I-POMDP can model in an adversarial scenario,” says Meyers. A framework that simulates how agents “learn” from their opponents and change strategies based on observed behavior has major potential for law-enforcement and national security applications.
Show Me the Money
“The money-laundering scenario is appealing because both agents have nested beliefs,” says Ng. “Each one acts on what it ‘believes’ the other is thinking.” The nested-belief framework attempts to model each player’s thought processes and actions in a manner that better simulates human behavior—what Ng calls an I-think-that-you-think-that-I-think pattern.
The team’s initial model consisted of two agents. The first agent, a money launderer, is trying to diffuse and integrate its assets, “dirty” money, into the mainstream economy without being detected. The second agent, a law-enforcement officer, wants to confiscate this money before the money launderer can “cash out” via transactions with legitimate businesses.
The two agents operate within a defined number of states where the laundered money may be placed or found. For the money launderer, each state represents a location through which money can be diverted, such as bank accounts, trusts, or securities, as well as businesses that can integrate the large sums, for example, casinos and real-estate agencies. For law enforcement, each state represents a location where the officer can probe for suspicious activities. Both agents take actions not only to gather intelligence information on the opponent but also to transition from state to state. The “game” resets when the money is either successfully laundered or confiscated.
One challenge for the Livermore team was reducing the computational burden required to calculate the solution. “We had to substantially scale down our model to make it run efficiently,” says Meyers. “Even in the simplified version, the number of actions, observations, and states was 20 times greater than that in any game previously solved by an I-POMDP.” The researchers also modified algorithms designed to solve the I-POMDP models. They introduced additional approximations to a technique
called reachability tree sampling, in which possible paths
Throughout an I-POMDP simulation, both agents maintain beliefs about the physical states of their environment and are knowledgeable of model parameters, such as how the other’s actions will contribute to the next state. After experimenting with the model, the Livermore team determined that under most conditions, the money launderer has the advantage. However, when both agents are set to focus on achieving immediate rewards, the law-enforcement officer wins more often and does so much faster.
A Risky Proposition
“We assume that the state, action, and observation spaces are finite and known, but the model parameters—namely, the probabilities with which the agents change states and get specific information—are not fully known,” says Ng. “This approach is more realistic because in the real world, agents would face a number of uncertainties as both parties try to deceive each other.”
To demonstrate how adversaries continually adapt to an opponent, the team applied the model to the tiger problem, a standard benchmark used in academia. In the two-agent scenario, two adjacent rooms contain an object, either a ferocious tiger or a jackpot. The two agents have access to their own set of doors and can hear but not see the other agent. Each agent can take one of three actions: open the left door, open the right door, or listen.
An agent choosing to listen might hear a tiger growl, a door creak, or only silence. However, observations are obscured by background noise, so the agent cannot completely trust what it hears. After listening, the agent can update its belief state, learn about the truthfulness of the observations, and then choose the next optimal action. At the same time, both agents are trying to anticipate the action, observation, and evolving belief of the opponent.
The model now has more states to track because the state space includes parameters that enable learning. As a result, the team had to add further algorithmic approximations. “We transferred our approximations from the money-laundering model,” says Boakye, “and then revised them to work for the larger state space.” The tiger simulations revealed that when both agents are learning, the agents reap more rewards as the accuracy of their learned parameters increases. In essence, the learned behavior allows the agent to significantly improve its rewards compared with those attained from an incorrect model with no learning. In addition, when both agents are learning, rewards take longer to acquire, which is similar to a real scenario in which adversaries try to “game” each other.
I See You
Ongoing research will focus on developing ways to enable more states, actions, and observations in the model while keeping the computation tractable. With more realistic adversarial models in the works, national security and law-enforcement officials may one day have a better tool for understanding their intelligent opponents. As a result, these systems may also help answer a fundamental question: how might adversaries act differently if they knew they were being watched?
Key Words: adversarial modeling; artificial intelligence; counterterrorism; interactive, partially observable Markov decision process (I-POMDP); law enforcement; money laundering; reinforcement learning.
For further information contact Brenda Ng (925) 422-4553 (firstname.lastname@example.org).
Lawrence Livermore National Laboratory
Privacy & Legal Notice | UCRL-TR-52000-11-3 | March 15, 2011