Data Collection
In order to train a model in our gym, we need to first write a simple battle loop to collect the training data. We use a similar framework to OpenAI’s gym environment, in which you have to call env.step(action)
to move to the next time step.
def run_battle(env, randomize_attributes = False, random_policy = False):
done = False
data_collection = {"s": [], "a": [], "r": []}
your_state, opponent_state = env.reset(randomize_attributes, random_policy)
your_attributes = env.your_fighter["battle_attributes"]
opponent_attributes = env.opponent_fighter["battle_attributes"]
state = get_state(your_state, opponent_state, your_attributes, opponent_attributes)
while not done:
action = env.fighters[0]["model"].select_action(state)
your_new_state, opponent_new_state, done, winner = env.step(action)
reward = get_reward(your_state, your_new_state, opponent_state, opponent_new_state, winner)
new_state = get_state(your_new_state, opponent_new_state, your_attributes, opponent_attributes)
data_collection["s"].append(state[0])
data_collection["a"].append(action)
data_collection["r"].append(reward)
your_state = your_new_state.copy()
opponent_state = opponent_new_state.copy()
state = new_state.copy()
return winner, data_collection
In this loop we collect states, actions, and rewards as the training data. However, we have not yet defined the get_reward
function. Below we define a simple reward function to get you started. However, we expect researchers to come up with more creative reward functions than this:
def get_reward(your_state, your_new_state, opponent_state, opponent_new_state, winner):
opponent_health_delta = opponent_new_state["health"]- opponent_state["health"]
your_health_delta = your_new_state["health"] - your_state["health"]
hit_reward = (opponent_health_delta < 0) * 0.3
get_hit_reward = (your_health_delta < 0) * -0.3
result_reward = 0
if winner == "You":
result_reward = 2
elif winner == "Opponent":
result_reward = -2
return result_reward + hit_reward + get_hit_reward
Now that we have our core battle loop for data collection, there are a few ways we can implement training. Below we define the two that we built templates for. Both templates share the same training loop:
GAMMA = 0.95
def training_loop(env, episodes = 100):
for e in range(episodes):
winner, gameplay_data = run_battle(env)
states = np.array(gameplay_data["s"])
actions = np.array(gameplay_data["a"])
discounted_return = get_discounted_return(gameplay_data["r"], GAMMA)
env.fighters[0]["model"].train(states, actions, discounted_return)
Refer back to get_discounted_return
function.
One-Sided RL
Most researchers will be familiar with this method of training agents. Researchers only focusing on improving one agent and simulate that agent’s actions in an environment. For AI Arena, the environment for this type of training will consist of the game as well as an opponent agent that does not learn. In our starter template, we use a rules-based agent as the opponent.
Self-Play
In this method of training, the researcher is responsible for training two models, which ultimately are copies of each other. The models are periodically trained asynchronously, and continually updated. This means the model is continually facing a better version of its previous self. To implement this, you simply define an interval at which you swap the model that you’re training with the opponent. For more information see
SWAP_INTERVAL = 50
if (e + 1) % SWAP_INTERVAL == 0:
env.swap_fighters()