Overview
Below is the typical diagram used to describe the core loop in reinforcement learning (RL). We break down each component of this loop below.
Agent
The agent is the model that you are training. For AI Arena, we use feedforward neural networks to represent the agent. You can check out starter agents we coded up for you at
Environment
This is the world that the agent is operating in. The goal of RL is to have an agent learn to act optimally in a given environment. For AI Arena, the environment is the battle arena - see
State
The state is a snapshot of the environment at any point in time. Agents use this observation to decide what to do. In other words, the state is the context used in the agent’s decision making process. To learn more see
Reward
Rewards are used to train the agent. If an action results in positive reward, the agent is incentivized to take that action more often. However, if an action results in a negative reward (punishment), then the agent takes that action less often. You can get creative and design any reward function you want to incentivize your agent!
Action
At each time step, the agent must decide what to do. Each decision that an agent makes is called an action.
- Run Left
- Run Right
- Single Punch
- Double Punch
- Defend
- Jump
- Jump Left
- Jump Right
- Jump Punch
- Low Kick
Training Methods
Generally speaking, RL algorithms can be separated into two main approaches: policy-based methods and value-based methods. Of course there are also hybrid approaches such as actor-critic methods, but we will focus on the two for now.
Policy-Based Methods
Models in this bucket map the state directly to the policy. As such, the goal for policy-based algorithms is to directly optimize for the policy.
- Probabilistic Sampling: At each time step an action is randomly selected, such that the probability of it being selected is determined by the output of the
softmax
layer (final layer) of the neural network. - Argmax: The action with the highest value in the output layer is selected.
Value-Based Methods
Models in this bucket focus on mapping the state to the value of a given state or action. Some models focus on modelling the state value to determine which state is the best to move to next. Others focus on modelling the action value (i.e. how good is it to take a specific action in the given state). We can then infer that the goal for value-based algorithms is to indirectly optimize for the policy by learning value functions, and then constructing a heuristic to map the value function to a policy.
- -Greedy: A random action is selected with probability, otherwise the model uses argmax to select the action.
- Argmax: The action with the highest value in the output layer is selected.
RL Models in AI Arena
As of now, researchers can use policy methods and action-value (Q-value) methods on our platform. The one condition is that they use a feedforward neural network (