3.2 - Design - Observation, Action, and Reward
The performance and success of a reinforcement learning agent are dictated by the quality of its environment design. This document serves as the formal design specification for the three core components of our WorkerEnv
: the Observation Space, the Action Space, and the Reward Function.
1. Observation Space Specification
The observation space defines the fixed-size vector of data that the environment provides to the agent on each step. It must contain all information necessary for the agent to make an informed decision for its given task.
- Gymnasium Type:
gymnasium.spaces.Box
- Shape:
(3,)
- Data Type:
numpy.float32
Index | Source Feature | Normalization | Purpose |
---|---|---|---|
0 | self.minerals | Divide by 1000.0 | Informs the agent of its current purchasing power. |
1 | self.workers.amount | Divide by 50.0 | Provides a metric of progress toward the episode's goal. |
2 | self.supply_left | Divide by 20.0 | Signals whether the agent has the capacity for new units. |
Design Rationale:
- Minimalism: The state space is intentionally limited to only the features relevant to the "build worker" decision, focusing the learning process.
- Normalization: All inputs are scaled to a
[0.0, 1.0]
range. This is a standard and critical practice for stabilizing neural network training.
2. Action Space Specification
The action space defines the set of all possible choices the agent can make. For this task, we use a simple discrete space.
- Gymnasium Type:
gymnasium.spaces.Discrete
- Size:
2
Action Value | Agent's Intent |
---|---|
0 | Do Nothing: No command is issued. |
1 | Build Worker: Attempt to train one SCV. |
3. Reward Function Specification
The reward function provides the learning signal to the agent. This design uses a combination of event-driven sparse rewards and state-based dense rewards.
Design Philosophy:
- Sparse Rewards for Correctness: Provide a large, immediate signal for "correct" or "incorrect" actions to quickly teach the rules and constraints of the environment.
- Dense Rewards for Progress: Provide a small, continuous signal to guide the agent toward the long-term goal and prevent passive, do-nothing behavior.
Pseudocode Logic:
function get_reward(action, game_state):
reward = 0
# Sparse Reward for the "Build Worker" action
if action == 1:
if game_state.can_train_worker:
reward += 5 // Correct action
else:
reward -= 5 // Wasted/Impossible action
# Dense Reward for progress
reward += game_state.worker_count * 0.1
return reward
This hybrid reward structure is designed to train the agent efficiently, providing strong feedback for immediate decisions while maintaining a clear signal for the overall episode objective.