3.2 - Design - Observation, Action, and Reward

The performance and success of a reinforcement learning agent are dictated by the quality of its environment design. This document serves as the formal design specification for the three core components of our WorkerEnv: the Observation Space, the Action Space, and the Reward Function.

1. Observation Space Specification

The observation space defines the fixed-size vector of data that the environment provides to the agent on each step. It must contain all information necessary for the agent to make an informed decision for its given task.

Gymnasium Type: gymnasium.spaces.Box
Shape: (3,)
Data Type: numpy.float32

Index	Source Feature	Normalization	Purpose
`0`	`self.minerals`	Divide by `1000.0`	Informs the agent of its current purchasing power.
`1`	`self.workers.amount`	Divide by `50.0`	Provides a metric of progress toward the episode's goal.
`2`	`self.supply_left`	Divide by `20.0`	Signals whether the agent has the capacity for new units.

Design Rationale:

Minimalism: The state space is intentionally limited to only the features relevant to the "build worker" decision, focusing the learning process.
Normalization: All inputs are scaled to a [0.0, 1.0] range. This is a standard and critical practice for stabilizing neural network training.

2. Action Space Specification

The action space defines the set of all possible choices the agent can make. For this task, we use a simple discrete space.

Gymnasium Type: gymnasium.spaces.Discrete
Size: 2

Action Value	Agent's Intent
`0`	Do Nothing: No command is issued.
`1`	Build Worker: Attempt to train one SCV.

3. Reward Function Specification

The reward function provides the learning signal to the agent. This design uses a combination of event-driven sparse rewards and state-based dense rewards.

Design Philosophy:

Sparse Rewards for Correctness: Provide a large, immediate signal for "correct" or "incorrect" actions to quickly teach the rules and constraints of the environment.
Dense Rewards for Progress: Provide a small, continuous signal to guide the agent toward the long-term goal and prevent passive, do-nothing behavior.

Pseudocode Logic:

function get_reward(action, game_state):
  reward = 0

  # Sparse Reward for the "Build Worker" action
  if action == 1:
    if game_state.can_train_worker:
      reward += 5  // Correct action
    else:
      reward -= 5  // Wasted/Impossible action

  # Dense Reward for progress
  reward += game_state.worker_count * 0.1

  return reward

This hybrid reward structure is designed to train the agent efficiently, providing strong feedback for immediate decisions while maintaining a clear signal for the overall episode objective.

1. Observation Space Specification​

2. Action Space Specification​

3. Reward Function Specification​

1. Observation Space Specification

2. Action Space Specification

3. Reward Function Specification