4.1 - Goal - Learning an Economic Trade-off

Having successfully trained an agent on a single task, we now introduce a core element of strategy games: decision-making under constraints. Our second agent will learn to manage the fundamental economic trade-off between building more workers (increasing income) and building more supply (increasing capacity).

This task is designed to teach the agent to prioritize actions based on a dynamic game state, moving beyond a single, repetitive goal.

The Conceptual Challenge- Balancing Competing Goods

The agent must learn that both actions are "good" but that one is often more critical than the other depending on the situation.

+------------------------+      +-----------------------------+
|   Action 1: Build SCV  |      |  Action 2: Build Depot      |
+------------------------+      +-----------------------------+
|        PROS:           |      |          PROS:              |
|  + Increases Income    |      |  + Unlocks Army Growth      |
|        CONS:           |      |          CONS:              |
|  - Consumes Supply     |      |  - Does not increase Income |
|                        |      |                             |
+-----------^------------+      +-------------^---------------+
            |                                |
            |      WHICH ONE IS BETTER       |
            |          RIGHT NOW?            |
            +--------------------------------+
                          |
            +-------------v---------------+
            |      The Agent's Policy     |
            +-----------------------------+

Success Criteria

The agent must learn to continuously produce workers.
The agent must learn to proactively build Supply Depots to avoid being supply-blocked.
The agent must learn to prioritize depot construction when supply_left is low, even if it has enough minerals to build a worker.
The episode must terminate successfully upon reaching a target of 30 workers.

Environment Design

1. Observation Space Specification

To make an informed decision, the agent now needs full visibility into the supply situation.

Gymnasium Type: gymnasium.spaces.Box
Shape: (4,)
Data Type: numpy.float32

Index	Source Feature	Normalization	Purpose
`0`	`self.minerals`	`1000.0`	"Can I afford my chosen action?"
`1`	`self.workers.amount`	`50.0`	"How is my economic progress?"
`2`	`self.supply_used`	`200.0`	"How much pressure is on my supply?"
`3`	`self.supply_cap`	`200.0`	"What is my current supply limit?"

2. Action Space Specification

The action space is expanded to allow for the new building choice.

Gymnasium Type: gymnasium.spaces.Discrete
Size: 3

Action Value	Agent's Intent
`0`	Do Nothing
`1`	Build Worker (SCV)
`2`	Build Supply (Supply Depot)

3. Reward Function Specification

The reward function is designed to heavily punish the primary failure state (getting supply-blocked) while lightly encouraging all productive actions.

Pseudocode Logic:

function get_reward(action, game_state, last_supply_left):
  reward = 0

  # Event-Based Penalty for a critical failure
  if game_state.supply_left == 0 and last_supply_left > 0:
    reward -= 10

  # Sparse Rewards for productive actions
  if action == 1 and successfully_trained_worker:
    reward += 1
  elif action == 2 and successfully_built_depot:
    reward += 2 // Slightly higher to incentivize this crucial task

  return reward

This design forces the agent to develop a more sophisticated policy. It cannot simply learn to always build workers; it must learn to pay attention to its supply and act preemptively to avoid the large penalty.

The Conceptual Challenge- Balancing Competing Goods​

Success Criteria​

Environment Design​

The Conceptual Challenge- Balancing Competing Goods

Success Criteria

Environment Design