21 Reinforcement Learning
About 1717 wordsAbout 6 min
computer-vision
2025-01-23
A general introduction to reinforcement learning and its applications in deep neural networks.
@Credits: EECS 498.007 | Video Lecture: UM-CV
Personal work for the assignments of the course: github repo.
Notice on Usage and Attribution
These are personal class notes based on the University of Michigan EECS 498.008 / 598.008 course. They are intended solely for personal learning and academic discussion, with no commercial use.
For detailed information, please refer to the complete notice at the end of this document
Paradigm of RL
Terms: Environment - Agent - State - Action - Reward

Examples
- Cart-Pole Problem




Reinforcement Learning vs Supervised Learning
Why is RL different from normal supervised learning?

- State transitions may be random in RL, while in supervised learning, we always apply the gradient descent to lower the loss
- Reward rt may not directly depend on action at.
- Non-differentiable: can't backprop through world; can't compute drt/dat
- Non-stationary: What the agent experiences depends on how it acts: the environment (distribution of data) changes.
Fundamentally more challenging than any supervised learning approach!
Markov Decision Process (MDP)
Mathematical formulation: A tuple (S,A,R,P,γ)
- S Set of possible states
- A Set of possible actions
- R Distribution of reward given (state, action) pair
- P Transition probability: distribution over next state given (state, action)
- γ Discount factor (tradeoff between future and present rewards)
with respect to the Markov Property
Agent executes a policy π giving distribution of actions conditioned on states.
Goal Find policy π∗ that maximizes cumulative discounted reward ∑tγtrt
- At time step t=0, environment samples initial state s0∼p(s0)
- Then, for t=0 until done:
- Agent selects action at∼π(a∣st)
- Environment samples reward rt∼R(r∣st,at)
- Environment samples next state st+1∼P(s∣st,at)
- Agent receives reward rt and next state st+1
Finding Optimal Policies
Goal Find the optimal policy π∗ that maximizes (discounted) sum of rewards.
Problem Lots of randomness! Initial state, transition probabilities, rewards.
Solution Maximize the expected sum of rewards
π∗=πargmaxeE[t∑γtrt∣π]
s0∼p(s0)at∼π(a∣st)st+1∼P(s∣st,at)
Value Function and Q Function
Following a policy π produces sample trajectories (or paths)
s0,a0,r0,s1,a1,r0,...
How good is a state? The value function at state s, is the expected cumulative reward from following policy from state s:
Vπ(s)=E[t≥0∑γtrt∣s0=s,π]
How good is a state-action pair? The Q function at state s and action a, is the expected cumulative reward from taking action a in state s and then following the policy:
Qπ(s,a)=E[t≥0∑γtrt∣s0=s,a0=a,π]
By the law of total probability, we can express the value function in terms of the Q function:
Vπ(s)=a∑π(a∣s)Qπ(s,a)
Optimal Q-function: Q∗ is the Q-function for the optimal policy π∗
Q∗(s,a)=πmaxQπ(s,a)
Q∗ encodes the optimal policy π∗(s)=argmaxa′Q(s,a′)
Bellman Equation
Q∗ satisfies the following recurrence relation:
Q∗(s,a)=Er,s′[r+γa′maxQ∗(s′,a′)]
where r∼R(r∣s,a) and s′∼P(s∣s,a).
Idea: If we find a function Q that satisfies the Bellman equation, then it must be Q∗. Start with a random Q and iteratively update it to satisfy the Bellman equation.
Qk+1(s,a)=Er,s′[r+γa′maxQk(s′,a′)]
where r∼R(r∣s,a) and s′∼P(s∣s,a).
Amazing fact: This iterative process converges to Q∗!
Problem: Need to keep track of Q(s,a) for every state-action pair - impossible if infinite.
Solution: Use a function approximator (e.g., neural network) to estimate Q(s,a), use Bellman equation as the loss function.
Algorithms for reinforcement learning
Deep Q-Learning
Train a neural network to predict approximate Q∗(s,a)∼Q(s,a;θ).
Use the bellman equation to tell what Q should output for a given state and action:
ys,a,θ=Er,s′[r+γa′maxQ(s′,a′;θ)]
where r∼R(r∣s,a) and s′∼P(s∣s,a).
Then, minimize the loss for training Q:
L(s,a)=(ys,a,θ−Q(s,a;θ))2
Problem: Nonstationary! The target for Q(s,a) depends on the current weights θ! Problem: How to sample batches of data for training?
For some problems this can be a hard function to learn. For some problems it is easier to learn the value function.
Policy Gradients
Directly optimize the policy π(a∣s;θ) to maximize expected reward.
Train a network π(a∣s;θ) to output a distribution over actions taking state s as input.
Objective function: Expected reward under policy π:
J(θ)=Eτ∼π[t≥0∑γtrt]
Find the optimal policy by gradient ascent: θ←θ+α∇θJ(θ)
θ∗=argmaxθJ(θ)
Problem: Non-differentiable! How to compute ∇θJ(θ)?
General idea: J(θ)=Ex∼pθ(x)[f(x)] Want to compute ∇θJ(θ)
∇θJ(θ)=Ex∼pθ(x)[f(x)∇θlogpθ(x)]
Approximate via sampling!
REINFORCE Algorithm
- Collect a trajectory τ=(s0,a0,r0,s1,a1,r1,...) by following policy πθ. It's ramdom: x∼pθ(x)
pθ(x)=t≥0∏P(st+1∣st)πθ(at∣st)⟹logpθ(x)=t≥0∑logP(st+1∣st)+logπθ(at∣st)⟹∇θlogpθ(x)=t≥0∑∇θlogπθ(at∣st)
Expected reward under πθ:
J(θ)=Ex∼pθ(x)[f(x)]
∇θJ(θ)=Ex∼pθ(x)[f(x)∇θlogpθ(x)]
- Initialize random weights θ
- Collect trajectories x and rewards f(x) using policy πθ
- Compute ∇θJ(θ) using samples
- gradient ascent: θ←θ+α∇θJ(θ)
- Goto 2
Intuition: when f(x) is high, increase the probability of the actions we took. When f(x) is low, decrease the probability of the actions we took.
Recap

Other approaches: Model based RL
Actor-Critic, Trust Region Policy Optimization, Proximal Policy Optimization, etc.
Actor Critic: Train an actor that predicts actions (like policy gradients) and a critics that predicts value functions (like Q-learning).
Model-based: Learn a model of the world's state transition function and then use planning to find the optimal policy.
Inverse Reinforcement Learning: Gather data of experts performing in environment; learn a reward function that they seem to be optimizing, then use RL on that reward function.
Adversarial Learning: Learn to fool a discriminator that tries to distinguish between real and fake trajectories.
Examples
- AlphaGo (2016): Policy network (actor) and value network (critic) trained with policy gradients.
- AlphaGO Zero (2017): Reinforcement learning with self-play and Monte Carlo Tree Search.
- Alpha Zero (2018): Generalized to other games: Chess and Shogi
- MuZero (2019): Plans through a learned model of the game.
- OpenAI Five (Apr 2019): Proximal Policy Optimization (PPO) with a model-based RL component.
Stochastic Computation Graphs


Notice on Usage and Attribution
This note is based on the University of Michigan's publicly available course EECS 498.008 / 598.008 and is intended solely for personal learning and academic discussion.
- Nature of the Notes: These notes include extensive references and citations from course materials to ensure clarity and completeness. However, they are presented as personal interpretations and summaries, not as substitutes for the original course content. Please refer to the official University of Michigan website for complete and accurate course materials.
- Third-Party Open Access Content: This note may reference Open Access (OA) papers or resources cited within the course materials. These materials are used under their original Open Access licenses (e.g., CC BY, CC BY-SA). Every referenced OA resource is appropriately cited, including the author, publication title, source link, and license type.
- Copyright: All rights to third-party content remain with their respective authors or publishers. If you believe any content infringes on your copyright, please contact me, and I will promptly remove the content in question.
Thanks to the University of Michigan and the contributors to the course for their openness and dedication to accessible education.