Markov Decision Process

A MDP is used in a fully observable but Stochastic Environment. This means all states are known but the Actions of an Agent are not deterministic.

It consists of

  • a set of states
  • an initial state
  • sets of actions for each state
  • Transition Model
  • a reward function
    • with reward

We only look at Markovian transition models and additive reward functions.

The Optimal Policy depends on the reward for each state. There is always a risk and reward tradeoff.

![[CleanShot 2023-09-29 at 09.31.51@2x.png]]

Fully Observable MDPs

Now to work with a Markov Decision Process we need utility over time which we introduce via Preferences on Reward Sequences. The additive way of combining rewards to a utility function becomes a problem when we consider the infinite lifetime of an Agent. The utility becomes infinite.

To fix this we could

  • terminate utility computation at fixed time
  • introduce absorbing states which for any policy will end the agents life
  • use discounting factor to dampen the reward when going towards infinity

For all of the ideas we want for an Optimal Policy that we have constant average reward per time step after some time. Currently we still only have utility for a set of states but we want utility for one state. To do this we use This is the Expected Utility we get by executing the policy starting from state until the agent terminates. We can now define the Optimal Policy as the policy with the maximum Expected Utility in the current state. It is important to note that this Optimal Policy is independent of the current state so we can even say that it is optimal for all states. This however does not hold for finite horizon policies.

We could say is the short term reward when doing one Action and is the long term reward as it contains a sequence of actions that might lead to a higher reward than only the next action.

We can observe a simple relationship between neighboring states which gives us the Bellman Equation. This equation is recursive so we can use an iteration algorithm to approximate the utility function with it. This is done in the Value Iteration Algorithm.

The Policy Iteration Algorithm goes a different route and approximates the Optimal Policy instead.

A combination of VI and PI often converges a lot faster.

Partially Observable MDPs

Partially Observable MDP