Learning by Example Consider the following MDP with state space S = {A, B, C, D, E, F} and action space A = {left, right, up, down, stay}. Notice that C and F and connect to A and D respectively. However, we do not know the transition dynamics or reward function (we do not know what the resulting next state and reward are after applying an action in a state). А Bc. В с A D E F D 1. We are now given a policy n and would like to determine how good it is using Temporal Difference Learning with a = 0.25 and y = 1. We run it in the environment and observe the following transitions. After observing each transition, we update the value function, which is initially 0. Fill in the blanks with the corresponding values of the Utility function after these updates. Episode Number State Action Reward Next State 1 A right 12 B 2 B right 4 с 3 B down -12 E 4 С down -16 F 5 F stay 4 F 6 с down -9 F State U*(state) A B с D E F