Show that the maximum likelihood estimator of the safe arm given the revealed rewards is determined by the sign of 2G₁- 2G₂ + S₂ 81 (positive sign corresponds to arm 1 and negative sign corresponds to arm 2) where G₁ and G₂ are the cumulative revealed rewards of arm 1 and arm 2 respectively, and s₁ and s₂ are the total number of times arm 1 and arm 2 respectively were previously chosen by the player.

[Hint: by the independence assumptions you can write down the probability pi of observing G₁ ones and s₁ - G₁ zeros from arm 1 and G₂ one and s₂ - G₂ zeros from arm 2 assuming arm i is safe as the PDFs of a binomial random variable; then consider the ratio p₁/p₂.]

Note the policy that chooses the arm according to this MLE entails no exploration (greedy policy).

Denote by a₁ and a₂ the Bernoulli random variables distributed according to the dis- tributions of arm 1 and 2 respectively. Given that in this problem a₁ = 1 - a₂, can you suggest a simple procedure for converting a sample of reward from arm 1 into a sample of reward from arm 2? Can you (informally) argue that therefore no exploration is not needed in this simplified problem?