lundi 28 décembre 2020

Reinforcement Learning: SGD use and independence of samples in sequences

I am taking a course in RL and many times, learning policy parameters of value function weights essentially boils down to using Stochastic Gradient Descent (SGD). The agent is represented as having a sequence of states S_t, actions A_t, and reaping rewards R_t at time t of the sequence.

My understanding of SGD in general, e.g when applied using training datasets on neural nets, is that we assume the data in the mini-batches to be iid, and this makes sense because in a way we are "approximating" an expectation using an average of gradients over points that are supposedly drawn from independent but exactly similar distributions. So why is it that we use SGD in RL while incrementing through time? Is that due to the implicit assumption of conditional independence for the distribution of p(S_t | S_{t-1})?

Thanks for clarifying this point. Amine




Aucun commentaire:

Enregistrer un commentaire