Feryal Behbahani on Feryal Behbahani

Acme: A Research Framework for Distributed Reinforcement Learning

Wed, 15 Jul 2020 22:00:19 +0100

Tutorial on RL (EEML 2020)

Wed, 15 Jul 2020 00:00:00 +0000

Privileged Information Dropout in Reinforcement Learning

Wed, 15 Jan 2020 22:00:19 +0100

Analysing Deep Reinforcement Learning Agents Trained with Domain Randomisation

Fri, 15 Nov 2019 22:00:19 +0100

Modular Meta-Learning with Shrinkage

Fri, 15 Nov 2019 22:00:19 +0100

Learning from Demonstration: Applications and challenges

Mon, 26 Nov 2018 00:00:00 +0000

Recent advances in deep reinforcement learning have enabled a wide range of capabilities, from learning to play video games to acquiring robotic visuomotor skills. However, there are a wide range of problems where hand-coding behaviour or a reward function is impractical. Learning from demonstration (LfD) serves as an essential tool for learning skills that are difficult to program by hand. These demonstrations provide snapshots of near-optimal behaviours, offering guidance for the learning process and alleviating the need to start from scratch or manually engineering parts of the solution. However, it is often unclear how to acquire these in non-controlled settings, and the new challenges that arise when trying to apply these techniques in the real world.

In this talk, I will present some of the recent techniques that can help us bridge that gap and learn realistic behaviours from a large source of untapped data already existing “in the wild”. I will cover some of the latest LfD approaches that leverage recent advances in deep learning and generative adversarial methods. I will present our recent work, video to behaviour (ViBe), which can extract realistic behaviours from raw unlabelled video data, without additional expert knowledge. We can automatically extract trajectories and use them to perform LfD through a novel curriculum and cope with multiple agents interacting in complex settings. I will finish with a discussion of open questions and future research directions required to extend these approaches further.

Learning from Demonstration in the Wild

Sat, 15 Sep 2018 22:00:19 +0100

Automated Curriculum Learning for Reinforcement Learning

Fri, 07 Sep 2018 21:53:04 +0100

Extending World Models for Multi-Agent Reinforcement Learning in MALMÖ

Fri, 07 Sep 2018 21:53:04 +0100

Learning from Demonstration in the Wild

Sat, 25 Aug 2018 00:00:00 +0000

Automated Curriculum Learning for Reinforcement Learning

Wed, 15 Aug 2018 00:00:00 +0000

Craft Environment

Wed, 15 Aug 2018 00:00:00 +0000

Automated Curriculum Learning for Reinforcement Learning

Sat, 28 Jul 2018 00:00:00 +0000

How would you make an agent capable of solving the complex hierarchical tasks?

Imagine a problem that is complex and requires a collection of skills, which are extremely hard to learn in one go with sparse rewards (e.g. solving complex object manipulation in robotics). Hence, one might need to learn to generate a curriculum of simpler tasks, so that overall a student network can learn to perform a complex task efficiently. Designing this curriculum by hand is inefficient. In this project, I set out to train an automatic curriculum generator using a Teacher network which keeps track of the progress of the student network, and proposes new tasks as a function of how well the student is learning. I adapted an state-of-the-art distributed reinforcement learning algorithm, for training the student network, while using an adversarial multi-armed bandit algorithm, for teacher network. I also developed an environment, Craft Env, with possibility of hierarchical task design with a range of complexity that is fast to iterate through. I analysed how using different metrics for quantifying student progress affect the curriculum that the teacher learns to propose and demonstrate that this approach can accelerate learning and interpretability of how the agent is learning to perform complex tasks. In order to start, I adapted the Craft Environment from work by Andreas et al.,[1] as it has a nice simple structure with possibility of hierarchical task design with a range of complexity that is fast to iterate through. I have developed a fixed curriculum of simpler target sub-tasks (in craft environment: “get wood” “get grass” “get iron” “make cloth” “get gold”), and in the future will make a teacher network who proposes tasks for the student to learn. I could also kick-start the student with demonstrations from an expert.

I have interfaced IMPALA[2], a GPU utilised version of A3C architecture which uses multiple distributed actors with V-Trace off-policy correction, with my Craft Environment to train on all the possible Craft tasks concurrently. This is possible by providing the hash of the task name as instruction to the network (similar setup to DMLab IMPALA, using an LSTM to process the instruction).

Other papers that I am inspired by in this work include [3], [4].

References

[1] Modular Multitask Reinforcement Learning with Policy Sketches (Andreas et al., 2016)

[2] Automated Curriculum Learning for Neural Networks (Graves et al., 2017)

[3] Learning by Playing-Solving Sparse Reward Tasks from Scratch (Reidmiller et al., 2018)

[4] POWERPLAY: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem (Schmidhuber, 2011)

Successor Representations

Fri, 23 Mar 2018 00:00:00 +0000

Successor representations were introduced by Dayan in 1993, as a way to represent states by thinking of how “similarity” for TD learning is similar to the temporal sequence of states that can be reached from a given state.

Dayan derived it in the tabular case, but let’s do it when assuming a feature vector $\phi$.

We assumes that the reward function can be factorised linearly: $$r(s) = \phi(s) \cdot w$$

This can then be inserted back into the value expectation formula: $$\begin{array}{rl} Q^\pi(s, a) &= E \left\lbrack \sum_{t=0}^\infty \gamma^t R(s_t) ~|~ s_o = s, a_0 = a \right\rbrack \\
& = E\left\lbrack \sum_t^\infty \gamma^t \phi(s_t) \cdot w ~|~ s_0=s, a_0 = a \right\rbrack \\
& = E\left\lbrack \sum_t^\infty \gamma^t \phi(s_t) ~|~ s_0=s, a_0 = a \right\rbrack \cdot w \\
& = \psi^\pi(s, a) \cdot w \end{array} $$

$\psi^\pi$ represents the Successor features, and does not depend on the rewards. It also represents a partial model of the world / occupancy of the states given the behaviour policy.

$w$ represents the rewards or the goal, and can be thought of as preferences about the given features $\phi(s)$ in a given task.

Assuming that the features $\phi(s)$ are good and consistent across a domain, we can imagine transfering to new tasks really quickly, as we simply need to learn new goal vectors $w$.

In Dayan’s case, the features $\phi(s)$ were directly the one-hot occupancy of a tabular state-space. This means that the successor features directly corresponds to the visitation counts under a policy. This is not as simple in more complex scenarios or when we want to learn $\phi$.

One important observation, also mentioned in passing by Dayan, is that one can rewrite the definition of the successor features $\psi$ using the Value expectation formula, obtaining independent Bellman updates for every components of $\phi$:

$$E\left\lbrack \sum_t^\infty \gamma^t \phi_i(s_t) ~|~ s_0=s, a_0 = a \right\rbrack$$

$$= E\left\lbrack \phi_i(s_t) + \gamma \psi_i(s_{t+1}, a^\star) - \psi_i(s_t, a_t) ~|~ s_0=s, a_0 = a \right\rbrack $$

In other terms, the features $\phi(s)$ behave as the rewards in a Bellman update, where $\psi(s, a)$ takes the role of the Q-values.

References

Peter Dayan, “Improving generalization for temporal difference learning: The successor representation”, Neural Computation, 5(4):613–624, 1993. pdf

What Would it Take to Train an Agent to Play with a Shape-Sorter?

Fri, 22 Sep 2017 00:00:00 +0000

The capabilities of humans to precisely and robustly recognise and manipulate objects has been instrumental in the development of human cognition. However, understanding and replicating this process has proven to be difficult. This is of particular importance when thinking of agents or robots acting in naturalistic environments, solving complex tasks. I will present recent work in this direction, focusing on computational optimality and Deep Reinforcement Learning techniques, to discover how to manipulate objects within a 3D physics simulator from high-dimensional sensory observations.