Reinforcement Learning in the Control of Attention Roderic A Grupen Laboratory for Analysis and Architecture of Systems (State University of Campinas-near.

Reinforcement Learning in the Control of Attention Roderic A Grupen Laboratory for Analysis and Architecture of Systems (State University of Campinas-near future) www.laas.fr/~lmgarcia Laboratory for Perceptual Robotics State University of Massachusetts (USA) www-robotics.cs.umass.edu Luiz M G Gonçalves

Objective n To develop a robotic system to perform tasks involving attention and pattern categorization, integrating multi-modal (haptic and visual) information in a behaviorally cooperative active system.

Motivation n Towards finding an useful robotic system able to: u foveate (verge) the eyes onto a ROI; u keep attention on the ROI if needed; u choose another ROI (shift focus of attention). n Result is a behaviorally cooperative active system, which provides on- line feedback to environmental stimuli in form of actions

Method n Use of (real time) visual information from a stereo head and a simulator n Selective Attention (bottom-up salience maps) n Multi feature extraction (perceptual state) n Associative memory (pattern address identification) n Efficient topological mapping n Learn policies to program the system

Task Specification (Objectives) n Visual Monitoring or Environment Inspection u Construction of an attentional map u Keep this map consistent with a current perception (update) u Categorize all patterns

Processo Markoviano n Um processo estocástico cujo passado não influencia o futuro se o seu presente está completamente especificado n Ex: Jogo de damas, Xadrez

Programação Dinâmica n Percorrer todos os estados possíveis, testando todas as possibilidades (executar todas as ações infinitamente) n Solução melhor (PD): u Reduzir a complexidade de um problema que pode ser resolvido em uma dimensão D para dois ou mais problemas em dimensões menores n Ex: Disparidade estéreo: u 1 problema em 3D (x,y,d) é reduzido para 2 problemas em 2D (x,d) e (y,d)

Pavlov n Animal faz certo, ganha comida n Animal faz errado, apanha n Em teoria, é provado que apenas um deles (recompensa ou punição funciona): fez coisa errada, não ganha comida. n Assim: u robô fez certo => recompensa

Reinforcement Learning (Related Work) n Watkins: Learning from Delayed Rewards (1989). n Sutton/Barto: Reinforcement Learning: An Introduction (1998). n Araujo: Learning a Control Composition in a Complex Environment (1996). n Huber: A Feedback Control Structure for On- line Learning tasks (1997). n Coelho: A Control Basis for Learning Multifingered Grasps (1997).

Modelling a problem with delayed reinforcement as an MDP: n a set of states (estados) S, n a set of actions (operadores) A, n a reward function R:SxA, and n a state transition function T:SxA (S), which maps states transition to probabilities. n Q-learning equation:

Q-learning equation n a = ação executada n r = recompensa n s = estado resultante de aplicar a n A = todas as ações possíveis a de serem executadas em s n = learning rate (geralmente 0.1) n = fator de disconto (geralmente 0.5)

Observações n Uma transição no espaço de estados pode ser completamente caracterizada pelo vetor (s,a,r,s) n Supondo que para todos os pares (s,a), Q(s,a) seja atualizado infinitamente (muitas vezes) para todo par (s,a), Q(s,a) converge com probabilidade 1 para a melhor recompensa possível para este par.

Exploração e explotação n Exploração; randomicamente escolher uma ação n Explotação: após certo tempo, o sistema começa a convergir, assim, escolhe-se ações que sabe- se estejam contribuindo para a convergência n Balancear entre exploração e explotação n Temperatura (lembra Simulated Annealing) u Escolher randomicamente em função da temperatura (inicial alta, depois baixa) u Na prática, mesmo no final, ainda 10% randomico

Algoritmo Q-learning n 1) Define current state s by decoding sensory information available; n 2) Use stochastic action selector to determine action a; n 3) Perform action a, generating new state s and a reinforcement r; n 4) Calculate temporal differencial error r: n 5) Update Q-value of the state/action pair(s,a) n 6) Go to 1;

Elegibility trace n Atualizar não apenas um par estado-ação de cada vez, mas sim uma seqüência de pares (após execução de uma série de ações). n Ganho em convergência

Na prática n Uma tabela (Q-table) n Linhas são os estados (s) n Colunas são as ações (a) n Elemento Q(s,a) são os Q-values, valores dados pela função que avalia a utilidade de tomar a ação a quando o estado é s

Roger-the-Crab

Stereo Head Environment

Degrees of Freedom (Controllers)

System Control Architecture

Low-level Control n Defining a target u Pre-attentional phase (stimuli + internal biased) u Shifting attention (saccade generation) u Fine saccade (using target model) u Verging eyes onto a target (correlation) u Movements are computed from errors to image centers

Low-level Control n Identifying Objects u Selecting a region of interest u Extracting features u Associative memory match n Mapping objects and/or updating memory u Pre-attentional maps u Automatic supervised learning

Behavioral Program

A straight-forward control algorithm n Step 0: Initialize the associative memory and start the concurrent controllers of arms, neck, and eyes. n Step 1: Re-direct attention; if a representation is activated, update attentional maps and re-do this step (1). n Step 2: Try a visual improvement; if a representation is activated, update attentional maps and return to step 1. n Step 3: Try an arm improvement; if a representation is activated, update attentional maps and return to step 1; n Step 4: Activate supervised learning module, update attentional maps and return to step 1.

Finite state machine

Results n Q-learning convergence

Partial Evaluation of strategies n Attentional Shifts

Partial Evaluation of strategies n Visual/arm Improv

Partial Evaluation of strategies n Objects Identified

Partial Evaluation of strategies n New objects

Global evaluation Mapped objects

Task accomplishment Mapped objects

Times for each phase or process n Phase Min(sec) Max(sec) Mean(sec) n Computing retina0.1450.189 0.166 n Transfer to host 0.0170.059 0.020 n Total acquiring 0.1620.255 0.186 n Pre-attention 0.1390.205 0.149 n Salience map 0.0670.134 0.075 n Total attention 0.3240.395 0.334 n Total saccade 0.4660.903 0.485 n Features for match 0.1350.158 0.150 n Memory match 0.0120.028 0.019 n Total matching 0.3230.353 0.333

Conclusions n The system can support other sensors. n Attention and categorization act together: tasks must be formulated n Inspection task succesfully done. n Currently support a 10-15 frame rate. n Reinforcement learning appr. worked well in simulation

Future works n Consider focus for saccade generation and accomodation (vergence) n Test with partially ocluded objects n Derive policies (with Q-learning) for control of top- down attention n Increase the state space and/or the set of actions n Define other hierarchical tasks (several policies, each appropriate for a given task) n Test learning architecture on a real environment

Thanks n Thanks to CNPQ, CAPES, FAPERJ, NSF and UMASS (USA) n To all of you for your patience n To Mimmo and Dr. Arcangelo Distante for hosting me:-).

Reinforcement Learning in the Control of Attention Roderic A Grupen Laboratory for Analysis and Architecture of Systems (State University of Campinas-near.

Apresentações semelhantes

Apresentação em tema: "Reinforcement Learning in the Control of Attention Roderic A Grupen Laboratory for Analysis and Architecture of Systems (State University of Campinas-near."— Transcrição da apresentação:

Apresentações semelhantes

Sobre projeto

Feedback

Login

Autorizar-se através da rede social:

Reinforcement Learning in the Control of Attention Roderic A Grupen Laboratory for Analysis and Architecture of Systems (State University of Campinas-near.

Apresentações semelhantes

Apresentação em tema: "Reinforcement Learning in the Control of Attention Roderic A Grupen Laboratory for Analysis and Architecture of Systems (State University of Campinas-near."— Transcrição da apresentação:

Apresentações semelhantes

Sobre projeto

Feedback