Modern Hopfield Networks for Return Decomposition for Delayed Rewards

Michael Widrich; Markus Hofmarcher; Vihang Prakash Patil; Angela Bitto-Nemling; Sepp Hochreiter

Modern Hopfield Networks for Return Decomposition for Delayed Rewards

Michael Widrich, Markus Hofmarcher, Vihang Prakash Patil, Angela Bitto-Nemling, Sepp Hochreiter

Research output: Conference proceeding/Chapter in Book/Report/ › Conference Paper › peer-review

Abstract

Delayed rewards, which are separated from their causative actions by irrelevant actions, hamper learning in reinforcement learning (RL). Especially real world problems often contain such delayed and sparse rewards. Recently, return decomposition for delayed rewards (RUDDER) employed pattern recognition to remove or reduce delay in rewards, which dramatically simplifies the learning task of the underlying RL method. RUDDER was realized using a long short-term memory (LSTM). The LSTM was trained to identify important state-action pair patterns, responsible for the return. Reward was then redistributed to these important state-action pairs. However, training the LSTM is often difficult and requires a large number of episodes. In this work, we replace the LSTM with the recently proposed continuous modern Hopfield networks (MHN) and introduce Hopfield-RUDDER. MHN are powerful trainable associative memories with large storage capacity. They require only few training samples and excel at identifying and recognizing patterns. We use this property of MHN to identify important state-action pairs that are associated with low or high return episodes and directly redistribute reward to them. However, in partially observable environments, Hopfield-RUDDER requires additional information about the history of state-action pairs. Therefore, we evaluate several methods for compressing history and introduce reset-max history, a lightweight history compression using the max-operator in combination with a reset gate. We experimentally show that Hopfield-RUDDER is able to outperform LSTM-based RUDDER on various 1D environments with small numbers of episodes. Finally, we show in preliminary experiments that Hopfield-RUDDER scales to highly complex environments with the Minecraft ObtainDiamond task from the MineRL NeurIPS challenge.

Original language	English
Title of host publication	Deep Reinforcement Learning Workshop at Neural Information Processing Systems, 2021
Publication status	Published - 12 Oct 2021
Externally published	Yes
Event	Deep Reinforcement Learning Workshop at Neural Information Processing Systems 2021 - Duration: 10 Dec 2021 → …

Workshop

Workshop	Deep Reinforcement Learning Workshop at Neural Information Processing Systems 2021
Period	10/12/21 → …

Access to Document

https://openreview.net/forum?id=t0PQSDcqAiy

Cite this

@inproceedings{502629f78bef426a83c0ec02e0941190,

title = "Modern Hopfield Networks for Return Decomposition for Delayed Rewards",

abstract = "Delayed rewards, which are separated from their causative actions by irrelevant actions, hamper learning in reinforcement learning (RL). Especially real world problems often contain such delayed and sparse rewards. Recently, return decomposition for delayed rewards (RUDDER) employed pattern recognition to remove or reduce delay in rewards, which dramatically simplifies the learning task of the underlying RL method. RUDDER was realized using a long short-term memory (LSTM). The LSTM was trained to identify important state-action pair patterns, responsible for the return. Reward was then redistributed to these important state-action pairs. However, training the LSTM is often difficult and requires a large number of episodes. In this work, we replace the LSTM with the recently proposed continuous modern Hopfield networks (MHN) and introduce Hopfield-RUDDER. MHN are powerful trainable associative memories with large storage capacity. They require only few training samples and excel at identifying and recognizing patterns. We use this property of MHN to identify important state-action pairs that are associated with low or high return episodes and directly redistribute reward to them. However, in partially observable environments, Hopfield-RUDDER requires additional information about the history of state-action pairs. Therefore, we evaluate several methods for compressing history and introduce reset-max history, a lightweight history compression using the max-operator in combination with a reset gate. We experimentally show that Hopfield-RUDDER is able to outperform LSTM-based RUDDER on various 1D environments with small numbers of episodes. Finally, we show in preliminary experiments that Hopfield-RUDDER scales to highly complex environments with the Minecraft ObtainDiamond task from the MineRL NeurIPS challenge.",

author = "Michael Widrich and Markus Hofmarcher and Patil, {Vihang Prakash} and Angela Bitto-Nemling and Sepp Hochreiter",

year = "2021",

month = oct,

day = "12",

language = "English",

booktitle = "Deep Reinforcement Learning Workshop at Neural Information Processing Systems, 2021",

note = "Deep Reinforcement Learning Workshop at Neural Information Processing Systems 2021 ; Conference date: 10-12-2021",

}

Widrich, M, Hofmarcher, M, Patil, VP, Bitto-Nemling, A & Hochreiter, S 2021, Modern Hopfield Networks for Return Decomposition for Delayed Rewards. in Deep Reinforcement Learning Workshop at Neural Information Processing Systems, 2021. Deep Reinforcement Learning Workshop at Neural Information Processing Systems 2021, 10/12/21. <https://openreview.net/forum?id=t0PQSDcqAiy>

TY - GEN

T1 - Modern Hopfield Networks for Return Decomposition for Delayed Rewards

AU - Widrich, Michael

AU - Hofmarcher, Markus

AU - Patil, Vihang Prakash

AU - Bitto-Nemling, Angela

AU - Hochreiter, Sepp

PY - 2021/10/12

Y1 - 2021/10/12

N2 - Delayed rewards, which are separated from their causative actions by irrelevant actions, hamper learning in reinforcement learning (RL). Especially real world problems often contain such delayed and sparse rewards. Recently, return decomposition for delayed rewards (RUDDER) employed pattern recognition to remove or reduce delay in rewards, which dramatically simplifies the learning task of the underlying RL method. RUDDER was realized using a long short-term memory (LSTM). The LSTM was trained to identify important state-action pair patterns, responsible for the return. Reward was then redistributed to these important state-action pairs. However, training the LSTM is often difficult and requires a large number of episodes. In this work, we replace the LSTM with the recently proposed continuous modern Hopfield networks (MHN) and introduce Hopfield-RUDDER. MHN are powerful trainable associative memories with large storage capacity. They require only few training samples and excel at identifying and recognizing patterns. We use this property of MHN to identify important state-action pairs that are associated with low or high return episodes and directly redistribute reward to them. However, in partially observable environments, Hopfield-RUDDER requires additional information about the history of state-action pairs. Therefore, we evaluate several methods for compressing history and introduce reset-max history, a lightweight history compression using the max-operator in combination with a reset gate. We experimentally show that Hopfield-RUDDER is able to outperform LSTM-based RUDDER on various 1D environments with small numbers of episodes. Finally, we show in preliminary experiments that Hopfield-RUDDER scales to highly complex environments with the Minecraft ObtainDiamond task from the MineRL NeurIPS challenge.

AB - Delayed rewards, which are separated from their causative actions by irrelevant actions, hamper learning in reinforcement learning (RL). Especially real world problems often contain such delayed and sparse rewards. Recently, return decomposition for delayed rewards (RUDDER) employed pattern recognition to remove or reduce delay in rewards, which dramatically simplifies the learning task of the underlying RL method. RUDDER was realized using a long short-term memory (LSTM). The LSTM was trained to identify important state-action pair patterns, responsible for the return. Reward was then redistributed to these important state-action pairs. However, training the LSTM is often difficult and requires a large number of episodes. In this work, we replace the LSTM with the recently proposed continuous modern Hopfield networks (MHN) and introduce Hopfield-RUDDER. MHN are powerful trainable associative memories with large storage capacity. They require only few training samples and excel at identifying and recognizing patterns. We use this property of MHN to identify important state-action pairs that are associated with low or high return episodes and directly redistribute reward to them. However, in partially observable environments, Hopfield-RUDDER requires additional information about the history of state-action pairs. Therefore, we evaluate several methods for compressing history and introduce reset-max history, a lightweight history compression using the max-operator in combination with a reset gate. We experimentally show that Hopfield-RUDDER is able to outperform LSTM-based RUDDER on various 1D environments with small numbers of episodes. Finally, we show in preliminary experiments that Hopfield-RUDDER scales to highly complex environments with the Minecraft ObtainDiamond task from the MineRL NeurIPS challenge.

M3 - Conference Paper

BT - Deep Reinforcement Learning Workshop at Neural Information Processing Systems, 2021

T2 - Deep Reinforcement Learning Workshop at Neural Information Processing Systems 2021

Y2 - 10 December 2021

ER -

Modern Hopfield Networks for Return Decomposition for Delayed Rewards

Abstract

Workshop

Access to Document

Fingerprint

Cite this