Decision ConvFormer: Local Filtering in MetaFormer is Sufficient for Decision Making

ICLR 2024 (Spotlight Presentation)

Jeonghye Kim1,  Suyoung Lee1,  Woojun Kim2*,  Youngchul Sung1*

1 KAIST  2 Carnegie Mellon University   

*indicates equal corresponding.


The recent success of Transformer in natural language processing has sparked its use in various domains. In offline reinforcement learning (RL), Decision Transformer (DT) is emerging as a promising model based on Transformer. However, we discovered that the attention module of DT is not appropriate to capture the inherent local dependence pattern in trajectories of RL modeled as a Markov decision process. To overcome the limitations of DT, we propose a novel action sequence predictor, named Decision ConvFormer (DC), based on the architecture of MetaFormer, which is a general structure to process multiple entities in parallel and understand the interrelationship among the multiple entities. DC employs local convolution filtering as the token mixer and can effectively capture the inherent local associations of the RL dataset. In extensive experiments, DC achieved state-of-the-art performance across various standard RL benchmarks while requiring fewer resources. Furthermore, we show that DC better understands the underlying meaning in data and exhibits enhanced generalization capability.


  • Offline RL can be viewed as sequence modeling problem and a representative work is Decision Transformer which employs an attention module.
  • Offline RL data has an inherent pattern of local association due to the Markovian property.

“Is the attention module still an appropriate local-association identifying structure for data sequences of Markov Decision Process?

The attention map of the Decision Transformer does not capture local associations well.

To properly learn the local association, we propose a new module focusing on past few timesteps and operating independently from the input sequence.

Model : Decision ConvFormer

Base Architecture : MetaFormer

MetaFormer is a general architecture that takes multiple entities in parallel, understands their interrelationship while minimizing information loss.The DC network architecture adopts a MetaFormer where the token mixer of the MetaFormer is given by a convolution module.

Token Mixer : 1D Depthwise Convolution Module

Considering the disparity among state, action, and RTG embeddings, we use three separate convolution filters for each hidden dimension

Results : Offline Performance

MuJoCo & Antmaze


Results : Ablation Study

Input Modal Dependency

DC found out that RTG and state are important, whereas DT seems not.

Generalization Capability

DC better understands the task context and better knows how to achieve the unseen desired higher target RTG than DT.


  title={Decision Convformer: Local Filtering in MetaFormer is Sufficient for Decision Making},
  author={Kim, Jeonghye and Lee, Suyoung and Kim, Woojun and Sung, Youngchul},
  booktitle={International Conference on Learning Representations},