| Literature DB >> 35363608 |
Yehao Li, Ting Yao, Yingwei Pan, Tao Mei.
Abstract
Transformer with self-attention has led to the revolutionizing of NLP field, and recently inspires the emergence of Transformer-style architecture design with competitive results in numerous CV tasks. Nevertheless, most of existing designs directly employ self-attention over a 2D feature map to obtain the attention matrix based on pairs of isolated queries and keys, but leave the rich contexts among neighbor keys under-exploited. Here we design a novel Transformer-style module, i.e., Contextual Transformer (CoT) block, for visual recognition. It fully capitalizes on the contextual information among input keys to guide the learning of dynamic attention matrix and thus strengthens the capacity of visual representation. Technically, CoT block first contextually encodes input keys via 3×3 convolution, leading to a static contextual representation. We further concatenate the encoded keys with input queries to learn the dynamic multi-head attention matrix through two consecutive 1×1 convolutions. The learnt attention matrix is multiplied by values to achieve the dynamic contextual representation. The fusion of static and dynamic contextual representations are finally taken as outputs. Our CoT block can readily replace each 3×3 convolution in ResNet architectures, yielding a Transformer-style backbone named as Contextual Transformer Networks (CoTNet). Through extensive experiments over a wide range of applications, we validate the superiority of CoTNet as a stronger backbone.Entities:
Year: 2022 PMID: 35363608 DOI: 10.1109/TPAMI.2022.3164083
Source DB: PubMed Journal: IEEE Trans Pattern Anal Mach Intell ISSN: 0098-5589 Impact factor: 6.226