Doorkeeper

[Deep Learning Theory Team Seminar] Talk by Prof. Difan Zou (HKU) on Understanding the Working Mechanism of Transformers

Wed, 20 Nov 2024 14:00 - 15:00 JST
Online Link visible to participants
Register
Free admission

Description

Venue: Online and the Open Space at the RIKEN AIP Nihonbashi office

Language: English

Title: Understanding the Working Mechanism of Transformers: Model Depth and Multi-head Attention

Speaker: Prof. Difan Zou, HKU, https://difanzou.github.io/

Abstract:

In this talk, I will discuss our recent works on the working mechanism of the Transformer architecture, including the learning capabilities and limitations of model depth and the multi-head attention mechanism in different tasks. Specifically, in the first part of the talk, we designed a series of learning tasks based on actual sequences and systematically evaluated the performance and limitations of Transformers of different depths in terms of memory, reasoning, generalization, and context generalization capabilities. We have demonstrated that a Transformer with single-layer attention performs excellently in memory tasks but cannot complete more complex tasks. In addition, at least a two-layer Transformer is required to achieve reasoning and generalization capabilities, while context generalization capabilities may require a three-layer Transformer to be achieved.

In the second part of the talk, considering the sparse linear regression problem, we explored the role of the multi-head attention of the Transformer model (after training) and revealed the working mechanism of multi-head attention on different Transformer layers. Firstly, we found in experiments that each attention head in the first layer of the Transformer is very important for the final performance, while in subsequent Transformer layers usually only one attention head plays an important role. We further proposed a preprocess-then-optimize working mechanism and theoretically proved that a multi-layer Transformer (multiple heads in the first layer and only one head in subsequent layers) can implement this mechanism. Moreover, in the sparse linear regression problem, we further proved the superiority of this mechanism compared to the naive gradient descent and ridge regression algorithms, which is consistent with the experimental findings. These research results help to deeply understand the advantages of multi-head attention and the role of model depth, providing a new perspective for revealing more complex mechanisms inside the Transformer.

Bio: Dr.Difan Zou is an assistant professor in computer science department and institute of data science at HKU. He has received his PhD degree in Department of Computer Science, University of California, Los Angeles (UCLA). His research interests are broadly in machine learning, deep learning theory, graph learning, and interdisciplinary research between AI and other subjects. His research is published in top-tier machine learning conferences (ICML, NeurIPS, COLT, ICLR) and journal papers (IEEE Trans., Nature Comm., PNAS, etc.). He serves as an area chair/senior PC member for NeurIPS and AAAI, and PC members for ICML, ICLR, COLT, etc.

About this community

RIKEN AIP Public

RIKEN AIP Public

Public events of RIKEN Center for Advanced Intelligence Project (AIP)

Join community