If you like my posts on Medium, also join the ds-econ newsletter:
To make my studying a bit more fun, I decided to compile a series of posts on some of the topics which I study. Luckily, this is a blog about data science and I study data science, so I figured that many of you might benefit from my study notes as well.
Now, I call this series Study Garden, you can think about these posts like study notes. They are work in progress, might get updated in the future, are not perfect and some information can be wrong.
Learn more about this format here.
Attention has emerged as a powerful tool in deep learning, enabling models to capture intricate relationships within data.
Here, we explore the Attention framework and its significance in enhancing deep learning models. We'll discuss its interpretation, single-head attention, positional encoding, performance, and its versatility across diverse tasks.
Attention allows models to learn the interactions between different nodes within a neural network. Interactions effects are otherwise hard to encode manually.
Often used for text data, here Attention identifies both grammatical and semantic connections, enabling the model to capture nuanced meanings that go beyond individual words.
In the case of text data, an interaction would mean, that e.g. two words together get a different meaning than two words by themselves. The words "couch" and "potato" each have their own meanings, but the word "couch potato" has a very different meaning.
While we could encode this relationship for simple composite words such as this one, this quickly becomes computationally infeasbile and does not allow for interactions that are more complex:
- I am going to the ball in the opera house
- I am going towards the ball on the field
- I kicked the ball into the goal
- I got kicked out of the ball at the opera house
We could not pick up the difference between ball (round sports device) and ball (upper class dance event) via feature engineering alone, as the whole context of the sentence matters for this differentiation.
Single-head attention is the fundamental building block of the attention framework, capturing relationships between different parts of the input. It enables models to process sequential data, such as time series or text, by considering the contextual dependencies among elements.
Single-head attention involves calculating the weighted sum of values based on the relationship between so called queries and keys.
By comparing the query with keys, Attention determines the relevance of information within a given context. This process is particularly useful for word embeddings, where attention emphasizes dimensions where both word vectors have high values. The idea behind this is, that these spikes reflect something about the text's semantic meaning or grammatical structure, and that we should hence pay attention to it.
Multiple single-head Attention mechanisms cen be employed in parallel, in the hope of capturing different aspects of the input.
The Transformer architecture incorporates the multi-head attention block and a feed-forward neural network. As the Transformer's input dimensions, equal its output dimensions, we can stack multiple Transformers on top of each other.
At the time of the paper by Vaswani et al. 2017, the Transformer outperformed other state-of-the-art translation models while using fewer training costs. Its performance extends beyond machine translation, as it achieves comparable results in tasks like English constituency parsing.
The Transformer, built upon the attention framework, offers scalability, flexibility, and remarkable results across various natural language processing tasks.
By simplifying complex data and enhancing model performance, Attention paved the way for advancements in the field of deep learning, such as BERT or GPT.
Online Resources
Posts
A Jupyter Notebook by HarvardNLP, walking through the original Attention is all you need paper (see below).
Excellent, and also technical (nice!), post by Lilian Weng
An extensive Medium post by user Harshall Lambda
Videos
These got recommend in one of my lectures. I found them really helpful.
Research Papers
The seminal paper on this topic.