How does weight-tying work in a RNN?

Weight-tying is where you have a language model and use the same weight matrix for the input-to-embedding layer (the input embedding) and the hidden-to-softmax layer (the output embedding). The idea is that these two matrices contain essentially the same information, each having a row per word in the vocabulary. The concept seems to be first mentioned by Press and Wolf in 2016.

Below we illustrate this with a RNN. This post assumes that you are somewhat familiar already with how a RNN works. Keep an eye out for the matrices $U$ and $V$ below: these are the input and output embedding matrices.

The RNN setup Link to heading

Say you have a document split into $T$ words, $w_1, \dots, w_T$. The RNN will take one word at a time as input. Call this $w_t$.

First $w_t$ is transformed into a one-hot vector $x_t$. This vector has length $C$, where $C$ is the number of words in the vocabulary, which means $x_t$ has shape $C \times 1$.

$U$ is the input embedding matrix and is usually pre-trained. Say we are using $D$-dimensional GloVe embeddings (e.g. 50-dimensional). Then $U$ has shape $C \times D$, with one row per word in the vocab. The purpose of $U$ is to convert the one-hot vector $x_t$ into its embedding representation $e_t$ via $e_t = U^Tx_t$. The embedding vector $e_t$ now has shape $D \times 1$.

Next some operations are done on $e_t$ to obtain the hidden state $h_t$ of some length $H$. For a RNN this looks something like $$h_t = f \big(P_{DH} e_t + W_{hh} h_{t-1} + b_1 \big) $$ where

$P_{DH} $ is a projection matrix whose purpose is to transform the $D \times 1$ vector $e_t$ to the $H \times 1$ shape of $h_t$. This matrix $P_{DH}$ has shape $ H \times D$
$W_{hh}$ is a weight matrix of size $H \times H$ that specifies how the previous hidden state affects the current hidden state
$h_{t-1}$ is the hidden vector from the previous time step and has shape $H \times 1$
$b_1$ is a bias vector of size $H \times 1$
$f$ is some activation function, like tanh or ReLu

Now that you have the hidden state calculated for the current time step, the next step is to obtain a vector of scores $s_t$ which are used to calculate probabilities. The score vector $s_t$ has size $C \times 1$, one entry per word in the vocab, and is called the hidden scores or the logits layer. It is given by

$$s_t = Vh_t + b_2 $$

where

$h_t$ is the hidden state from above and has size $H \times 1$
$V$ is a weight matrix and has shape $ C \times H$. This is the output embedding matrix.
$b_2$ is another bias vector and has size $C \times 1$

The final vector of next-word probabilities $p_t$ is obtained from $s_t$ via the softmax function $\sigma$, so that $p_t = \sigma(s_t)$. This layer is known as the softmax layer. At this point you are finished.

Adding weight tying Link to heading

Now, let’s consider the shapes of the input and output embedding matrices $U$ and $V$ in the above formulation.

$U$ has shape $C \times D$
$V$ has shape $ C \times H$

These are different at the moment, but if you set the size of the hidden layer $H$ to be equal to the size of each embedding vector $D$, then these two matrices become the same size. Then we can do weight-tying and use a common matrix for both $U$ and $V$.

Setting $D = H$ is required in some Python examples of weight-tying (here is one example). The downside is that you are stuck setting those two values to be the same, which is restrictive and not optimal for performance. For example, if you are using 300-d GloVe vectors and want to measure performance with the 50-d version, then the hidden vector will also shrink from length 300 to length 50, impacting performance in a way you probably didn’t intend.

Can we have $H$ different to $D$? We can if we modify the above equations by inserting a projection matrix $P_{HD}$ of size $D \times H$. Now we get

$$s_t = VP_{HD}h_t + b_2 $$

where

$h_t$ is the hidden state from above and has size $H \times 1$
$V$ is a weight matrix and now has shape $ C \times D$ (same as $U$)
$P_{HD}$ is a projection matrix and has shape $D \times H$
$b_2$ is another bias vector and has size $C \times 1$

So to summarise, our RNN with weight tying for the case $D \neq H$ looks like

$$\begin{cases} e_t = U^Tx_t \\\ h_t = f \big(P_{HD}e_t + W_{hh}h_{t-1}+b_1\big) \\\ s_t = VP_{DH}h_t + b_2 \\\ p_t = \sigma (s_t) \end{cases}$$

and for weight tying with $D = H$, it becomes

$$ \begin{cases} e_t = U^Tx_t \\\ h_t = f \big(e_t + W_{hh}h_{t-1}+b_1\big) \\\ s_t = Vh_t + b_2 \\\ p_t = \sigma (s_t) \end{cases} $$

A second implicit assumption here is the number of words with an input embedding vector is the same as the number of words able to be predicted by the language model. So if you have a pre-trained embedding that has a vocab of 500,000 words, you can only predict those 500,000 words. We have used the common symbol $C$ to denote both these values, but I can imagine there are situations where they should be different.