[CLS] is NOT supposed to be the first input token for decoder-only model while training

这两天在预训练模型，采用的是完整的transformer架构，但是呢，encoder的输入是音符序列（实际上是一个4维的向量序列），而decoder的输入则是传统的文本序列。由于两种数据存在明显的gap（music->text）,因此在backbone能够工作的情况下，为了进一步提升模型的效果，决定首先将encoder与decoder拆开，分别在音符序列以及歌词序列上做预训练，之后再合在一起，使用配对的数据进行联合训练。

在预训练encoder的过程中，没有遇到太多问题，基于的BERT的训练方式（MLM）

但是在训练decoder的过程中，就遇到了如下的问题：

predicted_tokens：  i'know me heart [SEP] a [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP]
ground_truth： you only left a load of care [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
100%|██████████| 22831/22831 [1:58:00<00:00,  3.22it/s]
  0%|          | 1/23976 [00:00<2:09:51,  3.08it/s]Epoch [1/2], Loss: 4.9228
predicted_tokens：  ia,g [SEP] sa [SEP] [SEP]o [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP]
ground_truth： karena bahagia'kan datang sendiri [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
  0%|          | 101/23976 [00:24<1:33:46,  4.24it/s]Epoch [1/2], Loss: 4.2510
predicted_tokens：  i i a [SEP] get [SEP] roll [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP]
ground_truth： and get dough and roll and ride [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
  1%|          | 201/23976 [00:48<1:33:55,  4.22it/s]Epoch [1/2], Loss: 4.6348
predicted_tokens：  i for the night [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP]
ground_truth： waiting for the tide [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
  1%|▏         | 301/23976 [01:12<1:37:43,  4.04it/s]Epoch [1/2], Loss: 4.5523
predicted_tokens：  i the love are [SEP] sameest [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP]
ground_truth： until our hands touched the cold ground [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
  2%|▏         | 401/23976 [01:36<1:38:21,  3.99it/s]Epoch [1/2], Loss: 4.5722
predicted_tokens：  i in the way [SEP] at the time ofns [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP]
ground_truth： sitting by the window looking at a couple goons [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
  2%|▏         | 425/23976 [01:42<1:34:33,  4.15it/s]

如果你足够细心，能够发现，每一个预测的序列的开头第一个词都是i，没有任何变化，当然，我们的第一直觉肯定是，数据集中以i开头的文本肯定很多，这固然没有问题，从不严谨的角度（未进行统计以i开头的句子数量），毕竟是歌词。但是再往深处想，即使，i出现在句子开头的概率固然高，但不至于让一个模型所有的输出，全部都是以i开头吧。那么我们就从模型训练的过程分析。

我是在另一篇blog https://mdnice.com/writing/fc0b920d4ca84837a5712df1a46865d2 的基础上进行的魔改（因为该blog很详细的用pytorch将transformer的每一个部分都做了最基础的实现，并且注释详细）。那么我们来看看这篇blog中的训练数据准备部分：

def make_data(sentences):
    """把单词序列转换为数字序列"""
    enc_inputs, dec_inputs, dec_outputs = [], [], []
    for i in range(len(sentences)):
 
        enc_input = [[src_vocab[n] for n in sentences[i][0].split()]]
        dec_input = [[tgt_vocab[n] for n in sentences[i][1].split()]]
        dec_output = [[tgt_vocab[n] for n in sentences[i][2].split()]]

        #[[1, 2, 3, 4, 5, 6, 7, 0], [1, 2, 8, 4, 9, 6, 7, 0], [1, 2, 3, 4, 10, 6, 7, 0]]
        enc_inputs.extend(enc_input)
        #[[9, 1, 2, 3, 4, 5, 11], [9, 1, 2, 6, 7, 5, 11], [9, 1, 2, 3, 8, 5, 11]]
        dec_inputs.extend(dec_input)
        #[[1, 2, 3, 4, 5, 11, 10], [1, 2, 6, 7, 5, 11, 10], [1, 2, 3, 8, 5, 11, 10]]
        dec_outputs.extend(dec_output)

    return torch.LongTensor(enc_inputs), torch.LongTensor(dec_inputs), torch.LongTensor(dec_outputs)


entences = [
    # 中文和英语的单词个数不要求相同
    # enc_input                dec_input           dec_output
    ['我 有 一 个 好 朋 友 P', 'S I have a good friend .', 'I have a good friend . E'],
    ['我 有 零 个 女 朋 友 P', 'S I have zero girl friend .', 'I have zero girl friend . E'],
    ['我 有 一 个 男 朋 友 P', 'S I have a boy friend .', 'I have a boy friend . E']
]

可以发现的是，所有的decoder的输入都是以 [CLS] + 完整的句子 作为输入，并且以 完整的句子 + [EOS] 作为输出（ground truth），从transformer的训练角度上讲很合理，因为decoder所需要做的事情就是通过 当前的词 + encoder输出 来预测 下一个词 ，所以会对decoder的输入输出做一个token的偏移。

而我们的预训练过程，也是采用的 [CLS] + 完整的句子 作为输入，而开头的这个 [CLS] 就是上述出现的问题（i字符总是出现在开头）的罪魁祸首。为什么这么说呢，基于实验经验与理论分析两点原因：

实验经验方面，在之前一版代码中，由于很多部分是直接调用的库函数以及gpt生成，质量不高，正确性不易分析，所以没有继续使用，但是在用该版代码的时候，发现模型很快能够收敛到有意义的文本输出（训练30-60分钟后）如：

66%|██████▋   | 7298/10990 [13:28<06:17,  9.78it/s]
Generated Text: cause being free is a state of mind steady he written control charm he he ’ your bottle wedding des bottle des des bottle
 25%|██▌       | 2799/10990 [04:31<10:29, 13.02it/s]
 Generated Text: i give her all my love yo tell turn girl girl girl girl turn yo girl met 
62%|██████▏   | 6799/10990 [12:30<07:09,  9.75it/s]
Generated Text: that's why i've done it again. no - no - roof. double rides rides rides writtenves double rides rides bottle rides des rides rides rides he written rides rides bottle pity forth

基本上就是训练十几二十分钟的水平，生成序列的前几个字符已经是有语义的一句话了，这就使我考虑到现行代码中可能存在的问题（虽然也有可能我的头版代码中压根没有做decoder输入输出的偏移，才会导致头版代码快速收敛，但是总之是给了我一点提醒）。

理论分析方面，为什么在完整的transformer中，直接以 [CLS] + 完整的句子 作为输入，并且以 完整的句子 + [EOS] 作为输出（ground truth）是可行的呢，这是因为decoder是通过 当前的词 + encoder输出 来预测 下一个词，注意 encoder输出 这一部分，即是说，即使每一个decoder的输入的第一个字符都是 [CLS] 但由于其 encoder输出 的输出是多变的，因此，decoder 生成的第一个字符 也是多变的。但是，在单独针对decoder的预训练中，没有encoder的输出作为输入的信息，那么其输入就只有 当前的词，而不是 当前的词 + encoder输出 ，那么此时使用 [CLS] 作为所有输入的开头第一个token，这件事情本身就不合理，同时，由于每次开头的输入token都是 [CLS]，即输入的第一个词总是一样的，所以，模型才会倾向于将 i 作为预测的第一个词，因为它在整个数据集中以较高的概率出现在文本的开头。同时这样的训练会使得模型无法很好的学习到其他的词之间的关系，以为第一个预测的词总是 i 对于后续其他的词的生成也会产生负面的影响，使模型无法正确的学习到词与词之间的条件概率分布。从而使得模型训练举步维艰。

改进的方式也很简单：

将输入从 [CLS] + 完整的句子 转变为 完整的句子

将输出从 完整的句子 + [EOS] 转变为 去掉第一个词的句子 + [EOS]

~~至于效果如何，先开始训练再说，哈哈哈。~~

经过训练，发现确实这样做是正确的，在训练相同轮数以及数据的情况下，我们来看看改进后方法输出的样例以及tensorboard上训练的loss曲线：

样例分析：

 40%|████      | 2301/5708 [09:51<14:29,  3.92it/s]Epoch [2/2], Loss: 4.5555
predicted_tokens：  i'm not alone about i i'm not the place [SEP] the [SEP] [SEP] [SEP] [SEP] [SEP]
ground_truth： i'm not worried, and i'm in a hurry to die [SEP] [PAD] [PAD] [PAD] [PAD]
 91%|█████████ | 5201/5708 [22:41<02:16,  3.71it/s]Epoch [2/2], Loss: 4.4399
predicted_tokens：  you ready for go? [SEP]??????????????
ground_truth： you ready to roll? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]

从这些例子中可以看出，模型已经很快的学习到了一些有意义的文本，一反之前生成的总是以 i 开头的无意义文本的常态，那我们再来看一下训练的tensorboard图像

只需要看紫色和黄色的线，因为这两条线是加载同一个初始的模型权重在同一个数据集上均训练两个epoch后的结果（紫色的batch是512，黄色的batch是64，原因看ps），可以发现，紫色的loss比黄色更低（4.5 vs 4.583），而且收敛速度（曲率）明显更快，这也间接验证了之前我们的观察，提出的问题确实存在，且给出的解决方案是合理的。perfect！

ps: 由于歌词每一句都很短，我之前使用104的文本长度上限是完全浪费了，现在将长度上限设置为了24，训练速度显著提升（120min -> 25min），一晚上就能将整个数据集train好啦，哈哈~

ps2：后续考虑要不要将预测token序列在遇到第一个 [SEP] 之后的token全部强制修改为 [SEP]，使模型强制意识到 [SEP] 是句子的末尾，从而间接的要求模型更好的去生成完整的句子。

#experiment

[CLS] is NOT supposed to be the first input token for decoder-only model while training

http://example.com/2023/12/10/CLS-is-NOT-supposed-to-be-the-first-input-token-for-decoder-only-model-while-training/

Author

iMusic

Posted on

December 10, 2023

Licensed under

MIDI and Lyrics Data Preprocess Next