2019.06 Adapter Tuning

@Parameter-Efficient Transfer Learning for NLP
Fine-tuning large pre-trained models is an effective transfer mechanism in NLP. However, in the presence of many downstream tasks, fine-tuning is parameter inefficient: an entire new model is required for every task. As an alternative, we propose transfer with adapter modules. Adapter modules yield a compact and extensible model;they add only a few trainable parameters per task,and new tasks can be added without revisiting previous ones. The parameters of the original network remain fixed, yielding a high degree of parameter sharing. To demonstrate adapter’s effectiveness, we transfer the recently proposed
BERT Transformer model to 26 diverse text classification tasks, including the GLUE benchmark.Adapters attain near state-of-the-art performance,whilst adding only a few parameters per task. On GLUE, we attain within 0.4% of the performance
of full fine-tuning, adding only 3.6% parameters per task. By contrast, fine-tuning trains 100% of the parameters per task.1
微调大型预训练模型是自然语言处理中一种有效的迁移机制。然而,在存在许多下游任务的情况下,微调过程的参数效率低下:每个任务都需要一个全新的模型。作为一种替代方案,我们提出了使用适配器模块进行迁移学习。适配器模块可以生成一个紧凑且可扩展的模型;每个任务只需添加少量可训练参数,并且可以在不影响之前任务的情况下添加新任务。原始网络的参数保持固定,实现了高度的参数共享。为了证明适配器的有效性,我们将最近提出的BERT Transformer模型迁移到26个不同的文本分类任务中,包括GLUE基准任务。适配器模型可以达到接近最先进性能,而每个任务只增加了很少的参数。在GLUE上,我们的方法仅比完全微调方法低0.4%,而每个任务仅添加了3.6%的参数。相比之下,传统微调方法需要训练每个任务的100%参数。)

2021.01 Prefix-Tuning

@Prefix-Tuning: Optimizing Continuous Prompts for Generation
Fine-tuning is the de facto way to leverage large pretrained language models to perform downstream tasks. However, it modifies all the language model parameters and therefore necessitates storing a full copy for each task.In this paper, we propose prefix-tuning, a lightweight alternative to fine-tuning for natural language generation tasks, which keeps language model parameters frozen, but optimizes a small continuous task-specific vector (called the prefix). Prefix-tuning draws inspiration from prompting, allowing subsequent tokens to attend to this prefix as if it were “virtual tokens”. We apply prefix-tuning to GPT-2 for table-to-text generation and to BART for summarization. We find that by learning only 0.1% of the parameters, prefix-tuning obtains comparable performance in the full data setting, outperforms fine-tuning in low-data settings, and extrapolates better to examples with topics unseen during training.
微调实际上是利用大型预训练语言模型来执行下游任务的方法。然而,它修改了所有的语言模型参数,因此需要为每个任务存储一个完整的副本。在本文中,我们提出了前缀调优,这是自然语言生成任务微调的一种轻量级替代方案,它可以保持语言模型参数的冻结,但可以优化一个小的连续任务特定向量(称为前缀)。Prefix-tuning灵感源自prompting,允许后续的令牌关注这个前缀,就好像它是”虚拟令牌”。我们将前缀调整应用于GPT-2进行表格到文本生成,以及应用于BART摘要生成。我们发现,通过仅学习0.1%的参数,前缀调整在全数据设置中获得了可比的性能,在低数据设置中优于微调,并更好地外推到训练中未发现主题的示例。

2021.03 P Tuning

@GPT Understands, Too
While GPTs with traditional fine-tuning fail to achieve strong results on natural language understanding (NLU), we show that GPTs can be better than or comparable to similar-sized BERTs on NLU tasks with a novel method P-tuning— which employs trainable continuous prompt embeddings. On the knowledge probing (LAMA) benchmark, the best GPT recovers 64% (P@1) of world knowledge without any additional text provided during test time, which substantially improves the previous best by 20+ percentage points. On the SuperGlue benchmark, GPTs achieve comparable and sometimes better performance to similar-sized BERTs in supervised learning. Importantly, we find that P-tuning also improves BERTs’ performance in both few-shot and supervised settings while largely reducing the need for prompt engineering. Consequently, Ptuning outperforms the state-of-the-art approaches on the few-shot SuperGlue benchmark.
传统的精调方法使得GPT在自然语言理解(NLU)上无法取得强大的结果。然而,我们展示了一种名为P-tuning的新方法,该方法使用可训练的连续提示嵌入,使得GPT在NLU任务上比类似大小的BERT表现更好或者可以媲美它。在知识探索(LAMA)基准测试中,最好的GPT在没有在测试时提供任何额外文本的情况下,可以恢复64%(P@1)的世界知识,这大大提高了先前最好结果的20%以上。在SuperGlue基准测试中,GPT在监督学习方面实现了与类似大小的BERT相当甚至更好的性能。重要的是,我们发现P-tuning还改善了BERT在少样本和监督设置中的性能,同时大大减少了提示工程的需求。因此,P-tuning在少样本SuperGlue基准测试上优于现有的最先进方法。

2021.09 Prompt Tuning

@The Power of Scale for Parameter-Efficient Prompt Tuning
In this work, we explore “prompt tuning,”a simple yet effective mechanism for learning “soft prompts” to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signals from any number of labeled examples.Our end-to-end learned approach outperforms GPT-3’s few-shot learning by a large margin.More remarkably, through ablations on model size using T5, we show that prompt tuning becomes more competitive with scale: as models exceed billions of parameters, our method “closes the gap” and matches the strong performance of model tuning (where all model weights are tuned). This finding is especially relevant because large models are costly to share and serve and the ability to reuse one frozen model for multiple downstream tasks can ease this burden. Our method can be seen as a simplification of the recently proposed “prefix tuning” of Li and Liang (2021) and we provide a comparison to this and other similar approaches. Finally, we show that conditioning a frozen model with soft prompts confers benefits in robustness to domain transfer and enables efficient “prompt ensembling.”
在这项工作中,我们探索了一种名为“prompt tuning”的简单而有效的机制,通过该机制可以学习“软提示”,以使冻结的语言模型能够执行特定的下游任务。与GPT-3使用的离散文本提示不同,软提示是通过反向传播学习得到的,并且可以通过调整来综合任意数量的标记示例中的信号。我们的端到端学习方法在性能上大大优于GPT-3的少样本学习。更为引人注目的是,通过使用T5模型进行模型大小的消融实验,我们发现随着模型参数超过数十亿,我们的prompt tuning方法变得与模型调优(即调整所有模型权重)具有相当竞争力。这一发现特别重要,因为大型模型的共享和部署成本较高,能够将一个冻结模型重复使用于多个下游任务可以减轻这种负担。我们的方法可以看作是最近提出的Li和Liang(2021)的“prefix tuning”的简化版本,并与此以及其他类似方法进行了比较。最后,我们还展示了使用软提示对冻结模型进行条件训练的好处,包括提高领域转移的鲁棒性和实现高效的“prompt集成”。

2021.10 LoRA

@LoRA: Low-Rank Adaptation of Large Language Models
An important paradigm of natural language processing consists of large-scale pretraining on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters,becomes less feasible. Using GPT-3 175B as an example – deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pretrained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam,LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than finetuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters,no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa,DeBERTa, and GPT-2 at https://github.com/microsoft/LoRA.
自然语言处理的一个重要范式是在通用领域数据上进行大规模预训练,并适应特定任务或领域。随着我们对模型进行更大规模的预训练,对所有模型参数进行完全微调变得越来越不可行。以GPT-3 175B为例,部署独立的经过微调的模型实例,每个模型实例都有175B个参数,成本非常高昂。我们提出了低秩适应(LoRA)的方法,它冻结预训练模型的权重,并将可训练的秩分解矩阵注入到Transformer架构的每个层中,大大降低了用于下游任务的可训练参数数量。与使用Adam进行微调的GPT-3 175B相比,LoRA可以将可训练参数的数量减少10,000倍,并将GPU内存需求降低3倍。尽管可训练参数更少、训练吞吐量更高,并且与适配器不同,没有额外的推断延迟,但LoRA在RoBERTa、DeBERTa、GPT-2和GPT-3的模型质量上表现相当或更好。我们还进行了关于语言模型适应中秩亏缺的实证研究,这为LoRA的有效性提供了一些启示。我们发布了一个方便将LoRA与PyTorch模型集成的软件包,并在 https://github.com/microsoft/LoRA 上提供了我们对RoBERTa、DeBERTa和GPT-2的实现和模型检查点。

原理

LoRA向预训练参数中注入参数矩阵

优点

2.1 预训练模型共享参数+多个LoRA = 多个下游任务
2.2 无推理延迟 两个权重矩阵相加即可,和其他方法是正交的,可以同时使用其他tuning,例如prefix-tuning

背景

3.1 Adater插入mlp增加网络深度,增加推理延迟、只会收敛到mlp的最优解,不一定全局最优
3.2 prefix tuning增加tokens,会减少输入tokens的长度,影响下游任务

方法

4.1 不需要对新增的矩阵进行全秩的微调,只需要对r秩(r=4/8)进行微调
4.2 微调范围huggface peft库对attention中的wq和wv(wk和wo没有)进行训练
4.3 wq的内在秩比较大,wv的内在秩比较小,单独wv效果比wq好

优化方向

5.1 不同的w矩阵的内在秩不同,需要不同的r
5.2 AdaLoRA根据svd的大小决定不同矩阵r的大小

参考

6.1 LoRA:训练你的GPT【论文粗读·1】-@bilibili-小杨不努力
6.2 LoRA.pptx-@飞书-小杨不努力

2022.03 P-Tuning v2

@P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks
Prompt tuning, which only tunes continuous prompts with a frozen language model, substantially reduces per-task storage and memory usage at training. However, in the context of NLU, prior work reveals that prompt tuning does not perform well for normal-sized pretrained models. We also find that existing methods of prompt tuning cannot handle hard sequence labeling tasks, indicating a lack of universality. We present a novel empirical finding that properly optimized prompt tuning can be universally effective across a wide range of model scales and NLU tasks. It matches the performance of finetuning while having only 0.1%-3% tuned parameters. Our method P-Tuning v2 is an implementation of Deep Prompt Tuning (Li and Liang, 2021; Qin and Eisner, 2021) optimized and adapted for NLU. Given the universality and simplicity of P-Tuning v2, we believe it can serve as an alternative to finetuning and a strong baseline for future research.1
Prompt调优是一种只对冻结的语言模型调优连续提示的方法,可以显著减少训练过程中每个任务的存储和内存使用量。然而,在自然语言理解(NLU)领域,之前的研究表明,对于普通大小的预训练模型,Prompt调优效果不佳。我们还发现,现有的Prompt调优方法无法处理困难的序列标注任务,表明缺乏普遍性。我们提出了一项新的经验性发现,即经过适当优化的Prompt调优可以在各种模型规模和NLU任务中普遍有效。它与微调的性能相当,但只有0.1% - 3%的调优参数。我们的方法P-Tuning v2是Deep Prompt Tuning(Li and Liang, 2021; Qin and Eisner, 2021)在NLU领域进行了优化和适应的实现。鉴于P-Tuning v2的普适性和简单性,我们相信它可以作为微调的替代方法,并为未来的研究提供一个强有力的基准。

2023.03 AdaLoRA

@ADAPTIVE BUDGET ALLOCATION FOR PARAMETER EFFICIENT FINE-TUNING
Fine-tuning large pre-trained language models on downstream tasks has become an important paradigm in NLP. However, common practice fine-tunes all of the parameters in a pre-trained model, which becomes prohibitive when a large number of downstream tasks are present. Therefore, many fine-tuning methods are proposed to learn incremental updates of pre-trained weights in a parameter efficient way,e.g., low-rank increments. These methods often evenly distribute the budget of incremental updates across all pre-trained weight matrices, and overlook the varying importance of different weight parameters. As a consequence, the finetuning performance is suboptimal. To bridge this gap, we propose AdaLoRA,which adaptively allocates the parameter budget among weight matrices according to their importance score. In particular, AdaLoRA parameterizes the incremental updates in the form of singular value decomposition. Such a novel approach allows us to effectively prune the singular values of unimportant updates, which is essentially to reduce their parameter budget but circumvent intensive exact SVD computations. We conduct extensive experiments with several pre-trained
models on natural language processing, question answering, and natural language generation to validate the effectiveness of AdaLoRA. Results demonstrate that AdaLoRA manifests notable improvement over baselines, especially in the low budget settings.
在自然语言处理(NLP)中,将大型预训练语言模型在下游任务上进行微调已经成为一个重要的范式。然而,通常的做法是微调预训练模型中的所有参数,当存在大量下游任务时,这种做法变得不可行。因此,许多微调方法被提出来以以参数高效的方式学习预训练权重的增量更新,例如低秩增量。这些方法通常均匀地分配增量更新的预算到所有预训练权重矩阵上,并忽视了不同权重参数的重要性差异。结果导致微调性能不佳。为了弥补这一差距,我们提出了AdaLoRA,它根据权重矩阵的重要性分数自适应地分配参数预算。具体而言,AdaLoRA将增量更新参数化为奇异值分解的形式。这种新颖的方法使我们能够有效地修剪不重要更新的奇异值,从而减少它们的参数预算,同时避免了繁重的精确奇异值分解计算。我们在自然语言处理、问答和自然语言生成等多个预训练模型上进行了大量实验证明了AdaLoRA的有效性。结果表明,AdaLoRA在基准模型上表现出显著的改进,特别是在低预算设置下。