4.5 Article

Removing Backdoors in Pre-trained Models by Regularized Continual Pre-training

出版社

MIT PRESS
DOI: 10.1162/tacl_a_00622

关键词

-

向作者/读者索取更多资源

Recent research has identified the vulnerability of pre-trained models (PTMs) to backdoor attacks. These attacks involve implanting task-agnostic backdoors in PTMs, allowing control over the model outputs for any downstream task, posing significant security threats. Current backdoor removal defenses are not suitable for PTMs and focus mainly on task-specific classification models. In response, the proposed method introduces a task-agnostic backdoor removal approach for PTMs. Based on the selective activation phenomenon in backdoored PTMs, a simple and effective backdoor eraser is designed, which utilizes a regularization term in an end-to-end pre-training process to remove backdoor functionalities while maintaining normal PTM functionalities. Extensive experiments across different modalities and architectures demonstrate that the method effectively removes backdoors and preserves benign functionalities of PTMs with a small amount of downstream-task-irrelevant auxiliary data.
Recent research has revealed that pre-trained models (PTMs) are vulnerable to backdoor attacks before the fine-tuning stage. The attackers can implant transferable task-agnostic backdoors in PTMs, and control model outputs on any downstream task, which poses severe security threats to all downstream applications. Existing backdoor-removal defenses focus on task-specific classification models and they are not suitable for defending PTMs against task-agnostic backdoor attacks. To this end, we propose the first task-agnostic backdoor removal method for PTMs. Based on the selective activation phenomenon in backdoored PTMs, we design a simple and effective backdoor eraser, which continually pre-trains the backdoored PTMs with a regularization term in an end-to-end approach. The regularization term removes backdoor functionalities from PTMs while the continual pre-training maintains the normal functionalities of PTMs. We conduct extensive experiments on pre-trained models across different modalities and architectures. The experimental results show that our method can effectively remove backdoors inside PTMs and preserve benign functionalities of PTMs with a few downstream-task-irrelevant auxiliary data, e.g., unlabeled plain texts. The average attack success rate on three downstream datasets is reduced from 99.88% to 8.10% after our defense on the backdoored BERT. The codes are publicly available at https://github.com/thunlp/RECIPE.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据