Hufu
Well this work is not yet published, so I must not share the very details of it. But here is a brief introduction of it.
Problem Statement
In AI safety, a lot of research stalls at decade old methods and tasks. I mean, most of the work focus on the same old MNIST, CIFAR-10 with VGG, ResNet, etc.
Taking model watermarking for example, the BadNet backdoor is enhanced over and over again, some even dive into the decision boundary of the model. But now the paradigm is changed. We are dealing with larger models (I mean, ViT-base, GPT2 large, not ChatGPT large), and pretraining-and-finetuning is the new normal. The classifier is never the valuable part of the model. The true value lies in model backbone. We often use timm library to load the pretrained model, and the classifier is always discarded.
And Transformer has blured the boundary between modalities. Transformer takes in tokens. Tokens can be anything. It can be image patches, it can be text, it can be audio, even video frames. The same old colorful trigger of BadNet is no longer applicable.
Our Solution
We propose a modality-agnostic watermarking system for pre-trained Transformers via permutation equivariance. We call it Hufu. Check out our paper for more details.