Gehen Sie mit der App Player FM offline!
AF - Backdoors as an analogy for deceptive alignment by Jacob Hilton
Fetch error
Hmmm there seems to be a problem fetching this series right now. Last successful fetch was on October 09, 2024 12:46 ()
What now? This series will be checked again in the next hour. If you believe it should be working, please verify the publisher's feed link below is valid and includes actual episode links. You can contact support to request the feed be immediately fetched.
Manage episode 438547559 series 3314709
ARC has released a paper on Backdoor defense, learnability and obfuscation in which we study a formal notion of backdoors in ML models. Part of our motivation for this is an analogy between backdoors and deceptive alignment, the possibility that an AI system would intentionally behave well in training in order to give itself the opportunity to behave uncooperatively later.
In our paper, we prove several theoretical results that shed some light on possible mitigations for deceptive alignment, albeit in a way that is limited by the strength of this analogy.
In this post, we will:
Lay out the analogy between backdoors and deceptive alignment
Discuss prior theoretical results from the perspective of this analogy
Explain our formal notion of backdoors and its strengths and weaknesses
Summarize the results in our paper and discuss their implications for deceptive alignment
Thanks to Boaz Barak, Roger Grosse, Thomas Read, John Schulman and Gabriel Wu for helpful comments.
Backdoors and deceptive alignment
A backdoor in an ML model is a modification to the model that causes it to behave
differently on certain inputs that activate a secret "trigger", while behaving similarly on ordinary inputs. There is a wide existing literature on backdoor attacks and defenses, which is primarily empirical, but also includes some theoretical results that we will mention.
Deceptive alignment is a term from the paper Risks from Learned Optimization in Advanced Machine Learning Systems (Section 4) that refers to the possibility that an AI system will internally reason about the objective that it is being trained on, and decide to perform well according to that objective unless there are clues that it has been taken out of its training environment.
Such a policy could be optimal on the training distribution, and yet perform very badly on certain out-of-distribution inputs where such clues are present, which we call defection triggers.[1] The opposite of deceptive alignment is robust alignment, meaning that this performance degradation is avoided.
Since a deceptively aligned model and a robustly aligned model behave very differently on defection triggers, but very similarly on typical inputs from the training distribution, deceptive alignment can be thought of as a special kind of backdoor, under the following correspondence:
Deceptive alignment
Backdoors
Robustly aligned model
Original (unmodified) model
Deceptively aligned model
Backdoored model
Defection trigger
Backdoor trigger
The main distinguishing feature of deceptive alignment compared to other kinds of backdoors is that the deceptively aligned model is not produced by an adversary, but is instead produced through ordinary training. Thus by treating deceptive alignment as a backdoor, we are modeling the training process as an adversary.
In our analysis of deceptive alignment, the basic tension we will face is that an unconstrained adversary will always win, but any particular proxy constraint we impose on the adversary may be unrealistic.
Static backdoor detection
An important piece of prior work is the paper Planting Undetectable Backdoors in Machine Learning Models, which uses a digital signature scheme to insert an undetectable backdoor into a model. Roughly speaking, the authors exhibit a modified version of a "Random Fourier Features" training algorithm that produces a backdoored model. Any input to the backdoored model can be perturbed by an attacker with knowledge of a secret key to produce a new input on which the model behaves differently.
However, the backdoor is undetectable in the sense that it is computationally infeasible for a defender with white-box access to distinguish a backdoored model from an or...
2437 Episoden
Fetch error
Hmmm there seems to be a problem fetching this series right now. Last successful fetch was on October 09, 2024 12:46 ()
What now? This series will be checked again in the next hour. If you believe it should be working, please verify the publisher's feed link below is valid and includes actual episode links. You can contact support to request the feed be immediately fetched.
Manage episode 438547559 series 3314709
ARC has released a paper on Backdoor defense, learnability and obfuscation in which we study a formal notion of backdoors in ML models. Part of our motivation for this is an analogy between backdoors and deceptive alignment, the possibility that an AI system would intentionally behave well in training in order to give itself the opportunity to behave uncooperatively later.
In our paper, we prove several theoretical results that shed some light on possible mitigations for deceptive alignment, albeit in a way that is limited by the strength of this analogy.
In this post, we will:
Lay out the analogy between backdoors and deceptive alignment
Discuss prior theoretical results from the perspective of this analogy
Explain our formal notion of backdoors and its strengths and weaknesses
Summarize the results in our paper and discuss their implications for deceptive alignment
Thanks to Boaz Barak, Roger Grosse, Thomas Read, John Schulman and Gabriel Wu for helpful comments.
Backdoors and deceptive alignment
A backdoor in an ML model is a modification to the model that causes it to behave
differently on certain inputs that activate a secret "trigger", while behaving similarly on ordinary inputs. There is a wide existing literature on backdoor attacks and defenses, which is primarily empirical, but also includes some theoretical results that we will mention.
Deceptive alignment is a term from the paper Risks from Learned Optimization in Advanced Machine Learning Systems (Section 4) that refers to the possibility that an AI system will internally reason about the objective that it is being trained on, and decide to perform well according to that objective unless there are clues that it has been taken out of its training environment.
Such a policy could be optimal on the training distribution, and yet perform very badly on certain out-of-distribution inputs where such clues are present, which we call defection triggers.[1] The opposite of deceptive alignment is robust alignment, meaning that this performance degradation is avoided.
Since a deceptively aligned model and a robustly aligned model behave very differently on defection triggers, but very similarly on typical inputs from the training distribution, deceptive alignment can be thought of as a special kind of backdoor, under the following correspondence:
Deceptive alignment
Backdoors
Robustly aligned model
Original (unmodified) model
Deceptively aligned model
Backdoored model
Defection trigger
Backdoor trigger
The main distinguishing feature of deceptive alignment compared to other kinds of backdoors is that the deceptively aligned model is not produced by an adversary, but is instead produced through ordinary training. Thus by treating deceptive alignment as a backdoor, we are modeling the training process as an adversary.
In our analysis of deceptive alignment, the basic tension we will face is that an unconstrained adversary will always win, but any particular proxy constraint we impose on the adversary may be unrealistic.
Static backdoor detection
An important piece of prior work is the paper Planting Undetectable Backdoors in Machine Learning Models, which uses a digital signature scheme to insert an undetectable backdoor into a model. Roughly speaking, the authors exhibit a modified version of a "Random Fourier Features" training algorithm that produces a backdoored model. Any input to the backdoored model can be perturbed by an attacker with knowledge of a secret key to produce a new input on which the model behaves differently.
However, the backdoor is undetectable in the sense that it is computationally infeasible for a defender with white-box access to distinguish a backdoored model from an or...
2437 Episoden
Alle Folgen
×Willkommen auf Player FM!
Player FM scannt gerade das Web nach Podcasts mit hoher Qualität, die du genießen kannst. Es ist die beste Podcast-App und funktioniert auf Android, iPhone und im Web. Melde dich an, um Abos geräteübergreifend zu synchronisieren.