Abstract

Concept erasure aims to remove unwanted attributes, such as social or demographic factors, from learned representations, while preserving their task-relevant utility. Existing methods are vulnerable to nonlinear adversaries because they fail to capture the nonlinear statistical dependencies between learned representations and undesired attributes. Moreover, the progressive degradation of utility throughout the erasure process, i.e., the cost of erasure, remains underexplored. In this work, we introduce Obliviator, which captures nonlinear statistical dependencies between undesired attributes and learned representations. We propose a multi-step framework that gradually morphs the feature space to facilitate the adversarial optimization and enable smooth concept removal. Unlike prior methods, Obliviator guards against nonlinear adversaries while achieving state-of-the-art performance that generalizes well across representations from recent language models. Our method quantifies the cost of nonlinear concept erasure and, across all levels of sensitive attribute protection, achieves higher task performance, revealing an empirical upper bound on this trade-off.