These labels were automatically added by AI and may be inaccurate. For details, see About Literature Database.
Abstract
How does the internal computation of a machine learning model transform
inputs into predictions? In this paper, we introduce a task called component
modeling that aims to address this question. The goal of component modeling is
to decompose an ML model's prediction in terms of its components -- simple
functions (e.g., convolution filters, attention heads) that are the "building
blocks" of model computation. We focus on a special case of this task,
component attribution, where the goal is to estimate the counterfactual impact
of individual components on a given prediction. We then present COAR, a
scalable algorithm for estimating component attributions; we demonstrate its
effectiveness across models, datasets, and modalities. Finally, we show that
component attributions estimated with COAR directly enable model editing across
five tasks, namely: fixing model errors, ``forgetting'' specific classes,
boosting subpopulation robustness, localizing backdoor attacks, and improving
robustness to typographic attacks. We provide code for COAR at
https://github.com/MadryLab/modelcomponents .