Publication
Distilling Adversarial Prompts from Safety Benchmarks: Report for the Adversarial Nibbler Challenge
Manuel Brack; Patrick Schramowski; Kristian Kersting
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2309.11575, Pages 2-5, arXiv, 2023.
Abstract
Text-conditioned image generation models
have recently achieved astonishing image qual-
ity and alignment results. Consequently, they
are employed in a fast-growing number of appli-
cations. Since they are highly data-driven, rely-
ing on billion-sized datasets randomly scraped
from the web, they also produce unsafe con-
tent. As a contribution to the Adversarial Nib-
bler challenge, we distill a large set of over
1,000 potential adversarial inputs from exist-
ing safety benchmarks. Our analysis of the
gathered prompts and corresponding images
demonstrates the fragility of input filters and
provides further insights into systematic safety
issues in current generative image models.
