Skip to main content Skip to main navigation

Publikation

Distilling Adversarial Prompts from Safety Benchmarks: Report for the Adversarial Nibbler Challenge

Manuel Brack; Patrick Schramowski; Kristian Kersting
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2309.11575, Pages 2-5, arXiv, 2023.

Zusammenfassung

Text-conditioned image generation models have recently achieved astonishing image qual- ity and alignment results. Consequently, they are employed in a fast-growing number of appli- cations. Since they are highly data-driven, rely- ing on billion-sized datasets randomly scraped from the web, they also produce unsafe con- tent. As a contribution to the Adversarial Nib- bler challenge, we distill a large set of over 1,000 potential adversarial inputs from exist- ing safety benchmarks. Our analysis of the gathered prompts and corresponding images demonstrates the fragility of input filters and provides further insights into systematic safety issues in current generative image models.

Weitere Links