Bypassing ten adversarial detection methods (Carlini and Wagner, 2017)

Some rough notes of this paper.

Paper link: Adversarial examples are not easily detected: Bypassing ten detection methods. (Carlini and Wagner, 2017)

In brief: Adversarial defences are often flimsy. The authors are able to bypass ten detection methods for adversarial examples. They do so with both black-box and white-box attacks. The C&W attack is the main attack used. The most promising defence evaluated the classification uncertanity of each image through generating randomised models.

Scenarios Link to heading

The authors tried three different scenarios. Each scenario depends on the knowledge of the adversary.

  • Zero-knowledge adversary: the attacker isn’t aware there is a detector in place. Generate adversarial examples with the C&W attack and then test the defence.
  • Perfect-knowledge adversary (white-box attack): The attacker knows a detector is in place, knows the type of detector, knows the model parameters used in the detector, and has access to the training data. The difficult thing is to construct a loss function to generate adversarial examples.
  • Limited-knowledge adversary (black-box attack): The attacker knows there is a detector in place, knows what type of detector it is, but doesn’t know the parameters of the detector, and doesn’t have access to the training data. The attacker first trains a substitute model on a seperate training set in the same way as the original model was trained. They know the parameters of this model, and can generate adversarial examples with a white-box attack. The adversarial examples are then tested on the original model.

Attacks Link to heading

They used one main method of attack: the L2 based C&W attack.

Detectors Link to heading

Ten different detectors were tested.

Three of these detectors added a second network for detection. Three detectors relied on PCA to detect adversarial examples. Two detectors used other statistical methods to distinguish adversarial examples, comparing the distribution of natural images to the distribution of adversarial examples. Two detectors rely on input normalisation with randomisation and blurring.

Lessons Link to heading

  • Randomisation can increase the amount of distortion required for a successful adversarial example. This is a promising direction.
  • Many defences to adversarial attacks are demonstrated on the MNIST dataset. These defences often fail on CIFAR, and hence on many other datasets. Defences should be tested on more datasets than just MNIST.
  • Defences based around a second detection neural networks seem to be easy to fool. Adversarial examples can fool one neural network, and a second one doesn’t provide much more challenge.
  • Defences operating on raw pixel values aren’t effective. They might work against simple attacks, but not against more complex ones.

Recommendations Link to heading

  • Use a strong attack for evaluation, like C&W. Don’t just use the fast gradient-sign method or JSMA.
  • Use a few datasets for evaluation.
  • Show that white-box attacks don’t work for your defence. Doing just black-box attacks isn’t enough.
  • Report false-positive and true-negative rates, and ROC curves if possible. Accuracy isn’t enough: the same accuracy values can be either useful or not useful. A low false-positive rate is good. You’d rather detect all adversarial examples correctly and miss some, than detect a lot of natural images as adversarial and get all adversarial examples.

Further reading Link to heading

The recommended four papers by the authors for background reading (in order)