Attacking and defending neural networks. HU Xiaolin ( 胡晓林 ) Department of Computer Science and Technology Tsinghua University, Beijing, China

Attacking and defending neural networks HU Xiaolin ( 胡晓林 ) Department of Computer Science and Technology Tsinghua University, Beijing, China

Outline Background Attacking methods Defending methods 2

AI is booming US and China have huge AI plans United Kingdom plans $1.3 billion AI push France spends $1.8 billion on AI to compete with US and China EU wants to invest 18 billion pounds in AI development 3

Neural networks are powerful Many powerful models Resnet (He et al., 2016) Inception V1 (Szegedy et al., 2014) Inception V2 (Szegedy et al., 2016) Inception V3 (Szegedy et al., 2016) DenseNet (Huang et al., 2017) Most of them achieve higher accuracy than humans on the ImageNet classification task 4

but Unreliable! legitimate Examples Adversarial Examples Crafted specifically to fool ResNet 5

Black-box Attacks (Transferability) Cross-model transferability ResNet Inc V3 dog dog Cross-data transferability 6

Physically realizable attack Sharif et al., CCS 2016 Kaylee Defer Nancy Travis 7

Physical attack Athalye et al., ICLR 2018 8

Definitions Given a classifier ff xx : xx XX yy YY, which maps the input sample xx to the label yy. Add a small noise δδ to xx and we have xx = xx + δδ. If ff xx yy, then xx is called an adversarial example, and xx is called the normal sample or legitimate sample. δδ is called adversarial perturbation, whose magnitude is often limited such as δδ pp εε, where pp could be 0, 1, 2 or, εε is a small constant If ff xx = yy, where yy is a specified label, then this attack is called targeted attack If ff xx is not specified to any class, then this attack is called untargeted attack 9

Definitions Given a classifier ff xx : xx XX yy YY, which maps the input sample xx to the label yy. Add a small noise δδ to xx and we have xx = xx + δδ. Attack a certain model: perturb xx to xx and make the model misclassify xx, i.e., ff xx yy White-box attack: the model is known Black-box attack: the model is unknown Defend a certain model: make the model map xx to yy 10

Outline Background Attacking methods Defending methods Interpretability Summary 11

Principles for attacking a classification model Recall: Learning a neural network amounts to minimizing a loss function LL (often cross entropy + regularizer) w.r.t. the weights and biases min ww,bb LL(xx, yy; ww, bb) where (xx, yy) denotes the input and desired output pair Attacking: maximizing the above loss function w.r.t. the input max LL(xx, yy; ww, bb) xx subject to xx xx < δδ pp 12

Attacking methods One-step FGSM (linearity of decision boundary) xx = xx + εε sign( xx LL(xx, yy)) Poor white-box attack ability, good black-box attack ability Iterative FGSM (I-FGSM) xx 0 = xx, xx tt+1 = clip(xx tt + α sign xx LL xx tt, yy ) Optimization-based methods min dd xx, xx LL(xx, yy) Poor white-box attack ability, good black-box attack ability 13

Optimization with Momentum Constrained optimization of adversarial attacks: argmax xx LL xx, yy ss. tt. xx xx εε Optimization algorithm with momentum Accelerate gradient descent; Escape from poor local minima and maxima; Stabilize update directions of stochastic gradient descent; Momentum can be used for adversarial attacks The white-box attack ability and black-box attack ability are both strong (transferability) Dong, Liao, Pang, Su, Zhu, Hu, Li, Boosting Adversarial Attacks with Momentum, CVPR 2018 14

Momentum Iterative FGSM xx 0 = xx, xx tt+1 = clip(xx tt + α sign xx LL xx tt, yy ) xx 0 = xx, gg 0 = 0 gg tt+1 = μμ gg tt + xxll xx tt, yy xx LL xx tt, yy 1 = clip(xx tt + αα sign gg tt+1 ) xx tt+1 Momentum μμ is the decay factor; gg tt accumulates the gradient w.r.t. input space of the first tt iterations; The current gradient is normalized. 15

Non-targeted Results εε = 16, μμ = 1.0, 10 iterations 16

Attacking an Ensemble of Models If an adversarial example remains adversarial for multiple models, it is more likely to be misclassified by other black-box models. Ensemble in logits ll xx = KK ii=1 ww ii ll ii (xx) The loss is defined as JJ xx, yy = 1 yy log(softmax(ll(xx))) Comparisons: Ensemble in predictions: pp xx = KK ii=1 ww ii pp ii (xx) Ensemble in loss: JJ xx, yy = KK ii=1 ww ii JJ ii (xx, yy) 17

Non-targeted Results (2) A total of 7 models: in each row, ensemble of 6 models and test on one model as indicated in the 1 st column 18

NIPS 2017 Competition Three tracks: Non-targeted Adversarial Attack 1 st Targeted Adversarial Attack 1 st Defense Against Adversarial Attack 1 st Evaluation: Given a dataset (5000images ImageNet Compatible) Score aaaaaaaaaaaa = ddeeeeeeeeeeee NN kk=1 [defense(attack(xx kk )) yy kk ] Score tttttttttttt = ddeeeeeeeeeeee NN kk=1 [defense target xx kk = yy kk ] Score dddddddddddddd = aaaaaaaaaaaa NN kk=1 [defense attack xx kk = yy kk ] Requirement: 4 εε 16, based on LL norm. Running time 500ss for 100 images 19

Outline Background Attacking methods Defending methods Interpretability Summary 22

Defending methods Obfuscated gradient methods Input transformations (e.g., JPEG compression) (Guo et al. 2018) Thermometer encoding (Buckman et al., 2018) Local intrinsic dimensionality (LID) (Ma et al., 2018) Stochastic activation pruning (Dhillon et al.,2018) Etc. Not powerful enough (Athalye, et al., ICML 2018 Best paper) Cannot defend black-box attack well 23

Defending black-box attacks Adversarial training (Kurakin, Goodfellow, Bengio, 2017) Add adversarial examples into the training set Computationally expensive Ensemble adversarial training (Tramèr et al., 2018) Use more than one model for adversarial training Even more computationally expensive Worked the best till our method HGD (Liao et al., 2018) 24

Motivation for denoising Misclassification: xx + dddd yy After denoising: xx + dddd xx yy Adv img Est. clean img 25

Neural networks for denoising Denoising Autoencoder Denoising U-Net Pixel Guided Denoiser (PGD) L1 dd xx L1 xx xx xx xx + xx xx Liao, Liang, Dong, Pang, Hu, Zhu, Defense against Adversarial Attacks Using High-Level Representation Guided Denoiser, CVPR 2018 26

Architecture of the DU-Net 27

Experiment setting Attack methods FGSM attack IFGSM attack Training set 30K source images Test set for whitebox attack 10K source images Test set for blackbox attack 10K source images 28

Results DAE is poor for reconstructing large images DU-NET removes more noise for white-box attack, but the accuracy was lower than DAE in white box attack Why? 29

Error amplification Network Image Clean Im Adv Im Let s construct a loss in higher layers! 30

Proposed schemes Train Test Naive CNN Adv Im Pixel Guided Denoiser (PGD) Adv Im D Den Im L1 Clean Im L1 Adv Im D Den Im CNN High-level Representation Guided Denoiser (HGD) Adv Im D Den Im Feat1 CNN Feat2 CNN Clean Im Adv Im D Den Im CNN 31

Variants of HGD Feature guided denoiser FGD Logits guided denoiser LGD Class label guided denoiser CGD 32

Robustness of HGD 33

Transferability of HGD L1 Feat1 Feat2 D CNN1 CNN1 D CNN2 Adv Im Den Im Clean Im Adv Im Den Im The two CNNs for training and testing could be different CNN1 The target model CNN2 is Resnet 34

Summary and discussion Current deep learning models are not robust If DL is more widely used, the risk of adversarial attack will be higher One attacking method Momentum IFGSM is presented One defending methods High-level representation guided denoiser (HGD) is presented Defending techniques are not effective if they are known to the attacker HGD can be also fooled Human is much more robust to adversarial examples Brain-inspired computing is promising 37

Open source Momentum IFGSM https://github.com/dongyp13/non-targeted-adversarial- Attacks https://github.com/dongyp13/targeted-adversarial- Attack HGD https://github.com/lfz/guided-denoise 38

Q & A