Fun Neural Net Demo Site. CS 188: Artificial Intelligence. N-Layer Neural Network. Multi-class Softmax Σ >0? Deep Learning II

Size: px

Start display at page:

Download "Fun Neural Net Demo Site. CS 188: Artificial Intelligence. N-Layer Neural Network. Multi-class Softmax Σ >0? Deep Learning II"

Eugene Shields
5 years ago
Views:

1 Fun Neural Net Demo Site CS 188: Artificial Intelligence Demo-site: Deep Learning II Instructors: Pieter Abbeel & Anca Dragan --- University of California, Berkeley [These slides were created by Dan Klein, Pieter Abbeel, Anca Dragan for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at N-Layer Neural Network Multi-class Softmax Σ >0? Σ >0? Σ >0? 3-class softmax classes A, B, C 3 weight vectors: f 1 Probability of label A: (similar for B, C) f 2 Σ >0? Σ >0? Σ >0? Σ f 3 Objective: Σ >0? Σ >0? Σ >0? Log:

2 Multi-class Two-Layer Neural Network Remaining Pieces f 1 f 2 w 11 w 12 w 22 f 3 w 13 w 21 Σ >0? A w 31 Σ w 23 Σ w 32 >0? Σ w 33 >0? wa 1 w 2 A w 3 B w 1 B w 2 B w 3 C w 1 C w 2 C w 3 Σ Σ Score for A Score for B Score for C Optimizing machine learning objectives: Stochastic Descent Mini-batches Improving generalization Drop-out Activation functions Initialization and batch normalization Computing the gradient Backprop Gradient checking Mini-batches and Stochastic Gradient Descent Remaining Pieces Typical objective: = average log-likelihood of label given input = estimate based on mini-batch 1 k - Mini-batch gradient descent: compute gradient on mini-batch (+ cycle over mini-batches: 1..k, k+1 2k,... ; make sure to randomize permutation of data!) - Stochastic gradient descent: k = 1 Optimizing machine learning objectives: Stochastic Descent Mini-batches Improving generalization Drop-out Activation functions Initialization and batch normalization Computing the gradient Gradient checking Backprop

Regularization: Dropout randomly set some neurons to zero in the forward pass Waaaait a second How could this possibly be a good idea? [Srivastava et al.

3 Regularization: Dropout randomly set some neurons to zero in the forward pass Waaaait a second How could this possibly be a good idea? [Srivastava et al., 2014] Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture Jan 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture Jan 2016 Waaaait a second How could this possibly be a good idea? Waaaait a second How could this possibly be a good idea? Forces the network to have a redundant representation. has an ear has a tail is furry has claws mischievous look X X X cat score Another interpretation: Dropout is training a large ensemble of models (that share parameters). Each binary mask is one model, gets trained on only ~one datapoint. Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture Jan 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture Jan 2016

At test time. Ideally: want to integrate out all the noise Sampling-based approximation: do many forward passes with different dropout masks, average all predictions At test time.

4 At test time. Ideally: want to integrate out all the noise Sampling-based approximation: do many forward passes with different dropout masks, average all predictions At test time. Can in fact do this with a single forward pass! (approximately) Leave all input neurons turned on (no dropout). Q: Suppose that with all inputs present at test time the output of this neuron is x. What would its output be during training time, in expectation? (e.g. if p = 0.5) Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture Jan 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture Jan 2016 At test time. Can in fact do this with a single forward pass! (approximately) Leave all input neurons turned on (no dropout). At test time. Can in fact do this with a single forward pass! (approximately) Leave all input neurons turned on (no dropout). x w0 a y w1 during test: a = w0*x + w1*y during train: E[a] = ¼ * (w0*0 + w1*0 w0*0 + w1*y w0*x + w1*0 w0*x + w1*y) = ¼ * (2 w0*x + 2 w1*y) = ½ * (w0*x + w1*y) x w0 a w1 y during test: a = w0*x + w1*y during train: E[a] = ¼ * (w0*0 + w1*0 w0*0 + w1*y w0*x + w1*0 w0*x + w1*y) = ¼ * (2 w0*x + 2 w1*y) = ½ * (w0*x + w1*y) With p=0.5, using all inputs in the forward pass would inflate the activations by 2x from what the network was used to during training! => Have to compensate by scaling the activations back down by ½ Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture Jan 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture Jan 2016

Gradient checking Backprop Sigmoid tanh tanh(x) Maxout ELU Activation functions ReLU max(0,x) Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5-18 Remaining Pieces - Q:

5 Remaining Pieces Activation Functions Leaky ReLU max(0.1x, x) Optimizing machine learning objectives: Stochastic Descent Mini-batches Improving generalization Drop-out Initialization and batch normalization Computing the gradient Gradient checking Backprop Sigmoid tanh tanh(x) Maxout ELU Activation functions ReLU max(0,x) Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5-18 Remaining Pieces - Q: what happens when W=0 init is used? Optimizing machine learning objectives: Stochastic Descent Mini-batches Improving generalization Drop-out Initialization and batch normalization Computing the gradient Gradient checking Backprop Activation functions Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5-20

6 - First idea: Small random numbers (gaussian with zero mean and 1e-2 standard deviation) - First idea: Small random numbers (gaussian with zero mean and 1e-2 standard deviation) Works ~okay for small networks, but can lead to non-homogeneous distributions of activations across the layers of a network. Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5-21 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5-22 Lets look at some activation statistics E.g. 10-layer net with 500 neurons on each layer, using tanh non-linearities, and initializing as described in last slide. Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5-23 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5-24

All activations become zero! Q: What do the gradients look like? *1.0 instead of *0.01 Almost all neurons completely saturated, either -1 and 1. Gradients will be all zero.

7 All activations become zero! Q: What do the gradients look like? *1.0 instead of *0.01 Almost all neurons completely saturated, either -1 and 1. Gradients will be all zero. Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5-25 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5-26 Xavier initialization [Glorot et al., 2010] but when using the ReLU nonlinearity it breaks. Reasonable initialization. (Mathematical derivation assumes linear activations) Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5-27 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5-28

8 He et al., 2015 (note additional /2) He et al., 2015 (note additional /2) Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5-29 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5-30 Proper initialization is an active area of research Understanding the difficulty of training deep feedforward neural networks by Glorot and Bengio, 2010 Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013 Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014 Batch Normalization you want unit gaussian activations? just make them so. consider a batch of activations at some layer. To make each dimension unit gaussian, apply: [Ioffe and Szegedy, 2015] Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015 Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015 All you need is a good init, Mishkin and Matas, 2015 this is a vanilla differentiable function... Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5-31 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5-32

9 Batch Normalization [Ioffe and Szegedy, 2015] Batch Normalization [Ioffe and Szegedy, 2015] N you want unit gaussian activations? just make them so. X D 1. compute the empirical mean and variance independently for each dimension. 2. Normalize Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5-33 FC BN tanh FC BN tanh... Usually inserted after Fully Connected / (or Convolutional, as we ll see soon) layers, and before nonlinearity. Problem: do we necessarily want a unit gaussian input to a tanh layer? Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5-34 Batch Normalization [Ioffe and Szegedy, 2015] Batch Normalization [Ioffe and Szegedy, 2015] Normalize: Note, the network can learn: - Improves gradient flow through the network - Allows higher learning rates - Reduces the strong dependence on initialization - Acts as a form of regularization in a funny way, and slightly reduces the need for dropout, maybe And then allow the network to squash the range if it wants to: to recover the identity mapping. Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5-35 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5-36

10 Batch Normalization [Ioffe and Szegedy, 2015] Note: at test time BatchNorm layer functions differently: The mean/std are not computed based on the batch. Instead, a single fixed empirical mean of activations during training is used. (e.g. can be estimated during training with running averages) Optimizing machine learning objectives: Stochastic Descent Mini-batches Improving generalization Drop-out Activation functions Remaining Pieces Initialization and batch normalization Computing the gradient Gradient checking Backprop Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 5-37 Gradient Descent Computational Graph Numerical gradient: slow :(, approximate :(, easy to write :) Analytic gradient: fast :), exact :), error-prone :( W x * s (scores) hinge loss R + L In practice: Derive analytic gradient, check your implementation with numerical gradient Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-39 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-40

Convolutional Network (AlexNet) Neural Turing Machine input image weights loss input tape loss Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-41 Fei-Fei Li & Andrej Karpathy &

11 Convolutional Network (AlexNet) Neural Turing Machine input image weights loss input tape loss Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-41 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-42 Neural Turing Machine e.g. x = -2, y = 5, z = -4 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-43 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-44

12 e.g. x = -2, y = 5, z = -4 e.g. x = -2, y = 5, z = -4 Want: Want: Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-45 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-46 e.g. x = -2, y = 5, z = -4 e.g. x = -2, y = 5, z = -4 Want: Want: Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-47 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-48

13 e.g. x = -2, y = 5, z = -4 e.g. x = -2, y = 5, z = -4 Want: Want: Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-49 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-50 e.g. x = -2, y = 5, z = -4 e.g. x = -2, y = 5, z = -4 Want: Want: Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-51 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-52

14 e.g. x = -2, y = 5, z = -4 e.g. x = -2, y = 5, z = -4 Chain rule: Want: Want: Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-53 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-54 activations e.g. x = -2, y = 5, z = -4 Chain rule: f Want: Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-55 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-56

15 activations activations local gradient local gradient f f gradients Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-57 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-58 activations activations local gradient local gradient f f gradients gradients Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-59 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-60

16 activations Another example: local gradient f gradients Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-61 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-62 Another example: Another example: Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-63 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-64

17 Another example: Another example: Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-65 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-66 Another example: Another example: Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-67 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-68

18 Another example: Another example: Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-69 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-70 Another example: Another example: (-1) * (-0.20) = 0.20 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-71 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-72

19 Another example: Another example: [local gradient] x [its gradient] [1] x [0.2] = 0.2 [1] x [0.2] = 0.2 (both inputs!) Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-73 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-74 Another example: sigmoid function [local gradient] x [its gradient] x0: [2] x [0.2] = 0.4 w0: [-1] x [0.2] = -0.2 sigmoid gate Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-75 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-76

20 sigmoid function Patterns in backward flow add gate: gradient distributor max gate: gradient router mul gate: gradient? sigmoid gate (0.73) * (1-0.73) = 0.2 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-77 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-78 Gradients add at branches Implementation: forward/backward API Graph (or Net) object. (Rough psuedo code) + Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-79 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-80

Implementation: forward/backward API Implementation:

scalars) (x,y,z are scalars) Fei-Fei Li & Andrej

Andrej Karpathy & Justin Johnson Lecture 4-82

Classification Retrieval Detection Segmentation

21 Implementation: forward/backward API Implementation: forward/backward API x * z x * z y y (x,y,z are scalars) (x,y,z are scalars) Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-81 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4-82 ConvNets are everywhere ConvNets are everywhere Classification Retrieval Detection Segmentation [Krizhevsky 2012] [Faster R-CNN: Ren, He, Girshick, Sun 2015] [Farabet et al., 2012]

22 ConvNets are everywhere ConvNets are everywhere [Taigman et al. 2014] NVIDIA Tegra X1 self-driving cars [Simonyan et al. 2014] [Goodfellow 2014] ConvNets are everywhere ConvNets are everywhere [Toshev, Szegedy 2014] [Mnih 2013] [Ciresan et al. 2013] [Sermanet et al. 2011] [Ciresan et al.]

23 ConvNets are everywhere ConvNets are everywhere [Denil et al. 2014] [Turaga et al., 2010] Whale recognition, Kaggle Challenge Mnih and Hinton, 2010 Image Captioning [Vinyals et al., 2015] reddit.com/r/deepdream

Lecture 39: Training Neural Networks (Cont d)

Lecture 39: Training Neural Networks (Cont d) CS 4670/5670 Sean Bell Strawberry Goblet Throne (Side Note for PA5) AlexNet: 1 vs 2 parts Caffe represents caffe like the above image, but computes as if it