صفحه 1:
IMAGENET CLASSIFICATION WITH DEEP
CONVOLUTIONAL NEURAL NETWORKS
Geoffrey E. Hinton Ilya Sutskever Alex Krizhevsky
University of University of University of Toronto
Toronto Toronto kriz@cs.utoronto.ca
ام SS NR OMAR SS Heres CB tems 25 (NIPS 201
Cited by 12013
5 ۳ Ali Albawi Karrar Alkaabi
صفحه 2:
ABSTRACT
© We trained a large, deep convolutional neural
network to classify the 1.2 million high-resolution
images in the ImageNet LSVRC-2010 contest into
the 1000 different classes.
© On the test data, we achieved top-1 and top-5
error rates of 37.5% and 17.0% which is
considerably better than the previous state-of-the-
art.
©The neural network, which has 60 million
parameters and 650,000 neurons, consists of five
convolutional layers, some of which are followed
by max pooling layers, and three fully-connected
layers with final 1000-way softmax.
صفحه 3:
1 - ABSTRACT
°To make training faster, we used non-
saturating neurons and a very efficient GPU
implementation of the convolution operation.
°To reduce overfitting in the fully-connected
layers we employed a_ recently-developed
regularization method called “dropout” that
proved to be very effective.
صفحه 4:
2 - INTRODUCTION
© Current approaches to object recognition make
essential use of machine learning methods.
datasets of labeled images were relatively
small — on the order of tens of thousands of
images (e.g., NORB [16], Caltech-101/256 [8,
9], and CIFAR-10/100 [12]).
© But objects in realistic settings exhibit
considerable variability, so to learn to
recognize them it is necessary to use much
larger training sets.
صفحه 5:
2 - INTRODUCTION
© The new larger datasets include LabelMe [23], which
consists of hundreds of thousands of fully-
segmented images, and ImageNet [6], which
consists of over 15 million labeled high-resolution
images in over 22,000 categories.
© To learn about thousands of objects from millions of
images, we need a model with a large learning
capacity like CNN.
© Despite the attractive qualities of CNNs, and despite
the relative efficiency of their local architecture,
they have still been prohibitively expensive to apply
in large scale to high-resolution images for this
reason we using GPU.
صفحه 6:
3 - THE ARCHITECTURE
Fully connected layers
7 — قب
>a وت
Convolution Max pooling
لس سس سا
Convolutional Layers + Pooling layers
صفحه 7:
3 - ARCHITECTURE
©The = architecture of our network is
summarized in Figure 2. It contains eight
learned layers —five convolutional and three
fully-connected.
© Below, we describe some of the novel or
unusual features of our network’s
architecture.
© Sections 3.1-3.4 are sorted according to our
estimation of their importance, with the most
important first.
صفحه 8:
oie \Wonse
۳ en
3 - THE ARCHITECTURE
Ti Max
Max Te Max pooling
pooling pooling
Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities
between the two GPUs. One GPU nuns the layer parts atthe top of the figure while the other runs the layer-parts
at the bottom, The GPUs communicate only at certain layers. The network's input is 150,528-dimensional, and
the number of neurons in the network's remaining layers is given by 253.440-186,624-64,896-64,896-43,264—
4096096-10.
صفحه 9:
3.1 - RELU NONLINEARITY
© The standard way to model a neuron’s output
fas
f(x) = (1 + e7*)-1 ts input x is with f(x) = tanh(x)
Or
© Deep convolutional neural networks with
ReLUs train several times faster than their
equivalents with tanh units.
صفحه 10:
3.1 - RELU NONLINEARITY
075
Training
6 10 6 2 25 30 35 40
Epochs
Figure 1: A four-layer convolutional neural network with ReLUs (solid line)
reaches a 25% training error rate on CIFAR-10 six times faster than an
equivalent network with tanh neurons (dashed line). The learning rates
for each network were chosen independently to make training as fast as
possible. No regularization of any kind was employed. The magnitude of
the effect demonstrated here varies with network architecture, but
networks with ReLUs consistently learn several times faster than
equivalents with saturating neurons.
صفحه 11:
3.2 - TRAINING ON MULTIPLE GPUS
© A single GTX 580 GPU has only 3GB of memory, which
limits the maximum size of the networks that can be
trained on it. It turns out that 1.2 million training
examples are enough to train networks which are too big
to fit on one GPU. Therefore we spread the net across two
GPUs. Current GPUs are particularly well-suited to cross-
GPU parallelization, as they are able to read from and
write to one another's memory directly, without going
through host machine memory.
© The parallelization scheme that we employ essentially
puts half of the kernels (or neurons) on each GPU, with
one additional trick: the GPUs communicate only in
certain layers. This means that, for example, the kernels
of layer 3 take input from all kernel maps in layer 2.
However, kernels in layer 4 take input only from those
kernel maps in layer 3 which reside on the same GPU.
صفحه 12:
3.3 - LOCAL RESPONSE
NORMALIZATION
© ReLUs have the desirable property that they do not require
input normalization to prevent them from saturating. If at
least some training examples produce a positive input to a
ReLU, learning will happen in that neuron. However, we still
find that the following local normalization scheme aids
generalization. Denoting by a' x;y the activity of a neuron
computed by applying kernel i at position (x; y) and then
applying the ReLU nonlinearity, the response-normalized
activity hi x-v is aiven hy the exnressian
min(N—1,¢+n/2) 8
0 بر ]ره ریا
j=max(0,i-n/2)
°
Response normalization reduces our top-1 and top-5 error
rates by 1.4% and 1.2%, respectively. We also verified the
effectiveness of this scheme on the CIFAR-10 dataset: a four-
layer CNN achieved a 13% test error rate without ©
normalization and 11% with normalization.
صفحه 13:
3.4 - OVERLAPPING POOLING
© Pooling layers in CNNs summarize the outputs of
neighboring groups of neurons in the same kernel map.
Traditionally, the neighborhoods summarized by adjacent
pooling units do not overlap (e.g., [17, 11, 4]).
° To be more precise, a pooling layer can be thought of as
consisting of a grid of pooling units spaced s pixels apart,
each summarizing a neighborhood of size z 2 centered at
the location of the pooling unit. If we set s = z, we obtain
traditional local pooling as commonly employed in CNNs. If
we set s < z, we obtain overlapping pooling.
© This is what we use throughout our network, with s = 2 and
Z = 3. This scheme reduces the top-1 and top-5 error rates
by 0.4% and 0.3%, respectively, as compared with the non
overlapping scheme s = 2; z = 2, which produces output of
equivalent dimensions.
صفحه 14:
3.4 - OVERLAPPING POOLING
°We generally observe during training that
models with overlapping pooling find it
slightly more difficult to over fit.
Single depth slice
>| BREE
max pool with 2x2 filters
and stride 2
5 | 6 | 7 | 8
3 | 2 | 1 | 9
1 | 2 | ۴
صفحه 15:
3.4 - OVERLAPPING POOLING
صفحه 16:
3.5 - OVERALL ARCHITECTURE
© Now we are ready to describe the overall architecture of
our CNN.
© the net contains eight layers with weights; the first five
are convolutional and the remaining three are fully
connected.
© The output of the last fully-connected layer is fed to a
1000-way softmax which produces a distribution over
the 1000 class labels.
© The kernels of the second, fourth, and fifth convolutional
layers are connected only to those kernel maps in the
previous layer which reside on the same GPU.
° The kernels of the third convolutional layer are
connected to all kernel maps in the second layer.
© The neurons in the fully connected layers are connected
to all neurons in the previous layer.
صفحه 17:
3.5 - OVERALL ARCHITECTURE
© The ReLU non-linearity is applied to the output of
every convolutional and fully-connected layer.
© Response-normalization layers follow the first and
second convolutional layers.
6 Max-pooling layers, of the kind described in Section
3.4, follow both response-normalization layers as well
as the fifth convolutional layer.
ic mumber of neusons Inthe ntwork's coming layersie glen By 289.990. 186624 64896-64896. 43.268.
صفحه 18:
3.5 - OVERALL ARCHITECTURE
© The second convolutional layer takes as input the
(response-normalized and pooled) output of the first
convolutional layer and filters it with 256 kernels of size
5x5x48.
© The third, fourth, and fifth convolutional layers are
connected to one another without any intervening
pooling or normalization layers.
© The third convolutional layer has 384 kernels of size
3x3x 256 connected to the (normalized, pooled)
outputs of the second convolutional layer.
© The fourth convolutional layer has 384 kernels of size
3x3x192 , and the fifth convolutional layer has 256
kernels of size 3x3x192. The fully-connected layers
have 4096 neurons each.
صفحه 19:
3.5 - OVERALL ARCHITECTURE
224K224x3 256 kernels 2048 neurons each
input image 8
با
2
= :
1 8
3و ل
2
256 kernels
384 kernels dxdxi92
96 kernels 3x3x256
فد
صفحه 20:
4 - REDUCING OVERFITTING
© We describe the two primary ways in which we
combat overfitting.
4.1 - Data Augmentation
© We employ two distinct forms of data augmentation.
“© The first form of data augmentation consists of
generating image translations and_ horizontal
reflections. We do this by extracting random 224 x
224 patches (and their horizontal reflections) from
the 256 x 256 images and training our network on
these extracted patches . ©
صفحه 21:
4.1 - DATA AUGMENTATION
© this scheme, our network suffers from substantial
overfitting, which would have forced us to use much
smaller networks. Hoszontal Elle
a 224x224
۳۰
224x224
224x224
224x224
224x224
۲ 9
224x224 224x224
256x256
صفحه 22:
4.1 - DATA AUGMENTATION
© The second form of data augmentation consists of
altering the intensities of the RGB channels in training
imac
رو لت 3
ee
صفحه 23:
4.2 - DROPOUT
© The recently-introduced technique, called “dropout”
[10], consists of setting to zero the output of each
hidden neuron with probability 0.5.
5 50 every time an input is presented, the neural
network samples a different architecture.
© We use dropout in the first two fully-connected layers
of Figure 2.
© Without dropout, our network exhibits substantial
overfitting.
© Dropout roughly doubles the number of iterations
required to converge.
صفحه 24:
4.2 - DROPOUT
Standard Neural Net
صفحه 25:
5 - DETAILS OF LEARNING
© We trained our models using stochastic gradient descent with a
batch size of 128 examples, momentum of 0.9, and weight decay
of 0.0005.
© We found that this small amount of weight decay was important
for the model to learn. In other words, weight decay here is not
merely a regularizer : it reduces the model's training error.
° We initialized the weights in each layer from a zero-mean
Gaussian distribution with standard deviation 0.01.
© We initialized the neuron biases in the second, fourth, and fifth
convolutional layers,as well as in the fully-connected hidden
layers, with the constant 1.
© We initialized the neuron biases in the remaining layers with the
constant 0.
© This initialization accelerates the early stages of learning by
providing the ReLUs with positive inputs.
صفحه 26:
5 - DETAILS OF LEARNING
© We used an equal learning rate for all layers, which we
adjusted manually throughout training.The heuristic
which we followed was to divide the learning rate by
10 when thev alidation error rate stopped improving
with the current learning rate.
© The learning rate was initialized at 0.01 and reduced
three times prior to termination.
© We trained the network for roughly 90 cycles through
the training set of 1.2 million images, which took five
to six days on two NVIDIA GTX 580 3GB GPUs.
صفحه 27:
6 - THE DATASET
ImageNet is a dataset of over 15 million labeled high-
resolution images belonging to roughly 22,000
categories.
ILSVRC uses a subset of ImageNet with roughly 1000
images in each of 1000 categories.
1.2 million training images, 50,000 validation images,
and 150,000 testing images.
(ILSVRC) : ImageNet Large-Scale Visual Recognition
Challenge.
On ImageNet, it is customary to report two error rates:
top-1 and top-5, where the top-5 error rate.
صفحه 28:
6.1 - QUALITATIVE EVALUATIONS
=
Figure 4: (Left) Eight ILSVRC-2010 ها and the five labels considered most probable by our model
The conect label is written under each i the probability assigned to the comect label is also shown
With a red bar Git happens to be inthe top 5). (Right) Five ILSVRC-2010 test images inthe frst column, The
remaining columns show the six traning images that produce feature vectors in the lst hidden layer with the
smallest Euclidean distance from the feature vector forthe test image
صفحه 29:
7 - RESULTS
Sparse coding [2] | 47.1% | 28.2%
SIFT + FVs [24] | 45.7% | 25.7%
373% [17.0%
Table 1: Comparison of results on ILSVRC-
2010 test set. In italics are best results
achieved by others.
صفحه 30:
Top-5 (test)
26.2%
16.4%
15.3%
7 - RESULTS
Top-5 (val)
18.2%
16.4%
16.6%
15.4%
Model Top-1 (val)
SIFT + FVs [7] —
I CNN 40.7%
5 ۶ 38.1%
1 CNN* 39.0%
7 CNNs* 36.7%
Table 2: Comparison of error rates on ILSVRC-2012 validation and
test sets. In italics are best results achieved by others. Models with an
asterisk* were “pre-trained” to classify the entire ImageNet 2011 Fall
release. See Section 6 for details.
صفحه 31:
THANK YOU
Any questic