MIT 6.S191 – Introduction to Deep Learning (Summary Notes)
Lecture 1: Foundations of Deep Learning
1.
2.
What’s deep learning?
Why deep learning
Hand engineered features are time consuming, brittle and not scalable in practice.
3.
Perceptron: Simplest neural unit. Takes inputs xi , multiplies by weights wi , applies activation
function.
4.
Common Activation Functions:
Sigmoid function �(�) =
1
1+ �−�
�(�) =
Hyperbolic Tangent
Rectified Linear Unit (ReLU)
�� − �−�
�� + �−�
�(�) = max(0, z)
5.
Neural Networks: Multiple layers of perceptrons (input, hidden, output).
6.
Forward Propagation: y = f (Wx+b)
7.
Loss Functions: Quantify error (e.g., MSE, cross-entropy).
Empirical loss: measures the total loss over entire dataset
Binary Cross Entropy loss: used with models that output a probability between 0 and 1
Mean Squared Error Loss: used with regression models that output continuous real numbers
8.
Loss Optimization: find the network weights that achieve the lowest loss
W = argmin
1
�
�
�=1
� �(��; �), ��
9.
Gradient Descent: Updates weights to minimize loss.
1)
2)
3)
4)
Process
Initialize weights randomly
Loop until convergence
Compute gradient
Update weights 5)
return weights
1)
2)
3)
4)
5)
Algorithm
SGD
Adam
Adadelta
Adagrad
RMSProp
10. Backpropagation: Uses chain rule to compute gradients.
11. Learning rate
1) Setting the learning rate
a) Large learning rates overshoot, become unstable and diverge
b) stable learning rates converge smoothly and avoid minima
2) Adaptive learning rates - no longer fixed
12. The Problem of underfitting and overfitting
13. Regularization
Technique that constrains the optimization problem to discourage complex model
Improve generalization of model on unseen data
1)
Regularization 1 : Dropout
During training, randomly set some activations to 0
2)
Regularization 2 : Early stopping
Stop training before we have a chance to overfit.