Hands-on Machine Learning Learning Guide: Neural Networks, CNNs, and Image Similarity Search

Audience: Beginner who understands ML theory and wants hands-on coding practice.
Main goal: Move from basic neural networks to CNNs for images, and then prepare for an image similarity / visual search project.

1. What You Are Trying to Learn

You already have theory from Andrew Ng-style ML courses and from the attached visual search system document. The practical goal now is to learn how to write, train, test, and improve neural networks using real image datasets.

The learning path is:

Basic ML workflow
    ↓
Basic neural network on image data
    ↓
CNN for images
    ↓
Pretrained CNN / ResNet
    ↓
Image embeddings
    ↓
Similarity search / visual search system

The attached document frames visual search as a ranking problem using representation learning: images are converted into embedding vectors, and visually similar images should be close to one another in embedding space.

2. Mental Model of Machine Learning

Every supervised ML project usually has these components:

Data + Model + Loss Function + Optimizer + Training Loop + Evaluation = Trained ML System

2.1 Data

Data contains examples. For image classification:

Input  = image
Output = label/class

Example:

Input  = handwritten digit image
Output = 7

2.2 Model

A model is a function that maps input to output.

model(image) → prediction

For classification, the output is usually a set of scores, one score per class.

2.3 Loss Function

A loss function measures how wrong the model is.

loss = difference between prediction and true label

For multi-class classification, we commonly use:

nn.CrossEntropyLoss()

2.4 Optimizer

The optimizer updates model weights to reduce loss.

Common optimizers:

SGD
Adam
AdamW

For beginner projects, Adam is a good default:

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

2.5 Training Loop

A training loop repeatedly performs:

1. Forward pass
2. Calculate loss
3. Backward pass
4. Update weights

3. Tools You Should Use

Recommended stack:

Python
PyTorch
torchvision
matplotlib
scikit-learn, later
FAISS, later for similarity search

Install basic packages:

pip install torch torchvision matplotlib

If using notebooks:

pip install notebook

4. Stage 1: Basic Neural Network on MNIST

4.1 Why MNIST?

MNIST is a classic beginner dataset of handwritten digits.

Each image is:

1 x 28 x 28

Meaning:

1  = grayscale channel
28 = height
28 = width

The label is one of:

0, 1, 2, 3, 4, 5, 6, 7, 8, 9

5. Basic Neural Network Code

5.1 Import Libraries

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

5.2 Prepare the Dataset

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

train_dataset = datasets.MNIST(
    root="./data",
    train=True,
    download=True,
    transform=transform
)

test_dataset = datasets.MNIST(
    root="./data",
    train=False,
    download=True,
    transform=transform
)

train_loader = DataLoader(
    train_dataset,
    batch_size=64,
    shuffle=True
)

test_loader = DataLoader(
    test_dataset,
    batch_size=64,
    shuffle=False
)

Explanation

transforms.ToTensor() converts an image into a PyTorch tensor.

transforms.Normalize((0.5,), (0.5,)) normalizes pixel values so training becomes more stable.

DataLoader creates mini-batches.

Instead of training on one image at a time, we train on a batch:

Batch size = 64

This means the model sees 64 images at once.

6. Define a Basic Neural Network

class BasicNN(nn.Module):
    def __init__(self):
        super(BasicNN, self).__init__()

        self.flatten = nn.Flatten()

        self.network = nn.Sequential(
            nn.Linear(28 * 28, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        x = self.network(x)
        return x

6.1 Shape Flow

Input image:

1 x 28 x 28

Flattened:

Network flow:

784 → 128 → 64 → 10

The final 10 numbers are class scores.

Example:

score for digit 0
score for digit 1
score for digit 2
...
score for digit 9

The class with the highest score becomes the prediction.

7. Create Model, Loss, and Optimizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = BasicNN().to(device)

criterion = nn.CrossEntropyLoss()

optimizer = optim.Adam(model.parameters(), lr=0.001)

7.1 Device

This line chooses GPU if available:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

7.2 Loss Function

Use cross entropy because this is multi-class classification.

criterion = nn.CrossEntropyLoss()

7.3 Optimizer

Adam updates model weights.

optimizer = optim.Adam(model.parameters(), lr=0.001)

8. Training Loop

num_epochs = 5

for epoch in range(num_epochs):
    model.train()

    running_loss = 0.0

    for images, labels in train_loader:
        images = images.to(device)
        labels = labels.to(device)

        optimizer.zero_grad()

        outputs = model(images)

        loss = criterion(outputs, labels)

        loss.backward()

        optimizer.step()

        running_loss += loss.item()

    avg_loss = running_loss / len(train_loader)

    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}")

8.1 Training Loop Breakdown

Step 1: Set model to training mode

model.train()

Step 2: Move data to device

images = images.to(device)
labels = labels.to(device)

Step 3: Clear old gradients

optimizer.zero_grad()

Step 4: Forward pass

outputs = model(images)

Step 5: Calculate loss

loss = criterion(outputs, labels)

Step 6: Backpropagation

loss.backward()

Step 7: Update weights

optimizer.step()

9. Evaluate the Model

model.eval()

correct = 0
total = 0

with torch.no_grad():
    for images, labels in test_loader:
        images = images.to(device)
        labels = labels.to(device)

        outputs = model(images)

        _, predicted = torch.max(outputs, 1)

        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = 100 * correct / total

print(f"Test Accuracy: {accuracy:.2f}%")

9.1 Why `torch.no_grad()`?

During testing, we do not update weights. So we do not need gradients.

This saves memory and makes evaluation faster.

10. What You Should Learn From BasicNN

After running this, you should understand:

how image data is loaded
what a tensor is
how a neural network is defined
how forward pass works
how loss is calculated
how backpropagation updates weights
how test accuracy is calculated

But there is one limitation:

BasicNN flattens the image and loses spatial structure.

That is why CNNs are better for images.

11. Stage 2: CNN for Images

11.1 Why CNN?

A normal neural network sees the image as a long list of numbers:

28 x 28 → 784

A CNN keeps the 2D structure:

1 x 28 x 28

This helps the model learn visual patterns such as:

edges
corners
curves
strokes
textures
object parts

This is important because visual search systems often use CNN-based architectures such as ResNet to generate useful image representations.

12. Important CNN Concepts

12.1 Convolution Layer

A convolution layer applies filters to an image.

A filter is a small matrix that slides over the image and detects patterns.

Example:

nn.Conv2d(in_channels=1, out_channels=16, kernel_size=3, padding=1)

Meaning:

Input channels  = 1 grayscale channel
Output channels = 16 learned filters
Kernel size     = 3 x 3 filter
Padding         = keeps height/width same

12.2 ReLU Activation

nn.ReLU()

ReLU adds non-linearity so the model can learn complex patterns.

12.3 Max Pooling

nn.MaxPool2d(kernel_size=2)

Max pooling reduces spatial size.

Example:

28 x 28 → 14 x 14

This reduces computation and keeps important features.

12.4 Fully Connected Layer

After convolution layers extract features, fully connected layers make final predictions.

13. Simple CNN Model for MNIST

class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()

        self.conv_layers = nn.Sequential(
            nn.Conv2d(
                in_channels=1,
                out_channels=16,
                kernel_size=3,
                padding=1
            ),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),

            nn.Conv2d(
                in_channels=16,
                out_channels=32,
                kernel_size=3,
                padding=1
            ),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2)
        )

        self.fc_layers = nn.Sequential(
            nn.Flatten(),
            nn.Linear(32 * 7 * 7, 128),
            nn.ReLU(),
            nn.Linear(128, 10)
        )

    def forward(self, x):
        x = self.conv_layers(x)
        x = self.fc_layers(x)
        return x

14. CNN Shape Flow

Input:

1 x 28 x 28

After first convolution:

16 x 28 x 28

After first max pooling:

16 x 14 x 14

After second convolution:

32 x 14 x 14

After second max pooling:

32 x 7 x 7

Flatten:

32 * 7 * 7 = 1568

Final output:

10 class scores

15. Train the CNN

Use the same training code as BasicNN. Only replace the model:

model = SimpleCNN().to(device)

Keep the same loss and optimizer:

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

Then run the same training loop.

16. Compare BasicNN vs CNN

After training both models, compare:

BasicNN accuracy
CNN accuracy

You should usually see CNN perform better because it understands image structure better.

Comparison Questions

Ask yourself:

Did CNN accuracy improve compared to BasicNN?
Did CNN loss reduce faster?
What happens if epochs increase from 5 to 10?
What happens if batch size changes from 64 to 128?
What happens if learning rate changes from 0.001 to 0.01?

17. Stage 3: CNN on CIFAR-10

After MNIST, move to CIFAR-10.

CIFAR-10 images are:

3 x 32 x 32

Meaning:

3  = RGB channels
32 = height
32 = width

Classes:

airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck

17.1 CIFAR-10 Dataset Code

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

train_dataset = datasets.CIFAR10(
    root="./data",
    train=True,
    download=True,
    transform=transform
)

test_dataset = datasets.CIFAR10(
    root="./data",
    train=False,
    download=True,
    transform=transform
)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

17.2 CNN for CIFAR-10

class CIFAR10CNN(nn.Module):
    def __init__(self):
        super(CIFAR10CNN, self).__init__()

        self.conv_layers = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),

            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),

            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2)
        )

        self.fc_layers = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 4 * 4, 256),
            nn.ReLU(),
            nn.Linear(256, 10)
        )

    def forward(self, x):
        x = self.conv_layers(x)
        x = self.fc_layers(x)
        return x

Shape Flow

Input:

3 x 32 x 32

After pooling 1:

32 x 16 x 16

After pooling 2:

64 x 8 x 8

After pooling 3:

128 x 4 x 4

Flatten:

128 * 4 * 4 = 2048

Output:

10 class scores

18. How This Connects to Image Similarity Search

Your attached visual search document explains that visual search can be framed as a ranking problem. Instead of predicting a class label, the system generates embeddings for images and ranks results based on similarity in embedding space.

Classification flow:

Image → CNN → Class label

Similarity search flow:

Image → CNN → Embedding vector → Similarity search → Ranked similar images

This is the bridge from CNN classification to visual search.

19. What Is an Embedding?

An embedding is a numeric vector representing an image.

Example:

[0.12, -0.45, 0.88, ..., 0.03]

If two images are similar, their embeddings should be close.

This is the key idea behind visual search.

20. From CNN Classifier to Embedding Extractor

A trained CNN has two parts:

Feature extractor + Classifier head

For image similarity, we often remove the final classification layer and use the feature vector before it.

Example concept:

CNN layers → feature vector → classification layer

For similarity search:

CNN layers → feature vector

The feature vector becomes the embedding.

21. Similarity Metrics

Common similarity/distance methods:

21.1 Cosine Similarity

Measures angle between vectors.

import torch

similarity = torch.nn.functional.cosine_similarity(vec1, vec2, dim=0)

Higher cosine similarity means vectors are more similar.

21.2 Dot Product

score = torch.dot(vec1, vec2)

21.3 Euclidean Distance

distance = torch.norm(vec1 - vec2)

Lower distance means vectors are closer.

For high-dimensional embeddings, cosine similarity is often preferred.

22. Later Project: Simple Image Similarity Search

Once you complete CNN basics, build this project:

Project Goal

Given a query image, return the top K visually similar images.

Steps

1. Load dataset
2. Train or use pretrained CNN
3. Extract embeddings for all images
4. Store embeddings
5. Select query image
6. Compute similarity between query and all images
7. Return top K most similar images

Basic Pseudocode

query_embedding = get_embedding(query_image)

scores = []

for image_id, embedding in database_embeddings:
    score = cosine_similarity(query_embedding, embedding)
    scores.append((image_id, score))

top_results = sorted(scores, key=lambda x: x[1], reverse=True)[:5]

23. When to Use Which Algorithm

23.1 Logistic Regression

Use when:

data is structured/tabular
problem is simple
interpretability matters

Example:

Predict whether customer will churn based on age, income, usage

23.2 Decision Trees / Random Forests

Use when:

tabular data
non-linear relationships
you want strong baseline performance

23.3 Neural Networks

Use when:

data is unstructured
data is large
images, text, audio, video

23.4 CNNs

Use when:

input is image-like
spatial structure matters

Examples:

image classification
object detection
visual search
medical image analysis

23.5 Transformers / Vision Transformers

Use when:

you have large datasets
you want state-of-the-art vision models
you can use pretrained models

23.6 Contrastive Learning

Use when:

you care about similarity
you want embeddings
labels are limited
you want representation learning

Examples:

image similarity
duplicate detection
visual search
face verification

23.7 Approximate Nearest Neighbor Search

Use when:

you have many embeddings
exact search is too slow
you need fast retrieval

Common libraries:

FAISS
Annoy
ScaNN

24. Common Errors and How to Debug

24.1 Shape Mismatch

Error example:

mat1 and mat2 shapes cannot be multiplied

Cause:

Your Linear layer input size does not match flattened tensor size.

Fix:

Print shape inside forward:

def forward(self, x):
    x = self.conv_layers(x)
    print(x.shape)
    x = self.fc_layers(x)
    return x

24.2 Loss Not Decreasing

Possible reasons:

learning rate too high
learning rate too low
model too small
data not normalized
labels incorrect

24.3 Accuracy Good on Train but Bad on Test

This is overfitting.

Solutions:

use more data
use data augmentation
reduce model size
use dropout
use weight decay

24.4 Model Training Too Slowly

Possible fixes:

use GPU
reduce image size
reduce model size
increase batch size carefully

25. Practice Assignments

Assignment 1: BasicNN on MNIST

Run the BasicNN model and record:

Epoch 1 loss:
Epoch 2 loss:
Epoch 3 loss:
Epoch 4 loss:
Epoch 5 loss:
Test accuracy:

Assignment 2: CNN on MNIST

Run SimpleCNN and record:

Epoch 1 loss:
Epoch 2 loss:
Epoch 3 loss:
Epoch 4 loss:
Epoch 5 loss:
Test accuracy:

Assignment 3: Compare Models

Answer:

Which model performed better?
Why?
Did CNN train slower or faster?
Was the improvement worth it?

Assignment 4: Change Hyperparameters

Try:

batch_size = 32
batch_size = 128
learning_rate = 0.01
learning_rate = 0.0001
epochs = 10

Observe what changes.

Assignment 5: Move to CIFAR-10

Train CIFAR10CNN and compare accuracy with MNIST results.

Expected observation:

CIFAR-10 is harder than MNIST.

Why?

color images
more complex objects
background noise
object variation

26. Recommended Learning Schedule

Week 1: PyTorch Basics

Focus:

tensors
datasets
dataloaders
simple neural network
training loop

Deliverable:

BasicNN trained on MNIST

Week 2: CNN Basics

Focus:

convolution
pooling
CNN architecture
CNN training

Deliverable:

SimpleCNN trained on MNIST

Week 3: CIFAR-10

Focus:

RGB images
deeper CNN
model debugging
train/test comparison

Deliverable:

CNN trained on CIFAR-10

Week 4: Pretrained CNN

Focus:

ResNet
transfer learning
feature extraction

Deliverable:

Use pretrained ResNet as feature extractor

Week 5: Image Embeddings

Focus:

remove classification head
extract embeddings
cosine similarity

Deliverable:

Generate embeddings for image dataset

Week 6: Visual Search Mini Project

Focus:

query image
top K retrieval
ranking
visualization

Deliverable:

Mini Pinterest-like image similarity search system

27. Final Big Picture

You are learning two connected ideas:

Classification

Image → Neural Network → Label

Example:

Image of digit → 7

Representation Learning / Similarity Search

Image → Neural Network → Embedding → Similar images

Example:

Image of dog → other visually similar dog images

Your attached visual search document focuses on the second idea. But to understand it properly, you first need hands-on practice with the first idea.

That is why the best order is:

BasicNN → CNN → CIFAR-10 → ResNet → Embeddings → Similarity Search

28. Next Steps

After completing this guide, the next document/notebook should contain:

Complete runnable MNIST BasicNN notebook
Complete runnable MNIST CNN notebook
CIFAR-10 CNN notebook
Pretrained ResNet feature extraction notebook
Image similarity search notebook

Recommended next project:

Build a simple image similarity search engine using CIFAR-10 and cosine similarity.

Hands-on Machine Learning Learning Guide: Neural Networks, CNNs, and Image Similarity Search

1. What You Are Trying to Learn

2. Mental Model of Machine Learning

2.1 Data

2.2 Model

2.3 Loss Function

2.4 Optimizer

2.5 Training Loop

3. Tools You Should Use

4. Stage 1: Basic Neural Network on MNIST

4.1 Why MNIST?

5. Basic Neural Network Code

5.1 Import Libraries

5.2 Prepare the Dataset

Explanation

6. Define a Basic Neural Network

6.1 Shape Flow

7. Create Model, Loss, and Optimizer

7.1 Device

7.2 Loss Function

7.3 Optimizer

8. Training Loop

8.1 Training Loop Breakdown

Step 1: Set model to training mode

Step 2: Move data to device

Step 3: Clear old gradients

Step 4: Forward pass

Step 5: Calculate loss

Step 6: Backpropagation

Step 7: Update weights

9. Evaluate the Model

9.1 Why torch.no_grad()?

10. What You Should Learn From BasicNN

11. Stage 2: CNN for Images

11.1 Why CNN?

12. Important CNN Concepts

12.1 Convolution Layer

12.2 ReLU Activation

12.3 Max Pooling

12.4 Fully Connected Layer

13. Simple CNN Model for MNIST

14. CNN Shape Flow

15. Train the CNN

16. Compare BasicNN vs CNN

Comparison Questions

17. Stage 3: CNN on CIFAR-10

17.1 CIFAR-10 Dataset Code

17.2 CNN for CIFAR-10

Shape Flow

18. How This Connects to Image Similarity Search

19. What Is an Embedding?

20. From CNN Classifier to Embedding Extractor

21. Similarity Metrics

21.1 Cosine Similarity

21.2 Dot Product

21.3 Euclidean Distance

22. Later Project: Simple Image Similarity Search

Project Goal

Steps

Basic Pseudocode

23. When to Use Which Algorithm

23.1 Logistic Regression

23.2 Decision Trees / Random Forests

23.3 Neural Networks

23.4 CNNs

23.5 Transformers / Vision Transformers

23.6 Contrastive Learning

23.7 Approximate Nearest Neighbor Search

24. Common Errors and How to Debug

24.1 Shape Mismatch

24.2 Loss Not Decreasing

24.3 Accuracy Good on Train but Bad on Test

24.4 Model Training Too Slowly

25. Practice Assignments

Assignment 1: BasicNN on MNIST

Assignment 2: CNN on MNIST

Assignment 3: Compare Models

Assignment 4: Change Hyperparameters

Assignment 5: Move to CIFAR-10

26. Recommended Learning Schedule

9.1 Why `torch.no_grad()`?