Hands-on Machine Learning Learning Guide: Neural Networks, CNNs, and Image Similarity Search
Audience: Beginner who understands ML theory and wants hands-on coding practice.
Main goal: Move from basic neural networks to CNNs for images, and then prepare for an image similarity / visual search project.
1. What You Are Trying to Learn
You already have theory from Andrew Ng-style ML courses and from the attached visual search system document. The practical goal now is to learn how to write, train, test, and improve neural networks using real image datasets.
The learning path is:
Basic ML workflow
↓
Basic neural network on image data
↓
CNN for images
↓
Pretrained CNN / ResNet
↓
Image embeddings
↓
Similarity search / visual search system
The attached document frames visual search as a ranking problem using representation learning: images are converted into embedding vectors, and visually similar images should be close to one another in embedding space.
2. Mental Model of Machine Learning
Every supervised ML project usually has these components:
2.1 Data
Data contains examples. For image classification:
Example:
2.2 Model
A model is a function that maps input to output.
For classification, the output is usually a set of scores, one score per class.
2.3 Loss Function
A loss function measures how wrong the model is.
For multi-class classification, we commonly use:
2.4 Optimizer
The optimizer updates model weights to reduce loss.
Common optimizers:
- SGD
- Adam
- AdamW
For beginner projects, Adam is a good default:
2.5 Training Loop
A training loop repeatedly performs:
3. Tools You Should Use
Recommended stack:
- Python
- PyTorch
- torchvision
- matplotlib
- scikit-learn, later
- FAISS, later for similarity search
Install basic packages:
If using notebooks:
4. Stage 1: Basic Neural Network on MNIST
4.1 Why MNIST?
MNIST is a classic beginner dataset of handwritten digits.
Each image is:
Meaning:
The label is one of:
5. Basic Neural Network Code
5.1 Import Libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
5.2 Prepare the Dataset
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))
])
train_dataset = datasets.MNIST(
root="./data",
train=True,
download=True,
transform=transform
)
test_dataset = datasets.MNIST(
root="./data",
train=False,
download=True,
transform=transform
)
train_loader = DataLoader(
train_dataset,
batch_size=64,
shuffle=True
)
test_loader = DataLoader(
test_dataset,
batch_size=64,
shuffle=False
)
Explanation
transforms.ToTensor() converts an image into a PyTorch tensor.
transforms.Normalize((0.5,), (0.5,)) normalizes pixel values so training becomes more stable.
DataLoader creates mini-batches.
Instead of training on one image at a time, we train on a batch:
This means the model sees 64 images at once.
6. Define a Basic Neural Network
class BasicNN(nn.Module):
def __init__(self):
super(BasicNN, self).__init__()
self.flatten = nn.Flatten()
self.network = nn.Sequential(
nn.Linear(28 * 28, 128),
nn.ReLU(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 10)
)
def forward(self, x):
x = self.flatten(x)
x = self.network(x)
return x
6.1 Shape Flow
Input image:
Flattened:
Network flow:
The final 10 numbers are class scores.
Example:
The class with the highest score becomes the prediction.
7. Create Model, Loss, and Optimizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = BasicNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
7.1 Device
This line chooses GPU if available:
7.2 Loss Function
Use cross entropy because this is multi-class classification.
7.3 Optimizer
Adam updates model weights.
8. Training Loop
num_epochs = 5
for epoch in range(num_epochs):
model.train()
running_loss = 0.0
for images, labels in train_loader:
images = images.to(device)
labels = labels.to(device)
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
avg_loss = running_loss / len(train_loader)
print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}")
8.1 Training Loop Breakdown
Step 1: Set model to training mode
Step 2: Move data to device
Step 3: Clear old gradients
Step 4: Forward pass
Step 5: Calculate loss
Step 6: Backpropagation
Step 7: Update weights
9. Evaluate the Model
model.eval()
correct = 0
total = 0
with torch.no_grad():
for images, labels in test_loader:
images = images.to(device)
labels = labels.to(device)
outputs = model(images)
_, predicted = torch.max(outputs, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
accuracy = 100 * correct / total
print(f"Test Accuracy: {accuracy:.2f}%")
9.1 Why torch.no_grad()?
During testing, we do not update weights. So we do not need gradients.
This saves memory and makes evaluation faster.
10. What You Should Learn From BasicNN
After running this, you should understand:
- how image data is loaded
- what a tensor is
- how a neural network is defined
- how forward pass works
- how loss is calculated
- how backpropagation updates weights
- how test accuracy is calculated
But there is one limitation:
That is why CNNs are better for images.
11. Stage 2: CNN for Images
11.1 Why CNN?
A normal neural network sees the image as a long list of numbers:
A CNN keeps the 2D structure:
This helps the model learn visual patterns such as:
- edges
- corners
- curves
- strokes
- textures
- object parts
This is important because visual search systems often use CNN-based architectures such as ResNet to generate useful image representations.
12. Important CNN Concepts
12.1 Convolution Layer
A convolution layer applies filters to an image.
A filter is a small matrix that slides over the image and detects patterns.
Example:
Meaning:
Input channels = 1 grayscale channel
Output channels = 16 learned filters
Kernel size = 3 x 3 filter
Padding = keeps height/width same
12.2 ReLU Activation
ReLU adds non-linearity so the model can learn complex patterns.
12.3 Max Pooling
Max pooling reduces spatial size.
Example:
This reduces computation and keeps important features.
12.4 Fully Connected Layer
After convolution layers extract features, fully connected layers make final predictions.
13. Simple CNN Model for MNIST
class SimpleCNN(nn.Module):
def __init__(self):
super(SimpleCNN, self).__init__()
self.conv_layers = nn.Sequential(
nn.Conv2d(
in_channels=1,
out_channels=16,
kernel_size=3,
padding=1
),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2),
nn.Conv2d(
in_channels=16,
out_channels=32,
kernel_size=3,
padding=1
),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2)
)
self.fc_layers = nn.Sequential(
nn.Flatten(),
nn.Linear(32 * 7 * 7, 128),
nn.ReLU(),
nn.Linear(128, 10)
)
def forward(self, x):
x = self.conv_layers(x)
x = self.fc_layers(x)
return x
14. CNN Shape Flow
Input:
After first convolution:
After first max pooling:
After second convolution:
After second max pooling:
Flatten:
Final output:
15. Train the CNN
Use the same training code as BasicNN. Only replace the model:
Keep the same loss and optimizer:
Then run the same training loop.
16. Compare BasicNN vs CNN
After training both models, compare:
You should usually see CNN perform better because it understands image structure better.
Comparison Questions
Ask yourself:
- Did CNN accuracy improve compared to BasicNN?
- Did CNN loss reduce faster?
- What happens if epochs increase from 5 to 10?
- What happens if batch size changes from 64 to 128?
- What happens if learning rate changes from 0.001 to 0.01?
17. Stage 3: CNN on CIFAR-10
After MNIST, move to CIFAR-10.
CIFAR-10 images are:
Meaning:
Classes:
17.1 CIFAR-10 Dataset Code
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
train_dataset = datasets.CIFAR10(
root="./data",
train=True,
download=True,
transform=transform
)
test_dataset = datasets.CIFAR10(
root="./data",
train=False,
download=True,
transform=transform
)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)
17.2 CNN for CIFAR-10
class CIFAR10CNN(nn.Module):
def __init__(self):
super(CIFAR10CNN, self).__init__()
self.conv_layers = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2)
)
self.fc_layers = nn.Sequential(
nn.Flatten(),
nn.Linear(128 * 4 * 4, 256),
nn.ReLU(),
nn.Linear(256, 10)
)
def forward(self, x):
x = self.conv_layers(x)
x = self.fc_layers(x)
return x
Shape Flow
Input:
After pooling 1:
After pooling 2:
After pooling 3:
Flatten:
Output:
18. How This Connects to Image Similarity Search
Your attached visual search document explains that visual search can be framed as a ranking problem. Instead of predicting a class label, the system generates embeddings for images and ranks results based on similarity in embedding space.
Classification flow:
Similarity search flow:
This is the bridge from CNN classification to visual search.
19. What Is an Embedding?
An embedding is a numeric vector representing an image.
Example:
If two images are similar, their embeddings should be close.
This is the key idea behind visual search.
20. From CNN Classifier to Embedding Extractor
A trained CNN has two parts:
For image similarity, we often remove the final classification layer and use the feature vector before it.
Example concept:
For similarity search:
The feature vector becomes the embedding.
21. Similarity Metrics
Common similarity/distance methods:
21.1 Cosine Similarity
Measures angle between vectors.
Higher cosine similarity means vectors are more similar.
21.2 Dot Product
21.3 Euclidean Distance
Lower distance means vectors are closer.
For high-dimensional embeddings, cosine similarity is often preferred.
22. Later Project: Simple Image Similarity Search
Once you complete CNN basics, build this project:
Project Goal
Given a query image, return the top K visually similar images.
Steps
1. Load dataset
2. Train or use pretrained CNN
3. Extract embeddings for all images
4. Store embeddings
5. Select query image
6. Compute similarity between query and all images
7. Return top K most similar images
Basic Pseudocode
query_embedding = get_embedding(query_image)
scores = []
for image_id, embedding in database_embeddings:
score = cosine_similarity(query_embedding, embedding)
scores.append((image_id, score))
top_results = sorted(scores, key=lambda x: x[1], reverse=True)[:5]
23. When to Use Which Algorithm
23.1 Logistic Regression
Use when:
- data is structured/tabular
- problem is simple
- interpretability matters
Example:
23.2 Decision Trees / Random Forests
Use when:
- tabular data
- non-linear relationships
- you want strong baseline performance
23.3 Neural Networks
Use when:
- data is unstructured
- data is large
- images, text, audio, video
23.4 CNNs
Use when:
- input is image-like
- spatial structure matters
Examples:
- image classification
- object detection
- visual search
- medical image analysis
23.5 Transformers / Vision Transformers
Use when:
- you have large datasets
- you want state-of-the-art vision models
- you can use pretrained models
23.6 Contrastive Learning
Use when:
- you care about similarity
- you want embeddings
- labels are limited
- you want representation learning
Examples:
- image similarity
- duplicate detection
- visual search
- face verification
23.7 Approximate Nearest Neighbor Search
Use when:
- you have many embeddings
- exact search is too slow
- you need fast retrieval
Common libraries:
- FAISS
- Annoy
- ScaNN
24. Common Errors and How to Debug
24.1 Shape Mismatch
Error example:
Cause:
Fix:
Print shape inside forward:
24.2 Loss Not Decreasing
Possible reasons:
- learning rate too high
- learning rate too low
- model too small
- data not normalized
- labels incorrect
24.3 Accuracy Good on Train but Bad on Test
This is overfitting.
Solutions:
- use more data
- use data augmentation
- reduce model size
- use dropout
- use weight decay
24.4 Model Training Too Slowly
Possible fixes:
- use GPU
- reduce image size
- reduce model size
- increase batch size carefully
25. Practice Assignments
Assignment 1: BasicNN on MNIST
Run the BasicNN model and record:
Assignment 2: CNN on MNIST
Run SimpleCNN and record:
Assignment 3: Compare Models
Answer:
Assignment 4: Change Hyperparameters
Try:
Observe what changes.
Assignment 5: Move to CIFAR-10
Train CIFAR10CNN and compare accuracy with MNIST results.
Expected observation:
Why?
- color images
- more complex objects
- background noise
- object variation
26. Recommended Learning Schedule
Week 1: PyTorch Basics
Focus:
- tensors
- datasets
- dataloaders
- simple neural network
- training loop
Deliverable:
Week 2: CNN Basics
Focus:
- convolution
- pooling
- CNN architecture
- CNN training
Deliverable:
Week 3: CIFAR-10
Focus:
- RGB images
- deeper CNN
- model debugging
- train/test comparison
Deliverable:
Week 4: Pretrained CNN
Focus:
- ResNet
- transfer learning
- feature extraction
Deliverable:
Week 5: Image Embeddings
Focus:
- remove classification head
- extract embeddings
- cosine similarity
Deliverable:
Week 6: Visual Search Mini Project
Focus:
- query image
- top K retrieval
- ranking
- visualization
Deliverable:
27. Final Big Picture
You are learning two connected ideas:
Classification
Example:
Representation Learning / Similarity Search
Example:
Your attached visual search document focuses on the second idea. But to understand it properly, you first need hands-on practice with the first idea.
That is why the best order is:
28. Next Steps
After completing this guide, the next document/notebook should contain:
- Complete runnable MNIST BasicNN notebook
- Complete runnable MNIST CNN notebook
- CIFAR-10 CNN notebook
- Pretrained ResNet feature extraction notebook
- Image similarity search notebook
Recommended next project: