Learning Convolutions by Inventing Computer Vision¶
Philosophy: You won't be taught — you will discover. Every step builds on the last.
You will build everything from scratch: pixel by pixel, loop by loop.
What you'll need¶
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import requests
from io import BytesIO
Run the setup cell below.
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import requests
from io import BytesIO
%matplotlib inline
print("All imports successful!")
# IMAGE LOADER HELPER — run this cell as-is
# This gives you two PIL Image objects to work with throughout the notebook.
def load_color_image():
"""Downloads a small color image (a classic: Lena/Lenna test image alternative).
Falls back to a synthetic image if download fails."""
try:
url = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/Bikesgray.jpg/320px-Bikesgray.jpg"
response = requests.get(url, timeout=5)
img = Image.open(BytesIO(response.content)).convert('RGB')
return img
except Exception:
# Synthetic fallback: colorful gradient image
arr = np.zeros((100, 100, 3), dtype=np.uint8)
for r in range(100):
for c in range(100):
arr[r, c] = [r * 2, c * 2, 128]
return Image.fromarray(arr)
def load_gray_image():
"""Downloads a small grayscale image.
Falls back to a synthetic image if download fails."""
try:
url = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/Bikesgray.jpg/320px-Bikesgray.jpg"
response = requests.get(url, timeout=5)
img = Image.open(BytesIO(response.content)).convert('L')
return img
except Exception:
arr = np.zeros((100, 100), dtype=np.uint8)
for r in range(100):
for c in range(100):
arr[r, c] = (r + c) % 256
return Image.fromarray(arr)
color_img = load_color_image()
gray_img = load_gray_image()
print("Images loaded!")
print(f"Color image mode : {color_img.mode}")
print(f"Gray image mode : {gray_img.mode}")
# DISPLAY HELPER — run this cell as-is
def show_images(images, titles=None, cmap_list=None, figsize=None):
"""
Displays a list of images side by side.
Parameters:
images : list of numpy arrays or PIL Images
titles : list of title strings
cmap_list : list of colormaps (e.g. ['gray', None, 'hot'])
use None for color images, 'gray' for grayscale
figsize : optional (width, height) tuple
"""
n = len(images)
if figsize is None:
figsize = (5 * n, 4)
if titles is None:
titles = [f'Image {i+1}' for i in range(n)]
if cmap_list is None:
cmap_list = [None] * n
fig, axes = plt.subplots(1, n, figsize=figsize)
if n == 1:
axes = [axes]
for ax, img, title, cmap in zip(axes, images, titles, cmap_list):
if isinstance(img, Image.Image):
img = np.array(img)
ax.imshow(img, cmap=cmap)
ax.set_title(title)
ax.axis('off')
plt.tight_layout()
plt.show()
print("show_images() helper ready.")
Tasks:
- Use
show_imagesto display both the color and grayscale images.- For the grayscale image, pass
cmap_list=['gray'].
- For the grayscale image, pass
- What do you notice visually? How do they differ?
# YOUR CODE HERE
# Display the color image
show_images([color_img], titles=['Color Image'])
# Display the grayscale image
# show_images([gray_img], titles=['Grayscale Image'], cmap_list=...)
Exercise 1.2 — Peek Inside: Images as Numbers¶
An image is just a grid of numbers. Let's prove it.
A PIL Image can be converted to a NumPy array using np.array(img).
Tasks:
- Convert both images to NumPy arrays. Store them as
color_arrandgray_arr. - Print
color_arr.shapeandgray_arr.shape. What are the dimensions? - Print
color_arr.dtypeandgray_arr.dtype. What is the data type of each pixel? - What is the range of values? Print
color_arr.min()andcolor_arr.max(). - Print just the first 5x5 block of the grayscale array. What do the numbers represent?
- Print
color_arr[0, 0]. What does this value represent? What aboutcolor_arr[0, 0, 0]?
Write down your answers as comments in the code cell, and in the markdown cell below.
# YOUR CODE HERE
color_arr = np.array(color_img)
gray_arr = np.array(gray_img)
print("--- Color Array ---")
print("Shape:", color_arr.shape) # What are these 3 numbers?
print("Dtype:", color_arr.dtype)
print("Min:", color_arr.min(), " Max:", color_arr.max())
print("Pixel at [0,0]:", color_arr[0, 0]) # What is this?
print("Red value at [0,0]:", color_arr[0, 0, 0]) # And this?
print("\n--- Grayscale Array ---")
print("Shape:", gray_arr.shape)
print("Dtype:", gray_arr.dtype)
print("\nFirst 5x5 block of grayscale:")
print(gray_arr[:5, :5])
Your observations (fill in):
- A color image has shape
(H, W, ?). The third dimension is... because... - A grayscale image has shape
(H, W). There is no third dimension because... - The numbers in the array range from ___ to ___. This makes sense because...
color_arr[0, 0]gives me[R, G, B]— this means the pixel at row 0, column 0 has Red=, Green=, Blue=___.
Exercise 1.3 — Navigate the Grid¶
Now that you know the structure, explore the image as a grid.
Tasks:
- How many rows does the color image have? How many columns?
- What is the pixel at the very center of the color image? (Compute the center row and column from the shape.)
- What is the pixel at the bottom-right corner?
- Print all pixel values in the first row of the grayscale image. How many values are there?
- Without using any NumPy operations — using only Python loops — compute the average brightness of the grayscale image. (Average brightness = average of all pixel values.)
Important rule for this notebook: When asked to avoid NumPy, that means: no
np.mean(), nonp.sum(), no array slicing magic. Use Pythonforloops,range(), and plain arithmetic. You can still access array elements by index likearr[r, c].
# YOUR CODE HERE
# 1. Image dimensions
height = color_arr.shape[0]
width = color_arr.shape[1]
print(f"Image size: {height} rows x {width} columns")
# 2. Center pixel
# center_row = ???
# center_col = ???
# print(color_arr[center_row, center_col])
# 3. Bottom-right corner pixel
# print(color_arr[???, ???])
# 4. First row of grayscale
# print(gray_arr[0, :])
# 5. Average brightness using ONLY Python loops
total = 0
count = 0
# YOUR LOOP HERE
# for r in range(...):
# for c in range(...):
# ...
avg_brightness = total / count if count > 0 else 0
print(f"Average brightness (manual loop): {avg_brightness:.2f}")
print(f"Verify with numpy: {gray_arr.mean():.2f}")
Exercise 1.4 — Visualize a Tiny Slice¶
Let's look at a small patch of the image zoomed in, so we can literally see the pixels.
Tasks:
- Extract the top-left 20×20 patch of the grayscale image.
- Display it using
show_imageswithcmap_list=['gray']. - Do you notice anything? Each square in the display is one number from the array.
- Extract the same 20×20 patch from the color image and display it.
- Now extract a 20×20 patch from the middle of each image and display them side by side.
# YOUR CODE HERE
# Top-left 20x20 patch
gray_patch = gray_arr[:20, :20]
show_images([gray_patch], titles=['Grayscale top-left 20x20'], cmap_list=['gray'])
# Color patch
# color_patch = color_arr[:20, :20]
# show_images(...)
# Middle patch — compute the middle coordinates first!
# mid_r = ???
# mid_c = ???
# gray_mid = gray_arr[mid_r-10:mid_r+10, mid_c-10:mid_c+10]
# color_mid = color_arr[mid_r-10:mid_r+10, mid_c-10:mid_c+10]
# show_images([gray_mid, color_mid], ...)
PART 2: Extracting a Color Channel¶
Exercise 2.1 — What Are R, G, B?¶
A color pixel is stored as three numbers: (R, G, B) — Red, Green, Blue.
Each value is between 0 and 255.
(255, 0, 0)→ pure red(0, 255, 0)→ pure green(0, 0, 255)→ pure blue(0, 0, 0)→ black(255, 255, 255)→ white(128, 128, 128)→ gray
Tasks:
- Look at
color_arr[0, 0]. What are the R, G, B values of the top-left pixel? - Create a tiny 4×4 test array manually (use
np.array) where each pixel is a known color:- Row 0: all red pixels
[255, 0, 0] - Row 1: all green pixels
[0, 255, 0] - Row 2: all blue pixels
[0, 0, 255] - Row 3: all white pixels
[255, 255, 255]
Display it. Does it look right?
- Row 0: all red pixels
- Access
test[0, 0, 0]— this is the Red channel of the first pixel.
Accesstest[1, 0, 1]— what channel is this? What value do you expect?
# YOUR CODE HERE
# 1. Top-left pixel
print("Top-left pixel R,G,B:", color_arr[0, 0])
# 2. Build the 4x4 test array
test = np.zeros((4, 4, 3), dtype=np.uint8)
# Fill row 0 with red...
# YOUR CODE
show_images([test], titles=['Test 4x4 color array'])
# 3. Channel access
print("test[0,0,0] (Red of first pixel): ", test[0, 0, 0])
print("test[1,0,1] (??? of second row): ", test[1, 0, 1])
Exercise 2.2 — Extract the Red Channel Using Loops¶
Your job is to extract only the Red channel from the color image.
The result should be a 2D array (H × W) of numbers — just the red values.
Rules for this exercise:
- No NumPy slicing tricks like
color_arr[:, :, 0] - No
np.split, nonp.take, no fancy indexing - Use plain Python
forloops and index accessarr[r, c, 0]
Tasks:
- Create an empty 2D array of the right size using
np.zeros((height, width), dtype=np.uint8). - Loop over every pixel and copy the Red value into your new array.
- Display the result using
show_imageswithcmap_list=['gray']. - What do bright areas in the red channel mean? What do dark areas mean?
- Extra: Do the same for the Green and Blue channels. Display all three side by side.
# YOUR CODE HERE
height = color_arr.shape[0]
width = color_arr.shape[1]
red_channel = np.zeros((height, width), dtype=np.uint8)
# Loop over every pixel and extract Red value
for r in range(height):
for c in range(width):
pass # red_channel[r, c] = ???
show_images([red_channel], titles=['Red Channel (as grayscale)'], cmap_list=['gray'])
# EXTRA: Extract Green and Blue channels the same way
green_channel = np.zeros((height, width), dtype=np.uint8)
blue_channel = np.zeros((height, width), dtype=np.uint8)
# YOUR LOOPS HERE
# Display all three channels side by side
show_images(
[red_channel, green_channel, blue_channel],
titles=['Red', 'Green', 'Blue'],
cmap_list=['Reds', 'Greens', 'Blues']
)
Exercise 2.3 — Visualize a Single Channel in Its True Color¶
When we display the Red channel as grayscale, bright = high red value, dark = low red value.
But what if we want to display it as an actual red image?
Think: to show the red channel in red, you need a color image where:
- The red channel = the values from
red_channel - The green channel = all zeros
- The blue channel = all zeros
Tasks:
- Without using NumPy fancy tricks, build a 3D array
red_imageof shape(H, W, 3)where only the red component is non-zero. - Display it. Does it look red-tinted?
- Do the same for green and blue. Display all three.
- Now add all three colored arrays together using a loop (not NumPy addition). Display the result.
Does it look like the original color image? Why or why not?
Hint for step 4: Be careful about overflow! Adding uint8 values can exceed 255. Think about how to handle this.
# YOUR CODE HERE
# Build red_image: (H, W, 3) with only red channel filled
red_image = np.zeros((height, width, 3), dtype=np.uint8)
for r in range(height):
for c in range(width):
pass # red_image[r, c, 0] = ???
show_images([red_image], titles=['Red channel as color'])
# Build green_image and blue_image similarly
# ...
# Add them together — watch out for overflow!
# combined = ???
PART 3: Converting Color to Grayscale¶
Exercise 3.1 — What Is Grayscale?¶
A grayscale image has only one number per pixel — its brightness.
A color image has three (R, G, B).
The question is: given R, G, B — how do you compute a single brightness value?
There is no single "right" answer. But there are good ones and bad ones.
Your job is to invent several formulas and see what happens.
Tasks:
Before writing any code: think of 3 ways you could combine R, G, B into one number.
Write them as math formulas in the markdown cell below.The simplest idea: take the average.
$$\text{gray} = \frac{R + G + B}{3}$$
Does this make sense? What assumptions does it make?
Your 3 ideas (before looking at any answers):
- Idea 1: ...
- Idea 2: ...
- Idea 3: ...
Exercise 3.2 — Implement Your Formulas¶
For each formula you invented (and the ones below), write a function rgb_to_gray_METHOD(color_array) that:
- Takes a
(H, W, 3)numpy array - Returns a
(H, W)numpy array of typeuint8 - Uses only Python loops and arithmetic — no
np.mean(), no slicing across channels
Formulas to implement:
| Method | Formula |
|---|---|
| Average | $\frac{R + G + B}{3}$ |
| Lightness | $\frac{\max(R,G,B) + \min(R,G,B)}{2}$ |
| Luminosity (ITU-R BT.601) | $0.299 R + 0.587 G + 0.114 B$ |
| Your own idea | (whatever you came up with above) |
Tip on
uint8: When you compute a float result, convert it back touint8by wrapping inint(...)and clamping:max(0, min(255, value)).
# METHOD 1: Average
def rgb_to_gray_average(arr):
"""
Converts color image to grayscale using simple average: (R+G+B)/3.
arr: numpy array of shape (H, W, 3), dtype uint8
Returns: numpy array of shape (H, W), dtype uint8
"""
h, w = arr.shape[0], arr.shape[1]
result = np.zeros((h, w), dtype=np.uint8)
for r in range(h):
for c in range(w):
R, G, B = arr[r, c, 0], arr[r, c, 1], arr[r, c, 2]
gray = int((R + G + B) / 3)
result[r, c] = max(0, min(255, gray))
return result
# YOUR CODE: implement the other methods
def rgb_to_gray_lightness(arr):
pass
def rgb_to_gray_luminosity(arr):
pass
def rgb_to_gray_mymethod(arr):
"""Your own formula!"""
pass
# Test — run average on the color image
gray_average = rgb_to_gray_average(color_arr)
print("Shape:", gray_average.shape)
print("Dtype:", gray_average.dtype)
Exercise 3.3 — Compare the Methods¶
Tasks:
- Apply all four methods to
color_arr. - Display all results side by side using
show_imageswithcmap_list=['gray', 'gray', 'gray', 'gray']. - Also display the original color image for reference.
- Do you see any visible differences between the methods?
- Compute the pixel-wise difference between
gray_averageandgray_luminosity:
For each pixel, computeabs(average[r,c] - luminosity[r,c]).
Display this difference image. Where are the biggest differences? - Think about why the luminosity formula (0.299, 0.587, 0.114) uses unequal weights.
Notice: Green gets the most weight, Blue the least. Why might that be?
Hint for question 6: Think about how human eyes perceive brightness in different colors.
# YOUR CODE HERE
# Apply all methods
gray_average = rgb_to_gray_average(color_arr)
# gray_lightness = rgb_to_gray_lightness(color_arr)
# gray_luminosity = rgb_to_gray_luminosity(color_arr)
# gray_mymethod = rgb_to_gray_mymethod(color_arr)
# Display all side by side
# show_images([gray_average, gray_lightness, gray_luminosity, gray_mymethod],
# titles=['Average', 'Lightness', 'Luminosity', 'My Method'],
# cmap_list=['gray', 'gray', 'gray', 'gray'])
# Compute pixel-wise difference between average and luminosity using loops
# diff[r, c] = abs(gray_average[r,c] - gray_luminosity[r,c])
# YOUR CODE HERE
h, w = gray_average.shape
diff = np.zeros((h, w), dtype=np.uint8)
# for r in range(h):
# for c in range(w):
# ...
# show_images([diff], titles=['Difference: Average vs Luminosity'], cmap_list=['hot'])
Your reflection (fill in):
- Which method produced the most visually pleasing grayscale image? Why?
- Why does the luminosity formula give more weight to green?
- Where were the biggest differences between methods?
# SETUP HELPER — run this cell as-is
# Resize the grayscale image to exactly 100x100 using PIL
img_100 = np.array(gray_img.resize((100, 100), Image.LANCZOS))
print("Shape of 100x100 image:", img_100.shape)
show_images([img_100], titles=['100×100 Grayscale'], cmap_list=['gray'])
Exercise 4.2 — Think Before You Code¶
You have a 100×100 image. You want to produce a 50×50 image.
The output has 4× fewer pixels than the input. Each output pixel must somehow be derived from the input pixels.
Before writing any code, answer these questions in the markdown cell below:
- Each output pixel at position
(r, c)in the 50×50 image corresponds to which input pixel(s) in the 100×100 image? - If output pixel
(0, 0)covers input pixels(0,0),(0,1),(1,0),(1,1)— what single value should you assign to it? Think of at least 3 different choices. - Is there information loss when going from 100×100 to 50×50? Can you ever perfectly reconstruct the original?
Your answers before coding:
- Output pixel
(r, c)corresponds to input pixel(s): ... - Three possible values to assign: ...
- Is there information loss? ...
Exercise 4.3 — Implement Three Downsampling Methods¶
Rules: Use only Python loops. No cv2.resize, no PIL .resize, no np.mean over blocks.
Method A — Nearest Neighbor (Subsampling)
For each output pixel (r, c), simply copy the value from input pixel (2*r, 2*c).
Method B — Average Pooling
For each output pixel (r, c), take the average of the 2×2 block:
input[2r, 2c], input[2r, 2c+1], input[2r+1, 2c], input[2r+1, 2c+1]
Method C — Max Pooling
For each output pixel (r, c), take the maximum of the same 2×2 block.
Method D — Your own!
Invent a fourth method. Some ideas: min pooling, median of the 4 values, or weighted average (give corners less weight).
# METHOD A: Nearest Neighbor Downsampling
def downsample_nearest(img):
"""
Downsamples a (H, W) image to (H//2, W//2) by taking every other pixel.
"""
h, w = img.shape
out_h, out_w = h // 2, w // 2
result = np.zeros((out_h, out_w), dtype=np.uint8)
for r in range(out_h):
for c in range(out_w):
pass # result[r, c] = ???
return result
# METHOD B: Average Pooling
def downsample_average(img):
"""
Downsamples by averaging each 2x2 block.
"""
h, w = img.shape
out_h, out_w = h // 2, w // 2
result = np.zeros((out_h, out_w), dtype=np.uint8)
for r in range(out_h):
for c in range(out_w):
pass # result[r, c] = average of 2x2 block
return result
# METHOD C: Max Pooling
def downsample_max(img):
"""
Downsamples by taking the maximum of each 2x2 block.
"""
h, w = img.shape
out_h, out_w = h // 2, w // 2
result = np.zeros((out_h, out_w), dtype=np.uint8)
for r in range(out_h):
for c in range(out_w):
pass # result[r, c] = max of 2x2 block
return result
# METHOD D: YOUR OWN
def downsample_mymethod(img):
"""Your own downsampling method — describe it in the docstring!"""
pass
# Test one of them
small = downsample_nearest(img_100)
print("Output shape:", small.shape) # Should be (50, 50)
Exercise 4.4 — Compare and Analyze¶
Tasks:
- Apply all four methods to
img_100. Display the results side by side (along with the original). - Which method preserves the most visual detail? Which looks smoothest? Which looks sharpest?
- Compute a difference map: For each pair of methods, compute
abs(method_A[r,c] - method_B[r,c])for every pixel. Where do methods disagree most? - Think deeper: In neural networks, max pooling is used instead of average pooling. Based on what you see, why might max pooling be preferred for detecting features (like edges)?
# YOUR CODE HERE
# Apply all methods
small_nearest = downsample_nearest(img_100)
# small_average = downsample_average(img_100)
# small_max = downsample_max(img_100)
# small_mymethod = downsample_mymethod(img_100)
# Display side by side with original
# show_images(
# [img_100, small_nearest, small_average, small_max, small_mymethod],
# titles=['Original 100x100', 'Nearest (50x50)', 'Average (50x50)', 'Max (50x50)', 'My Method (50x50)'],
# cmap_list=['gray'] * 5
# )
# Compute difference map between nearest and average (using loops)
h, w = small_nearest.shape
diff_down = np.zeros((h, w), dtype=np.uint8)
# YOUR LOOP HERE
# for r in range(h):
# for c in range(w):
# diff_down[r, c] = abs(int(small_nearest[r,c]) - int(small_average[r,c]))
# show_images([diff_down], titles=['Difference: Nearest vs Average'], cmap_list=['hot'])
Your observations:
- Nearest neighbor looks...
- Average pooling looks...
- Max pooling looks...
- I think max pooling is preferred in neural networks because...
PART 5: Growing an Image — Upsampling to 100×100¶
Exercise 5.1 — The Reverse Problem¶
Now you have a 50×50 image (use small_nearest from Part 4).
You want to produce a 100×100 image.
This is the reverse of downsampling, but it's harder: you need to invent information that wasn't there.
Before coding, think:
- Each output pixel
(r, c)in the 100×100 image comes from which input pixel(s) in the 50×50 image? - What happens when
(r, c)falls exactly between two input pixels — e.g., output pixel(1, 0)lands between input(0,0)and(1,0)? - Write down two different strategies in the markdown cell below.
Your strategies before coding:
- Strategy A: ...
- Strategy B: ...
Exercise 5.2 — Implement Three Upsampling Methods¶
Rules: Use only Python loops. No cv2.resize, no PIL .resize.
Method A — Nearest Neighbor Replication
Each output pixel (r, c) copies from input pixel (r//2, c//2).
Every input pixel gets "stretched" into a 2×2 block.
Method B — Bilinear Interpolation (1D first, then 2D)
This one is harder. Let's build it up:
- First, implement linear interpolation between two values:
lerp(a, b, t) = a + t * (b - a)wheretis between 0 and 1. - For upsampling 2×: output pixel
(r, c)maps to input position(r/2, c/2). - If
r/2 = 1.5, you interpolate between row 1 and row 2 with weight 0.5. - Do this for both row and column directions.
Method C — Your own!
Ideas: replicate rows/columns, use the average of neighbors, or anything you can think of.
# Use the 50x50 image from Part 4
# If you didn't finish Part 4, here's a fallback:
img_50 = small_nearest if 'small_nearest' in dir() and small_nearest is not None else np.array(gray_img.resize((50, 50)))
print("Working with 50x50 image, shape:", img_50.shape)
show_images([img_50], titles=['50×50 Input'], cmap_list=['gray'])
# METHOD A: Nearest Neighbor Upsampling
def upsample_nearest(img, scale=2):
"""
Upsamples a (H, W) image to (H*scale, W*scale) by pixel replication.
Each output pixel (r, c) copies from input (r//scale, c//scale).
"""
h, w = img.shape
out_h, out_w = h * scale, w * scale
result = np.zeros((out_h, out_w), dtype=np.uint8)
for r in range(out_h):
for c in range(out_w):
pass # result[r, c] = ???
return result
# METHOD B: Bilinear Interpolation
def lerp(a, b, t):
"""Linear interpolation between a and b. t=0 gives a, t=1 gives b."""
pass # return ???
def upsample_bilinear(img, scale=2):
"""
Upsamples using bilinear interpolation.
For each output pixel (r, c):
- Map to input coordinates: in_r = r / scale, in_c = c / scale
- Find the four surrounding input pixels
- Interpolate
"""
h, w = img.shape
out_h, out_w = h * scale, w * scale
result = np.zeros((out_h, out_w), dtype=np.uint8)
for r in range(out_h):
for c in range(out_w):
# Map output coordinates to input space
in_r = r / scale
in_c = c / scale
# Find surrounding pixel indices
r0 = int(in_r) # floor
c0 = int(in_c) # floor
r1 = min(r0 + 1, h - 1) # ceiling, clamped to image
c1 = min(c0 + 1, w - 1)
# Fractional parts (where between r0 and r1 are we?)
dr = in_r - r0
dc = in_c - c0
# Bilinear interpolation: interpolate in r direction, then c
# Step 1: interpolate the top row and bottom row
# top = lerp(img[r0, c0], img[r0, c1], dc)
# bottom = lerp(img[r1, c0], img[r1, c1], dc)
# Step 2: interpolate between top and bottom
# value = lerp(top, bottom, dr)
# YOUR CODE HERE
pass
return result
# METHOD C: YOUR OWN
def upsample_mymethod(img, scale=2):
"""Your own upsampling method!"""
pass
# Test Method A
big_nearest = upsample_nearest(img_50)
print("Output shape:", big_nearest.shape) # Should be (100, 100)
Exercise 5.3 — Compare and Analyze¶
Tasks:
- Apply all three methods. Display results next to the original 100×100 image.
- Nearest neighbor will look "blocky" — why? What visual artifact does it create?
- Bilinear should look smoother — why? What is the trade-off?
- Compute the pixel-wise error between the upsampled image and the original
img_100:
error[r, c] = abs(int(upsampled[r,c]) - int(img_100[r,c]))
Which method is closest to the original? Does this surprise you? - Compute the mean absolute error (MAE) for each method against
img_100: $$\text{MAE} = \frac{1}{H \times W} \sum_{r,c} |\text{upsampled}[r,c] - \text{original}[r,c]|$$
# YOUR CODE HERE
# Apply all methods
big_nearest = upsample_nearest(img_50)
# big_bilinear = upsample_bilinear(img_50)
# big_mymethod = upsample_mymethod(img_50)
# Display all vs original
# show_images(
# [img_100, big_nearest, big_bilinear, big_mymethod],
# titles=['Original 100x100', 'Nearest (100x100)', 'Bilinear (100x100)', 'My Method'],
# cmap_list=['gray'] * 4
# )
# Compute MAE for each method vs the original (using loops)
def mean_absolute_error(img_a, img_b):
"""
Computes pixel-wise mean absolute error between two same-size images.
Use only Python loops — no np.mean, no array subtraction.
"""
h, w = img_a.shape
total = 0
for r in range(h):
for c in range(w):
pass # total += ???
return total / (h * w)
# Compute and print MAE for each method
mae_nearest = mean_absolute_error(big_nearest, img_100)
print(f"MAE (Nearest): {mae_nearest:.4f}")
# Add bilinear and mymethod when ready
# mae_bilinear = mean_absolute_error(big_bilinear, img_100)
# print(f"MAE (Bilinear): {mae_bilinear:.4f}")
PART 6: Upsampling to a Non-Integer Scale — 50×50 to 150×150¶
Exercise 6.1 — A New Challenge¶
So far you've doubled the image (50→100). Now go from 50×50 to 150×150 — a scale factor of 3.
Nearest neighbor is easy: output pixel (r, c) maps to input (r//3, c//3).
Bilinear is similar: output (r, c) maps to input coordinates (r/3, c/3).
But here's a deeper challenge: what about going from 50×50 to 75×75?
The scale factor is 1.5 — not an integer.
Output pixel (r, c) maps to input position (r / 1.5, c / 1.5) = (r * 2/3, c * 2/3).
This is a fractional coordinate — you must interpolate.
The key insight: Your bilinear implementation already handles this!
The formula in_r = r / scale works for any scale — integer or fractional.
Tasks:
- Generalize your
upsample_nearestandupsample_bilinearto accept an arbitrarytarget_h, target_winstead of a fixed scale. - Implement
resize_nearest(img, target_h, target_w)andresize_bilinear(img, target_h, target_w). - Resize
img_50to 150×150 using both methods. - Resize
img_50to 75×75 using both methods. - Display and compare results.
# METHOD A: Nearest Neighbor Resize (arbitrary target size)
def resize_nearest(img, target_h, target_w):
"""
Resizes img to (target_h, target_w) using nearest neighbor.
For output pixel (r, c), map to input (round(r * h / target_h), round(c * w / target_w))
or equivalently: input_r = int(r * h / target_h), clamped to [0, h-1].
"""
h, w = img.shape
result = np.zeros((target_h, target_w), dtype=np.uint8)
for r in range(target_h):
for c in range(target_w):
# Map to input coordinates
# in_r = ??? (use int(...) to round down)
# in_c = ???
# Clamp to valid range: in_r must be < h, in_c must be < w
pass
return result
# METHOD B: Bilinear Resize (arbitrary target size)
def resize_bilinear(img, target_h, target_w):
"""
Resizes img to (target_h, target_w) using bilinear interpolation.
For output pixel (r, c), the input coordinates are:
in_r = r * (h - 1) / (target_h - 1)
in_c = c * (w - 1) / (target_w - 1)
Then bilinearly interpolate.
"""
h, w = img.shape
result = np.zeros((target_h, target_w), dtype=np.uint8)
for r in range(target_h):
for c in range(target_w):
# Input coordinates (note: use (h-1)/(target_h-1) to map edges correctly)
in_r = r * (h - 1) / max(1, target_h - 1)
in_c = c * (w - 1) / max(1, target_w - 1)
r0 = int(in_r)
c0 = int(in_c)
r1 = min(r0 + 1, h - 1)
c1 = min(c0 + 1, w - 1)
dr = in_r - r0
dc = in_c - c0
# YOUR BILINEAR INTERPOLATION HERE
pass
return result
# Test: resize 50x50 to 150x150
# img_150_nearest = resize_nearest(img_50, 150, 150)
# img_150_bilinear = resize_bilinear(img_50, 150, 150)
# print("150x150 shapes:", img_150_nearest.shape, img_150_bilinear.shape)
# YOUR CODE HERE: Apply and display
# Resize to 150x150
# show_images([img_50, img_150_nearest, img_150_bilinear],
# titles=['Original 50x50', 'Nearest (150x150)', 'Bilinear (150x150)'],
# cmap_list=['gray', 'gray', 'gray'])
# Resize to 75x75
# img_75_nearest = resize_nearest(img_50, 75, 75)
# img_75_bilinear = resize_bilinear(img_50, 75, 75)
# show_images([img_50, img_75_nearest, img_75_bilinear],
# titles=['Original 50x50', 'Nearest (75x75)', 'Bilinear (75x75)'],
# cmap_list=['gray', 'gray', 'gray'])
Exercise 6.2 — Extreme Upsampling¶
Now let's push your implementation further.
Tasks:
- Take a tiny 10×10 patch from
img_50(top-left corner):patch = img_50[:10, :10]. - Upsample it to 200×200 using both nearest and bilinear.
- Display all three (original 10×10 patch, nearest 200×200, bilinear 200×200).
- The nearest result should look very blocky — you'll see the "pixels". The bilinear should be blurry. Why?
- Design question: Is there a way to upsample that preserves sharp edges? Describe your idea (no code needed — just think and write).
# YOUR CODE HERE
patch = img_50[:10, :10]
print("Patch shape:", patch.shape)
# patch_200_nearest = resize_nearest(patch, 200, 200)
# patch_200_bilinear = resize_bilinear(patch, 200, 200)
# show_images([patch, patch_200_nearest, patch_200_bilinear],
# titles=['10x10 Patch', 'Nearest (200x200)', 'Bilinear (200x200)'],
# cmap_list=['gray', 'gray', 'gray'],
# figsize=(15, 5))
Your design idea for sharp upsampling: ...
PART 7: Designing a Convolution¶
Exercise 7.1 — The Neighborhood Idea¶
You've learned that a pixel is a single number. Now let's think differently:
a pixel together with its neighbors tells you something about local structure.
Consider a grayscale image. Look at a 3×3 patch around any pixel (r, c):
img[r-1, c-1] img[r-1, c] img[r-1, c+1]
img[r, c-1] img[r, c] img[r, c+1]
img[r+1, c-1] img[r+1, c] img[r+1, c+1]
A filter (also called a kernel or weight matrix) is a small grid of numbers, also 3×3:
w[0,0] w[0,1] w[0,2]
w[1,0] w[1,1] w[1,2]
w[2,0] w[2,1] w[2,2]
The convolution output at pixel (r, c) is the sum of element-wise products:
$$\text{out}[r, c] = \sum_{i=0}^{2} \sum_{j=0}^{2} \text{img}[r+i-1, c+j-1] \times w[i, j]$$
Before coding, think:
- What happens at the edges of the image? (There are no neighbors outside the boundary.)
- What does the output represent? Is it still an image? What size is it?
- What would happen if all weights were
1/9? What computation would that be? - What would happen if the center weight is
1and all others are0?
Your answers before coding:
- Edge problem: ...
- Output: ...
- All weights = 1/9: ...
- Center weight = 1, rest = 0: ...
Exercise 7.2 — Your First Convolution¶
Implement conv(image, weights) where:
imageis a 2D numpy array of shape(H, W)weightsis a 2D numpy array of shape(K, K)whereKis odd (3, 5, 7, ...)- Output is a 2D numpy array
Boundary strategy — "valid" mode:
Only compute output where the kernel fits fully inside the image.
For a 3×3 kernel on an H×W image, the output will be (H-2) × (W-2).
Algorithm:
For each output position (r, c):
total = 0
For each kernel position (i, j):
total += image[r + i, c + j] * weights[i, j]
output[r, c] = total
Note: The output can have values outside
[0, 255]— don't clamp yet. Usefloatarrays for the output and we'll handle display separately.
def conv(image, weights):
"""
Applies a convolution filter to a grayscale image.
Parameters:
image : 2D numpy array of shape (H, W)
weights : 2D numpy array of shape (K, K) — the filter kernel
Returns:
output : 2D numpy array of shape (H-K+1, W-K+1)
Contains the raw convolution output (may not be in [0,255])
"""
H, W = image.shape
K = weights.shape[0] # Kernel size (assume square)
out_H = H - K + 1
out_W = W - K + 1
output = np.zeros((out_H, out_W), dtype=np.float64)
for r in range(out_H):
for c in range(out_W):
total = 0.0
for i in range(K):
for j in range(K):
pass # total += ???
output[r, c] = total
return output
# SANITY CHECK: Identity filter — should return the original image (minus edges)
identity_kernel = np.array([
[0, 0, 0],
[0, 1, 0],
[0, 0, 0]
], dtype=np.float64)
result = conv(img_100.astype(np.float64), identity_kernel)
print("Output shape:", result.shape) # Should be (98, 98)
# Check: result[0,0] should equal img_100[1,1]
print(f"result[0,0] = {result[0,0]:.1f}")
print(f"img_100[1,1] = {img_100[1,1]}")
print("Match:", abs(result[0,0] - float(img_100[1,1])) < 1e-9)
Exercise 7.3 — Helper for Displaying Convolution Output¶
The output of convolution may have negative values or values > 255.
We need a helper to normalize and display it.
# DISPLAY HELPER for convolution output — run this cell as-is
def show_conv_result(original, output, title='Convolution Output', clip=False):
"""
Displays the original image and convolution output side by side.
Normalizes the output to [0, 255] for display.
Parameters:
original : the input image (2D array, uint8)
output : the convolution result (2D float array)
clip : if True, clip to [0,255] instead of normalizing
"""
if clip:
display_out = np.clip(output, 0, 255).astype(np.uint8)
else:
# Normalize to [0, 255]
lo, hi = output.min(), output.max()
if hi > lo:
display_out = ((output - lo) / (hi - lo) * 255).astype(np.uint8)
else:
display_out = np.zeros_like(output, dtype=np.uint8)
# Trim original to match output size
pad = (original.shape[0] - display_out.shape[0]) // 2
orig_trimmed = original[pad:pad+display_out.shape[0], pad:pad+display_out.shape[1]]
show_images(
[orig_trimmed, display_out],
titles=['Original (trimmed)', title],
cmap_list=['gray', 'gray']
)
print("show_conv_result() helper ready.")
Exercise 7.4 — Discover What Filters Do¶
Now the fun part. Apply your conv function with different kernels and discover what each one does to the image.
Rules: You must apply each kernel, display the result, and describe what you see before reading the description.
Kernel A — Blur
1/9 1/9 1/9
1/9 1/9 1/9
1/9 1/9 1/9
Kernel B — Sharpen
0 -1 0
-1 5 -1
0 -1 0
Kernel C — Horizontal Edge Detector (Sobel)
-1 -2 -1
0 0 0
1 2 1
Kernel D — Vertical Edge Detector (Sobel)
-1 0 1
-2 0 2
-1 0 1
Tasks:
- Apply each kernel to
img_100. - For each result, write in a comment what you observe before reading what the filter does.
- For Kernels C and D: can you combine the horizontal and vertical edge maps?
Try:edge_magnitude[r,c] = sqrt(horiz[r,c]**2 + vert[r,c]**2)(implement with a loop).
# Define the kernels
kernel_blur = np.array([
[1/9, 1/9, 1/9],
[1/9, 1/9, 1/9],
[1/9, 1/9, 1/9]
])
kernel_sharpen = np.array([
[ 0, -1, 0],
[-1, 5, -1],
[ 0, -1, 0]
], dtype=np.float64)
kernel_sobel_h = np.array([
[-1, -2, -1],
[ 0, 0, 0],
[ 1, 2, 1]
], dtype=np.float64)
kernel_sobel_v = np.array([
[-1, 0, 1],
[-2, 0, 2],
[-1, 0, 1]
], dtype=np.float64)
# Apply and display — YOUR CODE
# Kernel A: Blur
result_blur = conv(img_100.astype(np.float64), kernel_blur)
show_conv_result(img_100, result_blur, title='Blur')
# What do you observe? Write below.
# Kernel B: Sharpen
# result_sharpen = ???
# show_conv_result(img_100, result_sharpen, title='Sharpen')
# Kernel C: Horizontal Edges
# result_sobel_h = ???
# Kernel D: Vertical Edges
# result_sobel_v = ???
# Combine horizontal and vertical edges into edge magnitude
# edge_magnitude[r, c] = sqrt(result_sobel_h[r,c]**2 + result_sobel_v[r,c]**2)
# YOUR CODE (use a loop, use import math or ** 0.5)
import math
# h, w = result_sobel_h.shape
# edge_magnitude = np.zeros((h, w), dtype=np.float64)
# for r in range(h):
# for c in range(w):
# edge_magnitude[r, c] = ???
# show_conv_result(img_100, edge_magnitude, title='Edge Magnitude (Sobel)')
Your observations (fill in before reading explanations):
- Blur kernel: I see...
- Sharpen kernel: I see...
- Horizontal Sobel: I see...
- Vertical Sobel: I see...
- Edge magnitude: I see...
Exercise 7.5 — Invent Your Own Filter¶
Now you know what a convolution filter does. Design your own.
Tasks:
- Invent a 3×3 filter that does something interesting. Apply it and display the result.
- Experiment with a 5×5 blur kernel.
A 5×5 blur has all weights =1/25.
Apply it. Compare to the 3×3 blur. Which is blurrier? Why? - Gaussian blur idea: Instead of equal weights, what if the center gets more weight and edges get less?
Design a 3×3 kernel where the center = 4, direct neighbors = 2, corners = 1 (then normalize so they sum to 1).
Apply it. How does it compare to the flat blur? - Try applying the blur filter multiple times in a row (chain:
conv(conv(img, blur), blur)).
What happens? Apply it 3 times, 5 times.
# Task 1: Your own 3x3 filter
my_kernel = np.array([
[0, 0, 0],
[0, 1, 0], # Change these!
[0, 0, 0]
], dtype=np.float64)
# Apply and display
# result_mine = conv(img_100.astype(np.float64), my_kernel)
# show_conv_result(img_100, result_mine, title='My Filter')
# Task 2: 5x5 blur
kernel_blur_5x5 = np.full((5, 5), 1/25)
# result_blur_5 = conv(img_100.astype(np.float64), kernel_blur_5x5)
# show_conv_result(img_100, result_blur_5, title='5x5 Blur')
# Task 3: Gaussian-like blur
# Design the kernel: center=4, direct neighbors=2, corners=1, normalized
# YOUR CODE
gaussian_approx = np.array([
[1, 2, 1],
[2, 4, 2],
[1, 2, 1]
], dtype=np.float64)
# Normalize so weights sum to 1:
# gaussian_approx = gaussian_approx / ???
# result_gaussian = conv(img_100.astype(np.float64), gaussian_approx)
# show_conv_result(img_100, result_gaussian, title='Gaussian-like Blur')
# Task 4: Apply blur multiple times
# YOUR CODE
img_float = img_100.astype(np.float64)
# Apply blur once
# blurred_1 = conv(img_float, kernel_blur)
# Apply again
# blurred_2 = conv(blurred_1, kernel_blur)
# Apply 5 times using a loop
# current = img_float
# for _ in range(5):
# current = conv(current, kernel_blur)
# blurred_5 = current
# Display original, 1x blurred, 2x blurred, 5x blurred
# show_images([img_100, ...], titles=['Original', '1x blur', '2x blur', '5x blur'], cmap_list=['gray']*4)
Exercise 7.6 — Add Padding¶
You may have noticed that each convolution slightly shrinks the image (98×98 from a 100×100 image with a 3×3 kernel).
A common fix is zero-padding: surround the image with a border of zeros before convolving, so the output stays the same size as the input.
For a 3×3 kernel, add 1 pixel of zeros on all sides.
For a 5×5 kernel, add 2 pixels.
Formula: pad_size = (K - 1) // 2
Tasks:
- Implement
pad_image(img, pad_size)that returns a new array withpad_sizerows/columns of zeros added on all sides. - Implement
conv_same(image, weights)that usespad_imageinternally, so the output is the same size as the input. - Apply
conv_samewith the blur kernel toimg_100. Verify the output shape is(100, 100). - Apply
conv_samewith the Sobel kernels. Notice that now edges near the image boundary are included in the output.
def pad_image(img, pad_size):
"""
Pads a (H, W) image with `pad_size` zeros on all sides.
Returns a (H + 2*pad_size, W + 2*pad_size) array.
Implement using loops — no np.pad!
"""
H, W = img.shape
new_H = H + 2 * pad_size
new_W = W + 2 * pad_size
result = np.zeros((new_H, new_W), dtype=img.dtype)
# Copy img into the center of result
# YOUR CODE HERE
for r in range(H):
for c in range(W):
pass # result[r + pad_size, c + pad_size] = ???
return result
def conv_same(image, weights):
"""
Convolution with 'same' padding: output is the same size as input.
Uses pad_image internally.
"""
K = weights.shape[0]
pad_size = (K - 1) // 2
# YOUR CODE: pad the image, then call conv on the padded version
pass
# Test
padded = pad_image(img_100, pad_size=1)
print("Padded shape:", padded.shape) # Should be (102, 102)
# result_same = conv_same(img_100.astype(np.float64), kernel_blur)
# print("conv_same output shape:", result_same.shape) # Should be (100, 100)
Exercise 7.7 — Convolution on a Color Image¶
So far, conv works on grayscale (2D). But real images are color (3D: H × W × 3).
How should convolution work on a color image?
Approach 1 — Per-channel: Apply the same kernel independently to each of R, G, B.
Stack the three results back into a color image.
Tasks:
- Implement
conv_color(image_rgb, weights)that:- Takes a
(H, W, 3)array and a kernel - Applies
conv_sameto each channel separately - Returns a
(H, W, 3)float array
- Takes a
- Apply it to
color_arrwith the blur kernel. Display the result. - Apply it with the Sobel horizontal kernel. What do you get?
- Think: What is a case where you'd want different kernels for different channels?
def conv_color(image_rgb, weights):
"""
Applies conv_same independently to each color channel.
image_rgb: (H, W, 3) uint8 array
weights : (K, K) kernel
Returns : (H, W, 3) float64 array
"""
H, W = image_rgb.shape[0], image_rgb.shape[1]
result = np.zeros((H, W, 3), dtype=np.float64)
for ch in range(3): # ch = 0 (R), 1 (G), 2 (B)
channel = image_rgb[:, :, ch].astype(np.float64)
# Apply conv_same to this channel
# result[:, :, ch] = ???
pass
return result
# Apply blur to color image and display
# color_blurred = conv_color(color_arr, kernel_blur)
# DISPLAY HELPER for color convolution output
def show_color_conv(original, output, title='Color Convolution'):
lo = output.min()
hi = output.max()
if hi > lo:
display = ((output - lo) / (hi - lo) * 255).astype(np.uint8)
else:
display = np.zeros_like(output, dtype=np.uint8)
show_images([original, display], titles=['Original', title])
# show_color_conv(color_arr, color_blurred, title='Color Blur')
PART 8: Pulling It All Together¶
Exercise 8.1 — Build a Mini Image Processing Pipeline¶
You now have all the building blocks. Let's build a real pipeline.
Goal: Given the original color image:
- Convert to grayscale (using luminosity method)
- Resize to 50×50
- Apply a blur filter
- Apply a Sobel edge detector (both H and V, then combine)
- Display each step side by side
Use your own implementations of each step — no library functions for the image processing.
# YOUR MINI PIPELINE
# Step 1: Grayscale
# step1_gray = rgb_to_gray_luminosity(color_arr)
# Step 2: Resize to 50x50
# step2_small = resize_bilinear(step1_gray, 50, 50)
# Step 3: Blur
# step3_blur = conv_same(step2_small.astype(np.float64), kernel_blur)
# Step 4: Edge detection
# step4_edges_h = conv_same(step3_blur, kernel_sobel_h)
# step4_edges_v = conv_same(step3_blur, kernel_sobel_v)
# step4_edges = edge magnitude of h and v combined (from Exercise 7.4)
# Display all steps
# show_images(
# [color_arr, step1_gray, step2_small, np.clip(step3_blur, 0, 255).astype(np.uint8), ...],
# titles=['1. Color', '2. Grayscale', '3. 50x50', '4. Blurred', '5. Edges'],
# cmap_list=[None, 'gray', 'gray', 'gray', 'gray']
# )
print("Fill in each step above!")
Exercise 8.2 — Bonus: Design a Filter by Intuition¶
Here is a challenge with no given formula — you must invent the filter.
Goal: Design a 3×3 kernel that detects diagonal edges (edges going from bottom-left to top-right, like \).
Think:
- The horizontal Sobel detects horizontal edges (top vs. bottom brightness difference)
- The vertical Sobel detects vertical edges (left vs. right brightness difference)
- What would a diagonal edge look like in the 3×3 patch?
- Which pixels would be bright and which dark for a
\edge?
Tasks:
- Sketch the 3×3 kernel on paper first.
- Implement it in code and apply to
img_100. - Does the result highlight diagonal edges? If not, adjust and try again.
- Also design a
/edge detector.
# YOUR DIAGONAL EDGE DETECTOR
# Think: for a '\\' edge (top-left bright, bottom-right dark),
# which pixels in a 3x3 neighborhood would be bright and which dark?
# Replace the zeros below with your values (use positive and negative numbers).
kernel_diag_backslash = np.array([
[0, 0, 0], # <-- replace with your values
[0, 0, 0],
[0, 0, 0]
], dtype=np.float64)
# Hint: look at the Sobel kernels for inspiration.
# The horizontal Sobel uses +1/-1 to detect top vs. bottom.
# What values detect top-left vs. bottom-right?
# result_diag = conv_same(img_100.astype(np.float64), kernel_diag_backslash)
# show_conv_result(img_100, result_diag, title='Diagonal Edge Detector (\\\\)')
Exercise 8.3 — Bonus: What Does Repeated Convolution Do?¶
Tasks:
- Apply the sharpen kernel 10 times in a row. Display the result after 1, 3, 5, 10 applications.
- Apply the blur kernel 20 times. What happens?
- Apply the edge detector to an image, then apply blur to the edge map. What does that produce?
- Think: In a neural network, many convolution layers are stacked. Each layer has learned kernels. Based on what you've seen, what might the first, second, and deeper layers be detecting?
# YOUR CODE HERE
# Apply sharpen multiple times
current = img_100.astype(np.float64)
snapshots = []
for i in range(10):
current = conv_same(current, kernel_sharpen)
if i + 1 in [1, 3, 5, 10]:
snapshots.append((i + 1, current.copy()))
# Display snapshots
# images = [snap for (_, snap) in snapshots]
# titles = [f'{n}x sharpen' for (n, _) in snapshots]
# ... normalize and display
Your thoughts on stacked layers:
- Layer 1 (close to input) likely detects...
- Layer 2 detects...
- Deeper layers likely detect...
Final Reflection¶
Answer these questions in your own words:
What is an image? Describe it as a data structure, not visually.
What is a convolution? Explain in 3 sentences without using the word "filter".
What is the difference between downsampling and upsampling?
Which one loses information permanently? Can you recover from either?Why does the blur kernel make the image blurry?
Explain using the math — what operation does it actually perform at each pixel?Why does the Sobel kernel detect edges?
Think about what happens when you apply it to a flat region vs. a sharp edge.In a Convolutional Neural Network (CNN), the kernels are not hand-designed — they are learned from data using gradient descent.
Based on what you've built, what does "learning a kernel" mean? What is being optimized?
Your answers:
An image is...
A convolution is...
Downsampling vs upsampling...
Blur works because...
Sobel detects edges because...
Learning a kernel means...
You just built a computer vision system from scratch — pixel by pixel, loop by loop.
Everything in a CNN's early layers is a version of what you built here.