BIBC2025 workshop - Image classification: Introduction
RSFAS, ANU
reticulate basicsImage classification is the task of assigning one or more category labels to an input image based on its visual content.
Simplicity: It is the simplest and most fundamental vision task, helping models to extract and organise visual patterns like edges, shapes, and textures into meaningful object representations.
Historical motivations: Researchers started with image classification because it isolates the core challenge of visual representation learning without the added complexity of where objects appear or how they are structured.
Pixel is the smallest unit of an image, representing a single point of colour.
Each pixel stores intensity information for one or more channels, depending on the colour scheme.
The resolution of an image is given by the number of pixels along width \times height, e.g., 1920×1080.

In greyscale image, each pixel stores one intensity value representing brightness.

In RGB image, each pixel has three channels: Red (R), Green (G) and Blue (B).

File storage can differ from in-memory representation, because files often compress or encode the pixel data.
| Extension | Typical Use | Notes |
|---|---|---|
.jpg |
Photographs | Compressed (lossy), reduces file size |
.png |
Graphics / web | Lossless, supports transparency (alpha channel) |
.bmp |
Windows bitmaps | Uncompressed, large file size |
.tiff |
High-quality images | Often used in printing or scientific imaging |
.gif |
Simple animations | 256 colours only |
.raw |
Camera sensor data | Direct sensor output, unprocessed |
Once loaded into a program (Python, R, etc.), images are usually represented as multi-dimensional arrays/tensors in uint8 (0 - 255) or float32 (0 - 1).
[height, width][height, width, channels]ImageFolder expects a folder structure like this:
root/dog/xxx.png
root/dog/xxy.png
root/cat/123.png
root/cat/nsdf3.png
Once the images are loaded, we convert them to torch tensors stored in a torch Dataset.
DataLoader can sample images from a torch Dataset. We can create a wrapper around it to convert images to channel-last format and ensure they are on the desired device, such as a GPU.
device <- torch$accelerator$current_accelerator()
dataloader <- torch$utils$data$DataLoader(dataset,
batch_size = 10L,
shuffle = TRUE)
my_loader <- py_iterable_wrapper(dataloader, \(batch) {
batch[0] <- batch[0]$permute(0L, 2L, 3L, 1L)
batch[0] <- batch[0]$to(device = device, dtype = torch$float32)
batch[1] <- batch[1]$to(device = device, dtype = torch$float32)
return(tuple(batch[0], batch[1]))
})
py_for(c(x, y) ~ my_loader, {
print(glue::glue("Images: {x$shape}, {x$dtype}, {x$device}"))
print(glue::glue("Labels: {y$shape}, {y$dtype}, {y$device}"))
break
})Images: torch.Size([10, 2100, 2100, 3]), torch.float32, mps:0
Labels: torch.Size([10]), torch.float32, mps:0
For regression tasks, the image loading pipeline can be a bit more complicated, as it requires manually defining the dataset construction.
We won’t cover it in this workshop, but more details are available on https://docs.pytorch.org/tutorials/beginner/basics/data_tutorial.html#creating-a-custom-dataset-for-your-files.
The CIFAR datasets are widely used benchmarks for image recognition, containing labelled 32 \times 32 RGB images across multiple object categories.
We extract the cat and dog images from the dataset, using 5,000 for training and 2,000 for testing in each class.

map(), plot_rgb(), and wrap_plots().From the exercises:
A neural network is a computational model composed of interconnected layers of nodes that learn to recognise patterns and relationships in data.

A general mathematical form of a neural network can be written as:
\mathbf{y} = f^{(L)}\Big(\mathbf{W}^{(L)} f_{\dots}\big(f^{(2)}(\mathbf{W}^{(2)} f^{(1)}(\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}) + \mathbf{b}^{(2)})\big) + \mathbf{b}^{(L)} \Big)
where:
By setting the dimension of \mathbf{x}, \mathbf{y} and \mathbf{b}^{(1)} to one, L = 1 and f^{(1)}(z) = z, we obtain
y = w^{(1)} x + b^{(1)}.

By setting the dimension of \mathbf{x} to p, the dimension of \mathbf{y} and \mathbf{b}^{(1)} to one, L = 1 and f^{(1)}(z) = z, we obtain
y = \mathbf{w}^{(1)} \mathbf{x} + b^{(1)}.

By setting the dimension of \mathbf{x} to p, the dimension of \mathbf{y} and \mathbf{b}_1 to one, L = 1 and f^{(1)}(z) = \frac{e^z}{1 + e^z}, we obtain
y = f^{(1)}(\mathbf{W}^{(1)} \mathbf{x} + b^{(1)}) = \frac{e^{\mathbf{W}^{(1)} \mathbf{x} + b^{(1)}}}{1 + e^{\mathbf{W}^{(1)} \mathbf{x} + b^{(1)}}}.

By setting the dimension of \mathbb{x} to p, the dimension of \mathbf{y} and \mathbf{b}_1 to K, L = 1 and f^{(1)}(\mathbf{z}) = \frac{\exp(\mathbf{z})}{\sum_{k=1}^{K}\exp({z_k})}, we obtain
\mathbf{y} = f^{(1)}(\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}) = \frac{\exp(\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)})}{\sum_{k=1}^{K} \exp\big(\mathbf{W}^{(1)} \mathbf{x} + b^{(1)}_{k}\big)}.

By setting the dimension of \mathbb{x} to p, the dimension of \mathbf{y} and \mathbf{b}_1 to one, L = 2, f^{(1)}(z) = z and f^{(2)}(\mathbf{z}) = \frac{\exp(\mathbf{z})}{\sum_{k=1}^{K}\exp({z_k})}, we obtain
\mathbf{h} = \mathbf{w}^{(1)} \mathbf{x} + \mathbb{b}^{(1)}, \quad \mathbf{y} = f^{(2)}(\mathbf{W}^{(2)} \mathbf{h} + b^{(2)}) = \frac{e^{\mathbf{W}^{(2)} \mathbf{h} + b^{(2)}}}{1 + e^{\mathbf{W}^{(2)} \mathbf{h} + b^{(2)}}}.

You can choose:
Like many supervised learning methods, a neural network needs a differentiable loss function so that optimisers can compute gradients to update the model.
Regression \hat{y} \in \mathbb{R}
Mean Squared Error (MSE): \frac{1}{N} (\mathbf{y} - \hat{\mathbf{y}})^\top (\mathbf{y} - \hat{\mathbf{y}})
Mean Absolute Error (MAE): \frac{1}{N} \sum_{i=1}^{N} |\hat{y}_i - y_i|
Classification \hat{y} \in (0,1)
Categorical Cross-Entropy: -\frac{1}{N} \sum_{i=1}^{N} \sum_{k=1}^{K} y_{ik} \log(\hat{y}_{ik})
Kullback–Leibler Divergence: \sum_{k=1}^{K} y_k \log \frac{y_k}{\hat{y}_k}
Hinge Loss: \frac{1}{N} \sum_{i=1}^{N} \max(0, 1 - y_i \hat{y}_i)
…
Neural networks learn in the same way as other complex differentiable models, by iteratively updating parameters using gradients.
Unlike linear regression, we cannot simply set the derivative of the loss to zero because the solution has no closed form.
Computing gradients for all weights is highly nontrivial.
The 1986 Backpropagation paper made this practical.
The goal is to compute \frac{\partial L}{\partial \boldsymbol{W}} and \frac{\partial L}{\partial \boldsymbol{b}}.
Using the chain rule, this can be decomposed layer by layer as \frac{\partial L}{\partial W^{(l)}_{ij}} = \underbrace{\frac{\partial L}{\partial a^{(l)}_i}}_{\boldsymbol{?}} ~~ \underbrace{\frac{\partial a^{(l)}_i}{\partial z^{(l)}_i}}_{\substack{\text{gradient of }\\ \text{ activation } \\ \text{function} } } ~~ \underbrace{\frac{\partial z^{(l)}_i}{\partial W^{(l)}_{ij}}}_{\substack{\text{output of}\\ \text{layer}~ l-1}}, where W^{(l)}_{ij} is the weight between the j-th unit in layer l-1 and the i-th unit in layer l, z^{(l)}_i is the pre-activation input, and a^{(l)}_i is the activation output.
To compute \frac{\partial L}{\partial a^{(l)}_i} recursively, we apply the chain rule through the next layer.
For l < L (hidden layers):
\frac{\partial L}{\partial a^{(l)}_i} = \sum_k W^{(l+1)}_{ki}~~ \underbrace{\frac{\partial L}{\partial a^{(l+1)}_k}}_{\text{recursive}} ~~\underbrace{\frac{\partial a^{(l+1)}_k}{\partial z^{(l+1)}_k}}_{\substack{\text{gradient of } \\ \text{activation} \\ \text{ function} }}
For l = L (output layer), \frac{\partial L}{\partial a^{(L)}_i} = \frac{\partial L}{\partial \hat{y}_i} which depends on the specific loss function used.
Similarly, since \mathbf{b}^{(l)} contributes additively to \mathbf{z}^{(l)}, its gradient is simply
\frac{\partial L}{\partial b^{(l)}_i} = \frac{\partial L}{\partial a^{(l)}_i}~~ \underbrace{\frac{\partial a^{(l)}_i}{\partial z^{(l)}_i}}_{\substack{\text{gradient of} \\ \text{activation} \\ \text{function}}}.
By applying these relations backward from the output layer to the first, we can efficiently compute all gradients
Gradient descent (GD) is an iterative method that updates model weights in the negative gradient direction to minimise a differentiable loss.
Stochastic gradient descent (SGD) is a variant of gradient descent that updates weights using the gradient from a single (or a small batch of) sample(s) at each step.
KerasKeras backend to PyTorch.Sys.setenv(KERAS_BACKEND = "torch")
keras <- import("keras", convert = FALSE)
input <- keras$Input(tuple(32L, 32L, 3L))
x <- keras$layers$Flatten()(input)
x <- keras$layers$Dense(32L, activation = "relu")(x)
output <- keras$layers$Dense(1L, activation = "sigmoid")(x)
model <- keras$Model(input, output)KerasKeras and disable automatic object conversion.Sys.setenv(KERAS_BACKEND = "torch")
keras <- import("keras", convert = FALSE)
input <- keras$Input(tuple(32L, 32L, 3L))
x <- keras$layers$Flatten()(input)
x <- keras$layers$Dense(32L, activation = "relu")(x)
output <- keras$layers$Dense(1L, activation = "sigmoid")(x)
model <- keras$Model(input, output)Keras(H, W, C).Sys.setenv(KERAS_BACKEND = "torch")
keras <- import("keras", convert = FALSE)
input <- keras$Input(tuple(32L, 32L, 3L))
x <- keras$layers$Flatten()(input)
x <- keras$layers$Dense(32L, activation = "relu")(x)
output <- keras$layers$Dense(1L, activation = "sigmoid")(x)
model <- keras$Model(input, output)KerasSys.setenv(KERAS_BACKEND = "torch")
keras <- import("keras", convert = FALSE)
input <- keras$Input(tuple(32L, 32L, 3L))
x <- keras$layers$Flatten()(input)
x <- keras$layers$Dense(32L, activation = "relu")(x)
output <- keras$layers$Dense(1L, activation = "sigmoid")(x)
model <- keras$Model(input, output)KerasSys.setenv(KERAS_BACKEND = "torch")
keras <- import("keras", convert = FALSE)
input <- keras$Input(tuple(32L, 32L, 3L))
x <- keras$layers$Flatten()(input)
x <- keras$layers$Dense(32L, activation = "relu")(x)
output <- keras$layers$Dense(1L, activation = "sigmoid")(x)
model <- keras$Model(input, output)KerasSys.setenv(KERAS_BACKEND = "torch")
keras <- import("keras", convert = FALSE)
input <- keras$Input(tuple(32L, 32L, 3L))
x <- keras$layers$Flatten()(input)
x <- keras$layers$Dense(32L, activation = "relu")(x)
output <- keras$layers$Dense(1L, activation = "sigmoid")(x)
model <- keras$Model(input, output)KerasSys.setenv(KERAS_BACKEND = "torch")
keras <- import("keras", convert = FALSE)
input <- keras$Input(tuple(32L, 32L, 3L))
x <- keras$layers$Flatten()(input)
x <- keras$layers$Dense(32L, activation = "relu")(x)
output <- keras$layers$Dense(1L, activation = "sigmoid")(x)
model <- keras$Model(input, output)Keras model summaryData flow (# param): \text{Input} ~~\underbrace{\longrightarrow}_{0} ~~\text{Flatten}~~ \underbrace{\longrightarrow}_{3072 \times 32 + 32 = 98336} ~~\text{Hidden}~~ \underbrace{\longrightarrow}_{32 \times 1 + 1 = 33} ~~\text{Output}
Model: "functional"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type) ┃ Output Shape ┃ Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ input_layer (InputLayer) │ (None, 32, 32, 3) │ 0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ flatten (Flatten) │ (None, 3072) │ 0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense (Dense) │ (None, 32) │ 98,336 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense) │ (None, 1) │ 33 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 98,369 (384.25 KB)
Trainable params: 98,369 (384.25 KB)
Non-trainable params: 0 (0.00 B)
keras modelIt is important to standardise the images so that their values are around 0, as modern weight initialisers expect this.
Convert R arrays to torch tensors with float32 data type:
x_train_mean <- mean(cat_and_dog$x_train)
x_train_sd <- sd(cat_and_dog$x_train)
x <- torch$tensor((cat_and_dog$x_train - x_train_mean)/x_train_sd,
dtype = torch$float32)
y <- torch$tensor(cat_and_dog$y_train, dtype = torch$float32)
x_test <- torch$tensor((cat_and_dog_test$x_test - x_train_mean)/x_train_sd,
dtype = torch$float32)
y_test <- torch$tensor(cat_and_dog_test$y_test, dtype = torch$float32)Keras modelTrain for 5 epochs (the model will see x five times):
Epoch 1/5
157/157 - 1s - 8ms/step - accuracy: 0.5737 - loss: 0.7175
Epoch 2/5
157/157 - 1s - 7ms/step - accuracy: 0.6250 - loss: 0.6496
Epoch 3/5
157/157 - 2s - 12ms/step - accuracy: 0.6470 - loss: 0.6254
Epoch 4/5
157/157 - 2s - 10ms/step - accuracy: 0.6659 - loss: 0.6081
Epoch 5/5
157/157 - 2s - 10ms/step - accuracy: 0.6737 - loss: 0.5960
keras modelPrediction:
array([[0.17342566],
[0.22466913],
[0.3888184 ],
...,
[0.0303942 ],
[0.6031979 ],
[0.6493077 ]], shape=(10000, 1), dtype=float32)
Train and test performance:
keras does under the hood?torch tensors for us.torch builds a computation graph to compute the loss L(\hat{\boldsymbol{y}},\boldsymbol{y}) and track gradients \nabla L(\hat{\boldsymbol{y}},\boldsymbol{y}).keras updates the weights based on the gradients and optimiser settings \boldsymbol{\theta}^{*} \leftarrow \mathcal{O}(\boldsymbol{\theta}, \nabla L), then clears the gradients.Check the documentation: https://keras.io/api/layers/
Dropout: randomly sets input units to 0 at a specified rate during training, while scaling the remaining active units by 1 / (1 - rate) to maintain the overall activation level.Concatenate: merges multiple inputs along a specified axis.GaussianNoise: adds random noise to inputs during training to improve generalisation.BatchNormalization: normalises its inputs over the batch, helping stabilise and speed up training.torch tensors.rate = 0.3) and retrain.
Slides URL: https://ibsar-cv-workshop.patrickli.org/ | Canberra time