Images as Arrays of Numbers

Before a computer can understand an image, it must first represent it as something it can process: numbers. Every photograph, screenshot or video frame your computer displays is, at its core, a structured grid of numerical values. This lesson explains exactly how that works and why it matters for everything that follows in computer vision.

What is a Pixel?

The word "pixel" comes from "picture element." It is the smallest addressable unit of a digital image. Think of a digital image as a mosaic: zoom in far enough and you see individual coloured squares packed tightly together. Each of those squares is a pixel.

A typical smartphone camera produces images that are several thousand pixels wide and several thousand pixels tall. A 12-megapixel photo, for example, contains 12 million individual pixels arranged in a grid.

Grayscale Images

The simplest type of digital image is a grayscale image: one with no colour, only shades of grey ranging from black to white.

In a grayscale image, each pixel is represented by a single number. By convention, this number ranges from 0 to 255:

0 represents pure black
255 represents pure white
Values in between represent progressively lighter shades of grey

Why 0 to 255? Because 256 values fit perfectly into one byte (8 bits) of computer memory. A byte can store any integer from 0 to 255, which gives us 256 distinct shades: more than enough for a smooth gradient from black to white.

A grayscale image that is 640 pixels wide and 480 pixels tall is stored as a grid (or matrix) of numbers with 480 rows and 640 columns. In Python using the NumPy library, this would be a two-dimensional array with shape (480, 640).

Colour Images: RGB Channels

Colour images are more interesting. Human vision perceives colour through three types of cone cells in the eye, sensitive to red, green and blue light. Digital cameras and displays exploit this by representing colour as a combination of three values: red (R), green (G) and blue (B).

A colour image is not one grid of numbers but three grids stacked on top of each other. Each grid is called a channel:

The red channel records how much red light is at each pixel position
The green channel records how much green light is at each pixel position
The blue channel records how much blue light is at each pixel position

Each channel, like a grayscale image, stores values from 0 to 255. When the three channels are combined, they produce the full range of visible colour:

R	G	B	Resulting colour
255	0	0	Pure red
0	255	0	Pure green
0	0	255	Pure blue
255	255	0	Yellow
255	255	255	White
0	0	0	Black
128	0	128	Purple

A colour image with dimensions 640×480 is stored as a three-dimensional array with shape (480, 640, 3) 480 rows, 640 columns and 3 channels. The total number of values stored is 480 × 640 × 3 = 921,600 numbers for a single relatively small image.

Image Dimensions and Shape

When working with images in code, you will constantly refer to their shape. The convention used in most computer vision libraries (including NumPy and OpenCV) is:

(height, width, channels)

Note that height comes first, then width. This is the opposite of how we often describe screen resolutions (1920×1080 means 1920 wide, 1080 tall), which can be a source of confusion when you start coding.

Common image sizes you will encounter:

Thumbnail: 128×128 pixels: often used as input to neural networks
SD video frame: 640×480 pixels
HD video frame: 1280×720 pixels
Full HD frame: 1920×1080 pixels
4K frame: 3840×2160 pixels: nearly 25 million pixels per frame

How Images are Stored in Memory

When an image is loaded into memory for processing, each pixel value is typically stored as an unsigned 8-bit integer (uint8) occupying one byte. A full HD colour image therefore requires:

1920 × 1080 × 3 bytes = 6,220,800 bytes ≈ 6 MB

In practice, image files on disk are much smaller because they are compressed. JPEG uses lossy compression (it discards some information to save space). PNG uses lossless compression (it recovers the exact original values). When you load an image into memory for processing, it is decompressed back to its full uncompressed form.

Deep learning models often work with images at a smaller size 224×224 or 256×256 pixels: because processing every pixel of a high-resolution image is computationally expensive. Resizing is therefore one of the most common preprocessing steps.

Data Types and Normalisation

Pixel values stored as uint8 range from 0 to 255. However, most neural networks expect input values in a different range, typically:

[0, 1]: divide every pixel value by 255.0
[-1, 1]: divide by 127.5, then subtract 1

This process is called normalisation and it helps neural networks train more stably. You will see it applied consistently throughout this course.

Why This Matters

Understanding images as arrays of numbers is the foundation of everything in computer vision. When you apply a filter to detect edges, you are performing arithmetic on these arrays. When a convolutional neural network processes an image, it is applying a series of mathematical operations to these grids of numbers. The more concrete your mental model of what an image actually is, the easier it becomes to reason about what algorithms are doing and why they work.

Quiz: A colour image is 512 pixels wide and 384 pixels tall. What is its shape as a NumPy array and how many total numerical values does it contain?