NumPy and Pandas for AI
Before you can train a model, you need to manipulate data. Before you can manipulate data effectively in Python, you need two libraries: NumPy and Pandas. They sit at the base of the entire Python AI stack. Every major framework (scikit-learn, TensorFlow, PyTorch) either builds on top of them or interoperates with them seamlessly.
NumPy: Arrays at the Heart of Machine Learning
NumPy (Numerical Python) provides the ndarray: an n-dimensional array that is the fundamental data structure for numerical computing in Python. Almost everything in machine learning boils down to operations on arrays of numbers: pixel values in an image, word embeddings, weight matrices in a neural network, a batch of training examples.
Why Not Just Use Python Lists?
Python lists are flexible but slow for numerical work. NumPy arrays store data in contiguous blocks of memory, allow vectorised operations and are implemented in C under the hood. The result is that operations on NumPy arrays are often 10–100× faster than the equivalent Python loop.
import numpy as np
# Create a 1-D array
a = np.array([1, 2, 3, 4, 5])
# Element-wise operations: no loop required
print(a * 2) # [2 4 6 8 10]
print(a ** 2) # [ 1 4 9 16 25]
print(np.sqrt(a)) # [1. 1.41 1.73 2. 2.23]
Vectorisation and Broadcasting
Vectorisation means expressing operations on entire arrays rather than writing explicit loops. This is not just a stylistic preference: it is the difference between code that runs in milliseconds and code that runs in seconds on large datasets.
Broadcasting is NumPy's mechanism for performing operations between arrays of different shapes. When you subtract the mean from every column of a matrix to normalise features, broadcasting handles the shape mismatch automatically.
# A (1000, 10) matrix of features
X = np.random.randn(1000, 10)
# Subtract the column mean from every row: broadcasting does the heavy lifting
X_centred = X - X.mean(axis=0) # mean shape: (10,) broadcasts to (1000, 10)
Common NumPy Patterns in AI Workflows
- Reshaping:
array.reshape(n_rows, n_cols): converting between shapes expected by different parts of your pipeline - Stacking:
np.vstack,np.hstack,np.concatenate: combining arrays from different sources - Indexing and masking: selecting subsets of data based on conditions
- Linear algebra:
np.dot,np.linalg.inv,np.linalg.eig: matrix multiplication and decomposition underpin many ML algorithms - Random number generation:
np.random.randn,np.random.choice: initialising weights, shuffling data, sampling batches
The Relationship Between NumPy and ML Frameworks
PyTorch tensors and TensorFlow tensors are conceptually similar to NumPy arrays. PyTorch even supports zero-copy conversion between tensors and NumPy arrays (when both are on CPU). When you load a dataset, preprocess it and hand it to a model, you are typically moving data through NumPy at some point in the chain.
Pandas: DataFrames for Datasets
Where NumPy excels at homogeneous numerical arrays, Pandas is designed for heterogeneous tabular data: datasets with different column types, missing values, named columns and time series. Real-world data almost always arrives in this form.
The DataFrame
A Pandas DataFrame is a table: rows and columns, like a spreadsheet in memory. Each column is a Pandas Series. Columns can have different types: integers, floats, strings, booleans, datetimes.
import pandas as pd
df = pd.read_csv('customers.csv')
print(df.head()) # first 5 rows
print(df.info()) # column names, types, null counts
print(df.describe()) # count, mean, std, min, quartiles, max
Common Data Manipulation Patterns in AI Workflows
Selecting and filtering
# Select a column
ages = df['age']
# Filter rows
adults = df[df['age'] >= 18]
# Select multiple columns
features = df[['age', 'income', 'tenure']]
Handling missing values Missing data is the norm in real datasets. Pandas makes it easy to identify and handle:
df.isnull().sum() # count missing per column
df.dropna(subset=['income']) # drop rows where income is missing
df['age'].fillna(df['age'].median(), inplace=True) # impute with median
Encoding and transformation Before passing data to a model, categorical columns must be encoded as numbers:
df['gender_encoded'] = df['gender'].map({'M': 0, 'F': 1})
dummies = pd.get_dummies(df['city'], prefix='city')
df = pd.concat([df, dummies], axis=1)
Grouping and aggregation
# Average churn rate by contract type
df.groupby('contract_type')['churned'].mean()
From DataFrame to NumPy When your data is clean, the final step before model training is converting to a NumPy array:
X = df[feature_cols].values # .values returns a NumPy array
y = df['target'].values
Why These Libraries Remain Central
Despite the rise of higher-level tools, NumPy and Pandas remain indispensable because:
- Every other library in the ecosystem speaks their language
- They cover the data preparation phase, which consumes 60–80% of a typical ML project's time
- Their APIs are stable, well-documented and understood by everyone in the field
- For structured tabular data and classical ML, they are still the fastest path from raw data to trained model
Quiz: What is vectorisation and why does it matter for performance in AI workloads? Describe the typical sequence of Pandas operations you would perform to prepare a raw CSV dataset for model training.