What Is Data?
Data is the raw material of the information age. Every app you use, every AI model you train and every business decision backed by evidence starts with data. Before you can work with databases, write SQL queries or build data pipelines, you need to understand what data actually is, how it differs from information and knowledge and how it flows through systems.
Data, Information and Knowledge
These three terms are often used interchangeably, but they mean different things:
Data is raw, unprocessed facts and figures. A list of numbers -- 28, 35, 22, 41 -- is data. On its own, it has no meaning. You do not know what these numbers represent.
Information is data that has been processed, organised and given context. "The ages of four job applicants are 28, 35, 22 and 41" is information. The data now has meaning because it has context.
Knowledge is information combined with experience, interpretation and judgement. "The youngest applicant at 22 may need more mentoring but brings recent academic knowledge, while the 41-year-old likely has the most industry experience" is knowledge. It requires human interpretation.
In ICT, we build systems that collect data, process it into information and present it in ways that help humans (or AI systems) generate knowledge.
Structured, Unstructured and Semi-Structured Data
Structured data fits neatly into rows and columns. A database table with columns for name, email and age is structured data. Spreadsheets, relational databases and CSV files all hold structured data. It is the easiest type to search, filter and analyse.
Unstructured data has no predefined format. Emails, social media posts, images, videos, audio recordings and PDF documents are all unstructured. Roughly 80% of the world's data is unstructured. Processing it requires techniques like natural language processing (NLP) and computer vision.
Semi-structured data has some organisational properties but does not conform to a rigid table structure. JSON, XML and HTML are semi-structured -- they have tags or keys that provide some structure, but the data within can vary from record to record.
The Data Lifecycle
Data moves through a predictable lifecycle:
- Collection: Data is generated or gathered from sources -- user input, sensors, APIs, web scraping, manual entry or automated systems.
- Storage: Data is saved in databases, data warehouses, data lakes, file systems or cloud storage.
- Processing: Raw data is cleaned, transformed and prepared for analysis. This includes removing duplicates, handling missing values and converting formats.
- Analysis: Data is examined to find patterns, trends and insights. This can range from simple spreadsheet formulas to complex machine learning models.
- Sharing/Visualisation: Insights are communicated through reports, dashboards, charts and presentations.
- Archiving/Deletion: Data that is no longer needed for active use is archived for compliance or historical purposes or securely deleted when retention periods expire.
Understanding this lifecycle is important because different tools and skills are needed at each stage and legal requirements (like POPIA) apply differently depending on where data is in its lifecycle.
Data as an Asset
Modern organisations treat data as a strategic asset -- something that has measurable value. Customer data helps companies personalise services. Transaction data reveals purchasing patterns. Sensor data enables predictive maintenance. In South Africa, data is increasingly recognised as a driver of economic growth, particularly in sectors like banking (transaction analytics), retail (customer behaviour) and healthcare (patient outcome analysis).
However, data also carries liability. Holding personal information creates obligations under POPIA. Storing data costs money (infrastructure, security, compliance). And poor-quality data can lead to bad decisions -- the "garbage in, garbage out" principle that is especially critical in AI and machine learning.