You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Data science is an interdisciplinary field that uses scientific methods, algorithms, and systems to extract meaningful insights and knowledge from structured and unstructured data. It combines elements of statistics, computer science, and domain expertise to turn raw data into actionable information that drives decision-making.
Data scientists work across the entire lifecycle of data:
Before touching any data, a data scientist must understand the business question. What decision needs to be made? What outcome would be valuable? A well-defined problem is half the solution.
Data may come from databases, APIs, web scraping, surveys, sensors, or third-party providers. Understanding where data lives and how to access it is a critical skill.
Real-world data is messy. It contains missing values, duplicates, inconsistencies, and errors. Data cleaning typically consumes 60-80% of a data scientist's time.
Exploratory Data Analysis (EDA) uses statistics and visualisation to understand patterns, distributions, correlations, and anomalies in the data.
This is where machine learning, statistical modelling, and algorithms come in — building models that can predict outcomes, classify data, or uncover hidden patterns.
Insights are worthless if they cannot be communicated. Data scientists must present findings clearly — through reports, dashboards, and visualisations — and deploy models into production systems.
| Field | Focus |
|---|---|
| Data Science | End-to-end process from question to insight — combines statistics, programming, and domain expertise |
| Machine Learning | Building algorithms that learn from data — a subset of data science |
| Data Analytics | Examining data to find trends and answer specific questions — more descriptive than predictive |
| Data Engineering | Building infrastructure to collect, store, and process data at scale |
| Statistics | Mathematical foundations for analysing and interpreting data |
| Artificial Intelligence | Broader field of creating intelligent systems — machine learning is a subset |
| Business Intelligence | Reporting and dashboards for business decision-making — typically less code-intensive |
| Language | Strengths |
|---|---|
| Python | Most popular for data science — rich ecosystem (NumPy, Pandas, Scikit-Learn, TensorFlow) |
| R | Strong in statistical analysis and academic research |
| SQL | Essential for querying relational databases |
| Julia | High-performance scientific computing |
| Library | Purpose |
|---|---|
NumPy | Numerical computing and array operations |
Pandas | Data manipulation and analysis |
Matplotlib | Data visualisation and plotting |
Seaborn | Statistical visualisation (built on Matplotlib) |
Scikit-Learn | Machine learning algorithms |
TensorFlow | Deep learning framework |
PyTorch | Deep learning framework (research-focused) |
Statsmodels | Statistical models and tests |
SciPy | Scientific computing and optimisation |
| Tool | Purpose |
|---|---|
| Jupyter Notebook | Interactive computing environment for code, text, and visualisations |
| Google Colab | Free cloud-based Jupyter notebooks with GPU access |
| Kaggle | Competition platform with datasets and notebooks |
| Git/GitHub | Version control for code and collaboration |
| Docker | Containerisation for reproducible environments |
| Apache Spark | Distributed data processing at scale |
Data organised in rows and columns — databases, spreadsheets, CSV files. Each column has a defined data type.
| Name | Age | City | Salary |
|---------|-----|---------|---------|
| Alice | 30 | London | 55000 |
| Bob | 25 | Paris | 48000 |
| Charlie | 35 | Berlin | 62000 |
Data without a predefined format — text documents, images, audio, video, social media posts. Requires special techniques (NLP, computer vision) to analyse.
Data with some organisational structure but not a rigid schema — JSON, XML, HTML, log files.
{
"name": "Alice",
"age": 30,
"skills": ["Python", "SQL", "Machine Learning"],
"address": {
"city": "London",
"country": "UK"
}
}
"What happened?" — Summarising historical data with metrics like mean, median, counts, and percentages. Example: monthly sales reports.
"Why did it happen?" — Investigating the causes behind observed patterns. Example: identifying why customer churn increased last quarter.
"What will happen?" — Using historical data to forecast future outcomes. Example: predicting which customers are likely to cancel their subscription.
"What should we do?" — Recommending actions based on predictions. Example: suggesting optimal pricing strategies to maximise revenue.
Python has become the dominant language for data science for several reasons:
Data science comes with significant ethical responsibilities:
Data science is the practice of extracting meaningful insights from data using a combination of statistics, programming, and domain knowledge. It spans the entire journey from defining a problem, through collecting and cleaning data, to building models and communicating results. Python is the most popular language for data science thanks to its readable syntax, rich ecosystem of libraries, and strong community support. In this course, we will build a solid foundation in the Python data science toolkit — starting with Jupyter notebooks, then progressing through NumPy, Pandas, Matplotlib, and Scikit-Learn.