What is Data Science

Data science is an interdisciplinary field that uses scientific methods, algorithms, and systems to extract meaningful insights and knowledge from structured and unstructured data. It combines elements of statistics, computer science, and domain expertise to turn raw data into actionable information that drives decision-making.

A Brief History

1962 — John Tukey publishes The Future of Data Analysis, advocating for a new discipline combining statistics and computing
1977 — The International Association for Statistical Computing (IASC) is founded
1996 — The term "data science" first appears in a database conference title
2001 — William S. Cleveland proposes "data science" as an expansion of statistics
2006 — Hadoop is released, enabling distributed storage and processing of massive datasets
2008 — The term "data scientist" gains popularity; DJ Patil and Jeff Hammerbacher coin the modern usage
2010 — Kaggle launches, creating a global data science competition platform
2012 — Harvard Business Review calls data scientist "the sexiest job of the 21st century"
2015 — TensorFlow is open-sourced by Google, accelerating machine learning adoption
2017 — PyTorch is released by Facebook, becoming a favourite in research communities
Today — Data science is embedded in virtually every industry — healthcare, finance, retail, transport, government, and more

What Does a Data Scientist Do?

Data scientists work across the entire lifecycle of data:

1. Define the Problem

Before touching any data, a data scientist must understand the business question. What decision needs to be made? What outcome would be valuable? A well-defined problem is half the solution.

2. Collect and Gather Data

Data may come from databases, APIs, web scraping, surveys, sensors, or third-party providers. Understanding where data lives and how to access it is a critical skill.

3. Clean and Prepare Data

Real-world data is messy. It contains missing values, duplicates, inconsistencies, and errors. Data cleaning typically consumes 60-80% of a data scientist's time.

4. Explore and Visualise

Exploratory Data Analysis (EDA) uses statistics and visualisation to understand patterns, distributions, correlations, and anomalies in the data.

5. Model and Analyse

This is where machine learning, statistical modelling, and algorithms come in — building models that can predict outcomes, classify data, or uncover hidden patterns.

6. Communicate and Deploy

Insights are worthless if they cannot be communicated. Data scientists must present findings clearly — through reports, dashboards, and visualisations — and deploy models into production systems.

Data Science vs Related Fields

Field	Focus
Data Science	End-to-end process from question to insight — combines statistics, programming, and domain expertise
Machine Learning	Building algorithms that learn from data — a subset of data science
Data Analytics	Examining data to find trends and answer specific questions — more descriptive than predictive
Data Engineering	Building infrastructure to collect, store, and process data at scale
Statistics	Mathematical foundations for analysing and interpreting data
Artificial Intelligence	Broader field of creating intelligent systems — machine learning is a subset
Business Intelligence	Reporting and dashboards for business decision-making — typically less code-intensive

The Data Science Toolkit

Programming Languages

Language	Strengths
Python	Most popular for data science — rich ecosystem (NumPy, Pandas, Scikit-Learn, TensorFlow)
R	Strong in statistical analysis and academic research
SQL	Essential for querying relational databases
Julia	High-performance scientific computing

Key Python Libraries

Library	Purpose
`NumPy`	Numerical computing and array operations
`Pandas`	Data manipulation and analysis
`Matplotlib`	Data visualisation and plotting
`Seaborn`	Statistical visualisation (built on Matplotlib)
`Scikit-Learn`	Machine learning algorithms
`TensorFlow`	Deep learning framework
`PyTorch`	Deep learning framework (research-focused)
`Statsmodels`	Statistical models and tests
`SciPy`	Scientific computing and optimisation

Tools and Platforms

Tool	Purpose
Jupyter Notebook	Interactive computing environment for code, text, and visualisations
Google Colab	Free cloud-based Jupyter notebooks with GPU access
Kaggle	Competition platform with datasets and notebooks
Git/GitHub	Version control for code and collaboration
Docker	Containerisation for reproducible environments
Apache Spark	Distributed data processing at scale

Types of Data

Structured Data

Data organised in rows and columns — databases, spreadsheets, CSV files. Each column has a defined data type.

Name	Age	City	Salary
Alice	30	London	55000
Bob	25	Paris	48000
Charlie	35	Berlin	62000

Unstructured Data

Data without a predefined format — text documents, images, audio, video, social media posts. Requires special techniques (NLP, computer vision) to analyse.

Semi-Structured Data

Data with some organisational structure but not a rigid schema — JSON, XML, HTML, log files.

{
  "name": "Alice",
  "age": 30,
  "skills": ["Python", "SQL", "Machine Learning"],
  "address": {
    "city": "London",
    "country": "UK"
  }
}

Types of Analysis

Descriptive Analysis

"What happened?" — Summarising historical data with metrics like mean, median, counts, and percentages. Example: monthly sales reports.

Diagnostic Analysis

"Why did it happen?" — Investigating the causes behind observed patterns. Example: identifying why customer churn increased last quarter.

Predictive Analysis

"What will happen?" — Using historical data to forecast future outcomes. Example: predicting which customers are likely to cancel their subscription.

Prescriptive Analysis

"What should we do?" — Recommending actions based on predictions. Example: suggesting optimal pricing strategies to maximise revenue.

Why Python for Data Science?

Python has become the dominant language for data science for several reasons:

Readable syntax — Python code reads almost like English, making it accessible to non-programmers
Rich ecosystem — Thousands of libraries for every aspect of data science
Community — The largest data science community, with abundant tutorials, courses, and support
Integration — Python integrates well with databases, web frameworks, cloud services, and big data tools
Industry adoption — Used by Google, Netflix, Facebook, NASA, and virtually every major tech company
Jupyter notebooks — The interactive notebook environment was built for Python and is the standard tool for data exploration

Real-World Applications

Healthcare

Predicting disease outbreaks and patient readmission risk
Drug discovery and clinical trial optimisation
Medical image analysis (X-rays, MRIs)

Finance

Fraud detection and prevention
Algorithmic trading and risk assessment
Credit scoring and loan approval

Retail and E-commerce

Recommendation systems (Netflix, Amazon, Spotify)
Customer segmentation and targeting
Demand forecasting and inventory optimisation

Transport

Route optimisation (Uber, Lyft)
Autonomous vehicle development
Traffic prediction and management

Social Media

Content recommendation algorithms
Sentiment analysis and trend detection
Misinformation detection

Ethics in Data Science

Data science comes with significant ethical responsibilities:

Privacy — Handling personal data responsibly and complying with regulations (GDPR, CCPA)
Bias — Recognising and mitigating bias in data and algorithms
Transparency — Making models explainable and decisions interpretable
Consent — Ensuring data is collected with informed consent
Fairness — Ensuring algorithms do not discriminate against protected groups
Security — Protecting data from breaches and misuse

Summary

Data science is the practice of extracting meaningful insights from data using a combination of statistics, programming, and domain knowledge. It spans the entire journey from defining a problem, through collecting and cleaning data, to building models and communicating results. Python is the most popular language for data science thanks to its readable syntax, rich ecosystem of libraries, and strong community support. In this course, we will build a solid foundation in the Python data science toolkit — starting with Jupyter notebooks, then progressing through NumPy, Pandas, Matplotlib, and Scikit-Learn.

What is Data Science

What is Data Science

A Brief History

What Does a Data Scientist Do?

1. Define the Problem

2. Collect and Gather Data

3. Clean and Prepare Data

4. Explore and Visualise

5. Model and Analyse

6. Communicate and Deploy

Data Science vs Related Fields

The Data Science Toolkit

Programming Languages

Key Python Libraries

Tools and Platforms

Types of Data

Structured Data

Unstructured Data

Semi-Structured Data

Types of Analysis

Descriptive Analysis

Diagnostic Analysis

Predictive Analysis

Prescriptive Analysis

Why Python for Data Science?

Real-World Applications

Healthcare

Finance

Retail and E-commerce

Transport

Social Media

Ethics in Data Science

Summary

More in Data Science