Skip to content

You are viewing a free preview of this lesson.

Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.

What is Data Science

What is Data Science

Data science is an interdisciplinary field that uses scientific methods, algorithms, and systems to extract meaningful insights and knowledge from structured and unstructured data. It combines elements of statistics, computer science, and domain expertise to turn raw data into actionable information that drives decision-making.


A Brief History

  • 1962 — John Tukey publishes The Future of Data Analysis, advocating for a new discipline combining statistics and computing
  • 1977 — The International Association for Statistical Computing (IASC) is founded
  • 1996 — The term "data science" first appears in a database conference title
  • 2001 — William S. Cleveland proposes "data science" as an expansion of statistics
  • 2006 — Hadoop is released, enabling distributed storage and processing of massive datasets
  • 2008 — The term "data scientist" gains popularity; DJ Patil and Jeff Hammerbacher coin the modern usage
  • 2010 — Kaggle launches, creating a global data science competition platform
  • 2012 — Harvard Business Review calls data scientist "the sexiest job of the 21st century"
  • 2015 — TensorFlow is open-sourced by Google, accelerating machine learning adoption
  • 2017 — PyTorch is released by Facebook, becoming a favourite in research communities
  • Today — Data science is embedded in virtually every industry — healthcare, finance, retail, transport, government, and more

What Does a Data Scientist Do?

Data scientists work across the entire lifecycle of data:

1. Define the Problem

Before touching any data, a data scientist must understand the business question. What decision needs to be made? What outcome would be valuable? A well-defined problem is half the solution.

2. Collect and Gather Data

Data may come from databases, APIs, web scraping, surveys, sensors, or third-party providers. Understanding where data lives and how to access it is a critical skill.

3. Clean and Prepare Data

Real-world data is messy. It contains missing values, duplicates, inconsistencies, and errors. Data cleaning typically consumes 60-80% of a data scientist's time.

4. Explore and Visualise

Exploratory Data Analysis (EDA) uses statistics and visualisation to understand patterns, distributions, correlations, and anomalies in the data.

5. Model and Analyse

This is where machine learning, statistical modelling, and algorithms come in — building models that can predict outcomes, classify data, or uncover hidden patterns.

6. Communicate and Deploy

Insights are worthless if they cannot be communicated. Data scientists must present findings clearly — through reports, dashboards, and visualisations — and deploy models into production systems.


Data Science vs Related Fields

Field Focus
Data Science End-to-end process from question to insight — combines statistics, programming, and domain expertise
Machine Learning Building algorithms that learn from data — a subset of data science
Data Analytics Examining data to find trends and answer specific questions — more descriptive than predictive
Data Engineering Building infrastructure to collect, store, and process data at scale
Statistics Mathematical foundations for analysing and interpreting data
Artificial Intelligence Broader field of creating intelligent systems — machine learning is a subset
Business Intelligence Reporting and dashboards for business decision-making — typically less code-intensive

The Data Science Toolkit

Programming Languages

Language Strengths
Python Most popular for data science — rich ecosystem (NumPy, Pandas, Scikit-Learn, TensorFlow)
R Strong in statistical analysis and academic research
SQL Essential for querying relational databases
Julia High-performance scientific computing

Key Python Libraries

Library Purpose
NumPy Numerical computing and array operations
Pandas Data manipulation and analysis
Matplotlib Data visualisation and plotting
Seaborn Statistical visualisation (built on Matplotlib)
Scikit-Learn Machine learning algorithms
TensorFlow Deep learning framework
PyTorch Deep learning framework (research-focused)
Statsmodels Statistical models and tests
SciPy Scientific computing and optimisation

Tools and Platforms

Tool Purpose
Jupyter Notebook Interactive computing environment for code, text, and visualisations
Google Colab Free cloud-based Jupyter notebooks with GPU access
Kaggle Competition platform with datasets and notebooks
Git/GitHub Version control for code and collaboration
Docker Containerisation for reproducible environments
Apache Spark Distributed data processing at scale

Types of Data

Structured Data

Data organised in rows and columns — databases, spreadsheets, CSV files. Each column has a defined data type.

| Name    | Age | City    | Salary  |
|---------|-----|---------|---------|
| Alice   | 30  | London  | 55000   |
| Bob     | 25  | Paris   | 48000   |
| Charlie | 35  | Berlin  | 62000   |

Unstructured Data

Data without a predefined format — text documents, images, audio, video, social media posts. Requires special techniques (NLP, computer vision) to analyse.

Semi-Structured Data

Data with some organisational structure but not a rigid schema — JSON, XML, HTML, log files.

{
  "name": "Alice",
  "age": 30,
  "skills": ["Python", "SQL", "Machine Learning"],
  "address": {
    "city": "London",
    "country": "UK"
  }
}

Types of Analysis

Descriptive Analysis

"What happened?" — Summarising historical data with metrics like mean, median, counts, and percentages. Example: monthly sales reports.

Diagnostic Analysis

"Why did it happen?" — Investigating the causes behind observed patterns. Example: identifying why customer churn increased last quarter.

Predictive Analysis

"What will happen?" — Using historical data to forecast future outcomes. Example: predicting which customers are likely to cancel their subscription.

Prescriptive Analysis

"What should we do?" — Recommending actions based on predictions. Example: suggesting optimal pricing strategies to maximise revenue.


Why Python for Data Science?

Python has become the dominant language for data science for several reasons:

  1. Readable syntax — Python code reads almost like English, making it accessible to non-programmers
  2. Rich ecosystem — Thousands of libraries for every aspect of data science
  3. Community — The largest data science community, with abundant tutorials, courses, and support
  4. Integration — Python integrates well with databases, web frameworks, cloud services, and big data tools
  5. Industry adoption — Used by Google, Netflix, Facebook, NASA, and virtually every major tech company
  6. Jupyter notebooks — The interactive notebook environment was built for Python and is the standard tool for data exploration

Real-World Applications

Healthcare

  • Predicting disease outbreaks and patient readmission risk
  • Drug discovery and clinical trial optimisation
  • Medical image analysis (X-rays, MRIs)

Finance

  • Fraud detection and prevention
  • Algorithmic trading and risk assessment
  • Credit scoring and loan approval

Retail and E-commerce

  • Recommendation systems (Netflix, Amazon, Spotify)
  • Customer segmentation and targeting
  • Demand forecasting and inventory optimisation

Transport

  • Route optimisation (Uber, Lyft)
  • Autonomous vehicle development
  • Traffic prediction and management

Social Media

  • Content recommendation algorithms
  • Sentiment analysis and trend detection
  • Misinformation detection

Ethics in Data Science

Data science comes with significant ethical responsibilities:

  • Privacy — Handling personal data responsibly and complying with regulations (GDPR, CCPA)
  • Bias — Recognising and mitigating bias in data and algorithms
  • Transparency — Making models explainable and decisions interpretable
  • Consent — Ensuring data is collected with informed consent
  • Fairness — Ensuring algorithms do not discriminate against protected groups
  • Security — Protecting data from breaches and misuse

Summary

Data science is the practice of extracting meaningful insights from data using a combination of statistics, programming, and domain knowledge. It spans the entire journey from defining a problem, through collecting and cleaning data, to building models and communicating results. Python is the most popular language for data science thanks to its readable syntax, rich ecosystem of libraries, and strong community support. In this course, we will build a solid foundation in the Python data science toolkit — starting with Jupyter notebooks, then progressing through NumPy, Pandas, Matplotlib, and Scikit-Learn.