You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
What is Data Science
What is Data Science
Data science is an interdisciplinary field that uses scientific methods, algorithms, and systems to extract meaningful insights and knowledge from structured and unstructured data. It combines elements of statistics, computer science, and domain expertise to turn raw data into actionable information that drives decision-making.
A Brief History
- 1962 — John Tukey publishes The Future of Data Analysis, advocating for a new discipline combining statistics and computing
- 1977 — The International Association for Statistical Computing (IASC) is founded
- 1996 — The term "data science" first appears in a database conference title
- 2001 — William S. Cleveland proposes "data science" as an expansion of statistics
- 2006 — Hadoop is released, enabling distributed storage and processing of massive datasets
- 2008 — The term "data scientist" gains popularity; DJ Patil and Jeff Hammerbacher coin the modern usage
- 2010 — Kaggle launches, creating a global data science competition platform
- 2012 — Harvard Business Review calls data scientist "the sexiest job of the 21st century"
- 2015 — TensorFlow is open-sourced by Google, accelerating machine learning adoption
- 2017 — PyTorch is released by Facebook, becoming a favourite in research communities
- Today — Data science is embedded in virtually every industry — healthcare, finance, retail, transport, government, and more
What Does a Data Scientist Do?
Data scientists work across the entire lifecycle of data:
1. Define the Problem
Before touching any data, a data scientist must understand the business question. What decision needs to be made? What outcome would be valuable? A well-defined problem is half the solution.
2. Collect and Gather Data
Data may come from databases, APIs, web scraping, surveys, sensors, or third-party providers. Understanding where data lives and how to access it is a critical skill.
3. Clean and Prepare Data
Real-world data is messy. It contains missing values, duplicates, inconsistencies, and errors. Data cleaning typically consumes 60-80% of a data scientist's time.
4. Explore and Visualise
Exploratory Data Analysis (EDA) uses statistics and visualisation to understand patterns, distributions, correlations, and anomalies in the data.
5. Model and Analyse
This is where machine learning, statistical modelling, and algorithms come in — building models that can predict outcomes, classify data, or uncover hidden patterns.
6. Communicate and Deploy
Insights are worthless if they cannot be communicated. Data scientists must present findings clearly — through reports, dashboards, and visualisations — and deploy models into production systems.
Data Science vs Related Fields
| Field | Focus |
|---|---|
| Data Science | End-to-end process from question to insight — combines statistics, programming, and domain expertise |
| Machine Learning | Building algorithms that learn from data — a subset of data science |
| Data Analytics | Examining data to find trends and answer specific questions — more descriptive than predictive |
| Data Engineering | Building infrastructure to collect, store, and process data at scale |
| Statistics | Mathematical foundations for analysing and interpreting data |
| Artificial Intelligence | Broader field of creating intelligent systems — machine learning is a subset |
| Business Intelligence | Reporting and dashboards for business decision-making — typically less code-intensive |
The Data Science Toolkit
Programming Languages
| Language | Strengths |
|---|---|
| Python | Most popular for data science — rich ecosystem (NumPy, Pandas, Scikit-Learn, TensorFlow) |
| R | Strong in statistical analysis and academic research |
| SQL | Essential for querying relational databases |
| Julia | High-performance scientific computing |
Key Python Libraries
| Library | Purpose |
|---|---|
NumPy |
Numerical computing and array operations |
Pandas |
Data manipulation and analysis |
Matplotlib |
Data visualisation and plotting |
Seaborn |
Statistical visualisation (built on Matplotlib) |
Scikit-Learn |
Machine learning algorithms |
TensorFlow |
Deep learning framework |
PyTorch |
Deep learning framework (research-focused) |
Statsmodels |
Statistical models and tests |
SciPy |
Scientific computing and optimisation |
Tools and Platforms
| Tool | Purpose |
|---|---|
| Jupyter Notebook | Interactive computing environment for code, text, and visualisations |
| Google Colab | Free cloud-based Jupyter notebooks with GPU access |
| Kaggle | Competition platform with datasets and notebooks |
| Git/GitHub | Version control for code and collaboration |
| Docker | Containerisation for reproducible environments |
| Apache Spark | Distributed data processing at scale |
Types of Data
Structured Data
Data organised in rows and columns — databases, spreadsheets, CSV files. Each column has a defined data type.
| Name | Age | City | Salary |
|---------|-----|---------|---------|
| Alice | 30 | London | 55000 |
| Bob | 25 | Paris | 48000 |
| Charlie | 35 | Berlin | 62000 |
Unstructured Data
Data without a predefined format — text documents, images, audio, video, social media posts. Requires special techniques (NLP, computer vision) to analyse.
Semi-Structured Data
Data with some organisational structure but not a rigid schema — JSON, XML, HTML, log files.
{
"name": "Alice",
"age": 30,
"skills": ["Python", "SQL", "Machine Learning"],
"address": {
"city": "London",
"country": "UK"
}
}
Types of Analysis
Descriptive Analysis
"What happened?" — Summarising historical data with metrics like mean, median, counts, and percentages. Example: monthly sales reports.
Diagnostic Analysis
"Why did it happen?" — Investigating the causes behind observed patterns. Example: identifying why customer churn increased last quarter.
Predictive Analysis
"What will happen?" — Using historical data to forecast future outcomes. Example: predicting which customers are likely to cancel their subscription.
Prescriptive Analysis
"What should we do?" — Recommending actions based on predictions. Example: suggesting optimal pricing strategies to maximise revenue.
Why Python for Data Science?
Python has become the dominant language for data science for several reasons:
- Readable syntax — Python code reads almost like English, making it accessible to non-programmers
- Rich ecosystem — Thousands of libraries for every aspect of data science
- Community — The largest data science community, with abundant tutorials, courses, and support
- Integration — Python integrates well with databases, web frameworks, cloud services, and big data tools
- Industry adoption — Used by Google, Netflix, Facebook, NASA, and virtually every major tech company
- Jupyter notebooks — The interactive notebook environment was built for Python and is the standard tool for data exploration
Real-World Applications
Healthcare
- Predicting disease outbreaks and patient readmission risk
- Drug discovery and clinical trial optimisation
- Medical image analysis (X-rays, MRIs)
Finance
- Fraud detection and prevention
- Algorithmic trading and risk assessment
- Credit scoring and loan approval
Retail and E-commerce
- Recommendation systems (Netflix, Amazon, Spotify)
- Customer segmentation and targeting
- Demand forecasting and inventory optimisation
Transport
- Route optimisation (Uber, Lyft)
- Autonomous vehicle development
- Traffic prediction and management
Social Media
- Content recommendation algorithms
- Sentiment analysis and trend detection
- Misinformation detection
Ethics in Data Science
Data science comes with significant ethical responsibilities:
- Privacy — Handling personal data responsibly and complying with regulations (GDPR, CCPA)
- Bias — Recognising and mitigating bias in data and algorithms
- Transparency — Making models explainable and decisions interpretable
- Consent — Ensuring data is collected with informed consent
- Fairness — Ensuring algorithms do not discriminate against protected groups
- Security — Protecting data from breaches and misuse
Summary
Data science is the practice of extracting meaningful insights from data using a combination of statistics, programming, and domain knowledge. It spans the entire journey from defining a problem, through collecting and cleaning data, to building models and communicating results. Python is the most popular language for data science thanks to its readable syntax, rich ecosystem of libraries, and strong community support. In this course, we will build a solid foundation in the Python data science toolkit — starting with Jupyter notebooks, then progressing through NumPy, Pandas, Matplotlib, and Scikit-Learn.