You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Working with real-world datasets is fundamentally different from working with clean, pre-processed toy data. Real data is messy, incomplete, and often surprising. This lesson covers where to find datasets, how to handle common real-world challenges, and how to conduct an end-to-end analysis on a genuine dataset.
| Source | URL | Description |
|---|---|---|
| Kaggle | kaggle.com | Thousands of datasets across every domain |
| UCI Machine Learning Repository | archive.ics.uci.edu | Classic ML datasets |
| Google Dataset Search | datasetsearch.research.google.com | Search engine for datasets |
| data.gov | data.gov | US government open data |
| data.gov.uk | data.gov.uk | UK government open data |
| World Bank | data.worldbank.org | Global development data |
| WHO | who.int/data | Global health data |
| Eurostat | ec.europa.eu/eurostat | European statistics |
# Scikit-Learn datasets
from sklearn.datasets import (
load_iris, # Classification: flower species
load_wine, # Classification: wine quality
fetch_california_housing, # Regression: house prices
load_digits, # Classification: handwritten digits
fetch_20newsgroups # Text classification: newsgroup posts
)
# Seaborn datasets
import seaborn as sns
tips = sns.load_dataset('tips')
titanic = sns.load_dataset('titanic')
penguins = sns.load_dataset('penguins')
flights = sns.load_dataset('flights')
# Pandas-datareader (external data)
# pip install pandas-datareader
import pandas as pd
# Basic CSV
df = pd.read_csv('data.csv')
# With custom settings
df = pd.read_csv('data.csv',
sep=';', # Semicolon delimiter
encoding='utf-8', # Character encoding
na_values=['N/A', 'missing', '-'], # Custom missing value markers
parse_dates=['date'], # Parse date columns
dtype={'id': str}, # Specify column types
nrows=1000 # Read only first 1000 rows
)
# pip install openpyxl
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# Read all sheets
all_sheets = pd.read_excel('data.xlsx', sheet_name=None)
for name, sheet_df in all_sheets.items():
print(f"Sheet: {name}, Shape: {sheet_df.shape}")
# JSON file
df = pd.read_json('data.json')
# From a web API
import requests
response = requests.get('https://api.example.com/data')
data = response.json()
df = pd.DataFrame(data['results'])
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM customers WHERE country = "UK"', conn)
conn.close()
# With SQLAlchemy (supports PostgreSQL, MySQL, etc.)
from sqlalchemy import create_engine
engine = create_engine('postgresql://user:pass@host/dbname')
df = pd.read_sql('SELECT * FROM orders', engine)
# Read in chunks for large files
chunks = pd.read_csv('large_file.csv', chunksize=10000)
results = []
for chunk in chunks:
# Process each chunk
result = chunk.groupby('category')['value'].sum()
results.append(result)
final = pd.concat(results).groupby(level=0).sum()
Let us walk through a complete analysis of the Titanic dataset:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# Load data
df = sns.load_dataset('titanic')
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
df.head()
# Missing values
print(df.isnull().sum())
print(f"\nMissing percentage:\n{df.isnull().mean() * 100}")
# Survival rate
print(f"\nOverall survival rate: {df['survived'].mean():.2%}")
# Survival by class
print(df.groupby('pclass')['survived'].mean())
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.