You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Exploratory Data Analysis (EDA) is the process of investigating a dataset to discover patterns, spot anomalies, test hypotheses, and check assumptions — primarily through statistical summaries and visualisations. Coined by the statistician John Tukey in the 1970s, EDA is the bridge between raw data and formal modelling.
Before building any model, you need to understand your data:
Skipping EDA is like prescribing medicine without diagnosing the patient.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
df = pd.read_csv('dataset.csv')
# Basic information
print(f"Shape: {df.shape}") # (rows, columns)
print(f"Columns: {df.columns.tolist()}")
print(f"Data types:\n{df.dtypes}")
# First few rows
df.head()
# Detailed info
df.info()
# Numeric summary
df.describe()
# Include non-numeric columns
df.describe(include='all')
# Individual statistics
print(f"Mean age: {df['Age'].mean():.1f}")
print(f"Median salary: {df['Salary'].median():,.0f}")
print(f"Std dev: {df['Salary'].std():,.0f}")
print(f"Skewness: {df['Salary'].skew():.2f}")
print(f"Kurtosis: {df['Salary'].kurtosis():.2f}")
| Measure | Value | Interpretation |
|---|---|---|
| Skewness | 0 | Symmetric distribution |
| > 0 | Right-skewed (tail to the right) | |
| < 0 | Left-skewed (tail to the left) | |
| Kurtosis | 0 | Normal distribution shape |
| > 0 | Heavy tails, more outliers | |
| < 0 | Light tails, fewer outliers |
Univariate analysis examines one variable at a time:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Histogram
axes[0].hist(df['Age'], bins=20, color='steelblue', edgecolor='white')
axes[0].set_title('Age Distribution')
axes[0].axvline(df['Age'].mean(), color='red', linestyle='--', label='Mean')
axes[0].axvline(df['Age'].median(), color='green', linestyle='--', label='Median')
axes[0].legend()
# Box plot
axes[1].boxplot(df['Age'].dropna())
axes[1].set_title('Age Box Plot')
# KDE (Kernel Density Estimate)
df['Age'].plot(kind='kde', ax=axes[2])
axes[2].set_title('Age KDE')
plt.tight_layout()
plt.show()
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Value counts bar chart
df['City'].value_counts().plot(kind='bar', ax=axes[0], color='steelblue')
axes[0].set_title('City Distribution')
axes[0].set_ylabel('Count')
# Proportion
df['City'].value_counts(normalize=True).plot(kind='bar', ax=axes[1], color='coral')
axes[1].set_title('City Proportions')
axes[1].set_ylabel('Proportion')
plt.tight_layout()
plt.show()
Bivariate analysis examines relationships between two variables:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Scatter plot
axes[0].scatter(df['Age'], df['Salary'], alpha=0.5)
axes[0].set_title('Age vs Salary')
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Salary')
# Scatter with regression line (Seaborn)
sns.regplot(data=df, x='Age', y='Salary', ax=axes[1], scatter_kws={'alpha': 0.5})
axes[1].set_title('Age vs Salary (with trend)')
plt.tight_layout()
plt.show()
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Box plot by category
sns.boxplot(data=df, x='Department', y='Salary', ax=axes[0])
axes[0].set_title('Salary by Department')
# Violin plot
sns.violinplot(data=df, x='Department', y='Salary', ax=axes[1])
axes[1].set_title('Salary Distribution by Department')
plt.tight_layout()
plt.show()
# Cross-tabulation
ct = pd.crosstab(df['Department'], df['City'])
print(ct)
# Stacked bar chart
ct.plot(kind='bar', stacked=True, figsize=(10, 6))
plt.title('Department by City')
plt.ylabel('Count')
plt.show()
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.