Exploratory Data Analysis

Exploratory Data Analysis (EDA) is the process of investigating a dataset to discover patterns, spot anomalies, test hypotheses, and check assumptions — primarily through statistical summaries and visualisations. Coined by the statistician John Tukey in the 1970s, EDA is the bridge between raw data and formal modelling.

Why EDA Matters

Before building any model, you need to understand your data:

What does the data look like? What are the shapes and types?
Are there missing values, duplicates, or outliers?
What are the distributions of individual variables?
Are there relationships between variables?
What patterns and trends exist?
What features are most relevant to the problem?

Skipping EDA is like prescribing medicine without diagnosing the patient.

The EDA Process

Step 1: Understand the Data Structure

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('dataset.csv')

# Basic information
print(f"Shape: {df.shape}")       # (rows, columns)
print(f"Columns: {df.columns.tolist()}")
print(f"Data types:\n{df.dtypes}")

# First few rows
df.head()

# Detailed info
df.info()

Step 2: Statistical Summary

# Numeric summary
df.describe()

# Include non-numeric columns
df.describe(include='all')

# Individual statistics
print(f"Mean age: {df['Age'].mean():.1f}")
print(f"Median salary: {df['Salary'].median():,.0f}")
print(f"Std dev: {df['Salary'].std():,.0f}")
print(f"Skewness: {df['Salary'].skew():.2f}")
print(f"Kurtosis: {df['Salary'].kurtosis():.2f}")

Interpreting Skewness and Kurtosis

Measure	Value	Interpretation
Skewness	0	Symmetric distribution
	> 0	Right-skewed (tail to the right)
	< 0	Left-skewed (tail to the left)
Kurtosis	0	Normal distribution shape
	> 0	Heavy tails, more outliers
	< 0	Light tails, fewer outliers

Univariate Analysis

Univariate analysis examines one variable at a time:

Numerical Variables

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Histogram
axes[0].hist(df['Age'], bins=20, color='steelblue', edgecolor='white')
axes[0].set_title('Age Distribution')
axes[0].axvline(df['Age'].mean(), color='red', linestyle='--', label='Mean')
axes[0].axvline(df['Age'].median(), color='green', linestyle='--', label='Median')
axes[0].legend()

# Box plot
axes[1].boxplot(df['Age'].dropna())
axes[1].set_title('Age Box Plot')

# KDE (Kernel Density Estimate)
df['Age'].plot(kind='kde', ax=axes[2])
axes[2].set_title('Age KDE')

plt.tight_layout()
plt.show()

Categorical Variables

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Value counts bar chart
df['City'].value_counts().plot(kind='bar', ax=axes[0], color='steelblue')
axes[0].set_title('City Distribution')
axes[0].set_ylabel('Count')

# Proportion
df['City'].value_counts(normalize=True).plot(kind='bar', ax=axes[1], color='coral')
axes[1].set_title('City Proportions')
axes[1].set_ylabel('Proportion')

plt.tight_layout()
plt.show()

Bivariate Analysis

Bivariate analysis examines relationships between two variables:

Numerical vs Numerical

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Scatter plot
axes[0].scatter(df['Age'], df['Salary'], alpha=0.5)
axes[0].set_title('Age vs Salary')
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Salary')

# Scatter with regression line (Seaborn)
sns.regplot(data=df, x='Age', y='Salary', ax=axes[1], scatter_kws={'alpha': 0.5})
axes[1].set_title('Age vs Salary (with trend)')

plt.tight_layout()
plt.show()

Numerical vs Categorical

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot by category
sns.boxplot(data=df, x='Department', y='Salary', ax=axes[0])
axes[0].set_title('Salary by Department')

# Violin plot
sns.violinplot(data=df, x='Department', y='Salary', ax=axes[1])
axes[1].set_title('Salary Distribution by Department')

plt.tight_layout()
plt.show()

Categorical vs Categorical

# Cross-tabulation
ct = pd.crosstab(df['Department'], df['City'])
print(ct)

# Stacked bar chart
ct.plot(kind='bar', stacked=True, figsize=(10, 6))
plt.title('Department by City')
plt.ylabel('Count')
plt.show()

Exploratory Data Analysis

Exploratory Data Analysis

Why EDA Matters

The EDA Process

Step 1: Understand the Data Structure

Step 2: Statistical Summary

Interpreting Skewness and Kurtosis

Univariate Analysis

Numerical Variables

Categorical Variables

Bivariate Analysis

Numerical vs Numerical

Numerical vs Categorical

Categorical vs Categorical

More in Data Science