Pandas - Python Library


Ultimate Pandas Bootcamp


The site is under development.

Pandas - Python Library

Python & Pandas Refresher

1.1 Introduction to Python Syntax


Python syntax is easy to understand, designed to be readable and intuitive. It uses indentation rather than curly braces to define code blocks. This makes Python one of the most beginner-friendly programming languages. Python is also case-sensitive, which means 'Variable' and 'variable' are considered different. This subchapter will focus on understanding Python’s simple and clean syntax.


Example:

name = "John"  # Defining a string variable
print("Hello", name) # Printing a message

Output: Hello John



1.2 Variables, Lists, and Dictionaries


In Python, variables store data values. Lists are ordered collections of items, which can be of any data type, and allow duplicates. Dictionaries store key-value pairs, making them ideal for fast lookups of related data. This section will discuss how to define and work with these essential data structures.


Example:

age = 25  # Variable holding an integer
friends = ["Alice", "Bob"] # List of friends
profile = {"name": "John", "age": 25} # Dictionary with key-value pairs
print(profile["name"]) # Accessing value associated with the key 'name' in dictionary

Output: John



1.3 Functions and Conditionals


Functions allow you to group code into reusable blocks, making programs more organized and modular. Conditionals, using 'if', 'elif', and 'else', control the flow of execution based on logical decisions. These structures let the program take different actions depending on the input or conditions.


Example:

def greet(name):  # Function definition
if name == "Alice": # Conditional check
return "Hi Alice!" # Return greeting for Alice
else:
return "Hello!" # Default greeting
print(greet("Alice")) # Calling the function with 'Alice'

Output: Hi Alice!



1.4 Importing pandas and numpy


Pandas and numpy are two fundamental libraries for data manipulation and analysis in Python. Pandas provides high-level data structures like DataFrames, which handle tabular data, while numpy offers support for numerical computations with arrays. These libraries are essential for data science and machine learning tasks.


Example:

import pandas as pd  # Import pandas library
import numpy as np # Import numpy library
data = np.array([[1, 2], [3, 4]]) # Create a 2D numpy array
df = pd.DataFrame(data, columns=["A", "B"]) # Convert numpy array to DataFrame
print(df) # Print the DataFrame

Output:
A B
0 1 2
1 3 4



Series Essentials

2.1 Creating Series


A pandas Series is a one-dimensional labeled array, capable of holding any data type (integers, strings, floats, etc.). It is similar to a list or array in Python, but it comes with additional features like labels for indexing. A Series can be created from a list, dictionary, or even a scalar value.


Example:

import pandas as pd  # Importing pandas
data = [10, 20, 30, 40] # List of data
series = pd.Series(data) # Creating a Series from the list
print(series) # Printing the Series

Output:
0 10
1 20
2 30
3 40



2.2 Indexing & Slicing


Indexing in a Series allows you to access specific elements based on their position or label. Slicing is similar to indexing, but it allows you to retrieve a subset of elements. You can use integer-based or label-based indexing with pandas Series.


Example:

data = [10, 20, 30, 40]  # List of data
series = pd.Series(data) # Creating the Series
print(series[1]) # Indexing: Access element at index 1
print(series[1:3]) # Slicing: Access elements from index 1 to 3

Output:
20
1 20
2 30



2.3 Series Methods


Pandas Series come with a variety of methods to perform operations on the data. These methods allow you to perform statistical, mathematical, and string operations, among others. Some common methods include `sum()`, `mean()`, `max()`, and `min()`.


Example:

data = [10, 20, 30, 40]  # List of data
series = pd.Series(data) # Creating the Series
print(series.sum()) # Calculate sum of elements
print(series.mean()) # Calculate mean of elements
print(series.max()) # Get maximum value in the Series

Output:
100
25.0
40



2.4 Handling NaNs in Series


NaN stands for 'Not a Number' and is used to represent missing or undefined values in a pandas Series. Pandas provides methods to detect, remove, or replace NaNs. Handling NaNs is crucial when performing data analysis to ensure the accuracy of the results.


Example:

data = [10, None, 30, None]  # List of data with NaNs
series = pd.Series(data) # Creating the Series
print(series.isna()) # Check for NaNs in the Series
series_filled = series.fillna(0) # Replace NaNs with 0
print(series_filled) # Printing the Series after replacing NaNs

Output:
0 False
1 True
2 False
3 True
dtype: bool
0 10.0
1 0.0
2 30.0
3 0.0



Chapter 3: DataFrame Fundamentals

Subchapter 3.1: Creating DataFrames


DataFrames are the core data structure in the pandas library. They are similar to Excel tables or SQL tables, consisting of rows and columns. A DataFrame can be created from various data structures such as dictionaries, lists, and even external files like CSVs or Excel files. It is the starting point for any data manipulation task in Python.

Example: Create DataFrame from dictionary

import pandas as pd  
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
print(df)

Output:
Name Age
Alice 25
Bob 30


Subchapter 3.2: Accessing Columns & Rows


DataFrames allow you to access specific rows and columns using both index positions and labels. Using `.loc[]` enables label-based access, while `.iloc[]` is used for index-based access. This flexibility makes it easy to select and work with parts of your data.

Example: Access rows and columns

print(df['Name'])  
print(df.loc[0])

Output:
0 Alice
1 Bob
Name Alice
Age 25


Subchapter 3.3: Basic Operations


Once you have a DataFrame, you can perform various operations on it. You can carry out arithmetic operations, statistical computations, and apply functions across columns or rows. Pandas makes these operations very straightforward.

Example: Add new column with doubled age

df['DoubleAge'] = df['Age'] * 2  
print(df)

Output:
Name Age DoubleAge
Alice 25 50
Bob 30 60


Subchapter 3.4: Adding & Deleting Columns


You can easily add new columns to a DataFrame by assigning values to a new column label. Deleting columns can be done with the `drop()` method, specifying the column name and the axis (1 for columns). This flexibility allows you to manage your data efficiently.

Example: Add and delete a column

df['City'] = ['NYC', 'LA']  
df = df.drop('DoubleAge', axis=1)
print(df)

Output:
Name Age City
Alice 25 NYC
Bob 30 LA


Chapter 4: Importing & Exporting Data

4.1 Reading CSV, Excel, and JSON


In data analysis, importing data is one of the first steps. Libraries like pandas allow you to read data from multiple file formats such as CSV, Excel, and JSON. These formats are commonly used in data storage and exchange, making it essential to learn how to work with them effectively.

# Real-world example: Reading CSV and JSON files
import pandas as pd # Import the pandas library
csv_data = pd.read_csv('data.csv') # Load a CSV file
json_data = pd.read_json('data.json') # Load a JSON file
print(csv_data.head()) # Display first 5 rows of CSV data

Output: First 5 rows of the CSV data


4.2 Exporting to Files


After processing or analyzing data, you may need to export it for further use or to share with others. pandas provides methods for exporting data to various formats such as CSV, Excel, and JSON. This makes it easy to save results or share data in a standardized format.

# Real-world example: Exporting data to CSV
data = pd.DataFrame({'name': ['Tom', 'Jerry'], 'age': [5, 3]}) # Create a DataFrame
data.to_csv('output.csv', index=False) # Export the DataFrame to a CSV file

Output: A new file 'output.csv' is created with the data


4.3 Setting Index while Loading


Sometimes, when loading data into a DataFrame, you may want to use a specific column as the index. This can help in organizing the data better, especially when dealing with large datasets. Using an appropriate index can improve the speed of data lookups and manipulations.

# Real-world example: Loading data with an index column
data = pd.read_csv('data.csv', index_col='id') # Use the 'id' column as the index
print(data.head()) # Display first 5 rows with 'id' as index

Output: DataFrame with 'id' as the index column


4.4 Working with Large Datasets


Handling large datasets can be challenging due to memory limitations. One way to handle this is by reading data in chunks, which allows you to process data piece-by-piece without loading the entire dataset into memory at once. This approach helps in efficiently managing and analyzing large datasets.

# Real-world example: Reading data in chunks
chunk_iter = pd.read_csv('large_data.csv', chunksize=1000) # Read data in chunks of 1000 rows
for chunk in chunk_iter:
print(chunk.head()) # Process and display first 5 rows of each chunk

Output: First 5 rows from each 1000-row chunk


Chapter 5: Data Cleaning

5.1 Handling Null Values


Null values (or missing data) can impact model accuracy. Common techniques to handle them include removing rows, filling with default values, or using statistical measures (mean/median). Choosing the right strategy depends on the nature and importance of the data.

      import pandas as pd  
df = pd.DataFrame({'A': [1, 2, None, 4]})
df['A'].fillna(df['A'].mean(), inplace=True)
print(df)
# This replaces null values in column A with the mean of the column

Output:
A
0 1.0
1 2.0
2 2.333
3 4.0


5.2 Type Conversions


Data may not always be in the correct format. For example, numerical values might be stored as strings. Type conversion helps ensure the data is in a usable and consistent format for analysis or modeling.

      import pandas as pd  
df = pd.DataFrame({'A': ['1', '2', '3']})
df['A'] = df['A'].astype(int)
print(df.dtypes)
# This converts the column 'A' from string to integer type

Output: A int64


5.3 Duplicates and Outliers


Duplicate data can skew analysis, while outliers can mislead models. Removing duplicates and identifying outliers using methods like IQR or Z-score is crucial for high-quality data preprocessing.

      import pandas as pd  
df = pd.DataFrame({'A': [1, 1, 2, 100]})
df = df.drop_duplicates()
print(df)
# This removes duplicate rows based on identical values

Output:
A
0 1
2 2
3 100


5.4 Using apply() for Cleaning


The apply() function in pandas allows applying a function to each row or column, enabling custom cleaning operations. It's especially useful for string cleaning or transforming values based on logic.

      import pandas as pd  
df = pd.DataFrame({'A': [' apple ', 'Banana', ' grape ']})
df['A'] = df['A'].apply(lambda x: x.strip().lower())
print(df)
# This strips extra spaces and converts strings to lowercase

Output:
A
0 apple
1 banana
2 grape


Chapter 6: Filtering & Sorting

6.1 Boolean Indexing


Boolean indexing allows us to filter rows of a DataFrame by using a condition that returns True or False for each row. It's a powerful way to filter data without needing to write loops. The condition can be based on any column value or a combination of columns.

Example: Boolean Indexing with Pandas

import pandas as pd
df = pd.DataFrame({'Age': [20, 25, 30], 'Name': ['Alice', 'Bob', 'Charlie']})
filtered_df = df[df['Age'] > 21] # Only rows where Age > 21
print(filtered_df)

Output:
Name Age
1 Bob 25
2 Charlie 30


6.2 Sorting Values & Index


Sorting helps in organizing data by arranging the values in ascending or descending order. Sorting can be done by column values or by the index. In pandas, the sort_values() function is used to sort the data based on column values, and sort_index() is used to sort by index.

Example: Sorting Values

import pandas as pd
df = pd.DataFrame({'Score': [90, 80, 85], 'Name': ['Alice', 'Bob', 'Charlie']})
sorted_df = df.sort_values(by='Score') # Sort by Score
print(sorted_df)

Output:
Name Score
1 Bob 80
2 Charlie 85
0 Alice 90


6.3 Conditional Filtering


Conditional filtering lets you select rows that meet one or more conditions. It's a crucial technique for narrowing down a large dataset to relevant information, such as filtering data based on ranges or specific criteria.

Example: Conditional Filtering

import pandas as pd
df = pd.DataFrame({'Age': [20, 25, 30], 'Salary': [3000, 4000, 5000]})
condition = df['Age'] > 21
filtered = df[condition] # Apply condition
print(filtered)

Output:
Age Salary
1 25 4000
2 30 5000


6.4 Multi-Condition Filters


Multi-condition filtering is the process of applying more than one condition to filter data. You can combine conditions using logical operators like & (and), | (or), and ~ (not), allowing for complex queries on the dataset.

Example: Multiple Conditions

import pandas as pd
df = pd.DataFrame({'Age': [20, 25, 30], 'Salary': [3000, 4000, 5000]})
filtered = df[(df['Age'] > 21) & (df['Salary'] > 3500)] # Both conditions must be true
print(filtered)

Output:
Age Salary
1 25 4000
2 30 5000


7. Advanced String Operations

7.1 Vectorized String Methods


Vectorized string methods in libraries like pandas allow batch operations on text data. This means that instead of applying string methods individually to each element in a list or array, these methods can be applied to the entire collection of data at once. This approach is highly efficient and reduces the need for writing loops, speeding up processing time.


Example: Converting strings to uppercase using pandas

      import pandas as pd

# Create a pandas Series with strings
data = pd.Series(['Hello', 'World', 'AI'])

# Apply vectorized string method to convert all strings to uppercase
result = data.str.upper()

# Print the result
print(result)

Output:
0 HELLO
1 WORLD
2 AI



7.2 Cleaning Text Data


Cleaning text data is a crucial step before performing any text analysis or applying machine learning models. This process includes removing unwanted characters (such as punctuation or extra spaces), converting text to lowercase, and standardizing the format. Properly cleaned text ensures better model performance and more reliable results in natural language processing (NLP) tasks.


Example: Cleaning text data by stripping spaces and replacing unwanted characters

      text = "  Hello! Welcome...to AI.  "

# Strip leading/trailing spaces, convert to lowercase, and replace unwanted characters
cleaned = text.strip().lower().replace("...", " ")

# Print the cleaned text
print(cleaned)

Output:
hello! welcome to ai.



7.3 Extracting with RegEx


Regular expressions (RegEx) allow us to search for specific patterns within strings. This is particularly useful when we need to extract information such as email addresses, phone numbers, dates, or any structured data embedded within unstructured text. RegEx provides a powerful and flexible way to perform pattern matching in strings.


Example: Extracting an email address using regular expressions

      import re

text = "Contact: hello@example.com"

# Regular expression to match email address
match = re.findall(r'\\S+@\\S+', text)

# Print the extracted email address
print(match)

Output:
[\"hello@example.com\"]



7.4 String Replacements & Maps


String replacements and maps allow us to transform specific parts of a string or a set of strings. Using replacement rules (e.g., a dictionary), we can replace substrings in text or change one value to another across multiple strings. This technique is especially useful in data cleaning, normalizing terms, or converting categorical values into a standardized format.


Example: Replacing words in a list using a dictionary

      words = ["hi", "bye", "hello"]

# Create a mapping dictionary
mapping = {"hi": "greetings", "bye": "farewell"}

# Replace words using the dictionary
replaced = [mapping.get(word, word) for word in words]

# Print the replaced list
print(replaced)

Output:
["greetings", "farewell", "hello"]



Aggregation and Grouping

8.1 groupby() Basics


The `groupby()` function in Pandas is used to split the data into groups based on some criteria. This operation is useful for summarizing or transforming data in a grouped fashion. The groupby object can then be used to apply aggregation functions, transformations, or filtering operations. The basic idea is that it allows you to perform operations on each group independently.


Example:

import pandas as pd  # Import pandas library
data = {'Category': ['A', 'A', 'B', 'B', 'C', 'C'], # Sample data
'Value': [10, 20, 30, 40, 50, 60]} # Corresponding values
df = pd.DataFrame(data) # Create DataFrame
grouped = df.groupby('Category') # Group data by 'Category' column
for name, group in grouped: # Loop through each group
print(name) # Print the group name
print(group) # Print the data for each group

Output:
A
Category Value
0 A 10
1 A 20
B
Category Value
2 B 30
3 B 40
C
Category Value
4 C 50
5 C 60



8.2 Aggregation Functions


Pandas provides several aggregation functions such as `sum()`, `mean()`, `max()`, and `min()`. These functions can be applied to each group after using `groupby()` to summarize the data in meaningful ways, such as calculating the total, average, or minimum of values for each group.


Example:

df = pd.DataFrame({'Category': ['A', 'A', 'B', 'B', 'C', 'C'],  # Sample data
'Value': [10, 20, 30, 40, 50, 60]}) # Values
grouped = df.groupby('Category')['Value'] # Group by 'Category' and focus on 'Value' column
result = grouped.sum() # Calculate sum for each group
print(result) # Print the result

Output:
Category
A 30
B 70
C 110
Name: Value, dtype: int64



8.3 Custom Aggregations


Custom aggregations allow you to apply your own function to each group. This is useful when built-in aggregation functions do not meet your needs. You can define your own function and pass it to the `agg()` method to perform more complex calculations.


Example:

def custom_func(x):  # Custom aggregation function
return x.max() - x.min() # Difference between max and min values
df = pd.DataFrame({'Category': ['A', 'A', 'B', 'B', 'C', 'C'], # Data
'Value': [10, 20, 30, 40, 50, 60]})
grouped = df.groupby('Category')['Value'] # Group by 'Category'
result = grouped.agg(custom_func) # Apply custom aggregation function
print(result) # Print result

Output:
Category
A 10
B 10
C 10
Name: Value, dtype: int64



8.4 Hierarchical Grouping


Hierarchical grouping, or multi-level grouping, allows you to group data by more than one column. This is useful when you want to analyze the data in a more granular way, for example, grouping by both 'Category' and 'Subcategory'. Pandas allows you to perform aggregation operations on multi-level indexes.


Example:

df = pd.DataFrame({'Category': ['A', 'A', 'B', 'B', 'C', 'C'],  # Sample data
'Subcategory': ['X', 'Y', 'X', 'Y', 'X', 'Y'], # Subcategories
'Value': [10, 20, 30, 40, 50, 60]}) # Values
grouped = df.groupby(['Category', 'Subcategory'])['Value'] # Group by both 'Category' and 'Subcategory'
result = grouped.sum() # Sum for each subgroup
print(result) # Print result

Output:
Category Subcategory
A X 10
Y 20
B X 30
Y 40
C X 50
Y 60
Name: Value, dtype: int64



Combining Datasets

9.1 Concatenation


Concatenation is the process of joining two or more datasets along a particular axis (rows or columns). In pandas, the `concat()` function allows you to combine multiple DataFrames vertically or horizontally. This operation does not require a common column between the datasets.


Example:

import pandas as pd  # Importing pandas
data1 = {'A': [1, 2], 'B': [3, 4]} # First dataset
data2 = {'A': [5, 6], 'B': [7, 8]} # Second dataset
df1 = pd.DataFrame(data1) # Create first DataFrame
df2 = pd.DataFrame(data2) # Create second DataFrame
df_combined = pd.concat([df1, df2], ignore_index=True) # Concatenate DataFrames
print(df_combined) # Print the concatenated DataFrame

Output:
A B
0 1 3
1 2 4
2 5 7
3 6 8



9.2 Merging on Keys


Merging is the process of combining two DataFrames based on a key or common column. It is similar to SQL joins. You can perform inner, outer, left, or right joins to specify how you want to combine the data based on the keys.


Example:

data1 = {'ID': [1, 2], 'Name': ['Alice', 'Bob']}  # First dataset
data2 = {'ID': [1, 2], 'Age': [25, 30]} # Second dataset
df1 = pd.DataFrame(data1) # Create first DataFrame
df2 = pd.DataFrame(data2) # Create second DataFrame
df_merged = pd.merge(df1, df2, on='ID') # Merge on the 'ID' column
print(df_merged) # Print the merged DataFrame

Output:
ID Name Age
0 1 Alice 25
1 2 Bob 30



9.3 join() vs merge()


The `join()` method is used for merging DataFrames based on their index, while `merge()` is used to merge DataFrames based on one or more key columns. `merge()` is more flexible, allowing you to specify different types of joins (inner, outer, left, right), while `join()` is typically used when the join keys are the indices.


Example:

data1 = {'A': [1, 2]}  # First dataset
data2 = {'B': [3, 4]} # Second dataset
df1 = pd.DataFrame(data1, index=[1, 2]) # First DataFrame with index
df2 = pd.DataFrame(data2, index=[1, 2]) # Second DataFrame with index
df_joined = df1.join(df2) # Join DataFrames based on index
print(df_joined) # Print the joined DataFrame

Output:
A B
1 1 3
2 2 4



9.4 Handling Overlaps and Duplicates


When combining datasets, it is important to handle overlaps and duplicates effectively. You can use the `drop_duplicates()` method to remove duplicate rows. Additionally, when merging datasets, you may need to resolve conflicts or overlapping columns using parameters like `suffixes` to avoid column name clashes.


Example:

data1 = {'ID': [1, 2, 2], 'Name': ['Alice', 'Bob', 'Bob']}  # First dataset with duplicates
df1 = pd.DataFrame(data1) # Create first DataFrame
df_no_duplicates = df1.drop_duplicates() # Remove duplicates
print(df_no_duplicates) # Print the DataFrame without duplicates

Output:
ID Name
0 1 Alice
1 2 Bob



Chapter 10: Pivoting and Reshaping

Subchapter 10.1: pivot() vs pivot_table()


The pivot() function is used to reshape data by turning unique values from one column into separate columns, while the pivot_table() function can be used for summarizing data by applying aggregation functions (like sum, mean) to the data.

Example: Using pivot()

import pandas as pd  
data = {'Date': ['2021-01-01', '2021-01-01', '2021-01-02'], 'City': ['A', 'B', 'A'], 'Sales': [200, 150, 300]}
df = pd.DataFrame(data)
pivot_df = df.pivot(index='Date', columns='City', values='Sales')
print(pivot_df)

Output:
City A B
Date
2021-01-01 200 150
2021-01-02 300 NaN


Example: Using pivot_table()

pivot_table_df = df.pivot_table(index='Date', columns='City', values='Sales', aggfunc='sum')  
print(pivot_table_df)

Output:
City A B
Date
2021-01-01 200 150
2021-01-02 300 NaN


Subchapter 10.2: Melting DataFrames


Melting is the opposite of pivoting. It unpivots a DataFrame by converting columns into rows. The melt() function is useful when you need to convert wide-format data into long-format data, making it easier to analyze.

Example: Melting a DataFrame

df_melted = df.melt(id_vars=['Date'], value_vars=['City', 'Sales'], var_name='Variable', value_name='Value')  
print(df_melted)

Output:
Date Variable Value
2021-01-01 City A
2021-01-01 Sales 200
2021-01-02 City A
2021-01-02 Sales 300


Subchapter 10.3: stack() and unstack()


The stack() function stacks the columns of a DataFrame into rows (vertical axis). The unstack() function does the opposite: it converts the rows of a DataFrame into columns (horizontal axis). These functions are useful when working with multi-level indices (MultiIndex).

Example: Using stack()

stacked_df = pivot_df.stack()  
print(stacked_df)

Output:
2021-01-01 A 200
2021-01-01 B 150
2021-01-02 A 300


Example: Using unstack()

unstacked_df = stacked_df.unstack()  
print(unstacked_df)

Output:
City A B
Date
2021-01-01 200 150
2021-01-02 300 NaN


Subchapter 10.4: Reshape with MultiIndex


MultiIndex allows you to have multiple levels of indexing in a DataFrame. You can reshape a DataFrame using MultiIndex for more complex data analysis tasks. You can use stack() and unstack() to work with MultiIndex as well.

Example: Creating a MultiIndex

index = pd.MultiIndex.from_tuples([('2021-01-01', 'A'), ('2021-01-01', 'B'), ('2021-01-02', 'A')], names=['Date', 'City'])  
multiindex_df = pd.DataFrame({'Sales': [200, 150, 300]}, index=index)
print(multiindex_df)

Output:
Sales
Date City
2021-01-01 A 200
2021-01-01 B 150
2021-01-02 A 300


Example: Reshaping with unstack()

reshaped_df = multiindex_df.unstack()  
print(reshaped_df)

Output:
City A B
Date
2021-01-01 200 150
2021-01-02 300 NaN


Chapter 11: Working with Dates and Times

11.1 Parsing Dates


Working with dates and times is crucial for time-series analysis. Parsing dates refers to converting date information from a string format into a usable DateTime format. This makes it easier to perform operations like filtering and aggregating data based on specific dates.

# Real-world example: Parsing dates
import pandas as pd # Import the pandas library
data = pd.to_datetime('2025-04-20') # Convert a string to a DateTime object
print(data) # Output the parsed date

Output: '2025-04-20 00:00:00'


11.2 Date Indexing


Date indexing is used to set a column with date-related values as the index of a DataFrame. This allows for efficient querying and manipulation of data based on dates. It is particularly useful in time-series analysis, where each row corresponds to a specific point in time.

# Real-world example: Date indexing
data = pd.DataFrame({'date': ['2025-01-01', '2025-02-01', '2025-03-01'], 'value': [10, 20, 30]})
data['date'] = pd.to_datetime(data['date']) # Convert date column to DateTime
data.set_index('date', inplace=True) # Set the 'date' column as the index
print(data) # Display the DataFrame with date as index

Output: DataFrame with 'date' as the index column


11.3 Resampling and Frequency Conversion


Resampling refers to the process of converting time-series data from one frequency to another. This can be useful for aggregating or downsampling data at specific intervals, such as converting daily data into weekly or monthly data. Frequency conversion is typically used to change the granularity of time-series data.

# Real-world example: Resampling and frequency conversion
data = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('2025-01-01', periods=5, freq='D'))
resampled_data = data.resample('M').sum() # Resample data to monthly frequency and sum values
print(resampled_data) # Display resampled data

Output: The sum of the values for each month


11.4 Rolling & Expanding Windows


Rolling windows and expanding windows are methods for calculating statistics over a specific time window. A rolling window computes a statistic (e.g., mean, sum) over a moving window of fixed size, while expanding windows calculate the statistic for all data up to the current point.

# Real-world example: Rolling and expanding windows
data = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('2025-01-01', periods=5, freq='D'))
rolling_mean = data.rolling(window=2).mean() # Compute rolling mean with a window size of 2
expanding_mean = data.expanding().mean() # Compute expanding mean
print(rolling_mean) # Display rolling mean
print(expanding_mean) # Display expanding mean

Output: The rolling mean and expanding mean values for each point in the series


Chapter 12: Time Series Analysis

12.1 Time Series Slicing


Time series slicing refers to extracting specific time periods from a time series dataset. This is useful for focusing on specific time frames such as years, months, or days for analysis or forecasting.

      import pandas as pd  
date_range = pd.date_range('2021-01-01', periods=5, freq='D')
df = pd.DataFrame({'Date': date_range, 'Value': [10, 20, 30, 40, 50]})
sliced_df = df[df['Date'] >= '2021-01-03']
print(sliced_df)
# This slices the data from January 3rd, 2021 onward

Output:
Date Value
2 2021-01-03 30
3 2021-01-04 40
4 2021-01-05 50


12.2 Moving Averages


Moving averages smooth out short-term fluctuations and highlight longer-term trends in time series data. This technique is frequently used in forecasting and trend analysis.

      import pandas as pd  
date_range = pd.date_range('2021-01-01', periods=5, freq='D')
df = pd.DataFrame({'Date': date_range, 'Value': [10, 20, 30, 40, 50]})
df['Moving_Avg'] = df['Value'].rolling(window=3).mean()
print(df)
# This calculates the 3-day moving average of the 'Value' column

Output:
Date Value Moving_Avg
0 2021-01-01 10 NaN
1 2021-01-02 20 NaN
2 2021-01-03 30 20.0
3 2021-01-04 40 30.0
4 2021-01-05 50 40.0


12.3 Time Zone Conversion


Time zone conversion is essential when working with global data to ensure that timestamps are accurate across different regions. This involves converting the time series data to a specific time zone.

      import pandas as pd  
date_range = pd.date_range('2021-01-01 12:00', periods=5, freq='D')
df = pd.DataFrame({'Date': date_range, 'Value': [10, 20, 30, 40, 50]})
df['Date'] = df['Date'].dt.tz_localize('UTC').dt.tz_convert('US/Eastern')
print(df)
# This converts the time zone from UTC to US Eastern Time

Output:
Date Value
0 2021-01-01 07:00:00-05:00 10
1 2021-01-02 07:00:00-05:00 20
2 2021-01-03 07:00:00-05:00 30
3 2021-01-04 07:00:00-05:00 40
4 2021-01-05 07:00:00-05:00 50


12.4 Shifting Data


Shifting data is a technique to move data forward or backward in time. This is useful for creating lag features in time series forecasting or calculating differences between time periods.

      import pandas as pd  
date_range = pd.date_range('2021-01-01', periods=5, freq='D')
df = pd.DataFrame({'Date': date_range, 'Value': [10, 20, 30, 40, 50]})
df['Shifted'] = df['Value'].shift(1)
print(df)
# This shifts the 'Value' column by 1 period down

Output:
Date Value Shifted
0 2021-01-01 10 NaN
1 2021-01-02 20 10.0
2 2021-01-03 30 20.0
3 2021-01-04 40 30.0
4 2021-01-05 50 40.0


Chapter 13: Visualization with Matplotlib & pandas

13.1 Line, Bar, and Pie Charts


Line, bar, and pie charts are fundamental ways of visualizing data. A line chart is used to display trends over time, while bar charts are useful for comparing quantities, and pie charts show the proportions of categories.

Example: Line Chart with Matplotlib

import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
plt.plot(x, y)
plt.title('Line Chart')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()

Output:
A line chart will be displayed showing the trend of values on the X and Y axes.


Example: Bar Chart

categories = ['A', 'B', 'C', 'D']
values = [10, 20, 30, 40]
plt.bar(categories, values)
plt.title('Bar Chart')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.show()

Output:
A bar chart will be displayed with the categories on the X-axis and the values on the Y-axis.


Example: Pie Chart

sizes = [25, 35, 40]
labels = ['Category A', 'Category B', 'Category C']
plt.pie(sizes, labels=labels, autopct='%1.1f%%')
plt.title('Pie Chart')
plt.show()

Output:
A pie chart will be displayed showing the distribution of the categories with percentage values.


13.2 Histograms & Boxplots


Histograms and boxplots are helpful in analyzing the distribution of data. Histograms show the frequency of data points in intervals, while boxplots highlight the range, median, and outliers in data.

Example: Histogram

import numpy as np
data = np.random.randn(1000)
plt.hist(data, bins=30)
plt.title('Histogram')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

Output:
A histogram will be displayed showing the distribution of values in the dataset.


Example: Boxplot

data = [np.random.randn(100), np.random.randn(100) + 1]
plt.boxplot(data)
plt.title('Boxplot')
plt.show()

Output:
A boxplot will be displayed with boxes showing the distribution of the data and potential outliers.


13.3 Plotting with pandas


pandas provides built-in plotting functionality to visualize data directly from DataFrames. It uses Matplotlib as a backend and makes it easy to create plots with simple commands.

Example: Plotting DataFrame

import pandas as pd
df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [2, 4, 6, 8, 10]})
df.plot(x='x', y='y', kind='line')
plt.title('pandas Line Plot')
plt.show()

Output:
A line plot will be displayed showing the relationship between columns 'x' and 'y' from the DataFrame.


Example: Bar Plot from pandas

df.plot(x='x', y='y', kind='bar')
plt.title('pandas Bar Plot')
plt.show()

Output:
A bar plot will be displayed using the values from the DataFrame.


13.4 Dual Axis and Subplots


Dual axis and subplots allow you to plot multiple graphs on the same figure. Dual-axis plots show two sets of data on different y-axes, while subplots allow multiple charts to be placed in a grid layout.

Example: Dual Axis Plot

x = [1, 2, 3, 4, 5]
y1 = [10, 20, 25, 30, 40]
y2 = [100, 90, 80, 70, 60]
fig, ax1 = plt.subplots()
ax1.plot(x, y1, 'g-') # First y-axis
ax2 = ax1.twinx()
ax2.plot(x, y2, 'b-') # Second y-axis
plt.show()

Output:
A plot with two y-axes will be displayed, one showing the 'y1' values and another showing 'y2' values.


Example: Subplots

fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.plot(x, y1)
ax2.plot(x, y2)
plt.show()

Output:
Two plots will be displayed side by side, each showing different data on the same figure.


14. Working with Real-World Datasets

14.1 E-Commerce Orders


E-commerce orders data typically contains information on product purchases, such as customer details, order date, product categories, prices, and more. Working with this data allows businesses to gain insights into customer behavior, sales trends, and performance metrics.


Example: Loading and analyzing e-commerce order data using pandas

      import pandas as pd

# Load the e-commerce dataset
data = pd.read_csv('ecommerce_orders.csv')

# Display the first few rows of the dataset
print(data.head())

# Analyze total sales
total_sales = data['total_price'].sum()
print(f'Total Sales: {total_sales}')

Output:
First few rows of the dataset
Total Sales: 50000



14.2 COVID-19 Data


COVID-19 data includes information about the spread of the virus, infection rates, mortality rates, vaccination rates, and more. This data is often used for epidemiological studies and to inform public health policies. Understanding and analyzing this dataset can provide crucial insights into the pandemic's progression.


Example: Analyzing COVID-19 cases by country

      # Load the COVID-19 dataset
covid_data = pd.read_csv('covid_data.csv')

# Group by country and calculate total cases
country_cases = covid_data.groupby('country')['cases'].sum()

# Print the total cases for each country
print(country_cases)

Output:
Country1: 100000
Country2: 200000
...



14.3 NYC Property Sales


The NYC property sales dataset includes information about properties sold in New York City, such as sale price, property type, and location. This dataset is valuable for real estate analysis, helping to identify trends in property values and market demand.


Example: Analyzing property sale prices in NYC

      # Load the NYC property dataset
property_data = pd.read_csv('nyc_property_sales.csv')

# Analyze the average sale price
avg_price = property_data['sale_price'].mean()
print(f'Average Sale Price: {avg_price}')

Output:
Average Sale Price: 750000



14.4 Airline On-Time Data


Airline on-time data includes information about flights, including departure and arrival times, delays, and cancellations. This data is essential for analyzing airline performance, identifying patterns in delays, and improving operational efficiency.


Example: Analyzing flight delays by airline

      # Load the airline on-time dataset
flight_data = pd.read_csv('airline_on_time.csv')

# Group by airline and calculate average delay
airline_delays = flight_data.groupby('airline')['arrival_delay'].mean()

# Print the average delays for each airline
print(airline_delays)

Output:
Airline1: 15 minutes
Airline2: 30 minutes
...



Data Export & Storage

15.1 Export to CSV, Excel, JSON


Exporting data to various file formats like CSV, Excel, and JSON is a common task in data analysis. Pandas makes it easy to save DataFrames to these formats using methods like `to_csv()`, `to_excel()`, and `to_json()`. These formats are widely supported and can be opened and analyzed in other tools such as spreadsheets or databases.


Example:

import pandas as pd  # Import pandas library
data = {'Name': ['Alice', 'Bob', 'Charlie'], # Sample data
'Age': [25, 30, 35]}
df = pd.DataFrame(data) # Create DataFrame
df.to_csv('data.csv', index=False) # Export DataFrame to CSV without row index
df.to_excel('data.xlsx', index=False) # Export DataFrame to Excel without row index
df.to_json('data.json') # Export DataFrame to JSON

Output: Files 'data.csv', 'data.xlsx', and 'data.json' will be created in the working directory.



15.2 HDF5 and Feather Formats


HDF5 and Feather are binary file formats that provide fast and efficient storage for large datasets. HDF5 is often used for storing complex data hierarchies, while Feather is optimized for fast reading and writing of large datasets. Both formats are commonly used for storing data in data science and machine learning workflows.


Example:

df.to_hdf('data.h5', key='df', mode='w')  # Export DataFrame to HDF5 format
df.to_feather('data.feather') # Export DataFrame to Feather format

Output: Files 'data.h5' and 'data.feather' will be created in the working directory.



15.3 Compression Techniques


Compression is useful for reducing file size and improving storage efficiency, especially when working with large datasets. Pandas supports compressing CSV, Excel, and HDF5 files using different algorithms like 'gzip', 'bz2', 'zip', and 'xz'. By using compression, you can significantly reduce the size of data files without losing information.


Example:

df.to_csv('data_compressed.csv.gz', index=False, compression='gzip')  # Export CSV with gzip compression
df.to_excel('data_compressed.xlsx', index=False, engine='openpyxl', compression='zip') # Export Excel with zip compression
df.to_hdf('data_compressed.h5', key='df', mode='w', compression='blosc') # Export HDF5 with compression

Output: Files 'data_compressed.csv.gz', 'data_compressed.xlsx', and 'data_compressed.h5' will be created with compression applied.



15.4 Version Control with Data


Version control for data ensures that different versions of datasets are stored and managed efficiently. This is particularly important in collaborative environments where data is constantly evolving. By using tools like Git or DVC (Data Version Control), you can track changes to datasets and revert to previous versions when necessary. This helps maintain consistency and reproducibility in data analysis projects.


Example:

# Using DVC to track a data file version
# Initialize DVC in a Git repository
!git init  # Initialize a Git repository
!dvc init # Initialize DVC (Data Version Control)
!dvc add data.csv # Add data file to DVC
!git add data.csv.dvc # Add DVC file to Git
!git commit -m "Add data version 1" # Commit to Git
!dvc push # Push data to remote storage

Output: The data file 'data.csv' will be tracked with DVC, and version control will be applied using Git.



Advanced Indexing Techniques

16.1 MultiIndex Introduction


MultiIndex allows you to have multiple levels of indexing on a DataFrame or Series. This provides the ability to represent more complex data structures, such as hierarchical or multi-dimensional data, using a single object. MultiIndex is commonly used when dealing with data grouped by more than one criterion.


Example:

import pandas as pd  # Importing pandas
arrays = [['A', 'A', 'B', 'B'], [1, 2, 1, 2]] # Defining the multi-level index
index = pd.MultiIndex.from_arrays(arrays, names=('Letter', 'Number')) # Creating MultiIndex
data = [10, 20, 30, 40] # Data for the Series
multi_index_series = pd.Series(data, index=index) # Creating a Series with MultiIndex
print(multi_index_series) # Print the Series with MultiIndex

Output:
Letter Number
A 1 10
2 20
B 1 30
2 40



16.2 Indexing by Level


Indexing by level allows you to access data in a MultiIndex structure by specifying a level or combination of levels. This is useful for slicing data based on specific index levels without needing to reset or flatten the index.


Example:

arrays = [['A', 'A', 'B', 'B'], [1, 2, 1, 2]]  # Defining the multi-level index
index = pd.MultiIndex.from_arrays(arrays, names=('Letter', 'Number')) # Creating MultiIndex
data = [10, 20, 30, 40] # Data for the Series
multi_index_series = pd.Series(data, index=index) # Creating a Series with MultiIndex
print(multi_index_series['A']) # Indexing by the first level (Letter='A')
print(multi_index_series.loc['B']) # Indexing by the first level (Letter='B')

Output:
Number
1 10
2 20
dtype: int64
Number
1 30
2 40
dtype: int64



16.3 Swapping and Sorting Index


Swapping the index levels allows you to change the order of the index levels, which can be useful for rearranging or sorting the data. Sorting by index can also be done to arrange the data in a specified order, either ascending or descending.


Example:

arrays = [['A', 'A', 'B', 'B'], [1, 2, 1, 2]]  # Defining the multi-level index
index = pd.MultiIndex.from_arrays(arrays, names=('Letter', 'Number')) # Creating MultiIndex
data = [10, 20, 30, 40] # Data for the Series
multi_index_series = pd.Series(data, index=index) # Creating a Series with MultiIndex
swapped = multi_index_series.swaplevel() # Swapping index levels
print(swapped) # Print Series with swapped index
sorted_series = multi_index_series.sort_index() # Sorting by index
print(sorted_series) # Print sorted Series

Output:
Number Letter
1 A 10
B 30
2 A 20
B 40
dtype: int64
Letter Number
A 1 10
2 20
B 1 30
2 40



16.4 Index Set Operations


Index set operations allow you to perform mathematical set operations (like union, intersection, and difference) on the index of a pandas object. This can be useful for comparing and manipulating the indices of different datasets.


Example:

index1 = pd.Index(['A', 'B', 'C', 'D'])  # First index
index2 = pd.Index(['B', 'C', 'E', 'F']) # Second index
union_index = index1.union(index2) # Union of indices (all unique elements)
intersection_index = index1.intersection(index2) # Intersection of indices (common elements)
difference_index = index1.difference(index2) # Difference of indices (elements in index1 but not index2)
print(union_index) # Print union of indices
print(intersection_index) # Print intersection of indices
print(difference_index) # Print difference of indices

Output:
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
Index(['B', 'C'], dtype='object')
Index(['A', 'D'], dtype='object')



Chapter 17: Custom Functions and apply()

Subchapter 17.1: Row vs Column Functions


In pandas, you can apply custom functions to rows or columns of a DataFrame using the apply() method. When applying a function to columns, the function is applied to each individual column, whereas applying it to rows means the function will be applied to each row. The axis parameter is used to specify whether you want to apply the function along rows (axis=1) or columns (axis=0).

Example: Apply function to columns

import pandas as pd  
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)
def sum_func(col):
return col.sum()
column_sum = df.apply(sum_func, axis=0)
print(column_sum)

Output:
A 6
B 15


Example: Apply function to rows

row_sum = df.apply(sum_func, axis=1)  
print(row_sum)

Output:
0 5
1 7
2 9


Subchapter 17.2: Lambda Functions


A lambda function is a small anonymous function that can be defined in a single line. It is often used with the apply() method to create quick functions for operations like transformations and computations.

Example: Using a lambda function for transformation

df['A_squared'] = df['A'].apply(lambda x: x**2)  
print(df)

Output:
A B A_squared
0 1 4 1
1 2 5 4
2 3 6 9


Example: Using lambda function for conditional transformation

df['A_label'] = df['A'].apply(lambda x: 'Small' if x < 3 else 'Large')  
print(df)

Output:
A B A_squared A_label
0 1 4 1 Small
1 2 5 4 Small
2 3 6 9 Large


Subchapter 17.3: Using map() vs apply()


The map() function is used for element-wise transformations, and it works specifically with Series (one-dimensional data). In contrast, the apply() function is more flexible and can be used with both Series and DataFrames, enabling more complex transformations across rows or columns. map() is generally faster than apply() when applied to Series.

Example: Using map() for element-wise transformation

df['B_mapped'] = df['B'].map(lambda x: x * 10)  
print(df)

Output:
A B A_squared A_label B_mapped
0 1 4 1 Small 40
1 2 5 4 Small 50
2 3 6 9 Large 60


Example: Using apply() for row-wise transformation

df['row_sum'] = df.apply(lambda row: row['A'] + row['B'], axis=1)  
print(df)

Output:
A B A_squared A_label B_mapped row_sum
0 1 4 1 Small 40 5
1 2 5 4 Small 50 7
2 3 6 9 Large 60 9


Subchapter 17.4: Complex Transformations


Complex transformations involve applying more advanced logic to the DataFrame or Series. This may include combining multiple columns, performing conditional operations, or using external functions within the transformation.

Example: Conditional transformation with multiple columns

df['A_and_B'] = df.apply(lambda row: row['A'] * row['B'] if row['A'] > 1 else row['A'] + row['B'], axis=1)  
print(df)

Output:
A B A_squared A_label B_mapped row_sum A_and_B
0 1 4 1 Small 40 5 5
1 2 5 4 Small 50 7 10
2 3 6 9 Large 60 9 18


Example: Applying an external function

def multiply(a, b):  
return a * b
df['multiplication'] = df.apply(lambda row: multiply(row['A'], row['B']), axis=1)
print(df)

Output:
A B A_squared A_label B_mapped row_sum A_and_B multiplication
0 1 4 1 Small 40 5 5 4
1 2 5 4 Small 50 7 10 10
2 3 6 9 Large 60 9 18 18


Chapter 18: Performance Optimization

18.1 Efficient Data Types


Optimizing the data types used in your dataset can significantly improve performance, especially when working with large datasets. By choosing more efficient data types, such as using integers instead of floats, or reducing memory usage by using smaller types like 'float32' instead of 'float64', you can save both memory and processing time.

# Real-world example: Efficient data types
import pandas as pd # Import the pandas library
data = pd.DataFrame({'col1': [1.0, 2.5, 3.8], 'col2': [5.2, 6.3, 7.4]}) # Create a DataFrame
data['col1'] = data['col1'].astype('float32') # Convert to more efficient float32
data['col2'] = data['col2'].astype('float32') # Convert to more efficient float32
print(data.dtypes) # Output the data types of each column

Output: The data types of 'col1' and 'col2' are now 'float32'


18.2 Using Categorical Data


Categorical data refers to data that can take on a limited, fixed number of values, often representing categories such as 'male/female' or 'red/blue/green'. Using the 'category' data type in pandas reduces memory usage and increases performance when dealing with repetitive text values.

# Real-world example: Using categorical data
data = pd.DataFrame({'category_col': ['red', 'blue', 'green', 'red', 'blue']}) # Create a DataFrame
data['category_col'] = data['category_col'].astype('category') # Convert to categorical type
print(data.dtypes) # Output the data type of the 'category_col' column

Output: The data type of 'category_col' is now 'category'


18.3 Vectorization Over Loops


Vectorization refers to replacing explicit loops with operations that are performed over entire arrays or columns of data. By using vectorized operations, you can take advantage of optimized C code behind libraries like NumPy and pandas, leading to faster execution times compared to using Python loops.

# Real-world example: Vectorization over loops
import numpy as np # Import NumPy library
data = np.array([1, 2, 3, 4, 5]) # Create a NumPy array
result = data * 2 # Vectorized operation (no loop needed)
print(result) # Output the result

Output: The array [2, 4, 6, 8, 10]


18.4 Profiling with %%timeit


Profiling with the magic command `%%timeit` in Jupyter Notebooks or IPython allows you to measure the execution time of a single line or a block of code. This can be useful for determining performance bottlenecks and optimizing code where necessary.

# Real-world example: Profiling with %%timeit
# This code should be run in a Jupyter Notebook or IPython environment
%%timeit # Measure the time taken by this code
data = [x for x in range(1000)] # List comprehension to create a list

Output: The time taken to execute the list comprehension is displayed


Chapter 19: Error Handling and Debugging

19.1 Try-Except in Pandas Pipelines


Using try-except blocks in a pandas pipeline allows you to gracefully handle errors, ensuring that the process continues even if one step fails. This is particularly useful in data processing when unexpected errors occur.

      import pandas as pd  
def process_data(df):
try:
df['Value'] = df['Value'] / df['Value2']
except ZeroDivisionError:
print("Error: Division by zero encountered")
return df
df = pd.DataFrame({'Value': [10, 20, 30], 'Value2': [2, 0, 5]})
result = process_data(df)
print(result)
# This uses try-except to handle division by zero errors

Output:
Error: Division by zero encountered
Value Value2
0 5 2
1 NaN 0
2 6 5


19.2 Logging and Custom Errors


Logging allows you to record information about your program's execution, making it easier to track errors and debug. Custom errors can also be raised to handle specific conditions in a program.

      import logging  
logging.basicConfig(level=logging.INFO)
def check_value(value):
if value < 0:
logging.error("Value cannot be negative")
raise ValueError("Negative value error")
return value
try:
check_value(-5)
except ValueError as e:
logging.exception(e)
# This logs an error when a negative value is encountered

Output:
ERROR:root:Value cannot be negative
Traceback (most recent call last):
File "script.py", line 7, in
check_value(-5)
File "script.py", line 5, in check_value
raise ValueError("Negative value error")
ValueError: Negative value error


19.3 Warnings and Deprecations


Warnings indicate potential issues in code that may not stop execution but could lead to problems in the future. Deprecation warnings indicate that certain features or functions will be removed in future versions of libraries.

      import warnings  
def old_function():
warnings.warn("This function is deprecated", DeprecationWarning)
old_function()
# This generates a deprecation warning

Output:
Warning: This function is deprecated


19.4 Debugging with Breakpoints


Breakpoints allow you to pause code execution at a specific line, enabling you to inspect variables and the program state. This is especially useful for debugging issues in complex code.

      import pdb  
def calculate(x, y):
pdb.set_trace()
return x + y
result = calculate(10, 20)
print(result)
# This will pause the program at the breakpoint, allowing inspection

Output:
(Pdb) 10
(Pdb) 20
(Pdb) n
30


Chapter 20: Advanced Visualization with Seaborn

20.1 Introduction to Seaborn


Seaborn is a Python data visualization library based on Matplotlib that provides a high-level interface for drawing attractive and informative statistical graphics. It is designed to work well with pandas DataFrames and provides a variety of plotting functions for visualizing distributions, relationships, and categories.

Example: Simple Seaborn Plot

import seaborn as sns
import matplotlib.pyplot as plt
# Load the dataset data = sns.load_dataset('iris')
sns.scatterplot(x='sepal_length', y='sepal_width', data=data)
plt.title('Seaborn Scatter Plot')
plt.show()

Output:
A scatter plot will be displayed showing the relationship between 'sepal_length' and 'sepal_width' from the iris dataset.


20.2 Statistical Plots


Seaborn provides several built-in functions to create statistical plots, such as bar plots, box plots, violin plots, and regression plots. These plots are designed to visualize the statistical properties of the data, like distributions, central tendency, and spread.

Example: Boxplot

sns.boxplot(x='species', y='sepal_length', data=data)
plt.title('Boxplot of Sepal Length by Species')
plt.show()

Output:
A boxplot will be displayed, showing the distribution of 'sepal_length' for each species in the iris dataset.


Example: Regression Plot

sns.regplot(x='sepal_length', y='sepal_width', data=data)
plt.title('Regression Plot')
plt.show()

Output:
A regression plot will be displayed showing the linear relationship between 'sepal_length' and 'sepal_width' with a fitted line.


20.3 Pairplots and Heatmaps


Pairplots and heatmaps are two powerful visualization techniques in Seaborn. Pairplots show relationships between all variables in a dataset, while heatmaps visualize correlation matrices or other two-dimensional data.

Example: Pairplot

sns.pairplot(data, hue='species')
plt.title('Pairplot of Iris Dataset')
plt.show()

Output:
A pairplot will be displayed showing scatter plots between all variables in the iris dataset, color-coded by the species.


Example: Heatmap

import numpy as np
corr_matrix = data.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

Output:
A heatmap will be displayed showing the correlation matrix between the numerical features of the iris dataset, with annotated values.


20.4 Custom Themes and Palettes


Seaborn allows users to customize the appearance of their plots using themes and color palettes. You can change the overall style of plots and adjust the color schemes to make the visuals more appealing and easier to interpret.

Example: Custom Theme

sns.set_style('whitegrid')
sns.scatterplot(x='sepal_length', y='sepal_width', data=data)
plt.title('Scatter Plot with Custom Theme')
plt.show()

Output:
A scatter plot will be displayed with a white grid background, using Seaborn's 'whitegrid' theme.


Example: Custom Color Palette

sns.set_palette('muted')
sns.boxplot(x='species', y='sepal_length', data=data)
plt.title('Boxplot with Custom Palette')
plt.show()

Output:
A boxplot will be displayed using Seaborn's 'muted' color palette.


21. Case Study: Stock Market Analysis

21.1 Importing Finance Data


Importing finance data is the first step in performing stock market analysis. You can use various data sources like Yahoo Finance, Alpha Vantage, or Quandl to fetch historical stock price data. This data typically includes the opening, closing, high, and low prices for each trading day.


Example: Importing stock data using the `yfinance` library

      import yfinance as yf

# Define the stock ticker symbol (e.g., Apple)
ticker = 'AAPL'

# Download historical stock data for the past year
data = yf.download(ticker, period='1y')

# Display the first few rows of the data
print(data.head())

Output:
First few rows of Apple's stock data including Date, Open, High, Low, Close, Volume, Adjusted Close



21.2 Daily & Cumulative Returns


Daily returns refer to the percentage change in stock price from one day to the next. Cumulative returns show the total percentage change over a specified period. These calculations help investors understand short-term price movements and the overall performance of a stock.


Example: Calculating daily and cumulative returns

      # Calculate daily returns by computing the percentage change of the closing price
data['Daily_Return'] = data['Close'].pct_change()

# Calculate cumulative returns
data['Cumulative_Return'] = (1 + data['Daily_Return']).cumprod() - 1

# Display the results
print(data[['Daily_Return', 'Cumulative_Return']].tail())

Output:
Last 5 rows of daily returns and cumulative returns for Apple



21.3 Moving Averages & Volatility


Moving averages help smooth out price data to identify trends over a specific period, while volatility measures the variation in stock prices. A common moving average is the 50-day or 200-day moving average, and volatility can be measured using the standard deviation of daily returns.


Example: Calculating moving averages and volatility

      # Calculate the 50-day moving average
data['50_Day_MA'] = data['Close'].rolling(window=50).mean()

# Calculate the volatility (standard deviation of daily returns)
data['Volatility'] = data['Daily_Return'].rolling(window=50).std()

# Display the results
print(data[['50_Day_MA', 'Volatility']].tail())

Output:
Last 5 rows of 50-day moving averages and volatility values



21.4 Risk vs Return


Risk vs return is a fundamental concept in finance. Investors must assess whether the potential return on an investment justifies the associated risk. By analyzing historical stock data, we can calculate the expected return and the risk (volatility) to make more informed investment decisions.


Example: Analyzing risk and return using standard deviation and mean

      # Calculate the expected return (mean of daily returns)
expected_return = data['Daily_Return'].mean()

# Calculate the risk (standard deviation of daily returns)
risk = data['Daily_Return'].std()

# Print the expected return and risk
print(f'Expected Return: {expected_return:.4f}')
print(f'Risk (Volatility): {risk:.4f}')

Output:
Expected Return: 0.0012
Risk (Volatility): 0.0153



Case Study: Movie Analytics

22.1 Scraping IMDB Data


Web scraping is a technique used to extract data from websites. In this case, we will scrape movie data from IMDb using the BeautifulSoup and requests libraries. This allows us to gather information such as movie titles, genres, release dates, and ratings.


Example:

import requests  # Import requests library
from bs4 import BeautifulSoup # Import BeautifulSoup for HTML parsing
url = 'https://www.imdb.com/chart/top' # IMDb top movies chart URL
response = requests.get(url) # Send a GET request to the URL
soup = BeautifulSoup(response.text, 'html.parser') # Parse HTML content using BeautifulSoup
movies = soup.find_all('td', class_='titleColumn') # Find all movie title columns
for movie in movies[:5]: # Loop through the first 5 movies
title = movie.a.text # Extract movie title
year = movie.span.text.strip('()') # Extract movie release year
print(f'{title} ({year})') # Print movie title and release year

Output: The top 5 movies from IMDb, along with their release years, will be printed.



22.2 Genre-Based Grouping


Grouping movies by genre allows us to analyze trends and compare different types of movies. We can create a dictionary where the genre is the key, and the value is a list of movies belonging to that genre.


Example:

import pandas as pd  # Import pandas for data manipulation
# Sample data: movie titles and their genres data = {'Title': ['Inception', 'The Dark Knight', 'Pulp Fiction', 'The Matrix', 'Forrest Gump'], 'Genre': ['Sci-Fi', 'Action', 'Crime', 'Sci-Fi', 'Drama']}
df = pd.DataFrame(data) # Create a DataFrame from the data
grouped_genres = df.groupby('Genre')['Title'].apply(list) # Group movies by genre
print(grouped_genres) # Print the grouped genres and their movies

Output: A list of movies grouped by genre will be printed, showing the genres with their respective movie titles.



22.3 Visualizing Ratings Over Time


Visualizing how ratings change over time is a powerful tool for understanding movie trends. We can create a plot that shows the ratings of movies over the years using libraries like Matplotlib and Pandas.


Example:

import matplotlib.pyplot as plt  # Import Matplotlib for plotting
import pandas as pd # Import pandas for data manipulation
# Sample data: movie titles, ratings, and release years data = {'Title': ['Inception', 'The Dark Knight', 'Pulp Fiction', 'The Matrix', 'Forrest Gump'], 'Rating': [8.8, 9.0, 8.9, 8.7, 8.8], 'Year': [2010, 2008, 1994, 1999, 1994]}
df = pd.DataFrame(data) # Create a DataFrame from the data
plt.plot(df['Year'], df['Rating'], marker='o') # Plot ratings over years
plt.title('Movie Ratings Over Time') # Set plot title
plt.xlabel('Year') # Set x-axis label
plt.ylabel('Rating') # Set y-axis label
plt.show() # Display the plot

Output: A line plot will be displayed showing movie ratings over the years.



22.4 Comparing Studios and Budgets


Comparing movies based on studio and budget can provide insights into the financial aspect of filmmaking. We can group movies by studio and calculate average budgets or other statistics.


Example:

# Sample data: movie titles, studio, and budget
data = {'Title': ['Inception', 'The Dark Knight', 'Pulp Fiction', 'The Matrix', 'Forrest Gump'],
        'Studio': ['Warner Bros.', 'Warner Bros.', 'Miramax', 'Warner Bros.', 'Paramount'],
        'Budget': [160000000, 185000000, 80000000, 63000000, 55000000]}
df = pd.DataFrame(data) # Create a DataFrame from the data
avg_budget_by_studio = df.groupby('Studio')['Budget'].mean() # Calculate average budget by studio
print(avg_budget_by_studio) # Print the average budget for each studio

Output: The average budget for each studio will be printed, showing how much they typically invest in their movies.



Chapter 23: Working with GeoData

23.1 Using geopandas


GeoPandas is an extension of the pandas library that makes working with geospatial data in Python easier. It allows you to read, manipulate, and analyze spatial data such as shapefiles, GeoJSON, and others. It supports spatial operations like calculating area, distance, buffering, and joins based on geometry.


Example:

import geopandas as gpd  # Importing the geopandas library
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres')) # Load world map data
print(world.head()) # Display the first few rows of the GeoDataFrame

Output:
pop_est continent name iso_a3 gdp_md_est geometry
0 920938 Africa Algeria DZA 48830.0 POLYGON ((11.999 23.471...
1 539509 Africa Angola AGO 110300.0 MULTIPOLYGON (((12.227 ...
...



23.2 Importing Shapefiles


Shapefiles are a common format for geospatial vector data, consisting of several files (.shp, .shx, .dbf, etc.). GeoPandas allows you to import shapefiles easily into a GeoDataFrame using the `read_file()` function. Once loaded, the data can be analyzed or visualized just like a regular DataFrame, with added geometry features.


Example:

import geopandas as gpd  # Import geopandas
gdf = gpd.read_file('your_shapefile_folder/your_file.shp') # Replace with your actual shapefile path
print(gdf.head()) # Show top rows of shapefile data

Output:
Will vary depending on the shapefile. Typically includes:
ID, Name, geometry, etc.
Example:
ID NAME geometry
1 River POLYGON ((...))



23.3 Mapping with Coordinates


With GeoPandas, you can create maps using longitude and latitude data by converting regular pandas DataFrames to GeoDataFrames using the `points_from_xy` function. This is useful when visualizing addresses, locations, or spatial patterns based on GPS coordinates.


Example:

import pandas as pd  # Import pandas
import geopandas as gpd # Import geopandas
from shapely.geometry import Point # Import Point for coordinates
# Create a simple DataFrame with coordinates
data = {'City': ['A', 'B'], 'Latitude': [34.05, 36.16], 'Longitude': [-118.24, -115.15]}
df = pd.DataFrame(data) # Create DataFrame
# Convert to GeoDataFrame
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df['Longitude'], df['Latitude']))
print(gdf.head()) # Show mapped GeoDataFrame

Output:
City Latitude Longitude geometry
A 34.05 -118.24 POINT (-118.24000 34.05000)
B 36.16 -115.15 POINT (-115.15000 36.16000)



23.4 Combining Maps with DataFrames


You can combine spatial data (maps) with regular pandas DataFrames using merges or joins on common keys. For example, you can join a GeoDataFrame of regions with another DataFrame containing population or economic data. This allows for spatial analysis and geographic visualizations with rich context.


Example:

import geopandas as gpd  # Import geopandas
import pandas as pd # Import pandas
# Load a world map
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
# Create a simple DataFrame with matching country names
data = {'name': ['Brazil', 'China'], 'Population': [211e6, 1393e6]}
pop_df = pd.DataFrame(data) # Create population DataFrame
# Merge spatial data with population
merged = world.merge(pop_df, on='name') # Merge on 'name' column
print(merged[['name', 'Population', 'geometry']]) # Show result

Output:
name Population geometry
Brazil 211000000.0 POLYGON ((-57.6251 -30.2163...
China 1393000000.0 POLYGON ((109.6875 18.1977...



Chapter 24: Working with Text Data (NLP Prep)

24.1 Tokenization & Lowercasing


Tokenization is the process of breaking down text into individual units such as words or phrases (tokens). Lowercasing is a preprocessing step where all text is converted to lowercase to maintain uniformity and improve comparison. These are often the first steps in natural language processing (NLP) workflows to prepare raw text for analysis or machine learning.


Example:

text = "NLP Is Fun and Powerful!"  # Original text
tokens = text.lower().split() # Convert to lowercase and split into tokens
print(tokens) # Display the tokens

Output: ['nlp', 'is', 'fun', 'and', 'powerful!']



24.2 Removing Stopwords


Stopwords are common words in a language (like "and", "is", "the") that often do not add significant meaning. Removing them helps reduce noise in the data and can lead to better model performance. Libraries like NLTK provide predefined lists of stopwords.


Example:

from nltk.corpus import stopwords  # Import stopwords list
import nltk # Import nltk for natural language tasks
nltk.download('stopwords') # Download stopwords if not already present
text = "NLP is fun and powerful" # Sample text
words = text.lower().split() # Lowercase and tokenize
filtered = [word for word in words if word not in stopwords.words('english')] # Remove stopwords
print(filtered) # Display filtered tokens

Output: ['nlp', 'fun', 'powerful']



24.3 Bag-of-Words Matrix


The Bag-of-Words (BoW) model transforms text into a fixed-length vector based on word frequency. It disregards grammar and word order but captures how often a word occurs in a document. It’s commonly used for text classification and clustering.


Example:

from sklearn.feature_extraction.text import CountVectorizer  # Import BoW tool
texts = ["I love NLP", "NLP is great and fun"] # Sample documents
vectorizer = CountVectorizer() # Create a CountVectorizer object
bow = vectorizer.fit_transform(texts) # Transform texts to BoW format
print(vectorizer.get_feature_names_out()) # Print words used in BoW
print(bow.toarray()) # Show frequency matrix

Output:
Words: ['and', 'fun', 'great', 'is', 'love', 'nlp']
Matrix: [[0, 0, 0, 0, 1, 1], [1, 1, 1, 1, 0, 1]]



24.4 TF-IDF with Scikit-learn


TF-IDF (Term Frequency-Inverse Document Frequency) is an advanced technique that scores words based on how unique they are across documents. A high TF-IDF value indicates a word is frequent in one document but rare across others, making it more meaningful. Scikit-learn makes it easy to implement TF-IDF using `TfidfVectorizer`.


Example:

from sklearn.feature_extraction.text import TfidfVectorizer  # Import TF-IDF tool
texts = ["I love NLP", "NLP is great and fun"] # Sample text list
tfidf = TfidfVectorizer() # Create the vectorizer
matrix = tfidf.fit_transform(texts) # Fit and transform the texts
print(tfidf.get_feature_names_out()) # Display words
print(matrix.toarray()) # Display TF-IDF scores

Output:
Words: ['and', 'fun', 'great', 'is', 'love', 'nlp']
Matrix: Each row shows the TF-IDF scores of words per document.



Chapter 25: Handling Big Data with Dask

25.1 Introduction to Dask


Dask is a parallel computing library in Python that scales Python code to handle large datasets. It allows you to work with big data in a similar way to pandas but in a distributed or parallelized fashion. Dask breaks your data into smaller chunks and processes them in parallel, which is ideal for datasets that don’t fit into memory.


Example:

import dask  # Import the Dask library
import dask.dataframe as dd # Import Dask's DataFrame module
# Load a large CSV using Dask
df = dd.read_csv('large_dataset.csv') # Efficiently reads big files in chunks
print(df.head()) # Displays the first few rows lazily

Output:
First few rows of your large CSV file (similar to pandas output).
Example:
id,name,value
1,Alice,100
2,Bob,200



25.2 Replacing pandas with Dask


Dask mimics the pandas API, which means you can use most of your existing pandas code with minimal changes. Instead of `import pandas as pd`, you use `import dask.dataframe as dd`. The rest of the syntax (like filtering, grouping, etc.) remains largely the same, making it easy to scale your existing workflows.


Example:

import dask.dataframe as dd  # Use Dask instead of pandas
# Read a CSV file with Dask
df = dd.read_csv('sales_data.csv') # Loads data in chunks
# Group by product and calculate the total
result = df.groupby('product')['amount'].sum() # Similar to pandas
print(result.compute()) # Triggers computation

Output:
product
Chair 1200
Desk 800
Lamp 400



25.3 Lazy Computation


One of the powerful features of Dask is lazy evaluation. Unlike pandas, which executes operations immediately, Dask builds a task graph and waits to compute until explicitly told to with `.compute()`. This makes operations faster and memory-efficient because nothing is done until needed.


Example:

import dask.dataframe as dd  # Import Dask
df = dd.read_csv('bigfile.csv') # Load big file lazily
# Define a lazy operation
average = df['price'].mean() # No computation yet
print(average) # Just shows a delayed object
print(average.compute()) # Now it actually calculates

Output:
Dask Series: price mean (delayed object)
Then the actual number like: 45.67



25.4 Performance Benchmarks


Dask is especially useful when handling gigabytes or terabytes of data. Performance can be benchmarked by comparing processing time and memory usage against pandas. For massive files, Dask often completes tasks significantly faster because it uses parallelism and avoids memory overload by processing data in chunks.


Example:

import pandas as pd  # Regular pandas
import dask.dataframe as dd # Dask
import time # To measure time
# Timing pandas
start = time.time()
df_pandas = pd.read_csv('large_file.csv')
print("Pandas Time:", time.time() - start)
# Timing Dask
start = time.time()
df_dask = dd.read_csv('large_file.csv')
df_dask.head() # Trigger load
print("Dask Time:", time.time() - start)

Output:
Pandas Time: 12.4 seconds
Dask Time: 2.7 seconds
(Numbers may vary based on file size and system.)



Chapter 26: Using SQL with pandas

Subchapter 26.1: Connecting to SQLite


SQLite is a lightweight, file-based database that can be easily integrated with pandas for storing and retrieving data. To connect pandas to SQLite, we use Python’s built-in sqlite3 module to establish a connection, which can then be passed to pandas functions like read_sql_query() and to_sql().

Example: Connecting to an SQLite database

import sqlite3  
conn = sqlite3.connect("example.db")
print("Connection successful!")

Output:
Connection successful!


Subchapter 26.2: Running SQL Queries


After establishing a connection to the database, you can use SQL queries to retrieve data using pd.read_sql_query(). This function allows you to run SQL commands like SELECT and fetch the results directly into a pandas DataFrame for further analysis.

Example: Running a SELECT query

import pandas as pd  
query = "SELECT name FROM sqlite_master WHERE type='table';"
tables = pd.read_sql_query(query, conn)
print(tables)

Output:
Displays a list of table names in the SQLite database


Subchapter 26.3: Joining SQL & pandas Results


You can join SQL query results with pandas DataFrames using pandas merge operations. This is useful when you want to combine database results with other in-memory data. The merge() function lets you do SQL-style joins (inner, outer, left, right) directly in pandas.

Example: Joining SQL result with pandas DataFrame

# Assuming 'employees' table exists in SQLite 
sql_df = pd.read_sql_query("SELECT id, name FROM employees", conn)
local_df = pd.DataFrame({'id': [1, 2], 'department': ['HR', 'IT']})
joined_df = pd.merge(sql_df, local_df, on='id')
print(joined_df)

Output:
Merges employee names from SQL with department data in pandas


Subchapter 26.4: Writing DataFrames to SQL


Pandas allows you to write a DataFrame to a SQL table using the to_sql() method. This is useful for saving processed or transformed data back into a database. You can choose to replace the table, append to it, or fail if it already exists by setting the if_exists parameter.

Example: Saving a DataFrame to SQL

df = pd.DataFrame({'id': [3, 4], 'name': ['Alice', 'Bob']})  
df.to_sql("employees", conn, if_exists="append", index=False)
print("Data written to SQL database.")

Output:
Data written to SQL database.


Chapter 27: Automating Reports

27.1 Generating Excel Reports


Generating Excel reports is a common task in data analysis for sharing results with non-technical users. With pandas, you can easily write DataFrames to Excel files using the `to_excel()` method, which supports multiple sheets and formatting.

# Real-world example: Generating an Excel report
import pandas as pd # Import pandas library
df = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Sales': [250, 300]}) # Create a sample DataFrame
df.to_excel('sales_report.xlsx', index=False) # Export the DataFrame to Excel
print("Excel report generated successfully.") # Confirm success

Output: Creates a file named sales_report.xlsx with Name and Sales data.



27.2 Adding Charts with xlsxwriter


With the xlsxwriter engine, you can create Excel files with embedded charts. This is helpful for visualizing trends or summaries directly in your reports. Charts can be added with formatting, titles, and styles for professional reporting.

# Real-world example: Adding charts to Excel with xlsxwriter
writer = pd.ExcelWriter('chart_report.xlsx', engine='xlsxwriter') # Use xlsxwriter engine
df.to_excel(writer, sheet_name='Sheet1', index=False) # Write data to Excel
workbook = writer.book # Get workbook object
worksheet = writer.sheets['Sheet1'] # Get worksheet object
chart = workbook.add_chart({'type': 'column'}) # Create column chart
chart.add_series({'values': '=Sheet1!$B$2:$B$3'}) # Add data series
worksheet.insert_chart('D2', chart) # Insert chart in cell D2
writer.save() # Save the file
print("Excel file with chart created.") # Confirm success

Output: Generates an Excel file with a column chart showing Sales data.



27.3 Emailing Reports


Emailing reports automatically is useful for regular updates. Using the built-in `smtplib` and `email` libraries in Python, you can send Excel or PDF files as email attachments. This is often combined with scheduled scripts for automation.

# Real-world example: Sending email with a report (requires real email setup)
import smtplib # Library for sending emails
from email.message import EmailMessage # Class to build email messages
msg = EmailMessage() # Create an email message
msg['Subject'] = 'Monthly Sales Report' # Set subject
msg['From'] = 'sender@example.com' # Sender's email
msg['To'] = 'receiver@example.com' # Receiver's email
msg.set_content('Please find the attached sales report.') # Email body
with open('sales_report.xlsx', 'rb') as f: # Open file in binary mode
file_data = f.read() # Read file
file_name = f.name # Get file name
msg.add_attachment(file_data, maintype='application', subtype='octet-stream', filename=file_name) # Attach file
with smtplib.SMTP('smtp.example.com', 587) as server: # Connect to SMTP server
server.starttls() # Secure the connection
server.login('sender@example.com', 'password') # Login credentials
server.send_message(msg) # Send email
print("Email sent with report.") # Confirm success

Output: Sends the sales_report.xlsx to the specified recipient.



27.4 Scheduling with cron or Task Scheduler


Automating the timing of report generation can be achieved using cron (Linux/macOS) or Task Scheduler (Windows). These tools allow you to schedule your Python script to run at specific intervals, like daily or weekly.

# Example (Linux/macOS): Add this to crontab
# Open crontab with: crontab -e
# Run every day at 8 AM
0 8 * * * /usr/bin/python3 /path/to/report_script.py
# Example (Windows Task Scheduler):
# - Open Task Scheduler
# - Create Task → Triggers: Daily 8AM
# - Actions: Start a Program → python.exe and script path

Output: Your script runs automatically on the schedule without manual intervention.



Chapter 28: Web APIs & pandas

28.1 Consuming JSON APIs


Consuming JSON APIs in pandas means making a request to an API that returns JSON data and converting that data into a DataFrame. This is common in data projects that involve pulling external information from web services or public data portals.

import requests  
import pandas as pd
url = 'https://jsonplaceholder.typicode.com/posts'
response = requests.get(url)
data = response.json()
df = pd.DataFrame(data)
print(df.head())
# Fetches JSON data from a free placeholder API and loads it into a DataFrame

Output:
id userId title
1 1 sunt aut facere repellat
2 1 qui est esse
3 1 ea molestias quasi
...


28.2 Normalizing Nested Data


Sometimes JSON from APIs contains nested structures like dictionaries or lists inside objects. Pandas has `json_normalize()` to flatten these structures into a tabular format so they can be more easily analyzed.

import requests  
import pandas as pd
from pandas import json_normalize
url = 'https://jsonplaceholder.typicode.com/users'
response = requests.get(url)
data = response.json()
df = json_normalize(data, sep='_')
print(df[['id', 'name', 'address_city']])
# This flattens the nested address data into individual columns

Output:
id name address_city
1 Leanne Graham Gwenborough
2 Ervin Howell Wisokyburgh
...


28.3 Pagination and Tokens


Many APIs limit the number of results returned per request. Pagination helps you retrieve multiple pages. Tokens are often used for authenticated access. You loop through pages until there's no more data.

import requests  
import pandas as pd
all_data = []
for page in range(1, 3):
url = f'https://jsonplaceholder.typicode.com/posts?_page={page}&_limit=5'
response = requests.get(url)
data = response.json()
all_data.extend(data)
df = pd.DataFrame(all_data)
print(df)
# This simulates pagination by looping through API pages

Output:
id title body
1 sunt aut facere ...
2 qui est esse ...
...


28.4 Rate-Limiting Handling


Some APIs restrict the number of requests in a given time (rate-limiting). To avoid getting blocked or receiving errors, you can use delays (like `time.sleep()`) between requests or handle the 429 error specifically.

import requests  
import pandas as pd
import time
all_data = []
for i in range(3):
url = 'https://jsonplaceholder.typicode.com/posts/{}'.format(i+1)
response = requests.get(url)
if response.status_code == 429:
print("Rate limit hit, sleeping...")
time.sleep(5)
continue
data = response.json()
all_data.append(data)
time.sleep(1)
df = pd.DataFrame(all_data)
print(df)
# Adds delay to handle rate limits and avoid API blocking

Output:
id title userId
1 sunt aut facere 1
2 qui est esse 1
3 ea molestias quasi 1


Chapter 29: Testing & Validation

29.1 Unit Testing with pytest


Unit testing ensures that individual components of code function as expected. pytest is a popular Python testing framework that allows writing simple and readable test cases using plain Python functions. Unit tests can catch bugs early by validating functions or methods in isolation.

Example: Testing a Function with pytest

# my_math.py
def add(x, y):
return x + y

# test_my_math.py
from my_math import add
def test_add():
assert add(2, 3) == 5
assert add(-1, 1) == 0
assert add(0, 0) == 0

Output:
Running pytest test_my_math.py will show that all test cases pass if the function works correctly.


29.2 Validating DataFrames


Validating DataFrames means checking that data meets specific conditions such as value types, ranges, or missing values. This step ensures that your data is clean and consistent before analysis or modeling. It can be done using simple pandas checks or libraries like pandera.

Example: Validating Missing Values and Data Types

import pandas as pd
# Create a sample DataFrame
df = pd.DataFrame({
'name': ['Alice', 'Bob', None],
'age': [25, 30, None]
})
# Check for missing values
missing = df.isnull().sum()
print("Missing values in each column:")
print(missing)
# Check data types
print("Data types:")
print(df.dtypes)

Output:
Displays missing values per column and data types of each column in the DataFrame.


29.3 Schema Enforcement


Schema enforcement involves defining and enforcing rules on the structure and types of your data. This is crucial when loading, transforming, or validating structured datasets to ensure consistency. Tools like pandera allow you to declare expected schemas and catch violations automatically.

Example: Enforcing Schema with pandera

import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema
# Define schema
schema = DataFrameSchema({
"name": Column(str, nullable=False),
"age": Column(int, nullable=False)
})
# Sample valid DataFrame
df_valid = pd.DataFrame({
"name": ["Alice", "Bob"],
"age": [25, 30]
})
# Validate
validated = schema.validate(df_valid)
print("Data validated successfully")

Output:
If the data conforms to the schema, it prints: Data validated successfully. Otherwise, it raises a schema error.


29.4 Custom Test Cases


Custom test cases allow you to create specific tests for edge cases or domain-specific logic in your code. These can be automated using pytest or manually coded to check particular behaviors that aren’t covered by general unit tests.

Example: Custom Logic Test for Discounts

# discount.py
def apply_discount(price, percentage):
if percentage < 0 or percentage > 100:
raise ValueError("Percentage must be between 0 and 100")
return price * (1 - percentage / 100)

# test_discount.py
from discount import apply_discount
import pytest
def test_valid_discount():
assert apply_discount(100, 20) == 80
def test_invalid_discount():
with pytest.raises(ValueError):
apply_discount(100, -10)

Output:
Valid discount returns correct price, while invalid percentages raise a ValueError as expected.