Python & Pandas Refresher

1.1 Introduction to Python Syntax

Python syntax is easy to understand, designed to be readable and intuitive. It uses indentation rather than curly braces to define code blocks. This makes Python one of the most beginner-friendly programming languages. Python is also case-sensitive, which means 'Variable' and 'variable' are considered different. This subchapter will focus on understanding Python’s simple and clean syntax.

Example:

name = "John"  # Defining a string variable

print("Hello", name)  # Printing a message

Output: Hello John

1.2 Variables, Lists, and Dictionaries

In Python, variables store data values. Lists are ordered collections of items, which can be of any data type, and allow duplicates. Dictionaries store key-value pairs, making them ideal for fast lookups of related data. This section will discuss how to define and work with these essential data structures.

Example:

age = 25  # Variable holding an integer

friends = ["Alice", "Bob"]  # List of friends

profile = {"name": "John", "age": 25}  # Dictionary with key-value pairs

print(profile["name"])  # Accessing value associated with the key 'name' in dictionary

Output: John

1.3 Functions and Conditionals

Functions allow you to group code into reusable blocks, making programs more organized and modular. Conditionals, using 'if', 'elif', and 'else', control the flow of execution based on logical decisions. These structures let the program take different actions depending on the input or conditions.

Example:

def greet(name):  # Function definition

    if name == "Alice":  # Conditional check

        return "Hi Alice!"  # Return greeting for Alice

    else:

        return "Hello!"  # Default greeting

print(greet("Alice"))  # Calling the function with 'Alice'

Output: Hi Alice!

1.4 Importing pandas and numpy

Pandas and numpy are two fundamental libraries for data manipulation and analysis in Python. Pandas provides high-level data structures like DataFrames, which handle tabular data, while numpy offers support for numerical computations with arrays. These libraries are essential for data science and machine learning tasks.

Example:

import pandas as pd  # Import pandas library

import numpy as np  # Import numpy library

data = np.array([[1, 2], [3, 4]])  # Create a 2D numpy array

df = pd.DataFrame(data, columns=["A", "B"])  # Convert numpy array to DataFrame

print(df)  # Print the DataFrame

Output:
A B
0 1 2
1 3 4

Series Essentials

2.1 Creating Series

A pandas Series is a one-dimensional labeled array, capable of holding any data type (integers, strings, floats, etc.). It is similar to a list or array in Python, but it comes with additional features like labels for indexing. A Series can be created from a list, dictionary, or even a scalar value.

Example:

import pandas as pd  # Importing pandas

data = [10, 20, 30, 40]  # List of data

series = pd.Series(data)  # Creating a Series from the list

print(series)  # Printing the Series

Output:
0 10
1 20
2 30
3 40

2.2 Indexing & Slicing

Indexing in a Series allows you to access specific elements based on their position or label. Slicing is similar to indexing, but it allows you to retrieve a subset of elements. You can use integer-based or label-based indexing with pandas Series.

Example:

data = [10, 20, 30, 40]  # List of data

series = pd.Series(data)  # Creating the Series

print(series[1])  # Indexing: Access element at index 1

print(series[1:3])  # Slicing: Access elements from index 1 to 3

Output:
20
1 20
2 30

2.3 Series Methods

Pandas Series come with a variety of methods to perform operations on the data. These methods allow you to perform statistical, mathematical, and string operations, among others. Some common methods include `sum()`, `mean()`, `max()`, and `min()`.

Example:

data = [10, 20, 30, 40]  # List of data

series = pd.Series(data)  # Creating the Series

print(series.sum())  # Calculate sum of elements

print(series.mean())  # Calculate mean of elements

print(series.max())  # Get maximum value in the Series

Output:
100
25.0
40

2.4 Handling NaNs in Series

NaN stands for 'Not a Number' and is used to represent missing or undefined values in a pandas Series. Pandas provides methods to detect, remove, or replace NaNs. Handling NaNs is crucial when performing data analysis to ensure the accuracy of the results.

Example:

data = [10, None, 30, None]  # List of data with NaNs

series = pd.Series(data)  # Creating the Series

print(series.isna())  # Check for NaNs in the Series

series_filled = series.fillna(0)  # Replace NaNs with 0

print(series_filled)  # Printing the Series after replacing NaNs

Output:
0 False
1 True
2 False
3 True
dtype: bool
0 10.0
1 0.0
2 30.0
3 0.0

Chapter 3: DataFrame Fundamentals

Subchapter 3.1: Creating DataFrames

DataFrames are the core data structure in the pandas library. They are similar to Excel tables or SQL tables, consisting of rows and columns. A DataFrame can be created from various data structures such as dictionaries, lists, and even external files like CSVs or Excel files. It is the starting point for any data manipulation task in Python.

Example: Create DataFrame from dictionary

import pandas as pd  

data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}  

df = pd.DataFrame(data)  

print(df)

Output:
Name Age
Alice 25
Bob 30

Subchapter 3.2: Accessing Columns & Rows

DataFrames allow you to access specific rows and columns using both index positions and labels. Using `.loc[]` enables label-based access, while `.iloc[]` is used for index-based access. This flexibility makes it easy to select and work with parts of your data.

Example: Access rows and columns

print(df['Name'])  

print(df.loc[0])

Output:
0 Alice
1 Bob
Name Alice
Age 25

Subchapter 3.3: Basic Operations

Once you have a DataFrame, you can perform various operations on it. You can carry out arithmetic operations, statistical computations, and apply functions across columns or rows. Pandas makes these operations very straightforward.

Example: Add new column with doubled age

df['DoubleAge'] = df['Age'] * 2  

print(df)

Output:
Name Age DoubleAge
Alice 25 50
Bob 30 60

Subchapter 3.4: Adding & Deleting Columns

You can easily add new columns to a DataFrame by assigning values to a new column label. Deleting columns can be done with the `drop()` method, specifying the column name and the axis (1 for columns). This flexibility allows you to manage your data efficiently.

Example: Add and delete a column

df['City'] = ['NYC', 'LA']  

df = df.drop('DoubleAge', axis=1)  

print(df)

Output:
Name Age City
Alice 25 NYC
Bob 30 LA

Chapter 4: Importing & Exporting Data

4.1 Reading CSV, Excel, and JSON

In data analysis, importing data is one of the first steps. Libraries like pandas allow you to read data from multiple file formats such as CSV, Excel, and JSON. These formats are commonly used in data storage and exchange, making it essential to learn how to work with them effectively.

# Real-world example: Reading CSV and JSON files

import pandas as pd  # Import the pandas library

csv_data = pd.read_csv('data.csv')  # Load a CSV file

json_data = pd.read_json('data.json')  # Load a JSON file

print(csv_data.head())  # Display first 5 rows of CSV data

Output: First 5 rows of the CSV data

4.2 Exporting to Files

After processing or analyzing data, you may need to export it for further use or to share with others. pandas provides methods for exporting data to various formats such as CSV, Excel, and JSON. This makes it easy to save results or share data in a standardized format.

# Real-world example: Exporting data to CSV

data = pd.DataFrame({'name': ['Tom', 'Jerry'], 'age': [5, 3]})  # Create a DataFrame

data.to_csv('output.csv', index=False)  # Export the DataFrame to a CSV file

Output: A new file 'output.csv' is created with the data

4.3 Setting Index while Loading

Sometimes, when loading data into a DataFrame, you may want to use a specific column as the index. This can help in organizing the data better, especially when dealing with large datasets. Using an appropriate index can improve the speed of data lookups and manipulations.

# Real-world example: Loading data with an index column

data = pd.read_csv('data.csv', index_col='id')  # Use the 'id' column as the index

print(data.head())  # Display first 5 rows with 'id' as index

Output: DataFrame with 'id' as the index column

4.4 Working with Large Datasets

Handling large datasets can be challenging due to memory limitations. One way to handle this is by reading data in chunks, which allows you to process data piece-by-piece without loading the entire dataset into memory at once. This approach helps in efficiently managing and analyzing large datasets.

# Real-world example: Reading data in chunks

chunk_iter = pd.read_csv('large_data.csv', chunksize=1000)  # Read data in chunks of 1000 rows

for chunk in chunk_iter:

    print(chunk.head())  # Process and display first 5 rows of each chunk

Output: First 5 rows from each 1000-row chunk

Chapter 5: Data Cleaning

5.1 Handling Null Values

Null values (or missing data) can impact model accuracy. Common techniques to handle them include removing rows, filling with default values, or using statistical measures (mean/median). Choosing the right strategy depends on the nature and importance of the data.

      import pandas as pd  

      df = pd.DataFrame({'A': [1, 2, None, 4]})  

      df['A'].fillna(df['A'].mean(), inplace=True)  

      print(df)  

      # This replaces null values in column A with the mean of the column

Output:
A
0 1.0
1 2.0
2 2.333
3 4.0

5.2 Type Conversions

Data may not always be in the correct format. For example, numerical values might be stored as strings. Type conversion helps ensure the data is in a usable and consistent format for analysis or modeling.

      import pandas as pd  

      df = pd.DataFrame({'A': ['1', '2', '3']})  

      df['A'] = df['A'].astype(int)  

      print(df.dtypes)  

      # This converts the column 'A' from string to integer type

Output: A int64

5.3 Duplicates and Outliers

Duplicate data can skew analysis, while outliers can mislead models. Removing duplicates and identifying outliers using methods like IQR or Z-score is crucial for high-quality data preprocessing.

      import pandas as pd  

      df = pd.DataFrame({'A': [1, 1, 2, 100]})  

      df = df.drop_duplicates()  

      print(df)  

      # This removes duplicate rows based on identical values

Output:
A
0 1
2 2
3 100

5.4 Using apply() for Cleaning

The apply() function in pandas allows applying a function to each row or column, enabling custom cleaning operations. It's especially useful for string cleaning or transforming values based on logic.

      import pandas as pd  

      df = pd.DataFrame({'A': [' apple ', 'Banana', ' grape ']})  

      df['A'] = df['A'].apply(lambda x: x.strip().lower())  

      print(df)  

      # This strips extra spaces and converts strings to lowercase

Output:
A
0 apple
1 banana
2 grape

Chapter 6: Filtering & Sorting

6.1 Boolean Indexing

Boolean indexing allows us to filter rows of a DataFrame by using a condition that returns True or False for each row. It's a powerful way to filter data without needing to write loops. The condition can be based on any column value or a combination of columns.

Example: Boolean Indexing with Pandas

import pandas as pd

df = pd.DataFrame({'Age': [20, 25, 30], 'Name': ['Alice', 'Bob', 'Charlie']})

filtered_df = df[df['Age'] > 21]  # Only rows where Age > 21

print(filtered_df)

Output:
Name Age
1 Bob 25
2 Charlie 30

6.2 Sorting Values & Index

Sorting helps in organizing data by arranging the values in ascending or descending order. Sorting can be done by column values or by the index. In pandas, the sort_values() function is used to sort the data based on column values, and sort_index() is used to sort by index.

Example: Sorting Values

import pandas as pd

df = pd.DataFrame({'Score': [90, 80, 85], 'Name': ['Alice', 'Bob', 'Charlie']})

sorted_df = df.sort_values(by='Score')  # Sort by Score

print(sorted_df)

Output:
Name Score
1 Bob 80
2 Charlie 85
0 Alice 90

6.3 Conditional Filtering

Conditional filtering lets you select rows that meet one or more conditions. It's a crucial technique for narrowing down a large dataset to relevant information, such as filtering data based on ranges or specific criteria.

Example: Conditional Filtering

import pandas as pd

df = pd.DataFrame({'Age': [20, 25, 30], 'Salary': [3000, 4000, 5000]})

condition = df['Age'] > 21

filtered = df[condition]  # Apply condition

print(filtered)

Output:
Age Salary
1 25 4000
2 30 5000

6.4 Multi-Condition Filters

Multi-condition filtering is the process of applying more than one condition to filter data. You can combine conditions using logical operators like & (and), | (or), and ~ (not), allowing for complex queries on the dataset.

Example: Multiple Conditions

import pandas as pd

df = pd.DataFrame({'Age': [20, 25, 30], 'Salary': [3000, 4000, 5000]})

filtered = df[(df['Age'] > 21) & (df['Salary'] > 3500)]  # Both conditions must be true

print(filtered)

Output:
Age Salary
1 25 4000
2 30 5000

7. Advanced String Operations

7.1 Vectorized String Methods

Vectorized string methods in libraries like pandas allow batch operations on text data. This means that instead of applying string methods individually to each element in a list or array, these methods can be applied to the entire collection of data at once. This approach is highly efficient and reduces the need for writing loops, speeding up processing time.

Example: Converting strings to uppercase using pandas

      import pandas as pd

      

      # Create a pandas Series with strings

      data = pd.Series(['Hello', 'World', 'AI'])

      

      # Apply vectorized string method to convert all strings to uppercase

      result = data.str.upper()

      

      # Print the result

      print(result)

Output:
0 HELLO
1 WORLD
2 AI

7.2 Cleaning Text Data

Cleaning text data is a crucial step before performing any text analysis or applying machine learning models. This process includes removing unwanted characters (such as punctuation or extra spaces), converting text to lowercase, and standardizing the format. Properly cleaned text ensures better model performance and more reliable results in natural language processing (NLP) tasks.

Example: Cleaning text data by stripping spaces and replacing unwanted characters

      text = "  Hello! Welcome...to AI.  "

      

      # Strip leading/trailing spaces, convert to lowercase, and replace unwanted characters

      cleaned = text.strip().lower().replace("...", " ")

      

      # Print the cleaned text

      print(cleaned)

Output:
hello! welcome to ai.

7.3 Extracting with RegEx

Regular expressions (RegEx) allow us to search for specific patterns within strings. This is particularly useful when we need to extract information such as email addresses, phone numbers, dates, or any structured data embedded within unstructured text. RegEx provides a powerful and flexible way to perform pattern matching in strings.

Example: Extracting an email address using regular expressions

      import re

      

      text = "Contact: hello@example.com"

      

      # Regular expression to match email address

      match = re.findall(r'\\S+@\\S+', text)

      

      # Print the extracted email address

      print(match)

Output:
[\"hello@example.com\"]

7.4 String Replacements & Maps

String replacements and maps allow us to transform specific parts of a string or a set of strings. Using replacement rules (e.g., a dictionary), we can replace substrings in text or change one value to another across multiple strings. This technique is especially useful in data cleaning, normalizing terms, or converting categorical values into a standardized format.

Example: Replacing words in a list using a dictionary

      words = ["hi", "bye", "hello"]

      

      # Create a mapping dictionary

      mapping = {"hi": "greetings", "bye": "farewell"}

      

      # Replace words using the dictionary

      replaced = [mapping.get(word, word) for word in words]

      

      # Print the replaced list

      print(replaced)

Output:
["greetings", "farewell", "hello"]

Aggregation and Grouping

8.1 groupby() Basics

The `groupby()` function in Pandas is used to split the data into groups based on some criteria. This operation is useful for summarizing or transforming data in a grouped fashion. The groupby object can then be used to apply aggregation functions, transformations, or filtering operations. The basic idea is that it allows you to perform operations on each group independently.

Example:

import pandas as pd  # Import pandas library

data = {'Category': ['A', 'A', 'B', 'B', 'C', 'C'],  # Sample data

        'Value': [10, 20, 30, 40, 50, 60]}  # Corresponding values

df = pd.DataFrame(data)  # Create DataFrame

grouped = df.groupby('Category')  # Group data by 'Category' column

for name, group in grouped:  # Loop through each group

    print(name)  # Print the group name

    print(group)  # Print the data for each group

Output:
A
Category Value
0 A 10
1 A 20
B
Category Value
2 B 30
3 B 40
C
Category Value
4 C 50
5 C 60

8.2 Aggregation Functions

Pandas provides several aggregation functions such as `sum()`, `mean()`, `max()`, and `min()`. These functions can be applied to each group after using `groupby()` to summarize the data in meaningful ways, such as calculating the total, average, or minimum of values for each group.

Example:

df = pd.DataFrame({'Category': ['A', 'A', 'B', 'B', 'C', 'C'],  # Sample data

                   'Value': [10, 20, 30, 40, 50, 60]})  # Values

grouped = df.groupby('Category')['Value']  # Group by 'Category' and focus on 'Value' column

result = grouped.sum()  # Calculate sum for each group

print(result)  # Print the result

Output:
Category
A 30
B 70
C 110
Name: Value, dtype: int64

8.3 Custom Aggregations

Custom aggregations allow you to apply your own function to each group. This is useful when built-in aggregation functions do not meet your needs. You can define your own function and pass it to the `agg()` method to perform more complex calculations.

Example:

def custom_func(x):  # Custom aggregation function

    return x.max() - x.min()  # Difference between max and min values


df = pd.DataFrame({'Category': ['A', 'A', 'B', 'B', 'C', 'C'],  # Data

                   'Value': [10, 20, 30, 40, 50, 60]})

grouped = df.groupby('Category')['Value']  # Group by 'Category'

result = grouped.agg(custom_func)  # Apply custom aggregation function

print(result)  # Print result

Output:
Category
A 10
B 10
C 10
Name: Value, dtype: int64

8.4 Hierarchical Grouping

Hierarchical grouping, or multi-level grouping, allows you to group data by more than one column. This is useful when you want to analyze the data in a more granular way, for example, grouping by both 'Category' and 'Subcategory'. Pandas allows you to perform aggregation operations on multi-level indexes.

Example:

df = pd.DataFrame({'Category': ['A', 'A', 'B', 'B', 'C', 'C'],  # Sample data

                   'Subcategory': ['X', 'Y', 'X', 'Y', 'X', 'Y'],  # Subcategories

                   'Value': [10, 20, 30, 40, 50, 60]})  # Values

grouped = df.groupby(['Category', 'Subcategory'])['Value']  # Group by both 'Category' and 'Subcategory'

result = grouped.sum()  # Sum for each subgroup

print(result)  # Print result

Output:
Category Subcategory
A X 10
Y 20
B X 30
Y 40
C X 50
Y 60
Name: Value, dtype: int64

Combining Datasets

9.1 Concatenation

Concatenation is the process of joining two or more datasets along a particular axis (rows or columns). In pandas, the `concat()` function allows you to combine multiple DataFrames vertically or horizontally. This operation does not require a common column between the datasets.

Example:

import pandas as pd  # Importing pandas

data1 = {'A': [1, 2], 'B': [3, 4]}  # First dataset

data2 = {'A': [5, 6], 'B': [7, 8]}  # Second dataset

df1 = pd.DataFrame(data1)  # Create first DataFrame

df2 = pd.DataFrame(data2)  # Create second DataFrame

df_combined = pd.concat([df1, df2], ignore_index=True)  # Concatenate DataFrames

print(df_combined)  # Print the concatenated DataFrame

Output:
A B
0 1 3
1 2 4
2 5 7
3 6 8

9.2 Merging on Keys

Merging is the process of combining two DataFrames based on a key or common column. It is similar to SQL joins. You can perform inner, outer, left, or right joins to specify how you want to combine the data based on the keys.

Example:

data1 = {'ID': [1, 2], 'Name': ['Alice', 'Bob']}  # First dataset

data2 = {'ID': [1, 2], 'Age': [25, 30]}  # Second dataset

df1 = pd.DataFrame(data1)  # Create first DataFrame

df2 = pd.DataFrame(data2)  # Create second DataFrame

df_merged = pd.merge(df1, df2, on='ID')  # Merge on the 'ID' column

print(df_merged)  # Print the merged DataFrame

Output:
ID Name Age
0 1 Alice 25
1 2 Bob 30

9.3 join() vs merge()

The `join()` method is used for merging DataFrames based on their index, while `merge()` is used to merge DataFrames based on one or more key columns. `merge()` is more flexible, allowing you to specify different types of joins (inner, outer, left, right), while `join()` is typically used when the join keys are the indices.

Example:

data1 = {'A': [1, 2]}  # First dataset

data2 = {'B': [3, 4]}  # Second dataset

df1 = pd.DataFrame(data1, index=[1, 2])  # First DataFrame with index

df2 = pd.DataFrame(data2, index=[1, 2])  # Second DataFrame with index

df_joined = df1.join(df2)  # Join DataFrames based on index

print(df_joined)  # Print the joined DataFrame

Output:
A B
1 1 3
2 2 4

9.4 Handling Overlaps and Duplicates

When combining datasets, it is important to handle overlaps and duplicates effectively. You can use the `drop_duplicates()` method to remove duplicate rows. Additionally, when merging datasets, you may need to resolve conflicts or overlapping columns using parameters like `suffixes` to avoid column name clashes.

Example:

data1 = {'ID': [1, 2, 2], 'Name': ['Alice', 'Bob', 'Bob']}  # First dataset with duplicates

df1 = pd.DataFrame(data1)  # Create first DataFrame

df_no_duplicates = df1.drop_duplicates()  # Remove duplicates

print(df_no_duplicates)  # Print the DataFrame without duplicates

Output:
ID Name
0 1 Alice
1 2 Bob

Chapter 10: Pivoting and Reshaping

Subchapter 10.1: pivot() vs pivot_table()

The pivot() function is used to reshape data by turning unique values from one column into separate columns, while the pivot_table() function can be used for summarizing data by applying aggregation functions (like sum, mean) to the data.

Example: Using pivot()

import pandas as pd  

data = {'Date': ['2021-01-01', '2021-01-01', '2021-01-02'], 'City': ['A', 'B', 'A'], 'Sales': [200, 150, 300]}  

df = pd.DataFrame(data)  

pivot_df = df.pivot(index='Date', columns='City', values='Sales')  

print(pivot_df)

Output:
City A B
Date
2021-01-01 200 150
2021-01-02 300 NaN

Example: Using pivot_table()

pivot_table_df = df.pivot_table(index='Date', columns='City', values='Sales', aggfunc='sum')  

print(pivot_table_df)

Output:
City A B
Date
2021-01-01 200 150
2021-01-02 300 NaN

Subchapter 10.2: Melting DataFrames

Melting is the opposite of pivoting. It unpivots a DataFrame by converting columns into rows. The melt() function is useful when you need to convert wide-format data into long-format data, making it easier to analyze.

Example: Melting a DataFrame

df_melted = df.melt(id_vars=['Date'], value_vars=['City', 'Sales'], var_name='Variable', value_name='Value')  

print(df_melted)

Output:
Date Variable Value
2021-01-01 City A
2021-01-01 Sales 200
2021-01-02 City A
2021-01-02 Sales 300

Subchapter 10.3: stack() and unstack()

The stack() function stacks the columns of a DataFrame into rows (vertical axis). The unstack() function does the opposite: it converts the rows of a DataFrame into columns (horizontal axis). These functions are useful when working with multi-level indices (MultiIndex).

Example: Using stack()

stacked_df = pivot_df.stack()  

print(stacked_df)

Output:
2021-01-01 A 200
2021-01-01 B 150
2021-01-02 A 300

Example: Using unstack()

unstacked_df = stacked_df.unstack()  

print(unstacked_df)

Output:
City A B
Date
2021-01-01 200 150
2021-01-02 300 NaN

Subchapter 10.4: Reshape with MultiIndex

MultiIndex allows you to have multiple levels of indexing in a DataFrame. You can reshape a DataFrame using MultiIndex for more complex data analysis tasks. You can use stack() and unstack() to work with MultiIndex as well.

Example: Creating a MultiIndex

index = pd.MultiIndex.from_tuples([('2021-01-01', 'A'), ('2021-01-01', 'B'), ('2021-01-02', 'A')], names=['Date', 'City'])  

multiindex_df = pd.DataFrame({'Sales': [200, 150, 300]}, index=index)  

print(multiindex_df)

Output:
Sales
Date City
2021-01-01 A 200
2021-01-01 B 150
2021-01-02 A 300

Example: Reshaping with unstack()

reshaped_df = multiindex_df.unstack()  

print(reshaped_df)

Output:
City A B
Date
2021-01-01 200 150
2021-01-02 300 NaN

Chapter 11: Working with Dates and Times

11.1 Parsing Dates

Working with dates and times is crucial for time-series analysis. Parsing dates refers to converting date information from a string format into a usable DateTime format. This makes it easier to perform operations like filtering and aggregating data based on specific dates.

# Real-world example: Parsing dates

import pandas as pd  # Import the pandas library

data = pd.to_datetime('2025-04-20')  # Convert a string to a DateTime object

print(data)  # Output the parsed date

Output: '2025-04-20 00:00:00'

11.2 Date Indexing

Date indexing is used to set a column with date-related values as the index of a DataFrame. This allows for efficient querying and manipulation of data based on dates. It is particularly useful in time-series analysis, where each row corresponds to a specific point in time.

# Real-world example: Date indexing

data = pd.DataFrame({'date': ['2025-01-01', '2025-02-01', '2025-03-01'], 'value': [10, 20, 30]})

data['date'] = pd.to_datetime(data['date'])  # Convert date column to DateTime

data.set_index('date', inplace=True)  # Set the 'date' column as the index

print(data)  # Display the DataFrame with date as index

Output: DataFrame with 'date' as the index column

11.3 Resampling and Frequency Conversion

Resampling refers to the process of converting time-series data from one frequency to another. This can be useful for aggregating or downsampling data at specific intervals, such as converting daily data into weekly or monthly data. Frequency conversion is typically used to change the granularity of time-series data.

# Real-world example: Resampling and frequency conversion

data = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('2025-01-01', periods=5, freq='D'))

resampled_data = data.resample('M').sum()  # Resample data to monthly frequency and sum values

print(resampled_data)  # Display resampled data

Output: The sum of the values for each month

11.4 Rolling & Expanding Windows

Rolling windows and expanding windows are methods for calculating statistics over a specific time window. A rolling window computes a statistic (e.g., mean, sum) over a moving window of fixed size, while expanding windows calculate the statistic for all data up to the current point.

# Real-world example: Rolling and expanding windows

data = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('2025-01-01', periods=5, freq='D'))

rolling_mean = data.rolling(window=2).mean()  # Compute rolling mean with a window size of 2

expanding_mean = data.expanding().mean()  # Compute expanding mean

print(rolling_mean)  # Display rolling mean

print(expanding_mean)  # Display expanding mean

Output: The rolling mean and expanding mean values for each point in the series

Chapter 12: Time Series Analysis

12.1 Time Series Slicing

Time series slicing refers to extracting specific time periods from a time series dataset. This is useful for focusing on specific time frames such as years, months, or days for analysis or forecasting.

      import pandas as pd  

      date_range = pd.date_range('2021-01-01', periods=5, freq='D')  

      df = pd.DataFrame({'Date': date_range, 'Value': [10, 20, 30, 40, 50]})  

      sliced_df = df[df['Date'] >= '2021-01-03']  

      print(sliced_df)  

      # This slices the data from January 3rd, 2021 onward

Output:
Date Value
2 2021-01-03 30
3 2021-01-04 40
4 2021-01-05 50

12.2 Moving Averages

Moving averages smooth out short-term fluctuations and highlight longer-term trends in time series data. This technique is frequently used in forecasting and trend analysis.

      import pandas as pd  

      date_range = pd.date_range('2021-01-01', periods=5, freq='D')  

      df = pd.DataFrame({'Date': date_range, 'Value': [10, 20, 30, 40, 50]})  

      df['Moving_Avg'] = df['Value'].rolling(window=3).mean()  

      print(df)  

      # This calculates the 3-day moving average of the 'Value' column

Output:
Date Value Moving_Avg
0 2021-01-01 10 NaN
1 2021-01-02 20 NaN
2 2021-01-03 30 20.0
3 2021-01-04 40 30.0
4 2021-01-05 50 40.0

12.3 Time Zone Conversion

Time zone conversion is essential when working with global data to ensure that timestamps are accurate across different regions. This involves converting the time series data to a specific time zone.

      import pandas as pd  

      date_range = pd.date_range('2021-01-01 12:00', periods=5, freq='D')  

      df = pd.DataFrame({'Date': date_range, 'Value': [10, 20, 30, 40, 50]})  

      df['Date'] = df['Date'].dt.tz_localize('UTC').dt.tz_convert('US/Eastern')  

      print(df)  

      # This converts the time zone from UTC to US Eastern Time

Output:
Date Value
0 2021-01-01 07:00:00-05:00 10
1 2021-01-02 07:00:00-05:00 20
2 2021-01-03 07:00:00-05:00 30
3 2021-01-04 07:00:00-05:00 40
4 2021-01-05 07:00:00-05:00 50

12.4 Shifting Data

Shifting data is a technique to move data forward or backward in time. This is useful for creating lag features in time series forecasting or calculating differences between time periods.

      import pandas as pd  

      date_range = pd.date_range('2021-01-01', periods=5, freq='D')  

      df = pd.DataFrame({'Date': date_range, 'Value': [10, 20, 30, 40, 50]})  

      df['Shifted'] = df['Value'].shift(1)  

      print(df)  

      # This shifts the 'Value' column by 1 period down

Output:
Date Value Shifted
0 2021-01-01 10 NaN
1 2021-01-02 20 10.0
2 2021-01-03 30 20.0
3 2021-01-04 40 30.0
4 2021-01-05 50 40.0

Chapter 13: Visualization with Matplotlib & pandas

13.1 Line, Bar, and Pie Charts

Line, bar, and pie charts are fundamental ways of visualizing data. A line chart is used to display trends over time, while bar charts are useful for comparing quantities, and pie charts show the proportions of categories.

Example: Line Chart with Matplotlib

import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]

y = [2, 4, 6, 8, 10]

plt.plot(x, y)

plt.title('Line Chart')

plt.xlabel('X-axis')

plt.ylabel('Y-axis')

plt.show()

Output:
A line chart will be displayed showing the trend of values on the X and Y axes.

Example: Bar Chart

categories = ['A', 'B', 'C', 'D']

values = [10, 20, 30, 40]

plt.bar(categories, values)

plt.title('Bar Chart')

plt.xlabel('Categories')

plt.ylabel('Values')

plt.show()

Output:
A bar chart will be displayed with the categories on the X-axis and the values on the Y-axis.

Example: Pie Chart

sizes = [25, 35, 40]

labels = ['Category A', 'Category B', 'Category C']

plt.pie(sizes, labels=labels, autopct='%1.1f%%')

plt.title('Pie Chart')

plt.show()

Output:
A pie chart will be displayed showing the distribution of the categories with percentage values.

13.2 Histograms & Boxplots

Histograms and boxplots are helpful in analyzing the distribution of data. Histograms show the frequency of data points in intervals, while boxplots highlight the range, median, and outliers in data.

Example: Histogram

import numpy as np

data = np.random.randn(1000)

plt.hist(data, bins=30)

plt.title('Histogram')

plt.xlabel('Value')

plt.ylabel('Frequency')

plt.show()

Output:
A histogram will be displayed showing the distribution of values in the dataset.

Example: Boxplot

data = [np.random.randn(100), np.random.randn(100) + 1]

plt.boxplot(data)

plt.title('Boxplot')

plt.show()

Output:
A boxplot will be displayed with boxes showing the distribution of the data and potential outliers.

13.3 Plotting with pandas

pandas provides built-in plotting functionality to visualize data directly from DataFrames. It uses Matplotlib as a backend and makes it easy to create plots with simple commands.

Example: Plotting DataFrame

import pandas as pd

df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [2, 4, 6, 8, 10]})

df.plot(x='x', y='y', kind='line')

plt.title('pandas Line Plot')

plt.show()

Output:
A line plot will be displayed showing the relationship between columns 'x' and 'y' from the DataFrame.

Example: Bar Plot from pandas

df.plot(x='x', y='y', kind='bar')

plt.title('pandas Bar Plot')

plt.show()

Output:
A bar plot will be displayed using the values from the DataFrame.

13.4 Dual Axis and Subplots

Dual axis and subplots allow you to plot multiple graphs on the same figure. Dual-axis plots show two sets of data on different y-axes, while subplots allow multiple charts to be placed in a grid layout.

Example: Dual Axis Plot

x = [1, 2, 3, 4, 5]

y1 = [10, 20, 25, 30, 40]

y2 = [100, 90, 80, 70, 60]

fig, ax1 = plt.subplots()

ax1.plot(x, y1, 'g-')  # First y-axis

ax2 = ax1.twinx()

ax2.plot(x, y2, 'b-')  # Second y-axis

plt.show()

Output:
A plot with two y-axes will be displayed, one showing the 'y1' values and another showing 'y2' values.

Example: Subplots

fig, (ax1, ax2) = plt.subplots(1, 2)

ax1.plot(x, y1)

ax2.plot(x, y2)

plt.show()

Output:
Two plots will be displayed side by side, each showing different data on the same figure.

14. Working with Real-World Datasets

14.1 E-Commerce Orders

E-commerce orders data typically contains information on product purchases, such as customer details, order date, product categories, prices, and more. Working with this data allows businesses to gain insights into customer behavior, sales trends, and performance metrics.

Example: Loading and analyzing e-commerce order data using pandas

      import pandas as pd

      

      # Load the e-commerce dataset

      data = pd.read_csv('ecommerce_orders.csv')

      

      # Display the first few rows of the dataset

      print(data.head())

      

      # Analyze total sales

      total_sales = data['total_price'].sum()

      print(f'Total Sales: {total_sales}')

Output:
First few rows of the dataset
Total Sales: 50000

14.2 COVID-19 Data

COVID-19 data includes information about the spread of the virus, infection rates, mortality rates, vaccination rates, and more. This data is often used for epidemiological studies and to inform public health policies. Understanding and analyzing this dataset can provide crucial insights into the pandemic's progression.

Example: Analyzing COVID-19 cases by country

      # Load the COVID-19 dataset

      covid_data = pd.read_csv('covid_data.csv')

      

      # Group by country and calculate total cases

      country_cases = covid_data.groupby('country')['cases'].sum()

      

      # Print the total cases for each country

      print(country_cases)

Output:
Country1: 100000
Country2: 200000
...

14.3 NYC Property Sales

The NYC property sales dataset includes information about properties sold in New York City, such as sale price, property type, and location. This dataset is valuable for real estate analysis, helping to identify trends in property values and market demand.

Example: Analyzing property sale prices in NYC

      # Load the NYC property dataset

      property_data = pd.read_csv('nyc_property_sales.csv')

      

      # Analyze the average sale price

      avg_price = property_data['sale_price'].mean()

      print(f'Average Sale Price: {avg_price}')

Output:
Average Sale Price: 750000

14.4 Airline On-Time Data

Airline on-time data includes information about flights, including departure and arrival times, delays, and cancellations. This data is essential for analyzing airline performance, identifying patterns in delays, and improving operational efficiency.

Example: Analyzing flight delays by airline

      # Load the airline on-time dataset

      flight_data = pd.read_csv('airline_on_time.csv')

      

      # Group by airline and calculate average delay

      airline_delays = flight_data.groupby('airline')['arrival_delay'].mean()

      

      # Print the average delays for each airline

      print(airline_delays)

Output:
Airline1: 15 minutes
Airline2: 30 minutes
...

Data Export & Storage

15.1 Export to CSV, Excel, JSON

Exporting data to various file formats like CSV, Excel, and JSON is a common task in data analysis. Pandas makes it easy to save DataFrames to these formats using methods like `to_csv()`, `to_excel()`, and `to_json()`. These formats are widely supported and can be opened and analyzed in other tools such as spreadsheets or databases.

Example:

import pandas as pd  # Import pandas library

data = {'Name': ['Alice', 'Bob', 'Charlie'],  # Sample data

        'Age': [25, 30, 35]}

df = pd.DataFrame(data)  # Create DataFrame

df.to_csv('data.csv', index=False)  # Export DataFrame to CSV without row index

df.to_excel('data.xlsx', index=False)  # Export DataFrame to Excel without row index

df.to_json('data.json')  # Export DataFrame to JSON

Output: Files 'data.csv', 'data.xlsx', and 'data.json' will be created in the working directory.

15.2 HDF5 and Feather Formats

HDF5 and Feather are binary file formats that provide fast and efficient storage for large datasets. HDF5 is often used for storing complex data hierarchies, while Feather is optimized for fast reading and writing of large datasets. Both formats are commonly used for storing data in data science and machine learning workflows.

Example:

df.to_hdf('data.h5', key='df', mode='w')  # Export DataFrame to HDF5 format

df.to_feather('data.feather')  # Export DataFrame to Feather format

Output: Files 'data.h5' and 'data.feather' will be created in the working directory.

15.3 Compression Techniques

Compression is useful for reducing file size and improving storage efficiency, especially when working with large datasets. Pandas supports compressing CSV, Excel, and HDF5 files using different algorithms like 'gzip', 'bz2', 'zip', and 'xz'. By using compression, you can significantly reduce the size of data files without losing information.

Example:

df.to_csv('data_compressed.csv.gz', index=False, compression='gzip')  # Export CSV with gzip compression

df.to_excel('data_compressed.xlsx', index=False, engine='openpyxl', compression='zip')  # Export Excel with zip compression

df.to_hdf('data_compressed.h5', key='df', mode='w', compression='blosc')  # Export HDF5 with compression

Output: Files 'data_compressed.csv.gz', 'data_compressed.xlsx', and 'data_compressed.h5' will be created with compression applied.

15.4 Version Control with Data

Version control for data ensures that different versions of datasets are stored and managed efficiently. This is particularly important in collaborative environments where data is constantly evolving. By using tools like Git or DVC (Data Version Control), you can track changes to datasets and revert to previous versions when necessary. This helps maintain consistency and reproducibility in data analysis projects.

Example:

# Using DVC to track a data file version
# Initialize DVC in a Git repository
!git init  # Initialize a Git repository

!dvc init  # Initialize DVC (Data Version Control)

!dvc add data.csv  # Add data file to DVC

!git add data.csv.dvc  # Add DVC file to Git

!git commit -m "Add data version 1"  # Commit to Git

!dvc push  # Push data to remote storage

Output: The data file 'data.csv' will be tracked with DVC, and version control will be applied using Git.

Advanced Indexing Techniques

16.1 MultiIndex Introduction

MultiIndex allows you to have multiple levels of indexing on a DataFrame or Series. This provides the ability to represent more complex data structures, such as hierarchical or multi-dimensional data, using a single object. MultiIndex is commonly used when dealing with data grouped by more than one criterion.

Example:

import pandas as pd  # Importing pandas

arrays = [['A', 'A', 'B', 'B'], [1, 2, 1, 2]]  # Defining the multi-level index

index = pd.MultiIndex.from_arrays(arrays, names=('Letter', 'Number'))  # Creating MultiIndex

data = [10, 20, 30, 40]  # Data for the Series

multi_index_series = pd.Series(data, index=index)  # Creating a Series with MultiIndex

print(multi_index_series)  # Print the Series with MultiIndex

Output:
Letter Number
A 1 10
2 20
B 1 30
2 40

16.2 Indexing by Level

Indexing by level allows you to access data in a MultiIndex structure by specifying a level or combination of levels. This is useful for slicing data based on specific index levels without needing to reset or flatten the index.

Example:

arrays = [['A', 'A', 'B', 'B'], [1, 2, 1, 2]]  # Defining the multi-level index

index = pd.MultiIndex.from_arrays(arrays, names=('Letter', 'Number'))  # Creating MultiIndex

data = [10, 20, 30, 40]  # Data for the Series

multi_index_series = pd.Series(data, index=index)  # Creating a Series with MultiIndex

print(multi_index_series['A'])  # Indexing by the first level (Letter='A')

print(multi_index_series.loc['B'])  # Indexing by the first level (Letter='B')

Output:
Number
1 10
2 20
dtype: int64
Number
1 30
2 40
dtype: int64

16.3 Swapping and Sorting Index

Swapping the index levels allows you to change the order of the index levels, which can be useful for rearranging or sorting the data. Sorting by index can also be done to arrange the data in a specified order, either ascending or descending.

Example:

arrays = [['A', 'A', 'B', 'B'], [1, 2, 1, 2]]  # Defining the multi-level index

index = pd.MultiIndex.from_arrays(arrays, names=('Letter', 'Number'))  # Creating MultiIndex

data = [10, 20, 30, 40]  # Data for the Series

multi_index_series = pd.Series(data, index=index)  # Creating a Series with MultiIndex

swapped = multi_index_series.swaplevel()  # Swapping index levels

print(swapped)  # Print Series with swapped index

sorted_series = multi_index_series.sort_index()  # Sorting by index

print(sorted_series)  # Print sorted Series

Output:
Number Letter
1 A 10
B 30
2 A 20
B 40
dtype: int64
Letter Number
A 1 10
2 20
B 1 30
2 40

16.4 Index Set Operations

Index set operations allow you to perform mathematical set operations (like union, intersection, and difference) on the index of a pandas object. This can be useful for comparing and manipulating the indices of different datasets.

Example:

index1 = pd.Index(['A', 'B', 'C', 'D'])  # First index

index2 = pd.Index(['B', 'C', 'E', 'F'])  # Second index

union_index = index1.union(index2)  # Union of indices (all unique elements)

intersection_index = index1.intersection(index2)  # Intersection of indices (common elements)

difference_index = index1.difference(index2)  # Difference of indices (elements in index1 but not index2)

print(union_index)  # Print union of indices

print(intersection_index)  # Print intersection of indices

print(difference_index)  # Print difference of indices

Output:
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
Index(['B', 'C'], dtype='object')
Index(['A', 'D'], dtype='object')

Chapter 17: Custom Functions and apply()

Subchapter 17.1: Row vs Column Functions

In pandas, you can apply custom functions to rows or columns of a DataFrame using the apply() method. When applying a function to columns, the function is applied to each individual column, whereas applying it to rows means the function will be applied to each row. The axis parameter is used to specify whether you want to apply the function along rows (axis=1) or columns (axis=0).

Example: Apply function to columns

import pandas as pd  

data = {'A': [1, 2, 3], 'B': [4, 5, 6]}  

df = pd.DataFrame(data)  

def sum_func(col):  

    return col.sum()  

column_sum = df.apply(sum_func, axis=0)  

print(column_sum)

Output:
A 6
B 15

Example: Apply function to rows

row_sum = df.apply(sum_func, axis=1)  

print(row_sum)

Output:
0 5
1 7
2 9

Subchapter 17.2: Lambda Functions

A lambda function is a small anonymous function that can be defined in a single line. It is often used with the apply() method to create quick functions for operations like transformations and computations.

Example: Using a lambda function for transformation

df['A_squared'] = df['A'].apply(lambda x: x**2)  

print(df)

Output:
A B A_squared
0 1 4 1
1 2 5 4
2 3 6 9

Example: Using lambda function for conditional transformation

df['A_label'] = df['A'].apply(lambda x: 'Small' if x < 3 else 'Large')  

print(df)

Output:
A B A_squared A_label
0 1 4 1 Small
1 2 5 4 Small
2 3 6 9 Large

Subchapter 17.3: Using map() vs apply()

The map() function is used for element-wise transformations, and it works specifically with Series (one-dimensional data). In contrast, the apply() function is more flexible and can be used with both Series and DataFrames, enabling more complex transformations across rows or columns. map() is generally faster than apply() when applied to Series.

Example: Using map() for element-wise transformation

df['B_mapped'] = df['B'].map(lambda x: x * 10)  

print(df)

Output:
A B A_squared A_label B_mapped
0 1 4 1 Small 40
1 2 5 4 Small 50
2 3 6 9 Large 60

Example: Using apply() for row-wise transformation

df['row_sum'] = df.apply(lambda row: row['A'] + row['B'], axis=1)  

print(df)

Output:
A B A_squared A_label B_mapped row_sum
0 1 4 1 Small 40 5
1 2 5 4 Small 50 7
2 3 6 9 Large 60 9

Subchapter 17.4: Complex Transformations

Complex transformations involve applying more advanced logic to the DataFrame or Series. This may include combining multiple columns, performing conditional operations, or using external functions within the transformation.

Example: Conditional transformation with multiple columns

df['A_and_B'] = df.apply(lambda row: row['A'] * row['B'] if row['A'] > 1 else row['A'] + row['B'], axis=1)  

print(df)

Output:
A B A_squared A_label B_mapped row_sum A_and_B
0 1 4 1 Small 40 5 5
1 2 5 4 Small 50 7 10
2 3 6 9 Large 60 9 18

Example: Applying an external function

def multiply(a, b):  

    return a * b  

df['multiplication'] = df.apply(lambda row: multiply(row['A'], row['B']), axis=1)  

print(df)

Output:
A B A_squared A_label B_mapped row_sum A_and_B multiplication
0 1 4 1 Small 40 5 5 4
1 2 5 4 Small 50 7 10 10
2 3 6 9 Large 60 9 18 18

Chapter 18: Performance Optimization

18.1 Efficient Data Types

Optimizing the data types used in your dataset can significantly improve performance, especially when working with large datasets. By choosing more efficient data types, such as using integers instead of floats, or reducing memory usage by using smaller types like 'float32' instead of 'float64', you can save both memory and processing time.

# Real-world example: Efficient data types

import pandas as pd  # Import the pandas library

data = pd.DataFrame({'col1': [1.0, 2.5, 3.8], 'col2': [5.2, 6.3, 7.4]})  # Create a DataFrame

data['col1'] = data['col1'].astype('float32')  # Convert to more efficient float32

data['col2'] = data['col2'].astype('float32')  # Convert to more efficient float32

print(data.dtypes)  # Output the data types of each column

Output: The data types of 'col1' and 'col2' are now 'float32'

18.2 Using Categorical Data

Categorical data refers to data that can take on a limited, fixed number of values, often representing categories such as 'male/female' or 'red/blue/green'. Using the 'category' data type in pandas reduces memory usage and increases performance when dealing with repetitive text values.

# Real-world example: Using categorical data

data = pd.DataFrame({'category_col': ['red', 'blue', 'green', 'red', 'blue']})  # Create a DataFrame

data['category_col'] = data['category_col'].astype('category')  # Convert to categorical type

print(data.dtypes)  # Output the data type of the 'category_col' column

Output: The data type of 'category_col' is now 'category'

18.3 Vectorization Over Loops

Vectorization refers to replacing explicit loops with operations that are performed over entire arrays or columns of data. By using vectorized operations, you can take advantage of optimized C code behind libraries like NumPy and pandas, leading to faster execution times compared to using Python loops.

# Real-world example: Vectorization over loops

import numpy as np  # Import NumPy library

data = np.array([1, 2, 3, 4, 5])  # Create a NumPy array

result = data * 2  # Vectorized operation (no loop needed)

print(result)  # Output the result

Output: The array [2, 4, 6, 8, 10]

18.4 Profiling with %%timeit

Profiling with the magic command `%%timeit` in Jupyter Notebooks or IPython allows you to measure the execution time of a single line or a block of code. This can be useful for determining performance bottlenecks and optimizing code where necessary.

# Real-world example: Profiling with %%timeit

# This code should be run in a Jupyter Notebook or IPython environment

%%timeit  # Measure the time taken by this code

data = [x for x in range(1000)]  # List comprehension to create a list

Output: The time taken to execute the list comprehension is displayed

Chapter 19: Error Handling and Debugging

19.1 Try-Except in Pandas Pipelines

Using try-except blocks in a pandas pipeline allows you to gracefully handle errors, ensuring that the process continues even if one step fails. This is particularly useful in data processing when unexpected errors occur.

      import pandas as pd  

      def process_data(df):  

          try:  

              df['Value'] = df['Value'] / df['Value2']  

          except ZeroDivisionError:  

              print("Error: Division by zero encountered")  

          return df  

      df = pd.DataFrame({'Value': [10, 20, 30], 'Value2': [2, 0, 5]})  

      result = process_data(df)  

      print(result)  

      # This uses try-except to handle division by zero errors

Output:
Error: Division by zero encountered
Value Value2
0 5 2
1 NaN 0
2 6 5

19.2 Logging and Custom Errors

Logging allows you to record information about your program's execution, making it easier to track errors and debug. Custom errors can also be raised to handle specific conditions in a program.

      import logging  

      logging.basicConfig(level=logging.INFO)  

      def check_value(value):  

          if value < 0:  

              logging.error("Value cannot be negative")  

              raise ValueError("Negative value error")  

          return value  

      try:  

          check_value(-5)  

      except ValueError as e:  

          logging.exception(e)  

      # This logs an error when a negative value is encountered

Output:
ERROR:root:Value cannot be negative
Traceback (most recent call last):
File "script.py", line 7, in
check_value(-5)
File "script.py", line 5, in check_value
raise ValueError("Negative value error")
ValueError: Negative value error

19.3 Warnings and Deprecations

Warnings indicate potential issues in code that may not stop execution but could lead to problems in the future. Deprecation warnings indicate that certain features or functions will be removed in future versions of libraries.

      import warnings  

      def old_function():  

          warnings.warn("This function is deprecated", DeprecationWarning)  

      old_function()  

      # This generates a deprecation warning

Output:
Warning: This function is deprecated

19.4 Debugging with Breakpoints

Breakpoints allow you to pause code execution at a specific line, enabling you to inspect variables and the program state. This is especially useful for debugging issues in complex code.

      import pdb  

      def calculate(x, y):  

          pdb.set_trace()  

          return x + y  

      result = calculate(10, 20)  

      print(result)  

      # This will pause the program at the breakpoint, allowing inspection

Output:
(Pdb) 10
(Pdb) 20
(Pdb) n
30

Chapter 20: Advanced Visualization with Seaborn

20.1 Introduction to Seaborn

Seaborn is a Python data visualization library based on Matplotlib that provides a high-level interface for drawing attractive and informative statistical graphics. It is designed to work well with pandas DataFrames and provides a variety of plotting functions for visualizing distributions, relationships, and categories.

Example: Simple Seaborn Plot

import seaborn as sns

import matplotlib.pyplot as plt

# Load the dataset
data = sns.load_dataset('iris')

sns.scatterplot(x='sepal_length', y='sepal_width', data=data)

plt.title('Seaborn Scatter Plot')

plt.show()

Output:
A scatter plot will be displayed showing the relationship between 'sepal_length' and 'sepal_width' from the iris dataset.

20.2 Statistical Plots

Seaborn provides several built-in functions to create statistical plots, such as bar plots, box plots, violin plots, and regression plots. These plots are designed to visualize the statistical properties of the data, like distributions, central tendency, and spread.

Example: Boxplot

sns.boxplot(x='species', y='sepal_length', data=data)

plt.title('Boxplot of Sepal Length by Species')

plt.show()

Output:
A boxplot will be displayed, showing the distribution of 'sepal_length' for each species in the iris dataset.

Example: Regression Plot

sns.regplot(x='sepal_length', y='sepal_width', data=data)

plt.title('Regression Plot')

plt.show()

Output:
A regression plot will be displayed showing the linear relationship between 'sepal_length' and 'sepal_width' with a fitted line.

20.3 Pairplots and Heatmaps

Pairplots and heatmaps are two powerful visualization techniques in Seaborn. Pairplots show relationships between all variables in a dataset, while heatmaps visualize correlation matrices or other two-dimensional data.

Example: Pairplot

sns.pairplot(data, hue='species')

plt.title('Pairplot of Iris Dataset')

plt.show()

Output:
A pairplot will be displayed showing scatter plots between all variables in the iris dataset, color-coded by the species.

Example: Heatmap

import numpy as np

corr_matrix = data.corr()

sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')

plt.title('Correlation Heatmap')

plt.show()

Output:
A heatmap will be displayed showing the correlation matrix between the numerical features of the iris dataset, with annotated values.

20.4 Custom Themes and Palettes

Seaborn allows users to customize the appearance of their plots using themes and color palettes. You can change the overall style of plots and adjust the color schemes to make the visuals more appealing and easier to interpret.

Example: Custom Theme

sns.set_style('whitegrid')

sns.scatterplot(x='sepal_length', y='sepal_width', data=data)

plt.title('Scatter Plot with Custom Theme')

plt.show()

Output:
A scatter plot will be displayed with a white grid background, using Seaborn's 'whitegrid' theme.

Example: Custom Color Palette

sns.set_palette('muted')

sns.boxplot(x='species', y='sepal_length', data=data)

plt.title('Boxplot with Custom Palette')

plt.show()

Output:
A boxplot will be displayed using Seaborn's 'muted' color palette.

21. Case Study: Stock Market Analysis

21.1 Importing Finance Data

Importing finance data is the first step in performing stock market analysis. You can use various data sources like Yahoo Finance, Alpha Vantage, or Quandl to fetch historical stock price data. This data typically includes the opening, closing, high, and low prices for each trading day.

Example: Importing stock data using the `yfinance` library

      import yfinance as yf

      

      # Define the stock ticker symbol (e.g., Apple)

      ticker = 'AAPL'

      

      # Download historical stock data for the past year

      data = yf.download(ticker, period='1y')

      

      # Display the first few rows of the data

      print(data.head())

Output:
First few rows of Apple's stock data including Date, Open, High, Low, Close, Volume, Adjusted Close

21.2 Daily & Cumulative Returns

Daily returns refer to the percentage change in stock price from one day to the next. Cumulative returns show the total percentage change over a specified period. These calculations help investors understand short-term price movements and the overall performance of a stock.

Example: Calculating daily and cumulative returns

      # Calculate daily returns by computing the percentage change of the closing price

      data['Daily_Return'] = data['Close'].pct_change()

      

      # Calculate cumulative returns

      data['Cumulative_Return'] = (1 + data['Daily_Return']).cumprod() - 1

      

      # Display the results

      print(data[['Daily_Return', 'Cumulative_Return']].tail())

Output:
Last 5 rows of daily returns and cumulative returns for Apple

21.3 Moving Averages & Volatility

Moving averages help smooth out price data to identify trends over a specific period, while volatility measures the variation in stock prices. A common moving average is the 50-day or 200-day moving average, and volatility can be measured using the standard deviation of daily returns.

Example: Calculating moving averages and volatility

      # Calculate the 50-day moving average

      data['50_Day_MA'] = data['Close'].rolling(window=50).mean()

      

      # Calculate the volatility (standard deviation of daily returns)

      data['Volatility'] = data['Daily_Return'].rolling(window=50).std()

      

      # Display the results

      print(data[['50_Day_MA', 'Volatility']].tail())

Output:
Last 5 rows of 50-day moving averages and volatility values

21.4 Risk vs Return

Risk vs return is a fundamental concept in finance. Investors must assess whether the potential return on an investment justifies the associated risk. By analyzing historical stock data, we can calculate the expected return and the risk (volatility) to make more informed investment decisions.

Example: Analyzing risk and return using standard deviation and mean

      # Calculate the expected return (mean of daily returns)

      expected_return = data['Daily_Return'].mean()

      

      # Calculate the risk (standard deviation of daily returns)

      risk = data['Daily_Return'].std()

      

      # Print the expected return and risk

      print(f'Expected Return: {expected_return:.4f}')

      print(f'Risk (Volatility): {risk:.4f}')

Output:
Expected Return: 0.0012
Risk (Volatility): 0.0153

Case Study: Movie Analytics

22.1 Scraping IMDB Data

Web scraping is a technique used to extract data from websites. In this case, we will scrape movie data from IMDb using the BeautifulSoup and requests libraries. This allows us to gather information such as movie titles, genres, release dates, and ratings.

Example:

import requests  # Import requests library

from bs4 import BeautifulSoup  # Import BeautifulSoup for HTML parsing


url = 'https://www.imdb.com/chart/top'  # IMDb top movies chart URL

response = requests.get(url)  # Send a GET request to the URL

soup = BeautifulSoup(response.text, 'html.parser')  # Parse HTML content using BeautifulSoup


movies = soup.find_all('td', class_='titleColumn')  # Find all movie title columns

for movie in movies[:5]:  # Loop through the first 5 movies

    title = movie.a.text  # Extract movie title

    year = movie.span.text.strip('()')  # Extract movie release year

    print(f'{title} ({year})')  # Print movie title and release year

Output: The top 5 movies from IMDb, along with their release years, will be printed.

22.2 Genre-Based Grouping

Grouping movies by genre allows us to analyze trends and compare different types of movies. We can create a dictionary where the genre is the key, and the value is a list of movies belonging to that genre.

Example:

import pandas as pd  # Import pandas for data manipulation


# Sample data: movie titles and their genres
data = {'Title': ['Inception', 'The Dark Knight', 'Pulp Fiction', 'The Matrix', 'Forrest Gump'],
        'Genre': ['Sci-Fi', 'Action', 'Crime', 'Sci-Fi', 'Drama']}


df = pd.DataFrame(data)  # Create a DataFrame from the data


grouped_genres = df.groupby('Genre')['Title'].apply(list)  # Group movies by genre

print(grouped_genres)  # Print the grouped genres and their movies

Output: A list of movies grouped by genre will be printed, showing the genres with their respective movie titles.

22.3 Visualizing Ratings Over Time

Visualizing how ratings change over time is a powerful tool for understanding movie trends. We can create a plot that shows the ratings of movies over the years using libraries like Matplotlib and Pandas.

Example:

import matplotlib.pyplot as plt  # Import Matplotlib for plotting

import pandas as pd  # Import pandas for data manipulation


# Sample data: movie titles, ratings, and release years
data = {'Title': ['Inception', 'The Dark Knight', 'Pulp Fiction', 'The Matrix', 'Forrest Gump'],
        'Rating': [8.8, 9.0, 8.9, 8.7, 8.8],
        'Year': [2010, 2008, 1994, 1999, 1994]}


df = pd.DataFrame(data)  # Create a DataFrame from the data


plt.plot(df['Year'], df['Rating'], marker='o')  # Plot ratings over years

plt.title('Movie Ratings Over Time')  # Set plot title

plt.xlabel('Year')  # Set x-axis label

plt.ylabel('Rating')  # Set y-axis label

plt.show()  # Display the plot

Output: A line plot will be displayed showing movie ratings over the years.

22.4 Comparing Studios and Budgets

Comparing movies based on studio and budget can provide insights into the financial aspect of filmmaking. We can group movies by studio and calculate average budgets or other statistics.

Example:

# Sample data: movie titles, studio, and budget
data = {'Title': ['Inception', 'The Dark Knight', 'Pulp Fiction', 'The Matrix', 'Forrest Gump'],
        'Studio': ['Warner Bros.', 'Warner Bros.', 'Miramax', 'Warner Bros.', 'Paramount'],
        'Budget': [160000000, 185000000, 80000000, 63000000, 55000000]}


df = pd.DataFrame(data)  # Create a DataFrame from the data


avg_budget_by_studio = df.groupby('Studio')['Budget'].mean()  # Calculate average budget by studio

print(avg_budget_by_studio)  # Print the average budget for each studio

Output: The average budget for each studio will be printed, showing how much they typically invest in their movies.

Chapter 23: Working with GeoData

23.1 Using geopandas

GeoPandas is an extension of the pandas library that makes working with geospatial data in Python easier. It allows you to read, manipulate, and analyze spatial data such as shapefiles, GeoJSON, and others. It supports spatial operations like calculating area, distance, buffering, and joins based on geometry.

Example:

import geopandas as gpd  # Importing the geopandas library

world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))  # Load world map data

print(world.head())  # Display the first few rows of the GeoDataFrame

Output:
pop_est continent name iso_a3 gdp_md_est geometry
0 920938 Africa Algeria DZA 48830.0 POLYGON ((11.999 23.471...
1 539509 Africa Angola AGO 110300.0 MULTIPOLYGON (((12.227 ...
...

23.2 Importing Shapefiles

Shapefiles are a common format for geospatial vector data, consisting of several files (.shp, .shx, .dbf, etc.). GeoPandas allows you to import shapefiles easily into a GeoDataFrame using the `read_file()` function. Once loaded, the data can be analyzed or visualized just like a regular DataFrame, with added geometry features.

Example:

import geopandas as gpd  # Import geopandas

gdf = gpd.read_file('your_shapefile_folder/your_file.shp')  # Replace with your actual shapefile path

print(gdf.head())  # Show top rows of shapefile data

Output:
Will vary depending on the shapefile. Typically includes:
ID, Name, geometry, etc.
Example:
ID NAME geometry
1 River POLYGON ((...))

23.3 Mapping with Coordinates

With GeoPandas, you can create maps using longitude and latitude data by converting regular pandas DataFrames to GeoDataFrames using the `points_from_xy` function. This is useful when visualizing addresses, locations, or spatial patterns based on GPS coordinates.

Example:

import pandas as pd  # Import pandas

import geopandas as gpd  # Import geopandas

from shapely.geometry import Point  # Import Point for coordinates


# Create a simple DataFrame with coordinates

data = {'City': ['A', 'B'], 'Latitude': [34.05, 36.16], 'Longitude': [-118.24, -115.15]}

df = pd.DataFrame(data)  # Create DataFrame


# Convert to GeoDataFrame

gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df['Longitude'], df['Latitude']))

print(gdf.head())  # Show mapped GeoDataFrame

Output:
City Latitude Longitude geometry
A 34.05 -118.24 POINT (-118.24000 34.05000)
B 36.16 -115.15 POINT (-115.15000 36.16000)

23.4 Combining Maps with DataFrames

You can combine spatial data (maps) with regular pandas DataFrames using merges or joins on common keys. For example, you can join a GeoDataFrame of regions with another DataFrame containing population or economic data. This allows for spatial analysis and geographic visualizations with rich context.

Example:

import geopandas as gpd  # Import geopandas

import pandas as pd  # Import pandas


# Load a world map

world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))


# Create a simple DataFrame with matching country names

data = {'name': ['Brazil', 'China'], 'Population': [211e6, 1393e6]}

pop_df = pd.DataFrame(data)  # Create population DataFrame


# Merge spatial data with population

merged = world.merge(pop_df, on='name')  # Merge on 'name' column

print(merged[['name', 'Population', 'geometry']])  # Show result

Output:
name Population geometry
Brazil 211000000.0 POLYGON ((-57.6251 -30.2163...
China 1393000000.0 POLYGON ((109.6875 18.1977...

Chapter 24: Working with Text Data (NLP Prep)

24.1 Tokenization & Lowercasing

Tokenization is the process of breaking down text into individual units such as words or phrases (tokens). Lowercasing is a preprocessing step where all text is converted to lowercase to maintain uniformity and improve comparison. These are often the first steps in natural language processing (NLP) workflows to prepare raw text for analysis or machine learning.

Example:

text = "NLP Is Fun and Powerful!"  # Original text

tokens = text.lower().split()  # Convert to lowercase and split into tokens

print(tokens)  # Display the tokens

Output: ['nlp', 'is', 'fun', 'and', 'powerful!']

24.2 Removing Stopwords

Stopwords are common words in a language (like "and", "is", "the") that often do not add significant meaning. Removing them helps reduce noise in the data and can lead to better model performance. Libraries like NLTK provide predefined lists of stopwords.

Example:

from nltk.corpus import stopwords  # Import stopwords list

import nltk  # Import nltk for natural language tasks

nltk.download('stopwords')  # Download stopwords if not already present


text = "NLP is fun and powerful"  # Sample text

words = text.lower().split()  # Lowercase and tokenize

filtered = [word for word in words if word not in stopwords.words('english')]  # Remove stopwords

print(filtered)  # Display filtered tokens

Output: ['nlp', 'fun', 'powerful']

24.3 Bag-of-Words Matrix

The Bag-of-Words (BoW) model transforms text into a fixed-length vector based on word frequency. It disregards grammar and word order but captures how often a word occurs in a document. It’s commonly used for text classification and clustering.

Example:

from sklearn.feature_extraction.text import CountVectorizer  # Import BoW tool


texts = ["I love NLP", "NLP is great and fun"]  # Sample documents

vectorizer = CountVectorizer()  # Create a CountVectorizer object

bow = vectorizer.fit_transform(texts)  # Transform texts to BoW format

print(vectorizer.get_feature_names_out())  # Print words used in BoW

print(bow.toarray())  # Show frequency matrix

Output:
Words: ['and', 'fun', 'great', 'is', 'love', 'nlp']
Matrix: [[0, 0, 0, 0, 1, 1], [1, 1, 1, 1, 0, 1]]

24.4 TF-IDF with Scikit-learn

TF-IDF (Term Frequency-Inverse Document Frequency) is an advanced technique that scores words based on how unique they are across documents. A high TF-IDF value indicates a word is frequent in one document but rare across others, making it more meaningful. Scikit-learn makes it easy to implement TF-IDF using `TfidfVectorizer`.

Example:

from sklearn.feature_extraction.text import TfidfVectorizer  # Import TF-IDF tool


texts = ["I love NLP", "NLP is great and fun"]  # Sample text list

tfidf = TfidfVectorizer()  # Create the vectorizer

matrix = tfidf.fit_transform(texts)  # Fit and transform the texts

print(tfidf.get_feature_names_out())  # Display words

print(matrix.toarray())  # Display TF-IDF scores

Output:
Words: ['and', 'fun', 'great', 'is', 'love', 'nlp']
Matrix: Each row shows the TF-IDF scores of words per document.

Chapter 25: Handling Big Data with Dask

25.1 Introduction to Dask

Dask is a parallel computing library in Python that scales Python code to handle large datasets. It allows you to work with big data in a similar way to pandas but in a distributed or parallelized fashion. Dask breaks your data into smaller chunks and processes them in parallel, which is ideal for datasets that don’t fit into memory.

Example:

import dask  # Import the Dask library

import dask.dataframe as dd  # Import Dask's DataFrame module


# Load a large CSV using Dask

df = dd.read_csv('large_dataset.csv')  # Efficiently reads big files in chunks


print(df.head())  # Displays the first few rows lazily

Output:
First few rows of your large CSV file (similar to pandas output).
Example:
id,name,value
1,Alice,100
2,Bob,200

25.2 Replacing pandas with Dask

Dask mimics the pandas API, which means you can use most of your existing pandas code with minimal changes. Instead of `import pandas as pd`, you use `import dask.dataframe as dd`. The rest of the syntax (like filtering, grouping, etc.) remains largely the same, making it easy to scale your existing workflows.

Example:

import dask.dataframe as dd  # Use Dask instead of pandas


# Read a CSV file with Dask

df = dd.read_csv('sales_data.csv')  # Loads data in chunks


# Group by product and calculate the total

result = df.groupby('product')['amount'].sum()  # Similar to pandas


print(result.compute())  # Triggers computation

Output:
product
Chair 1200
Desk 800
Lamp 400

25.3 Lazy Computation

One of the powerful features of Dask is lazy evaluation. Unlike pandas, which executes operations immediately, Dask builds a task graph and waits to compute until explicitly told to with `.compute()`. This makes operations faster and memory-efficient because nothing is done until needed.

Example:

import dask.dataframe as dd  # Import Dask


df = dd.read_csv('bigfile.csv')  # Load big file lazily


# Define a lazy operation

average = df['price'].mean()  # No computation yet


print(average)  # Just shows a delayed object


print(average.compute())  # Now it actually calculates

Output:
Dask Series: price mean (delayed object)
Then the actual number like: 45.67

25.4 Performance Benchmarks

Dask is especially useful when handling gigabytes or terabytes of data. Performance can be benchmarked by comparing processing time and memory usage against pandas. For massive files, Dask often completes tasks significantly faster because it uses parallelism and avoids memory overload by processing data in chunks.

Example:

import pandas as pd  # Regular pandas

import dask.dataframe as dd  # Dask

import time  # To measure time


# Timing pandas

start = time.time()

df_pandas = pd.read_csv('large_file.csv')

print("Pandas Time:", time.time() - start)


# Timing Dask

start = time.time()

df_dask = dd.read_csv('large_file.csv')

df_dask.head()  # Trigger load

print("Dask Time:", time.time() - start)

Output:
Pandas Time: 12.4 seconds
Dask Time: 2.7 seconds
(Numbers may vary based on file size and system.)

Chapter 26: Using SQL with pandas

Subchapter 26.1: Connecting to SQLite

SQLite is a lightweight, file-based database that can be easily integrated with pandas for storing and retrieving data. To connect pandas to SQLite, we use Python’s built-in sqlite3 module to establish a connection, which can then be passed to pandas functions like read_sql_query() and to_sql().

Example: Connecting to an SQLite database

import sqlite3  

conn = sqlite3.connect("example.db")  

print("Connection successful!")

Output:
Connection successful!

Subchapter 26.2: Running SQL Queries

After establishing a connection to the database, you can use SQL queries to retrieve data using pd.read_sql_query(). This function allows you to run SQL commands like SELECT and fetch the results directly into a pandas DataFrame for further analysis.

Example: Running a SELECT query

import pandas as pd  

query = "SELECT name FROM sqlite_master WHERE type='table';"  

tables = pd.read_sql_query(query, conn)  

print(tables)

Output:
Displays a list of table names in the SQLite database

Subchapter 26.3: Joining SQL & pandas Results

You can join SQL query results with pandas DataFrames using pandas merge operations. This is useful when you want to combine database results with other in-memory data. The merge() function lets you do SQL-style joins (inner, outer, left, right) directly in pandas.

Example: Joining SQL result with pandas DataFrame

# Assuming 'employees' table exists in SQLite 

sql_df = pd.read_sql_query("SELECT id, name FROM employees", conn)  

local_df = pd.DataFrame({'id': [1, 2], 'department': ['HR', 'IT']})  

joined_df = pd.merge(sql_df, local_df, on='id')  

print(joined_df)

Output:
Merges employee names from SQL with department data in pandas

Subchapter 26.4: Writing DataFrames to SQL

Pandas allows you to write a DataFrame to a SQL table using the to_sql() method. This is useful for saving processed or transformed data back into a database. You can choose to replace the table, append to it, or fail if it already exists by setting the if_exists parameter.

Example: Saving a DataFrame to SQL

df = pd.DataFrame({'id': [3, 4], 'name': ['Alice', 'Bob']})  

df.to_sql("employees", conn, if_exists="append", index=False)  

print("Data written to SQL database.")

Output:
Data written to SQL database.

Chapter 27: Automating Reports

27.1 Generating Excel Reports

Generating Excel reports is a common task in data analysis for sharing results with non-technical users. With pandas, you can easily write DataFrames to Excel files using the `to_excel()` method, which supports multiple sheets and formatting.

# Real-world example: Generating an Excel report

import pandas as pd  # Import pandas library

df = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Sales': [250, 300]})  # Create a sample DataFrame

df.to_excel('sales_report.xlsx', index=False)  # Export the DataFrame to Excel

print("Excel report generated successfully.")  # Confirm success

Output: Creates a file named sales_report.xlsx with Name and Sales data.

27.2 Adding Charts with xlsxwriter

With the xlsxwriter engine, you can create Excel files with embedded charts. This is helpful for visualizing trends or summaries directly in your reports. Charts can be added with formatting, titles, and styles for professional reporting.

# Real-world example: Adding charts to Excel with xlsxwriter

writer = pd.ExcelWriter('chart_report.xlsx', engine='xlsxwriter')  # Use xlsxwriter engine

df.to_excel(writer, sheet_name='Sheet1', index=False)  # Write data to Excel

workbook  = writer.book  # Get workbook object

worksheet = writer.sheets['Sheet1']  # Get worksheet object

chart = workbook.add_chart({'type': 'column'})  # Create column chart

chart.add_series({'values': '=Sheet1!$B$2:$B$3'})  # Add data series

worksheet.insert_chart('D2', chart)  # Insert chart in cell D2

writer.save()  # Save the file

print("Excel file with chart created.")  # Confirm success

Output: Generates an Excel file with a column chart showing Sales data.

27.3 Emailing Reports

Emailing reports automatically is useful for regular updates. Using the built-in `smtplib` and `email` libraries in Python, you can send Excel or PDF files as email attachments. This is often combined with scheduled scripts for automation.

# Real-world example: Sending email with a report (requires real email setup)

import smtplib  # Library for sending emails

from email.message import EmailMessage  # Class to build email messages


msg = EmailMessage()  # Create an email message

msg['Subject'] = 'Monthly Sales Report'  # Set subject

msg['From'] = 'sender@example.com'  # Sender's email

msg['To'] = 'receiver@example.com'  # Receiver's email

msg.set_content('Please find the attached sales report.')  # Email body


with open('sales_report.xlsx', 'rb') as f:  # Open file in binary mode

    file_data = f.read()  # Read file

    file_name = f.name  # Get file name

msg.add_attachment(file_data, maintype='application', subtype='octet-stream', filename=file_name)  # Attach file


with smtplib.SMTP('smtp.example.com', 587) as server:  # Connect to SMTP server

    server.starttls()  # Secure the connection

    server.login('sender@example.com', 'password')  # Login credentials

    server.send_message(msg)  # Send email

print("Email sent with report.")  # Confirm success

Output: Sends the sales_report.xlsx to the specified recipient.

27.4 Scheduling with cron or Task Scheduler

Automating the timing of report generation can be achieved using cron (Linux/macOS) or Task Scheduler (Windows). These tools allow you to schedule your Python script to run at specific intervals, like daily or weekly.

# Example (Linux/macOS): Add this to crontab

# Open crontab with: crontab -e

# Run every day at 8 AM

0 8 * * * /usr/bin/python3 /path/to/report_script.py


# Example (Windows Task Scheduler):

# - Open Task Scheduler

# - Create Task → Triggers: Daily 8AM

# - Actions: Start a Program → python.exe and script path

Output: Your script runs automatically on the schedule without manual intervention.

Chapter 28: Web APIs & pandas

28.1 Consuming JSON APIs

Consuming JSON APIs in pandas means making a request to an API that returns JSON data and converting that data into a DataFrame. This is common in data projects that involve pulling external information from web services or public data portals.

import requests  

import pandas as pd  

url = 'https://jsonplaceholder.typicode.com/posts'  

response = requests.get(url)  

data = response.json()  

df = pd.DataFrame(data)  

print(df.head())  

# Fetches JSON data from a free placeholder API and loads it into a DataFrame

Output:
id userId title
1 1 sunt aut facere repellat
2 1 qui est esse
3 1 ea molestias quasi
...

28.2 Normalizing Nested Data

Sometimes JSON from APIs contains nested structures like dictionaries or lists inside objects. Pandas has `json_normalize()` to flatten these structures into a tabular format so they can be more easily analyzed.

import requests  

import pandas as pd  

from pandas import json_normalize  

url = 'https://jsonplaceholder.typicode.com/users'  

response = requests.get(url)  

data = response.json()  

df = json_normalize(data, sep='_')  

print(df[['id', 'name', 'address_city']])  

# This flattens the nested address data into individual columns

Output:
id name address_city
1 Leanne Graham Gwenborough
2 Ervin Howell Wisokyburgh
...

28.3 Pagination and Tokens

Many APIs limit the number of results returned per request. Pagination helps you retrieve multiple pages. Tokens are often used for authenticated access. You loop through pages until there's no more data.

import requests  

import pandas as pd  

all_data = []  

for page in range(1, 3):  

    url = f'https://jsonplaceholder.typicode.com/posts?_page={page}&_limit=5'  

    response = requests.get(url)  

    data = response.json()  

    all_data.extend(data)  

df = pd.DataFrame(all_data)  

print(df)  

# This simulates pagination by looping through API pages

Output:
id title body
1 sunt aut facere ...
2 qui est esse ...
...

28.4 Rate-Limiting Handling

Some APIs restrict the number of requests in a given time (rate-limiting). To avoid getting blocked or receiving errors, you can use delays (like `time.sleep()`) between requests or handle the 429 error specifically.

import requests  

import pandas as pd  

import time  

all_data = []  

for i in range(3):  

    url = 'https://jsonplaceholder.typicode.com/posts/{}'.format(i+1)  

    response = requests.get(url)  

    if response.status_code == 429:  

        print("Rate limit hit, sleeping...")  

        time.sleep(5)  

        continue  

    data = response.json()  

    all_data.append(data)  

    time.sleep(1)  

df = pd.DataFrame(all_data)  

print(df)  

# Adds delay to handle rate limits and avoid API blocking

Output:
id title userId
1 sunt aut facere 1
2 qui est esse 1
3 ea molestias quasi 1

Chapter 29: Testing & Validation

29.1 Unit Testing with pytest

Unit testing ensures that individual components of code function as expected. pytest is a popular Python testing framework that allows writing simple and readable test cases using plain Python functions. Unit tests can catch bugs early by validating functions or methods in isolation.

Example: Testing a Function with pytest

# my_math.py

def add(x, y):

    return x + y



# test_my_math.py

from my_math import add

def test_add():

    assert add(2, 3) == 5

    assert add(-1, 1) == 0

    assert add(0, 0) == 0

Output:
Running pytest test_my_math.py will show that all test cases pass if the function works correctly.

29.2 Validating DataFrames

Validating DataFrames means checking that data meets specific conditions such as value types, ranges, or missing values. This step ensures that your data is clean and consistent before analysis or modeling. It can be done using simple pandas checks or libraries like pandera.

Example: Validating Missing Values and Data Types

import pandas as pd

# Create a sample DataFrame

df = pd.DataFrame({

    'name': ['Alice', 'Bob', None],

    'age': [25, 30, None]

})

# Check for missing values

missing = df.isnull().sum()

print("Missing values in each column:")

print(missing)

# Check data types

print("Data types:")

print(df.dtypes)

Output:
Displays missing values per column and data types of each column in the DataFrame.

29.3 Schema Enforcement

Schema enforcement involves defining and enforcing rules on the structure and types of your data. This is crucial when loading, transforming, or validating structured datasets to ensure consistency. Tools like pandera allow you to declare expected schemas and catch violations automatically.

Example: Enforcing Schema with pandera

import pandas as pd

import pandera as pa

from pandera import Column, DataFrameSchema

# Define schema

schema = DataFrameSchema({

    "name": Column(str, nullable=False),

    "age": Column(int, nullable=False)

})

# Sample valid DataFrame

df_valid = pd.DataFrame({

    "name": ["Alice", "Bob"],

    "age": [25, 30]

})

# Validate

validated = schema.validate(df_valid)

print("Data validated successfully")

Output:
If the data conforms to the schema, it prints: Data validated successfully. Otherwise, it raises a schema error.

29.4 Custom Test Cases

Custom test cases allow you to create specific tests for edge cases or domain-specific logic in your code. These can be automated using pytest or manually coded to check particular behaviors that aren’t covered by general unit tests.

Example: Custom Logic Test for Discounts

# discount.py

def apply_discount(price, percentage):

    if percentage < 0 or percentage > 100:

        raise ValueError("Percentage must be between 0 and 100")

    return price * (1 - percentage / 100)



# test_discount.py

from discount import apply_discount

import pytest

def test_valid_discount():

    assert apply_discount(100, 20) == 80

def test_invalid_discount():

    with pytest.raises(ValueError):

        apply_discount(100, -10)

Output:
Valid discount returns correct price, while invalid percentages raise a ValueError as expected.

Pandas - Python Library

Ultimate Pandas Bootcamp

The site is under development.

Pandas - Python Library