Python syntax is easy to understand, designed to be readable and intuitive. It uses indentation rather than curly braces to define code blocks. This makes Python one of the most beginner-friendly programming languages. Python is also case-sensitive, which means 'Variable' and 'variable' are considered different. This subchapter will focus on understanding Python’s simple and clean syntax.
name = "John" # Defining a string variable
print("Hello", name) # Printing a message
Output: Hello John
In Python, variables store data values. Lists are ordered collections of items, which can be of any data type, and allow duplicates. Dictionaries store key-value pairs, making them ideal for fast lookups of related data. This section will discuss how to define and work with these essential data structures.
age = 25 # Variable holding an integer
friends = ["Alice", "Bob"] # List of friends
profile = {"name": "John", "age": 25} # Dictionary with key-value pairs
print(profile["name"]) # Accessing value associated with the key 'name' in dictionary
Output: John
Functions allow you to group code into reusable blocks, making programs more organized and modular. Conditionals, using 'if', 'elif', and 'else', control the flow of execution based on logical decisions. These structures let the program take different actions depending on the input or conditions.
def greet(name): # Function definition
if name == "Alice": # Conditional check
return "Hi Alice!" # Return greeting for Alice
else:
return "Hello!" # Default greeting
print(greet("Alice")) # Calling the function with 'Alice'
Output: Hi Alice!
Pandas and numpy are two fundamental libraries for data manipulation and analysis in Python. Pandas provides high-level data structures like DataFrames, which handle tabular data, while numpy offers support for numerical computations with arrays. These libraries are essential for data science and machine learning tasks.
import pandas as pd # Import pandas library
import numpy as np # Import numpy library
data = np.array([[1, 2], [3, 4]]) # Create a 2D numpy array
df = pd.DataFrame(data, columns=["A", "B"]) # Convert numpy array to DataFrame
print(df) # Print the DataFrame
Output:
A B
0 1 2
1 3 4
A pandas Series is a one-dimensional labeled array, capable of holding any data type (integers, strings, floats, etc.). It is similar to a list or array in Python, but it comes with additional features like labels for indexing. A Series can be created from a list, dictionary, or even a scalar value.
import pandas as pd # Importing pandas
data = [10, 20, 30, 40] # List of data
series = pd.Series(data) # Creating a Series from the list
print(series) # Printing the Series
Output:
0 10
1 20
2 30
3 40
Indexing in a Series allows you to access specific elements based on their position or label. Slicing is similar to indexing, but it allows you to retrieve a subset of elements. You can use integer-based or label-based indexing with pandas Series.
data = [10, 20, 30, 40] # List of data
series = pd.Series(data) # Creating the Series
print(series[1]) # Indexing: Access element at index 1
print(series[1:3]) # Slicing: Access elements from index 1 to 3
Output:
20
1 20
2 30
Pandas Series come with a variety of methods to perform operations on the data. These methods allow you to perform statistical, mathematical, and string operations, among others. Some common methods include `sum()`, `mean()`, `max()`, and `min()`.
data = [10, 20, 30, 40] # List of data
series = pd.Series(data) # Creating the Series
print(series.sum()) # Calculate sum of elements
print(series.mean()) # Calculate mean of elements
print(series.max()) # Get maximum value in the Series
Output:
100
25.0
40
NaN stands for 'Not a Number' and is used to represent missing or undefined values in a pandas Series. Pandas provides methods to detect, remove, or replace NaNs. Handling NaNs is crucial when performing data analysis to ensure the accuracy of the results.
data = [10, None, 30, None] # List of data with NaNs
series = pd.Series(data) # Creating the Series
print(series.isna()) # Check for NaNs in the Series
series_filled = series.fillna(0) # Replace NaNs with 0
print(series_filled) # Printing the Series after replacing NaNs
Output:
0 False
1 True
2 False
3 True
dtype: bool
0 10.0
1 0.0
2 30.0
3 0.0
DataFrames are the core data structure in the pandas library. They are similar to Excel tables or SQL tables, consisting of rows and columns. A DataFrame can be created from various data structures such as dictionaries, lists, and even external files like CSVs or Excel files. It is the starting point for any data manipulation task in Python.
import pandas as pd
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
print(df)
Output:
Name Age
Alice 25
Bob 30
DataFrames allow you to access specific rows and columns using both index positions and labels. Using `.loc[]` enables label-based access, while `.iloc[]` is used for index-based access. This flexibility makes it easy to select and work with parts of your data.
print(df['Name'])
print(df.loc[0])
Output:
0 Alice
1 Bob
Name Alice
Age 25
Once you have a DataFrame, you can perform various operations on it. You can carry out arithmetic operations, statistical computations, and apply functions across columns or rows. Pandas makes these operations very straightforward.
df['DoubleAge'] = df['Age'] * 2
print(df)
Output:
Name Age DoubleAge
Alice 25 50
Bob 30 60
You can easily add new columns to a DataFrame by assigning values to a new column label. Deleting columns can be done with the `drop()` method, specifying the column name and the axis (1 for columns). This flexibility allows you to manage your data efficiently.
df['City'] = ['NYC', 'LA']
df = df.drop('DoubleAge', axis=1)
print(df)
Output:
Name Age City
Alice 25 NYC
Bob 30 LA
In data analysis, importing data is one of the first steps. Libraries like pandas allow you to read data from multiple file formats such as CSV, Excel, and JSON. These formats are commonly used in data storage and exchange, making it essential to learn how to work with them effectively.
# Real-world example: Reading CSV and JSON files
import pandas as pd # Import the pandas library
csv_data = pd.read_csv('data.csv') # Load a CSV file
json_data = pd.read_json('data.json') # Load a JSON file
print(csv_data.head()) # Display first 5 rows of CSV data
Output: First 5 rows of the CSV data
After processing or analyzing data, you may need to export it for further use or to share with others. pandas provides methods for exporting data to various formats such as CSV, Excel, and JSON. This makes it easy to save results or share data in a standardized format.
# Real-world example: Exporting data to CSV
data = pd.DataFrame({'name': ['Tom', 'Jerry'], 'age': [5, 3]}) # Create a DataFrame
data.to_csv('output.csv', index=False) # Export the DataFrame to a CSV file
Output: A new file 'output.csv' is created with the data
Sometimes, when loading data into a DataFrame, you may want to use a specific column as the index. This can help in organizing the data better, especially when dealing with large datasets. Using an appropriate index can improve the speed of data lookups and manipulations.
# Real-world example: Loading data with an index column
data = pd.read_csv('data.csv', index_col='id') # Use the 'id' column as the index
print(data.head()) # Display first 5 rows with 'id' as index
Output: DataFrame with 'id' as the index column
Handling large datasets can be challenging due to memory limitations. One way to handle this is by reading data in chunks, which allows you to process data piece-by-piece without loading the entire dataset into memory at once. This approach helps in efficiently managing and analyzing large datasets.
# Real-world example: Reading data in chunks
chunk_iter = pd.read_csv('large_data.csv', chunksize=1000) # Read data in chunks of 1000 rows
for chunk in chunk_iter:
print(chunk.head()) # Process and display first 5 rows of each chunk
Output: First 5 rows from each 1000-row chunk
Null values (or missing data) can impact model accuracy. Common techniques to handle them include removing rows, filling with default values, or using statistical measures (mean/median). Choosing the right strategy depends on the nature and importance of the data.
import pandas as pd
df = pd.DataFrame({'A': [1, 2, None, 4]})
df['A'].fillna(df['A'].mean(), inplace=True)
print(df)
# This replaces null values in column A with the mean of the column
Output:
A
0 1.0
1 2.0
2 2.333
3 4.0
Data may not always be in the correct format. For example, numerical values might be stored as strings. Type conversion helps ensure the data is in a usable and consistent format for analysis or modeling.
import pandas as pd
df = pd.DataFrame({'A': ['1', '2', '3']})
df['A'] = df['A'].astype(int)
print(df.dtypes)
# This converts the column 'A' from string to integer type
Output: A int64
Duplicate data can skew analysis, while outliers can mislead models. Removing duplicates and identifying outliers using methods like IQR or Z-score is crucial for high-quality data preprocessing.
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 2, 100]})
df = df.drop_duplicates()
print(df)
# This removes duplicate rows based on identical values
Output:
A
0 1
2 2
3 100
The apply()
function in pandas allows applying a function to each row or column, enabling custom cleaning operations. It's especially useful for string cleaning or transforming values based on logic.
import pandas as pd
df = pd.DataFrame({'A': [' apple ', 'Banana', ' grape ']})
df['A'] = df['A'].apply(lambda x: x.strip().lower())
print(df)
# This strips extra spaces and converts strings to lowercase
Output:
A
0 apple
1 banana
2 grape
Boolean indexing allows us to filter rows of a DataFrame by using a condition that returns True or False for each row. It's a powerful way to filter data without needing to write loops. The condition can be based on any column value or a combination of columns.
import pandas as pd
df = pd.DataFrame({'Age': [20, 25, 30], 'Name': ['Alice', 'Bob', 'Charlie']})
filtered_df = df[df['Age'] > 21] # Only rows where Age > 21
print(filtered_df)
Output:
Name Age
1 Bob 25
2 Charlie 30
Sorting helps in organizing data by arranging the values in ascending or descending order. Sorting can be done by column values or by the index. In pandas, the sort_values()
function is used to sort the data based on column values, and sort_index()
is used to sort by index.
import pandas as pd
df = pd.DataFrame({'Score': [90, 80, 85], 'Name': ['Alice', 'Bob', 'Charlie']})
sorted_df = df.sort_values(by='Score') # Sort by Score
print(sorted_df)
Output:
Name Score
1 Bob 80
2 Charlie 85
0 Alice 90
Conditional filtering lets you select rows that meet one or more conditions. It's a crucial technique for narrowing down a large dataset to relevant information, such as filtering data based on ranges or specific criteria.
import pandas as pd
df = pd.DataFrame({'Age': [20, 25, 30], 'Salary': [3000, 4000, 5000]})
condition = df['Age'] > 21
filtered = df[condition] # Apply condition
print(filtered)
Output:
Age Salary
1 25 4000
2 30 5000
Multi-condition filtering is the process of applying more than one condition to filter data. You can combine conditions using logical operators like &
(and), |
(or), and ~
(not), allowing for complex queries on the dataset.
import pandas as pd
df = pd.DataFrame({'Age': [20, 25, 30], 'Salary': [3000, 4000, 5000]})
filtered = df[(df['Age'] > 21) & (df['Salary'] > 3500)] # Both conditions must be true
print(filtered)
Output:
Age Salary
1 25 4000
2 30 5000
Vectorized string methods in libraries like pandas allow batch operations on text data. This means that instead of applying string methods individually to each element in a list or array, these methods can be applied to the entire collection of data at once. This approach is highly efficient and reduces the need for writing loops, speeding up processing time.
import pandas as pd
# Create a pandas Series with strings
data = pd.Series(['Hello', 'World', 'AI'])
# Apply vectorized string method to convert all strings to uppercase
result = data.str.upper()
# Print the result
print(result)
Output:
0 HELLO
1 WORLD
2 AI
Cleaning text data is a crucial step before performing any text analysis or applying machine learning models. This process includes removing unwanted characters (such as punctuation or extra spaces), converting text to lowercase, and standardizing the format. Properly cleaned text ensures better model performance and more reliable results in natural language processing (NLP) tasks.
text = " Hello! Welcome...to AI. "
# Strip leading/trailing spaces, convert to lowercase, and replace unwanted characters
cleaned = text.strip().lower().replace("...", " ")
# Print the cleaned text
print(cleaned)
Output:
hello! welcome to ai.
Regular expressions (RegEx) allow us to search for specific patterns within strings. This is particularly useful when we need to extract information such as email addresses, phone numbers, dates, or any structured data embedded within unstructured text. RegEx provides a powerful and flexible way to perform pattern matching in strings.
import re
text = "Contact: hello@example.com"
# Regular expression to match email address
match = re.findall(r'\\S+@\\S+', text)
# Print the extracted email address
print(match)
Output:
[\"hello@example.com\"]
String replacements and maps allow us to transform specific parts of a string or a set of strings. Using replacement rules (e.g., a dictionary), we can replace substrings in text or change one value to another across multiple strings. This technique is especially useful in data cleaning, normalizing terms, or converting categorical values into a standardized format.
words = ["hi", "bye", "hello"]
# Create a mapping dictionary
mapping = {"hi": "greetings", "bye": "farewell"}
# Replace words using the dictionary
replaced = [mapping.get(word, word) for word in words]
# Print the replaced list
print(replaced)
Output:
["greetings", "farewell", "hello"]
The `groupby()` function in Pandas is used to split the data into groups based on some criteria. This operation is useful for summarizing or transforming data in a grouped fashion. The groupby object can then be used to apply aggregation functions, transformations, or filtering operations. The basic idea is that it allows you to perform operations on each group independently.
import pandas as pd # Import pandas library
data = {'Category': ['A', 'A', 'B', 'B', 'C', 'C'], # Sample data
'Value': [10, 20, 30, 40, 50, 60]} # Corresponding values
df = pd.DataFrame(data) # Create DataFrame
grouped = df.groupby('Category') # Group data by 'Category' column
for name, group in grouped: # Loop through each group
print(name) # Print the group name
print(group) # Print the data for each group
Output:
A
Category Value
0 A 10
1 A 20
B
Category Value
2 B 30
3 B 40
C
Category Value
4 C 50
5 C 60
Pandas provides several aggregation functions such as `sum()`, `mean()`, `max()`, and `min()`. These functions can be applied to each group after using `groupby()` to summarize the data in meaningful ways, such as calculating the total, average, or minimum of values for each group.
df = pd.DataFrame({'Category': ['A', 'A', 'B', 'B', 'C', 'C'], # Sample data
'Value': [10, 20, 30, 40, 50, 60]}) # Values
grouped = df.groupby('Category')['Value'] # Group by 'Category' and focus on 'Value' column
result = grouped.sum() # Calculate sum for each group
print(result) # Print the result
Output:
Category
A 30
B 70
C 110
Name: Value, dtype: int64
Custom aggregations allow you to apply your own function to each group. This is useful when built-in aggregation functions do not meet your needs. You can define your own function and pass it to the `agg()` method to perform more complex calculations.
def custom_func(x): # Custom aggregation function
return x.max() - x.min() # Difference between max and min values
df = pd.DataFrame({'Category': ['A', 'A', 'B', 'B', 'C', 'C'], # Data
'Value': [10, 20, 30, 40, 50, 60]})
grouped = df.groupby('Category')['Value'] # Group by 'Category'
result = grouped.agg(custom_func) # Apply custom aggregation function
print(result) # Print result
Output:
Category
A 10
B 10
C 10
Name: Value, dtype: int64
Hierarchical grouping, or multi-level grouping, allows you to group data by more than one column. This is useful when you want to analyze the data in a more granular way, for example, grouping by both 'Category' and 'Subcategory'. Pandas allows you to perform aggregation operations on multi-level indexes.
df = pd.DataFrame({'Category': ['A', 'A', 'B', 'B', 'C', 'C'], # Sample data
'Subcategory': ['X', 'Y', 'X', 'Y', 'X', 'Y'], # Subcategories
'Value': [10, 20, 30, 40, 50, 60]}) # Values
grouped = df.groupby(['Category', 'Subcategory'])['Value'] # Group by both 'Category' and 'Subcategory'
result = grouped.sum() # Sum for each subgroup
print(result) # Print result
Output:
Category Subcategory
A X 10
Y 20
B X 30
Y 40
C X 50
Y 60
Name: Value, dtype: int64
Concatenation is the process of joining two or more datasets along a particular axis (rows or columns). In pandas, the `concat()` function allows you to combine multiple DataFrames vertically or horizontally. This operation does not require a common column between the datasets.
import pandas as pd # Importing pandas
data1 = {'A': [1, 2], 'B': [3, 4]} # First dataset
data2 = {'A': [5, 6], 'B': [7, 8]} # Second dataset
df1 = pd.DataFrame(data1) # Create first DataFrame
df2 = pd.DataFrame(data2) # Create second DataFrame
df_combined = pd.concat([df1, df2], ignore_index=True) # Concatenate DataFrames
print(df_combined) # Print the concatenated DataFrame
Output:
A B
0 1 3
1 2 4
2 5 7
3 6 8
Merging is the process of combining two DataFrames based on a key or common column. It is similar to SQL joins. You can perform inner, outer, left, or right joins to specify how you want to combine the data based on the keys.
data1 = {'ID': [1, 2], 'Name': ['Alice', 'Bob']} # First dataset
data2 = {'ID': [1, 2], 'Age': [25, 30]} # Second dataset
df1 = pd.DataFrame(data1) # Create first DataFrame
df2 = pd.DataFrame(data2) # Create second DataFrame
df_merged = pd.merge(df1, df2, on='ID') # Merge on the 'ID' column
print(df_merged) # Print the merged DataFrame
Output:
ID Name Age
0 1 Alice 25
1 2 Bob 30
The `join()` method is used for merging DataFrames based on their index, while `merge()` is used to merge DataFrames based on one or more key columns. `merge()` is more flexible, allowing you to specify different types of joins (inner, outer, left, right), while `join()` is typically used when the join keys are the indices.
data1 = {'A': [1, 2]} # First dataset
data2 = {'B': [3, 4]} # Second dataset
df1 = pd.DataFrame(data1, index=[1, 2]) # First DataFrame with index
df2 = pd.DataFrame(data2, index=[1, 2]) # Second DataFrame with index
df_joined = df1.join(df2) # Join DataFrames based on index
print(df_joined) # Print the joined DataFrame
Output:
A B
1 1 3
2 2 4
When combining datasets, it is important to handle overlaps and duplicates effectively. You can use the `drop_duplicates()` method to remove duplicate rows. Additionally, when merging datasets, you may need to resolve conflicts or overlapping columns using parameters like `suffixes` to avoid column name clashes.
data1 = {'ID': [1, 2, 2], 'Name': ['Alice', 'Bob', 'Bob']} # First dataset with duplicates
df1 = pd.DataFrame(data1) # Create first DataFrame
df_no_duplicates = df1.drop_duplicates() # Remove duplicates
print(df_no_duplicates) # Print the DataFrame without duplicates
Output:
ID Name
0 1 Alice
1 2 Bob
The pivot()
function is used to reshape data by turning unique values from one column into separate columns, while the pivot_table()
function can be used for summarizing data by applying aggregation functions (like sum, mean) to the data.
import pandas as pd
data = {'Date': ['2021-01-01', '2021-01-01', '2021-01-02'], 'City': ['A', 'B', 'A'], 'Sales': [200, 150, 300]}
df = pd.DataFrame(data)
pivot_df = df.pivot(index='Date', columns='City', values='Sales')
print(pivot_df)
Output:
City A B
Date
2021-01-01 200 150
2021-01-02 300 NaN
pivot_table_df = df.pivot_table(index='Date', columns='City', values='Sales', aggfunc='sum')
print(pivot_table_df)
Output:
City A B
Date
2021-01-01 200 150
2021-01-02 300 NaN
Melting is the opposite of pivoting. It unpivots a DataFrame by converting columns into rows. The melt()
function is useful when you need to convert wide-format data into long-format data, making it easier to analyze.
df_melted = df.melt(id_vars=['Date'], value_vars=['City', 'Sales'], var_name='Variable', value_name='Value')
print(df_melted)
Output:
Date Variable Value
2021-01-01 City A
2021-01-01 Sales 200
2021-01-02 City A
2021-01-02 Sales 300
The stack()
function stacks the columns of a DataFrame into rows (vertical axis). The unstack()
function does the opposite: it converts the rows of a DataFrame into columns (horizontal axis). These functions are useful when working with multi-level indices (MultiIndex).
stacked_df = pivot_df.stack()
print(stacked_df)
Output:
2021-01-01 A 200
2021-01-01 B 150
2021-01-02 A 300
unstacked_df = stacked_df.unstack()
print(unstacked_df)
Output:
City A B
Date
2021-01-01 200 150
2021-01-02 300 NaN
MultiIndex allows you to have multiple levels of indexing in a DataFrame. You can reshape a DataFrame using MultiIndex for more complex data analysis tasks. You can use stack()
and unstack()
to work with MultiIndex as well.
index = pd.MultiIndex.from_tuples([('2021-01-01', 'A'), ('2021-01-01', 'B'), ('2021-01-02', 'A')], names=['Date', 'City'])
multiindex_df = pd.DataFrame({'Sales': [200, 150, 300]}, index=index)
print(multiindex_df)
Output:
Sales
Date City
2021-01-01 A 200
2021-01-01 B 150
2021-01-02 A 300
reshaped_df = multiindex_df.unstack()
print(reshaped_df)
Output:
City A B
Date
2021-01-01 200 150
2021-01-02 300 NaN
Working with dates and times is crucial for time-series analysis. Parsing dates refers to converting date information from a string format into a usable DateTime format. This makes it easier to perform operations like filtering and aggregating data based on specific dates.
# Real-world example: Parsing dates
import pandas as pd # Import the pandas library
data = pd.to_datetime('2025-04-20') # Convert a string to a DateTime object
print(data) # Output the parsed date
Output: '2025-04-20 00:00:00'
Date indexing is used to set a column with date-related values as the index of a DataFrame. This allows for efficient querying and manipulation of data based on dates. It is particularly useful in time-series analysis, where each row corresponds to a specific point in time.
# Real-world example: Date indexing
data = pd.DataFrame({'date': ['2025-01-01', '2025-02-01', '2025-03-01'], 'value': [10, 20, 30]})
data['date'] = pd.to_datetime(data['date']) # Convert date column to DateTime
data.set_index('date', inplace=True) # Set the 'date' column as the index
print(data) # Display the DataFrame with date as index
Output: DataFrame with 'date' as the index column
Resampling refers to the process of converting time-series data from one frequency to another. This can be useful for aggregating or downsampling data at specific intervals, such as converting daily data into weekly or monthly data. Frequency conversion is typically used to change the granularity of time-series data.
# Real-world example: Resampling and frequency conversion
data = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('2025-01-01', periods=5, freq='D'))
resampled_data = data.resample('M').sum() # Resample data to monthly frequency and sum values
print(resampled_data) # Display resampled data
Output: The sum of the values for each month
Rolling windows and expanding windows are methods for calculating statistics over a specific time window. A rolling window computes a statistic (e.g., mean, sum) over a moving window of fixed size, while expanding windows calculate the statistic for all data up to the current point.
# Real-world example: Rolling and expanding windows
data = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('2025-01-01', periods=5, freq='D'))
rolling_mean = data.rolling(window=2).mean() # Compute rolling mean with a window size of 2
expanding_mean = data.expanding().mean() # Compute expanding mean
print(rolling_mean) # Display rolling mean
print(expanding_mean) # Display expanding mean
Output: The rolling mean and expanding mean values for each point in the series
Time series slicing refers to extracting specific time periods from a time series dataset. This is useful for focusing on specific time frames such as years, months, or days for analysis or forecasting.
import pandas as pd
date_range = pd.date_range('2021-01-01', periods=5, freq='D')
df = pd.DataFrame({'Date': date_range, 'Value': [10, 20, 30, 40, 50]})
sliced_df = df[df['Date'] >= '2021-01-03']
print(sliced_df)
# This slices the data from January 3rd, 2021 onward
Output:
Date Value
2 2021-01-03 30
3 2021-01-04 40
4 2021-01-05 50
Moving averages smooth out short-term fluctuations and highlight longer-term trends in time series data. This technique is frequently used in forecasting and trend analysis.
import pandas as pd
date_range = pd.date_range('2021-01-01', periods=5, freq='D')
df = pd.DataFrame({'Date': date_range, 'Value': [10, 20, 30, 40, 50]})
df['Moving_Avg'] = df['Value'].rolling(window=3).mean()
print(df)
# This calculates the 3-day moving average of the 'Value' column
Output:
Date Value Moving_Avg
0 2021-01-01 10 NaN
1 2021-01-02 20 NaN
2 2021-01-03 30 20.0
3 2021-01-04 40 30.0
4 2021-01-05 50 40.0
Time zone conversion is essential when working with global data to ensure that timestamps are accurate across different regions. This involves converting the time series data to a specific time zone.
import pandas as pd
date_range = pd.date_range('2021-01-01 12:00', periods=5, freq='D')
df = pd.DataFrame({'Date': date_range, 'Value': [10, 20, 30, 40, 50]})
df['Date'] = df['Date'].dt.tz_localize('UTC').dt.tz_convert('US/Eastern')
print(df)
# This converts the time zone from UTC to US Eastern Time
Output:
Date Value
0 2021-01-01 07:00:00-05:00 10
1 2021-01-02 07:00:00-05:00 20
2 2021-01-03 07:00:00-05:00 30
3 2021-01-04 07:00:00-05:00 40
4 2021-01-05 07:00:00-05:00 50
Shifting data is a technique to move data forward or backward in time. This is useful for creating lag features in time series forecasting or calculating differences between time periods.
import pandas as pd
date_range = pd.date_range('2021-01-01', periods=5, freq='D')
df = pd.DataFrame({'Date': date_range, 'Value': [10, 20, 30, 40, 50]})
df['Shifted'] = df['Value'].shift(1)
print(df)
# This shifts the 'Value' column by 1 period down
Output:
Date Value Shifted
0 2021-01-01 10 NaN
1 2021-01-02 20 10.0
2 2021-01-03 30 20.0
3 2021-01-04 40 30.0
4 2021-01-05 50 40.0
Line, bar, and pie charts are fundamental ways of visualizing data. A line chart is used to display trends over time, while bar charts are useful for comparing quantities, and pie charts show the proportions of categories.
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
plt.plot(x, y)
plt.title('Line Chart')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()
Output:
A line chart will be displayed showing the trend of values on the X and Y axes.
categories = ['A', 'B', 'C', 'D']
values = [10, 20, 30, 40]
plt.bar(categories, values)
plt.title('Bar Chart')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.show()
Output:
A bar chart will be displayed with the categories on the X-axis and the values on the Y-axis.
sizes = [25, 35, 40]
labels = ['Category A', 'Category B', 'Category C']
plt.pie(sizes, labels=labels, autopct='%1.1f%%')
plt.title('Pie Chart')
plt.show()
Output:
A pie chart will be displayed showing the distribution of the categories with percentage values.
Histograms and boxplots are helpful in analyzing the distribution of data. Histograms show the frequency of data points in intervals, while boxplots highlight the range, median, and outliers in data.
import numpy as np
data = np.random.randn(1000)
plt.hist(data, bins=30)
plt.title('Histogram')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
Output:
A histogram will be displayed showing the distribution of values in the dataset.
data = [np.random.randn(100), np.random.randn(100) + 1]
plt.boxplot(data)
plt.title('Boxplot')
plt.show()
Output:
A boxplot will be displayed with boxes showing the distribution of the data and potential outliers.
pandas provides built-in plotting functionality to visualize data directly from DataFrames. It uses Matplotlib as a backend and makes it easy to create plots with simple commands.
import pandas as pd
df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [2, 4, 6, 8, 10]})
df.plot(x='x', y='y', kind='line')
plt.title('pandas Line Plot')
plt.show()
Output:
A line plot will be displayed showing the relationship between columns 'x' and 'y' from the DataFrame.
df.plot(x='x', y='y', kind='bar')
plt.title('pandas Bar Plot')
plt.show()
Output:
A bar plot will be displayed using the values from the DataFrame.
Dual axis and subplots allow you to plot multiple graphs on the same figure. Dual-axis plots show two sets of data on different y-axes, while subplots allow multiple charts to be placed in a grid layout.
x = [1, 2, 3, 4, 5]
y1 = [10, 20, 25, 30, 40]
y2 = [100, 90, 80, 70, 60]
fig, ax1 = plt.subplots()
ax1.plot(x, y1, 'g-') # First y-axis
ax2 = ax1.twinx()
ax2.plot(x, y2, 'b-') # Second y-axis
plt.show()
Output:
A plot with two y-axes will be displayed, one showing the 'y1' values and another showing 'y2' values.
fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.plot(x, y1)
ax2.plot(x, y2)
plt.show()
Output:
Two plots will be displayed side by side, each showing different data on the same figure.
E-commerce orders data typically contains information on product purchases, such as customer details, order date, product categories, prices, and more. Working with this data allows businesses to gain insights into customer behavior, sales trends, and performance metrics.
import pandas as pd
# Load the e-commerce dataset
data = pd.read_csv('ecommerce_orders.csv')
# Display the first few rows of the dataset
print(data.head())
# Analyze total sales
total_sales = data['total_price'].sum()
print(f'Total Sales: {total_sales}')
Output:
First few rows of the dataset
Total Sales: 50000
COVID-19 data includes information about the spread of the virus, infection rates, mortality rates, vaccination rates, and more. This data is often used for epidemiological studies and to inform public health policies. Understanding and analyzing this dataset can provide crucial insights into the pandemic's progression.
# Load the COVID-19 dataset
covid_data = pd.read_csv('covid_data.csv')
# Group by country and calculate total cases
country_cases = covid_data.groupby('country')['cases'].sum()
# Print the total cases for each country
print(country_cases)
Output:
Country1: 100000
Country2: 200000
...
The NYC property sales dataset includes information about properties sold in New York City, such as sale price, property type, and location. This dataset is valuable for real estate analysis, helping to identify trends in property values and market demand.
# Load the NYC property dataset
property_data = pd.read_csv('nyc_property_sales.csv')
# Analyze the average sale price
avg_price = property_data['sale_price'].mean()
print(f'Average Sale Price: {avg_price}')
Output:
Average Sale Price: 750000
Airline on-time data includes information about flights, including departure and arrival times, delays, and cancellations. This data is essential for analyzing airline performance, identifying patterns in delays, and improving operational efficiency.
# Load the airline on-time dataset
flight_data = pd.read_csv('airline_on_time.csv')
# Group by airline and calculate average delay
airline_delays = flight_data.groupby('airline')['arrival_delay'].mean()
# Print the average delays for each airline
print(airline_delays)
Output:
Airline1: 15 minutes
Airline2: 30 minutes
...
Exporting data to various file formats like CSV, Excel, and JSON is a common task in data analysis. Pandas makes it easy to save DataFrames to these formats using methods like `to_csv()`, `to_excel()`, and `to_json()`. These formats are widely supported and can be opened and analyzed in other tools such as spreadsheets or databases.
import pandas as pd # Import pandas library
data = {'Name': ['Alice', 'Bob', 'Charlie'], # Sample data
'Age': [25, 30, 35]}
df = pd.DataFrame(data) # Create DataFrame
df.to_csv('data.csv', index=False) # Export DataFrame to CSV without row index
df.to_excel('data.xlsx', index=False) # Export DataFrame to Excel without row index
df.to_json('data.json') # Export DataFrame to JSON
Output: Files 'data.csv', 'data.xlsx', and 'data.json' will be created in the working directory.
HDF5 and Feather are binary file formats that provide fast and efficient storage for large datasets. HDF5 is often used for storing complex data hierarchies, while Feather is optimized for fast reading and writing of large datasets. Both formats are commonly used for storing data in data science and machine learning workflows.
df.to_hdf('data.h5', key='df', mode='w') # Export DataFrame to HDF5 format
df.to_feather('data.feather') # Export DataFrame to Feather format
Output: Files 'data.h5' and 'data.feather' will be created in the working directory.
Compression is useful for reducing file size and improving storage efficiency, especially when working with large datasets. Pandas supports compressing CSV, Excel, and HDF5 files using different algorithms like 'gzip', 'bz2', 'zip', and 'xz'. By using compression, you can significantly reduce the size of data files without losing information.
df.to_csv('data_compressed.csv.gz', index=False, compression='gzip') # Export CSV with gzip compression
df.to_excel('data_compressed.xlsx', index=False, engine='openpyxl', compression='zip') # Export Excel with zip compression
df.to_hdf('data_compressed.h5', key='df', mode='w', compression='blosc') # Export HDF5 with compression
Output: Files 'data_compressed.csv.gz', 'data_compressed.xlsx', and 'data_compressed.h5' will be created with compression applied.
Version control for data ensures that different versions of datasets are stored and managed efficiently. This is particularly important in collaborative environments where data is constantly evolving. By using tools like Git or DVC (Data Version Control), you can track changes to datasets and revert to previous versions when necessary. This helps maintain consistency and reproducibility in data analysis projects.
# Using DVC to track a data file version # Initialize DVC in a Git repository !git init # Initialize a Git repository
!dvc init # Initialize DVC (Data Version Control)
!dvc add data.csv # Add data file to DVC
!git add data.csv.dvc # Add DVC file to Git
!git commit -m "Add data version 1" # Commit to Git
!dvc push # Push data to remote storage
Output: The data file 'data.csv' will be tracked with DVC, and version control will be applied using Git.
MultiIndex allows you to have multiple levels of indexing on a DataFrame or Series. This provides the ability to represent more complex data structures, such as hierarchical or multi-dimensional data, using a single object. MultiIndex is commonly used when dealing with data grouped by more than one criterion.
import pandas as pd # Importing pandas
arrays = [['A', 'A', 'B', 'B'], [1, 2, 1, 2]] # Defining the multi-level index
index = pd.MultiIndex.from_arrays(arrays, names=('Letter', 'Number')) # Creating MultiIndex
data = [10, 20, 30, 40] # Data for the Series
multi_index_series = pd.Series(data, index=index) # Creating a Series with MultiIndex
print(multi_index_series) # Print the Series with MultiIndex
Output:
Letter Number
A 1 10
2 20
B 1 30
2 40
Indexing by level allows you to access data in a MultiIndex structure by specifying a level or combination of levels. This is useful for slicing data based on specific index levels without needing to reset or flatten the index.
arrays = [['A', 'A', 'B', 'B'], [1, 2, 1, 2]] # Defining the multi-level index
index = pd.MultiIndex.from_arrays(arrays, names=('Letter', 'Number')) # Creating MultiIndex
data = [10, 20, 30, 40] # Data for the Series
multi_index_series = pd.Series(data, index=index) # Creating a Series with MultiIndex
print(multi_index_series['A']) # Indexing by the first level (Letter='A')
print(multi_index_series.loc['B']) # Indexing by the first level (Letter='B')
Output:
Number
1 10
2 20
dtype: int64
Number
1 30
2 40
dtype: int64
Swapping the index levels allows you to change the order of the index levels, which can be useful for rearranging or sorting the data. Sorting by index can also be done to arrange the data in a specified order, either ascending or descending.
arrays = [['A', 'A', 'B', 'B'], [1, 2, 1, 2]] # Defining the multi-level index
index = pd.MultiIndex.from_arrays(arrays, names=('Letter', 'Number')) # Creating MultiIndex
data = [10, 20, 30, 40] # Data for the Series
multi_index_series = pd.Series(data, index=index) # Creating a Series with MultiIndex
swapped = multi_index_series.swaplevel() # Swapping index levels
print(swapped) # Print Series with swapped index
sorted_series = multi_index_series.sort_index() # Sorting by index
print(sorted_series) # Print sorted Series
Output:
Number Letter
1 A 10
B 30
2 A 20
B 40
dtype: int64
Letter Number
A 1 10
2 20
B 1 30
2 40
Index set operations allow you to perform mathematical set operations (like union, intersection, and difference) on the index of a pandas object. This can be useful for comparing and manipulating the indices of different datasets.
index1 = pd.Index(['A', 'B', 'C', 'D']) # First index
index2 = pd.Index(['B', 'C', 'E', 'F']) # Second index
union_index = index1.union(index2) # Union of indices (all unique elements)
intersection_index = index1.intersection(index2) # Intersection of indices (common elements)
difference_index = index1.difference(index2) # Difference of indices (elements in index1 but not index2)
print(union_index) # Print union of indices
print(intersection_index) # Print intersection of indices
print(difference_index) # Print difference of indices
Output:
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
Index(['B', 'C'], dtype='object')
Index(['A', 'D'], dtype='object')
In pandas, you can apply custom functions to rows or columns of a DataFrame using the apply()
method. When applying a function to columns, the function is applied to each individual column, whereas applying it to rows means the function will be applied to each row. The axis parameter is used to specify whether you want to apply the function along rows (axis=1) or columns (axis=0).
import pandas as pd
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)
def sum_func(col):
return col.sum()
column_sum = df.apply(sum_func, axis=0)
print(column_sum)
Output:
A 6
B 15
row_sum = df.apply(sum_func, axis=1)
print(row_sum)
Output:
0 5
1 7
2 9
A lambda function is a small anonymous function that can be defined in a single line. It is often used with the apply()
method to create quick functions for operations like transformations and computations.
df['A_squared'] = df['A'].apply(lambda x: x**2)
print(df)
Output:
A B A_squared
0 1 4 1
1 2 5 4
2 3 6 9
df['A_label'] = df['A'].apply(lambda x: 'Small' if x < 3 else 'Large')
print(df)
Output:
A B A_squared A_label
0 1 4 1 Small
1 2 5 4 Small
2 3 6 9 Large
The map()
function is used for element-wise transformations, and it works specifically with Series (one-dimensional data). In contrast, the apply()
function is more flexible and can be used with both Series and DataFrames, enabling more complex transformations across rows or columns. map()
is generally faster than apply()
when applied to Series.
df['B_mapped'] = df['B'].map(lambda x: x * 10)
print(df)
Output:
A B A_squared A_label B_mapped
0 1 4 1 Small 40
1 2 5 4 Small 50
2 3 6 9 Large 60
df['row_sum'] = df.apply(lambda row: row['A'] + row['B'], axis=1)
print(df)
Output:
A B A_squared A_label B_mapped row_sum
0 1 4 1 Small 40 5
1 2 5 4 Small 50 7
2 3 6 9 Large 60 9
Complex transformations involve applying more advanced logic to the DataFrame or Series. This may include combining multiple columns, performing conditional operations, or using external functions within the transformation.
df['A_and_B'] = df.apply(lambda row: row['A'] * row['B'] if row['A'] > 1 else row['A'] + row['B'], axis=1)
print(df)
Output:
A B A_squared A_label B_mapped row_sum A_and_B
0 1 4 1 Small 40 5 5
1 2 5 4 Small 50 7 10
2 3 6 9 Large 60 9 18
def multiply(a, b):
return a * b
df['multiplication'] = df.apply(lambda row: multiply(row['A'], row['B']), axis=1)
print(df)
Output:
A B A_squared A_label B_mapped row_sum A_and_B multiplication
0 1 4 1 Small 40 5 5 4
1 2 5 4 Small 50 7 10 10
2 3 6 9 Large 60 9 18 18
Optimizing the data types used in your dataset can significantly improve performance, especially when working with large datasets. By choosing more efficient data types, such as using integers instead of floats, or reducing memory usage by using smaller types like 'float32' instead of 'float64', you can save both memory and processing time.
# Real-world example: Efficient data types
import pandas as pd # Import the pandas library
data = pd.DataFrame({'col1': [1.0, 2.5, 3.8], 'col2': [5.2, 6.3, 7.4]}) # Create a DataFrame
data['col1'] = data['col1'].astype('float32') # Convert to more efficient float32
data['col2'] = data['col2'].astype('float32') # Convert to more efficient float32
print(data.dtypes) # Output the data types of each column
Output: The data types of 'col1' and 'col2' are now 'float32'
Categorical data refers to data that can take on a limited, fixed number of values, often representing categories such as 'male/female' or 'red/blue/green'. Using the 'category' data type in pandas reduces memory usage and increases performance when dealing with repetitive text values.
# Real-world example: Using categorical data
data = pd.DataFrame({'category_col': ['red', 'blue', 'green', 'red', 'blue']}) # Create a DataFrame
data['category_col'] = data['category_col'].astype('category') # Convert to categorical type
print(data.dtypes) # Output the data type of the 'category_col' column
Output: The data type of 'category_col' is now 'category'
Vectorization refers to replacing explicit loops with operations that are performed over entire arrays or columns of data. By using vectorized operations, you can take advantage of optimized C code behind libraries like NumPy and pandas, leading to faster execution times compared to using Python loops.
# Real-world example: Vectorization over loops
import numpy as np # Import NumPy library
data = np.array([1, 2, 3, 4, 5]) # Create a NumPy array
result = data * 2 # Vectorized operation (no loop needed)
print(result) # Output the result
Output: The array [2, 4, 6, 8, 10]
Profiling with the magic command `%%timeit` in Jupyter Notebooks or IPython allows you to measure the execution time of a single line or a block of code. This can be useful for determining performance bottlenecks and optimizing code where necessary.
# Real-world example: Profiling with %%timeit
# This code should be run in a Jupyter Notebook or IPython environment
%%timeit # Measure the time taken by this code
data = [x for x in range(1000)] # List comprehension to create a list
Output: The time taken to execute the list comprehension is displayed
Using try-except blocks in a pandas pipeline allows you to gracefully handle errors, ensuring that the process continues even if one step fails. This is particularly useful in data processing when unexpected errors occur.
import pandas as pd
def process_data(df):
try:
df['Value'] = df['Value'] / df['Value2']
except ZeroDivisionError:
print("Error: Division by zero encountered")
return df
df = pd.DataFrame({'Value': [10, 20, 30], 'Value2': [2, 0, 5]})
result = process_data(df)
print(result)
# This uses try-except to handle division by zero errors
Output:
Error: Division by zero encountered
Value Value2
0 5 2
1 NaN 0
2 6 5
Logging allows you to record information about your program's execution, making it easier to track errors and debug. Custom errors can also be raised to handle specific conditions in a program.
import logging
logging.basicConfig(level=logging.INFO)
def check_value(value):
if value < 0:
logging.error("Value cannot be negative")
raise ValueError("Negative value error")
return value
try:
check_value(-5)
except ValueError as e:
logging.exception(e)
# This logs an error when a negative value is encountered
Output:
ERROR:root:Value cannot be negative
Traceback (most recent call last):
File "script.py", line 7, in
check_value(-5)
File "script.py", line 5, in check_value
raise ValueError("Negative value error")
ValueError: Negative value error
Warnings indicate potential issues in code that may not stop execution but could lead to problems in the future. Deprecation warnings indicate that certain features or functions will be removed in future versions of libraries.
import warnings
def old_function():
warnings.warn("This function is deprecated", DeprecationWarning)
old_function()
# This generates a deprecation warning
Output:
Warning: This function is deprecated
Breakpoints allow you to pause code execution at a specific line, enabling you to inspect variables and the program state. This is especially useful for debugging issues in complex code.
import pdb
def calculate(x, y):
pdb.set_trace()
return x + y
result = calculate(10, 20)
print(result)
# This will pause the program at the breakpoint, allowing inspection
Output:
(Pdb) 10
(Pdb) 20
(Pdb) n
30
Seaborn is a Python data visualization library based on Matplotlib that provides a high-level interface for drawing attractive and informative statistical graphics. It is designed to work well with pandas DataFrames and provides a variety of plotting functions for visualizing distributions, relationships, and categories.
import seaborn as sns
import matplotlib.pyplot as plt
# Load the dataset data = sns.load_dataset('iris')
sns.scatterplot(x='sepal_length', y='sepal_width', data=data)
plt.title('Seaborn Scatter Plot')
plt.show()
Output:
A scatter plot will be displayed showing the relationship between 'sepal_length' and 'sepal_width' from the iris dataset.
Seaborn provides several built-in functions to create statistical plots, such as bar plots, box plots, violin plots, and regression plots. These plots are designed to visualize the statistical properties of the data, like distributions, central tendency, and spread.
sns.boxplot(x='species', y='sepal_length', data=data)
plt.title('Boxplot of Sepal Length by Species')
plt.show()
Output:
A boxplot will be displayed, showing the distribution of 'sepal_length' for each species in the iris dataset.
sns.regplot(x='sepal_length', y='sepal_width', data=data)
plt.title('Regression Plot')
plt.show()
Output:
A regression plot will be displayed showing the linear relationship between 'sepal_length' and 'sepal_width' with a fitted line.
Pairplots and heatmaps are two powerful visualization techniques in Seaborn. Pairplots show relationships between all variables in a dataset, while heatmaps visualize correlation matrices or other two-dimensional data.
sns.pairplot(data, hue='species')
plt.title('Pairplot of Iris Dataset')
plt.show()
Output:
A pairplot will be displayed showing scatter plots between all variables in the iris dataset, color-coded by the species.
import numpy as np
corr_matrix = data.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()
Output:
A heatmap will be displayed showing the correlation matrix between the numerical features of the iris dataset, with annotated values.
Seaborn allows users to customize the appearance of their plots using themes and color palettes. You can change the overall style of plots and adjust the color schemes to make the visuals more appealing and easier to interpret.
sns.set_style('whitegrid')
sns.scatterplot(x='sepal_length', y='sepal_width', data=data)
plt.title('Scatter Plot with Custom Theme')
plt.show()
Output:
A scatter plot will be displayed with a white grid background, using Seaborn's 'whitegrid' theme.
sns.set_palette('muted')
sns.boxplot(x='species', y='sepal_length', data=data)
plt.title('Boxplot with Custom Palette')
plt.show()
Output:
A boxplot will be displayed using Seaborn's 'muted' color palette.
Importing finance data is the first step in performing stock market analysis. You can use various data sources like Yahoo Finance, Alpha Vantage, or Quandl to fetch historical stock price data. This data typically includes the opening, closing, high, and low prices for each trading day.
import yfinance as yf
# Define the stock ticker symbol (e.g., Apple)
ticker = 'AAPL'
# Download historical stock data for the past year
data = yf.download(ticker, period='1y')
# Display the first few rows of the data
print(data.head())
Output:
First few rows of Apple's stock data including Date, Open, High, Low, Close, Volume, Adjusted Close
Daily returns refer to the percentage change in stock price from one day to the next. Cumulative returns show the total percentage change over a specified period. These calculations help investors understand short-term price movements and the overall performance of a stock.
# Calculate daily returns by computing the percentage change of the closing price
data['Daily_Return'] = data['Close'].pct_change()
# Calculate cumulative returns
data['Cumulative_Return'] = (1 + data['Daily_Return']).cumprod() - 1
# Display the results
print(data[['Daily_Return', 'Cumulative_Return']].tail())
Output:
Last 5 rows of daily returns and cumulative returns for Apple
Moving averages help smooth out price data to identify trends over a specific period, while volatility measures the variation in stock prices. A common moving average is the 50-day or 200-day moving average, and volatility can be measured using the standard deviation of daily returns.
# Calculate the 50-day moving average
data['50_Day_MA'] = data['Close'].rolling(window=50).mean()
# Calculate the volatility (standard deviation of daily returns)
data['Volatility'] = data['Daily_Return'].rolling(window=50).std()
# Display the results
print(data[['50_Day_MA', 'Volatility']].tail())
Output:
Last 5 rows of 50-day moving averages and volatility values
Risk vs return is a fundamental concept in finance. Investors must assess whether the potential return on an investment justifies the associated risk. By analyzing historical stock data, we can calculate the expected return and the risk (volatility) to make more informed investment decisions.
# Calculate the expected return (mean of daily returns)
expected_return = data['Daily_Return'].mean()
# Calculate the risk (standard deviation of daily returns)
risk = data['Daily_Return'].std()
# Print the expected return and risk
print(f'Expected Return: {expected_return:.4f}')
print(f'Risk (Volatility): {risk:.4f}')
Output:
Expected Return: 0.0012
Risk (Volatility): 0.0153
Web scraping is a technique used to extract data from websites. In this case, we will scrape movie data from IMDb using the BeautifulSoup and requests libraries. This allows us to gather information such as movie titles, genres, release dates, and ratings.
import requests # Import requests library
from bs4 import BeautifulSoup # Import BeautifulSoup for HTML parsing
url = 'https://www.imdb.com/chart/top' # IMDb top movies chart URL
response = requests.get(url) # Send a GET request to the URL
soup = BeautifulSoup(response.text, 'html.parser') # Parse HTML content using BeautifulSoup
movies = soup.find_all('td', class_='titleColumn') # Find all movie title columns
for movie in movies[:5]: # Loop through the first 5 movies
title = movie.a.text # Extract movie title
year = movie.span.text.strip('()') # Extract movie release year
print(f'{title} ({year})') # Print movie title and release year
Output: The top 5 movies from IMDb, along with their release years, will be printed.
Grouping movies by genre allows us to analyze trends and compare different types of movies. We can create a dictionary where the genre is the key, and the value is a list of movies belonging to that genre.
import pandas as pd # Import pandas for data manipulation
# Sample data: movie titles and their genres data = {'Title': ['Inception', 'The Dark Knight', 'Pulp Fiction', 'The Matrix', 'Forrest Gump'], 'Genre': ['Sci-Fi', 'Action', 'Crime', 'Sci-Fi', 'Drama']}
df = pd.DataFrame(data) # Create a DataFrame from the data
grouped_genres = df.groupby('Genre')['Title'].apply(list) # Group movies by genre
print(grouped_genres) # Print the grouped genres and their movies
Output: A list of movies grouped by genre will be printed, showing the genres with their respective movie titles.
Visualizing how ratings change over time is a powerful tool for understanding movie trends. We can create a plot that shows the ratings of movies over the years using libraries like Matplotlib and Pandas.
import matplotlib.pyplot as plt # Import Matplotlib for plotting
import pandas as pd # Import pandas for data manipulation
# Sample data: movie titles, ratings, and release years data = {'Title': ['Inception', 'The Dark Knight', 'Pulp Fiction', 'The Matrix', 'Forrest Gump'], 'Rating': [8.8, 9.0, 8.9, 8.7, 8.8], 'Year': [2010, 2008, 1994, 1999, 1994]}
df = pd.DataFrame(data) # Create a DataFrame from the data
plt.plot(df['Year'], df['Rating'], marker='o') # Plot ratings over years
plt.title('Movie Ratings Over Time') # Set plot title
plt.xlabel('Year') # Set x-axis label
plt.ylabel('Rating') # Set y-axis label
plt.show() # Display the plot
Output: A line plot will be displayed showing movie ratings over the years.
Comparing movies based on studio and budget can provide insights into the financial aspect of filmmaking. We can group movies by studio and calculate average budgets or other statistics.
# Sample data: movie titles, studio, and budget data = {'Title': ['Inception', 'The Dark Knight', 'Pulp Fiction', 'The Matrix', 'Forrest Gump'], 'Studio': ['Warner Bros.', 'Warner Bros.', 'Miramax', 'Warner Bros.', 'Paramount'], 'Budget': [160000000, 185000000, 80000000, 63000000, 55000000]}
df = pd.DataFrame(data) # Create a DataFrame from the data
avg_budget_by_studio = df.groupby('Studio')['Budget'].mean() # Calculate average budget by studio
print(avg_budget_by_studio) # Print the average budget for each studio
Output: The average budget for each studio will be printed, showing how much they typically invest in their movies.
GeoPandas is an extension of the pandas library that makes working with geospatial data in Python easier. It allows you to read, manipulate, and analyze spatial data such as shapefiles, GeoJSON, and others. It supports spatial operations like calculating area, distance, buffering, and joins based on geometry.
import geopandas as gpd # Importing the geopandas library
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres')) # Load world map data
print(world.head()) # Display the first few rows of the GeoDataFrame
Output:
pop_est continent name iso_a3 gdp_md_est geometry
0 920938 Africa Algeria DZA 48830.0 POLYGON ((11.999 23.471...
1 539509 Africa Angola AGO 110300.0 MULTIPOLYGON (((12.227 ...
...
Shapefiles are a common format for geospatial vector data, consisting of several files (.shp, .shx, .dbf, etc.). GeoPandas allows you to import shapefiles easily into a GeoDataFrame using the `read_file()` function. Once loaded, the data can be analyzed or visualized just like a regular DataFrame, with added geometry features.
import geopandas as gpd # Import geopandas
gdf = gpd.read_file('your_shapefile_folder/your_file.shp') # Replace with your actual shapefile path
print(gdf.head()) # Show top rows of shapefile data
Output:
Will vary depending on the shapefile. Typically includes:
ID, Name, geometry, etc.
Example:
ID NAME geometry
1 River POLYGON ((...))
With GeoPandas, you can create maps using longitude and latitude data by converting regular pandas DataFrames to GeoDataFrames using the `points_from_xy` function. This is useful when visualizing addresses, locations, or spatial patterns based on GPS coordinates.
import pandas as pd # Import pandas
import geopandas as gpd # Import geopandas
from shapely.geometry import Point # Import Point for coordinates
# Create a simple DataFrame with coordinates
data = {'City': ['A', 'B'], 'Latitude': [34.05, 36.16], 'Longitude': [-118.24, -115.15]}
df = pd.DataFrame(data) # Create DataFrame
# Convert to GeoDataFrame
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df['Longitude'], df['Latitude']))
print(gdf.head()) # Show mapped GeoDataFrame
Output:
City Latitude Longitude geometry
A 34.05 -118.24 POINT (-118.24000 34.05000)
B 36.16 -115.15 POINT (-115.15000 36.16000)
You can combine spatial data (maps) with regular pandas DataFrames using merges or joins on common keys. For example, you can join a GeoDataFrame of regions with another DataFrame containing population or economic data. This allows for spatial analysis and geographic visualizations with rich context.
import geopandas as gpd # Import geopandas
import pandas as pd # Import pandas
# Load a world map
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
# Create a simple DataFrame with matching country names
data = {'name': ['Brazil', 'China'], 'Population': [211e6, 1393e6]}
pop_df = pd.DataFrame(data) # Create population DataFrame
# Merge spatial data with population
merged = world.merge(pop_df, on='name') # Merge on 'name' column
print(merged[['name', 'Population', 'geometry']]) # Show result
Output:
name Population geometry
Brazil 211000000.0 POLYGON ((-57.6251 -30.2163...
China 1393000000.0 POLYGON ((109.6875 18.1977...
Tokenization is the process of breaking down text into individual units such as words or phrases (tokens). Lowercasing is a preprocessing step where all text is converted to lowercase to maintain uniformity and improve comparison. These are often the first steps in natural language processing (NLP) workflows to prepare raw text for analysis or machine learning.
text = "NLP Is Fun and Powerful!" # Original text
tokens = text.lower().split() # Convert to lowercase and split into tokens
print(tokens) # Display the tokens
Output: ['nlp', 'is', 'fun', 'and', 'powerful!']
Stopwords are common words in a language (like "and", "is", "the") that often do not add significant meaning. Removing them helps reduce noise in the data and can lead to better model performance. Libraries like NLTK provide predefined lists of stopwords.
from nltk.corpus import stopwords # Import stopwords list
import nltk # Import nltk for natural language tasks
nltk.download('stopwords') # Download stopwords if not already present
text = "NLP is fun and powerful" # Sample text
words = text.lower().split() # Lowercase and tokenize
filtered = [word for word in words if word not in stopwords.words('english')] # Remove stopwords
print(filtered) # Display filtered tokens
Output: ['nlp', 'fun', 'powerful']
The Bag-of-Words (BoW) model transforms text into a fixed-length vector based on word frequency. It disregards grammar and word order but captures how often a word occurs in a document. It’s commonly used for text classification and clustering.
from sklearn.feature_extraction.text import CountVectorizer # Import BoW tool
texts = ["I love NLP", "NLP is great and fun"] # Sample documents
vectorizer = CountVectorizer() # Create a CountVectorizer object
bow = vectorizer.fit_transform(texts) # Transform texts to BoW format
print(vectorizer.get_feature_names_out()) # Print words used in BoW
print(bow.toarray()) # Show frequency matrix
Output:
Words: ['and', 'fun', 'great', 'is', 'love', 'nlp']
Matrix: [[0, 0, 0, 0, 1, 1], [1, 1, 1, 1, 0, 1]]
TF-IDF (Term Frequency-Inverse Document Frequency) is an advanced technique that scores words based on how unique they are across documents. A high TF-IDF value indicates a word is frequent in one document but rare across others, making it more meaningful. Scikit-learn makes it easy to implement TF-IDF using `TfidfVectorizer`.
from sklearn.feature_extraction.text import TfidfVectorizer # Import TF-IDF tool
texts = ["I love NLP", "NLP is great and fun"] # Sample text list
tfidf = TfidfVectorizer() # Create the vectorizer
matrix = tfidf.fit_transform(texts) # Fit and transform the texts
print(tfidf.get_feature_names_out()) # Display words
print(matrix.toarray()) # Display TF-IDF scores
Output:
Words: ['and', 'fun', 'great', 'is', 'love', 'nlp']
Matrix: Each row shows the TF-IDF scores of words per document.
Dask is a parallel computing library in Python that scales Python code to handle large datasets. It allows you to work with big data in a similar way to pandas but in a distributed or parallelized fashion. Dask breaks your data into smaller chunks and processes them in parallel, which is ideal for datasets that don’t fit into memory.
import dask # Import the Dask library
import dask.dataframe as dd # Import Dask's DataFrame module
# Load a large CSV using Dask
df = dd.read_csv('large_dataset.csv') # Efficiently reads big files in chunks
print(df.head()) # Displays the first few rows lazily
Output:
First few rows of your large CSV file (similar to pandas output).
Example:
id,name,value
1,Alice,100
2,Bob,200
Dask mimics the pandas API, which means you can use most of your existing pandas code with minimal changes. Instead of `import pandas as pd`, you use `import dask.dataframe as dd`. The rest of the syntax (like filtering, grouping, etc.) remains largely the same, making it easy to scale your existing workflows.
import dask.dataframe as dd # Use Dask instead of pandas
# Read a CSV file with Dask
df = dd.read_csv('sales_data.csv') # Loads data in chunks
# Group by product and calculate the total
result = df.groupby('product')['amount'].sum() # Similar to pandas
print(result.compute()) # Triggers computation
Output:
product
Chair 1200
Desk 800
Lamp 400
One of the powerful features of Dask is lazy evaluation. Unlike pandas, which executes operations immediately, Dask builds a task graph and waits to compute until explicitly told to with `.compute()`. This makes operations faster and memory-efficient because nothing is done until needed.
import dask.dataframe as dd # Import Dask
df = dd.read_csv('bigfile.csv') # Load big file lazily
# Define a lazy operation
average = df['price'].mean() # No computation yet
print(average) # Just shows a delayed object
print(average.compute()) # Now it actually calculates
Output:
Dask Series: price mean (delayed object)
Then the actual number like: 45.67
Dask is especially useful when handling gigabytes or terabytes of data. Performance can be benchmarked by comparing processing time and memory usage against pandas. For massive files, Dask often completes tasks significantly faster because it uses parallelism and avoids memory overload by processing data in chunks.
import pandas as pd # Regular pandas
import dask.dataframe as dd # Dask
import time # To measure time
# Timing pandas
start = time.time()
df_pandas = pd.read_csv('large_file.csv')
print("Pandas Time:", time.time() - start)
# Timing Dask
start = time.time()
df_dask = dd.read_csv('large_file.csv')
df_dask.head() # Trigger load
print("Dask Time:", time.time() - start)
Output:
Pandas Time: 12.4 seconds
Dask Time: 2.7 seconds
(Numbers may vary based on file size and system.)
SQLite is a lightweight, file-based database that can be easily integrated with pandas for storing and retrieving data. To connect pandas to SQLite, we use Python’s built-in sqlite3
module to establish a connection, which can then be passed to pandas functions like read_sql_query()
and to_sql()
.
import sqlite3
conn = sqlite3.connect("example.db")
print("Connection successful!")
Output:
Connection successful!
After establishing a connection to the database, you can use SQL queries to retrieve data using pd.read_sql_query()
. This function allows you to run SQL commands like SELECT
and fetch the results directly into a pandas DataFrame for further analysis.
import pandas as pd
query = "SELECT name FROM sqlite_master WHERE type='table';"
tables = pd.read_sql_query(query, conn)
print(tables)
Output:
Displays a list of table names in the SQLite database
You can join SQL query results with pandas DataFrames using pandas merge operations. This is useful when you want to combine database results with other in-memory data. The merge()
function lets you do SQL-style joins (inner, outer, left, right) directly in pandas.
# Assuming 'employees' table exists in SQLite
sql_df = pd.read_sql_query("SELECT id, name FROM employees", conn)
local_df = pd.DataFrame({'id': [1, 2], 'department': ['HR', 'IT']})
joined_df = pd.merge(sql_df, local_df, on='id')
print(joined_df)
Output:
Merges employee names from SQL with department data in pandas
Pandas allows you to write a DataFrame to a SQL table using the to_sql()
method. This is useful for saving processed or transformed data back into a database. You can choose to replace the table, append to it, or fail if it already exists by setting the if_exists
parameter.
df = pd.DataFrame({'id': [3, 4], 'name': ['Alice', 'Bob']})
df.to_sql("employees", conn, if_exists="append", index=False)
print("Data written to SQL database.")
Output:
Data written to SQL database.
Generating Excel reports is a common task in data analysis for sharing results with non-technical users. With pandas, you can easily write DataFrames to Excel files using the `to_excel()` method, which supports multiple sheets and formatting.
# Real-world example: Generating an Excel report
import pandas as pd # Import pandas library
df = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Sales': [250, 300]}) # Create a sample DataFrame
df.to_excel('sales_report.xlsx', index=False) # Export the DataFrame to Excel
print("Excel report generated successfully.") # Confirm success
Output: Creates a file named sales_report.xlsx
with Name and Sales data.
With the xlsxwriter engine, you can create Excel files with embedded charts. This is helpful for visualizing trends or summaries directly in your reports. Charts can be added with formatting, titles, and styles for professional reporting.
# Real-world example: Adding charts to Excel with xlsxwriter
writer = pd.ExcelWriter('chart_report.xlsx', engine='xlsxwriter') # Use xlsxwriter engine
df.to_excel(writer, sheet_name='Sheet1', index=False) # Write data to Excel
workbook = writer.book # Get workbook object
worksheet = writer.sheets['Sheet1'] # Get worksheet object
chart = workbook.add_chart({'type': 'column'}) # Create column chart
chart.add_series({'values': '=Sheet1!$B$2:$B$3'}) # Add data series
worksheet.insert_chart('D2', chart) # Insert chart in cell D2
writer.save() # Save the file
print("Excel file with chart created.") # Confirm success
Output: Generates an Excel file with a column chart showing Sales data.
Emailing reports automatically is useful for regular updates. Using the built-in `smtplib` and `email` libraries in Python, you can send Excel or PDF files as email attachments. This is often combined with scheduled scripts for automation.
# Real-world example: Sending email with a report (requires real email setup)
import smtplib # Library for sending emails
from email.message import EmailMessage # Class to build email messages
msg = EmailMessage() # Create an email message
msg['Subject'] = 'Monthly Sales Report' # Set subject
msg['From'] = 'sender@example.com' # Sender's email
msg['To'] = 'receiver@example.com' # Receiver's email
msg.set_content('Please find the attached sales report.') # Email body
with open('sales_report.xlsx', 'rb') as f: # Open file in binary mode
file_data = f.read() # Read file
file_name = f.name # Get file name
msg.add_attachment(file_data, maintype='application', subtype='octet-stream', filename=file_name) # Attach file
with smtplib.SMTP('smtp.example.com', 587) as server: # Connect to SMTP server
server.starttls() # Secure the connection
server.login('sender@example.com', 'password') # Login credentials
server.send_message(msg) # Send email
print("Email sent with report.") # Confirm success
Output: Sends the sales_report.xlsx
to the specified recipient.
Automating the timing of report generation can be achieved using cron (Linux/macOS) or Task Scheduler (Windows). These tools allow you to schedule your Python script to run at specific intervals, like daily or weekly.
# Example (Linux/macOS): Add this to crontab
# Open crontab with: crontab -e
# Run every day at 8 AM
0 8 * * * /usr/bin/python3 /path/to/report_script.py
# Example (Windows Task Scheduler):
# - Open Task Scheduler
# - Create Task → Triggers: Daily 8AM
# - Actions: Start a Program → python.exe and script path
Output: Your script runs automatically on the schedule without manual intervention.
Consuming JSON APIs in pandas means making a request to an API that returns JSON data and converting that data into a DataFrame. This is common in data projects that involve pulling external information from web services or public data portals.
import requests
import pandas as pd
url = 'https://jsonplaceholder.typicode.com/posts'
response = requests.get(url)
data = response.json()
df = pd.DataFrame(data)
print(df.head())
# Fetches JSON data from a free placeholder API and loads it into a DataFrame
Output:
id userId title
1 1 sunt aut facere repellat
2 1 qui est esse
3 1 ea molestias quasi
...
Sometimes JSON from APIs contains nested structures like dictionaries or lists inside objects. Pandas has `json_normalize()` to flatten these structures into a tabular format so they can be more easily analyzed.
import requests
import pandas as pd
from pandas import json_normalize
url = 'https://jsonplaceholder.typicode.com/users'
response = requests.get(url)
data = response.json()
df = json_normalize(data, sep='_')
print(df[['id', 'name', 'address_city']])
# This flattens the nested address data into individual columns
Output:
id name address_city
1 Leanne Graham Gwenborough
2 Ervin Howell Wisokyburgh
...
Many APIs limit the number of results returned per request. Pagination helps you retrieve multiple pages. Tokens are often used for authenticated access. You loop through pages until there's no more data.
import requests
import pandas as pd
all_data = []
for page in range(1, 3):
url = f'https://jsonplaceholder.typicode.com/posts?_page={page}&_limit=5'
response = requests.get(url)
data = response.json()
all_data.extend(data)
df = pd.DataFrame(all_data)
print(df)
# This simulates pagination by looping through API pages
Output:
id title body
1 sunt aut facere ...
2 qui est esse ...
...
Some APIs restrict the number of requests in a given time (rate-limiting). To avoid getting blocked or receiving errors, you can use delays (like `time.sleep()`) between requests or handle the 429 error specifically.
import requests
import pandas as pd
import time
all_data = []
for i in range(3):
url = 'https://jsonplaceholder.typicode.com/posts/{}'.format(i+1)
response = requests.get(url)
if response.status_code == 429:
print("Rate limit hit, sleeping...")
time.sleep(5)
continue
data = response.json()
all_data.append(data)
time.sleep(1)
df = pd.DataFrame(all_data)
print(df)
# Adds delay to handle rate limits and avoid API blocking
Output:
id title userId
1 sunt aut facere 1
2 qui est esse 1
3 ea molestias quasi 1
Unit testing ensures that individual components of code function as expected. pytest
is a popular Python testing framework that allows writing simple and readable test cases using plain Python functions. Unit tests can catch bugs early by validating functions or methods in isolation.
# my_math.py
def add(x, y):
return x + y
# test_my_math.py
from my_math import add
def test_add():
assert add(2, 3) == 5
assert add(-1, 1) == 0
assert add(0, 0) == 0
Output:
Running pytest test_my_math.py
will show that all test cases pass if the function works correctly.
Validating DataFrames means checking that data meets specific conditions such as value types, ranges, or missing values. This step ensures that your data is clean and consistent before analysis or modeling. It can be done using simple pandas checks or libraries like pandera
.
import pandas as pd
# Create a sample DataFrame
df = pd.DataFrame({
'name': ['Alice', 'Bob', None],
'age': [25, 30, None]
})
# Check for missing values
missing = df.isnull().sum()
print("Missing values in each column:")
print(missing)
# Check data types
print("Data types:")
print(df.dtypes)
Output:
Displays missing values per column and data types of each column in the DataFrame.
Schema enforcement involves defining and enforcing rules on the structure and types of your data. This is crucial when loading, transforming, or validating structured datasets to ensure consistency. Tools like pandera
allow you to declare expected schemas and catch violations automatically.
import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema
# Define schema
schema = DataFrameSchema({
"name": Column(str, nullable=False),
"age": Column(int, nullable=False)
})
# Sample valid DataFrame
df_valid = pd.DataFrame({
"name": ["Alice", "Bob"],
"age": [25, 30]
})
# Validate
validated = schema.validate(df_valid)
print("Data validated successfully")
Output:
If the data conforms to the schema, it prints: Data validated successfully
. Otherwise, it raises a schema error.
Custom test cases allow you to create specific tests for edge cases or domain-specific logic in your code. These can be automated using pytest
or manually coded to check particular behaviors that aren’t covered by general unit tests.
# discount.py
def apply_discount(price, percentage):
if percentage < 0 or percentage > 100:
raise ValueError("Percentage must be between 0 and 100")
return price * (1 - percentage / 100)
# test_discount.py
from discount import apply_discount
import pytest
def test_valid_discount():
assert apply_discount(100, 20) == 80
def test_invalid_discount():
with pytest.raises(ValueError):
apply_discount(100, -10)
Output:
Valid discount returns correct price, while invalid percentages raise a ValueError
as expected.