Scikit-Learn

1. What is scikit-learn?
Scikit-learn is a simple, efficient, and robust Python library for machine learning. It provides algorithms for classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.
# Import a classifier
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
print(model)  # Output: LogisticRegression()
      

2. History and evolution
Scikit-learn was developed by David Cournapeau in 2007 as part of the Google Summer of Code. It was built on top of NumPy, SciPy, and later integrated with joblib and pandas.
# Check scikit-learn version
import sklearn
print(sklearn.__version__)  # Example Output: '1.4.0'
      

3. Installation and setup
You can install scikit-learn using pip or conda depending on your environment.
# Using pip (command-line)
# pip install scikit-learn

# Using conda (command-line)
# conda install scikit-learn
      

4. Overview of supported algorithms
Scikit-learn supports algorithms for classification (SVM, KNN, Decision Trees), regression (Linear, Ridge, Lasso), clustering (KMeans, DBSCAN), and more.
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
print(model)  # Output: KMeans(n_clusters=3)
      

5. Key features and strengths
Scikit-learn is known for:
  • Consistent API
  • Extensive documentation
  • Integration with NumPy/Pandas
  • Excellent performance
# Feature: Pipelines and preprocessing
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

pipeline = make_pipeline(StandardScaler(), Ridge())
print(pipeline)
      

6. Comparison with other ML libraries
Compared to TensorFlow or PyTorch, scikit-learn focuses on traditional ML (not deep learning) and provides ready-to-use models with minimal code.
# Scikit-learn is great for small-to-medium datasets
from sklearn.svm import SVC
clf = SVC(kernel='linear')
print(clf)
      

7. Structure of a scikit-learn project
A typical project includes:
  1. Data loading
  2. Preprocessing
  3. Model training
  4. Prediction
  5. Evaluation
# Full structure in short
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
      

8. The fit/predict paradigm
Most models follow the `fit()` and `predict()` methods. First, train on data using `fit()`, then make predictions with `predict()`.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit([[0], [1], [2]], [0, 1, 2])
print(model.predict([[3]]))  # Output: [3.]
      

9. How scikit-learn fits into the ML pipeline
Scikit-learn helps automate the full ML pipeline: preprocessing, feature selection, modeling, evaluation, and even hyperparameter tuning.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

pipeline = Pipeline([
  ('scaler', StandardScaler()),
  ('svm', SVC())
])
print(pipeline)
      

10. Using scikit-learn with other Python tools
Scikit-learn integrates smoothly with pandas, NumPy, Matplotlib, and Jupyter notebooks, making it ideal for prototyping and development.
import pandas as pd
from sklearn.linear_model import LogisticRegression

df = pd.DataFrame({'x': [1, 2, 3], 'y': [0, 1, 0]})
model = LogisticRegression()
model.fit(df[['x']], df['y'])
print(model.predict([[2]]))  # Output: [1]
      

1. Loading datasets
Scikit-learn provides built-in datasets like Iris, Boston, and Digits for experimentation and learning.
from sklearn.datasets import load_iris
data = load_iris()
print(data.data[:2])  # Show first 2 samples
      

2. Handling missing values
Missing data can be handled using imputation techniques such as mean, median, or constant value filling.
import numpy as np
from sklearn.impute import SimpleImputer

X = [[1, 2], [np.nan, 3], [7, 6]]
imp = SimpleImputer(strategy='mean')
print(imp.fit_transform(X))
      

3. Encoding categorical variables
Convert text labels into numeric form using `OneHotEncoder` or `OrdinalEncoder`.
from sklearn.preprocessing import OneHotEncoder

X = [['red'], ['green'], ['blue']]
encoder = OneHotEncoder()
print(encoder.fit_transform(X).toarray())
      

4. Feature scaling and normalization
Features should be scaled for most ML algorithms. Common techniques are Standardization and Min-Max Scaling.
from sklearn.preprocessing import StandardScaler

X = [[1, 10], [2, 15], [3, 14]]
scaler = StandardScaler()
print(scaler.fit_transform(X))
      

5. Feature binarization and discretization
Binarization converts values to 0/1 based on a threshold. Discretization breaks data into bins.
from sklearn.preprocessing import Binarizer

X = [[1, 5], [3, 2], [4, 0]]
binarizer = Binarizer(threshold=2)
print(binarizer.fit_transform(X))
      

6. Polynomial features
Polynomial features generate interactions and higher-order terms useful in linear models.
from sklearn.preprocessing import PolynomialFeatures

X = [[2, 3]]
poly = PolynomialFeatures(degree=2)
print(poly.fit_transform(X))
      

7. Custom transformers
You can create your own transformers by extending `BaseEstimator` and `TransformerMixin`.
from sklearn.base import BaseEstimator, TransformerMixin

class AddOneTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X + 1

import numpy as np
print(AddOneTransformer().fit_transform(np.array([[1], [2], [3]])))
      

8. Pipelines for preprocessing
A pipeline chains multiple preprocessing steps and a model into one object.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression

pipeline = Pipeline([
  ('scale', MinMaxScaler()),
  ('model', LinearRegression())
])
print(pipeline)
      

9. ColumnTransformer usage
ColumnTransformer lets you apply different preprocessing to specific columns.
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

data = pd.DataFrame({
  'age': [20, 30, 40],
  'gender': ['M', 'F', 'M']
})

ct = ColumnTransformer([
  ('num', StandardScaler(), ['age']),
  ('cat', OneHotEncoder(), ['gender'])
])

print(ct.fit_transform(data))
      

10. Preprocessing best practices
Always split data before preprocessing to avoid leakage. Use Pipelines to ensure consistency in training and testing.
# Good practice: use pipeline + train/test split
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge
import numpy as np

X = np.array([[1], [2], [3], [4]])
y = np.array([1, 3, 3, 4])

X_train, X_test, y_train, y_test = train_test_split(X, y)

pipe = make_pipeline(StandardScaler(), Ridge())
pipe.fit(X_train, y_train)
print(pipe.score(X_test, y_test))
      

1. Definition and key concepts
Supervised learning uses labeled data to train models to predict outputs from inputs. It includes classification and regression.
# Example: simple supervised model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit([[0], [1], [2]], [0, 1, 2])
print(model.predict([[3]]))  # Output: [3.]
      

2. Regression vs classification
Regression predicts continuous values; classification predicts discrete classes.
# Classification example
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit([[0], [1], [2], [3]], [0, 0, 1, 1])
print(model.predict([[1.5]]))  # Output: [0] or [1]

# Regression example
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit([[0], [1], [2], [3]], [0, 0, 1, 1])
print(reg.predict([[1.5]]))  # Output: float
      

3. Choosing the right algorithm
Choose based on data size, problem type, interpretability needs, and computational efficiency.
# Use DecisionTreeClassifier for non-linear patterns
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit([[0], [1], [2], [3]], [0, 0, 1, 1])
print(clf.predict([[1.5]]))  # Output: [0]
      

4. Model evaluation metrics
Accuracy, precision, recall, F1-score (classification); MAE, MSE, R² (regression).
from sklearn.metrics import accuracy_score
y_true = [0, 1, 1, 0]
y_pred = [0, 1, 0, 0]
print(accuracy_score(y_true, y_pred))  # Output: 0.75
      

5. Cross-validation basics
Cross-validation splits data into folds to evaluate model stability.
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
scores = cross_val_score(SVC(), X, y, cv=5)
print(scores)  # Output: array of 5 scores
      

6. Overfitting and underfitting
Overfitting means too complex (memorizes data); underfitting means too simple (misses patterns).
# Overfitting example with high-degree polynomial
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

import numpy as np
X = np.array([[1], [2], [3], [4]])
y = np.array([3, 6, 7, 10])

model = make_pipeline(PolynomialFeatures(5), LinearRegression())
model.fit(X, y)
print(model.predict([[5]]))  # May overfit badly
      

7. Bias-variance tradeoff
High bias models are too simple; high variance models are too complex. Ideal models balance both.
# Use cross-validation to manage bias-variance
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

scores = cross_val_score(KNeighborsClassifier(n_neighbors=1), X, y, cv=3)
print("High variance model scores:", scores)
      

8. Using train_test_split
`train_test_split` is used to split data into training and testing sets for evaluation.
from sklearn.model_selection import train_test_split

X = [[1], [2], [3], [4]]
y = [0, 0, 1, 1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
print("Train:", X_train, "Test:", X_test)
      

9. Model selection techniques
Techniques include grid search, random search, and cross-validation scoring to find best hyperparameters.
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

params = {'C': [0.1, 1, 10]}
grid = GridSearchCV(SVC(), params)
grid.fit([[1], [2], [3], [4]], [0, 0, 1, 1])
print("Best Params:", grid.best_params_)
      

10. Debugging supervised models
Use metrics, residuals, learning curves, and feature importance to diagnose issues in models.
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit([[1], [2], [3], [4]], [0, 0, 1, 1])
print("Feature Importances:", model.feature_importances_)
      

1. Logistic regression
Logistic Regression is a linear model for binary classification that outputs probabilities and uses a sigmoid function.
from sklearn.linear_model import LogisticRegression
X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]
model = LogisticRegression()
model.fit(X, y)
print(model.predict([[1.5]]))  # Output: [0]
      

2. K-nearest neighbors (KNN)
KNN classifies based on the majority vote of k closest neighbors using distance metrics.
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=3)
model.fit([[0], [1], [2], [3]], [0, 0, 1, 1])
print(model.predict([[1.6]]))  # Output: [0] or [1]
      

3. Decision trees
Decision Trees use if-else rules learned from data to classify inputs.
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit([[1], [2], [3], [4]], [0, 0, 1, 1])
print(model.predict([[2.5]]))  # Output: [0]
      

4. Random forests
Random Forests combine multiple decision trees for more stable and accurate predictions.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=10)
model.fit([[1], [2], [3], [4]], [0, 0, 1, 1])
print(model.predict([[3.5]]))  # Output: [1]
      

5. Support vector machines (SVM)
SVM finds the hyperplane that best separates classes in a high-dimensional space.
from sklearn.svm import SVC
model = SVC(kernel='linear')
model.fit([[0], [1], [2], [3]], [0, 0, 1, 1])
print(model.predict([[2.5]]))  # Output: [1]
      

6. Naive Bayes
Naive Bayes applies Bayes' Theorem assuming independence between features.
from sklearn.naive_bayes import GaussianNB
X = [[1, 20], [2, 18], [3, 25], [4, 28]]
y = [0, 0, 1, 1]
model = GaussianNB()
model.fit(X, y)
print(model.predict([[2.5, 22]]))  # Output: [0] or [1]
      

7. Gradient boosting
Gradient Boosting builds models sequentially, correcting previous errors using boosting technique.
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()
model.fit([[1], [2], [3], [4]], [0, 0, 1, 1])
print(model.predict([[3.2]]))  # Output: [1]
      

8. Stochastic gradient descent
SGD is an optimization algorithm used in classifiers for very large datasets.
from sklearn.linear_model import SGDClassifier
model = SGDClassifier(loss='log_loss', max_iter=1000)
model.fit([[0], [1], [2], [3]], [0, 0, 1, 1])
print(model.predict([[2.2]]))  # Output: [1]
      

9. Voting classifiers
A Voting Classifier combines multiple classifiers and predicts by majority vote or average probability.
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

clf1 = LogisticRegression()
clf2 = DecisionTreeClassifier()
clf3 = SVC(probability=True)

model = VotingClassifier(estimators=[
  ('lr', clf1), ('dt', clf2), ('svm', clf3)
], voting='soft')

X = [[1], [2], [3], [4]]
y = [0, 0, 1, 1]
model.fit(X, y)
print(model.predict([[2.8]]))  # Output: [1]
      

10. Model evaluation for classification
Classification evaluation uses accuracy, precision, recall, F1-score, and confusion matrix.
from sklearn.metrics import classification_report

y_true = [0, 1, 0, 1]
y_pred = [0, 1, 1, 1]
print(classification_report(y_true, y_pred))
      

1. Linear regression
Linear regression models the relationship between input and output using a straight line.
from sklearn.linear_model import LinearRegression

X = [[1], [2], [3], [4]]
y = [2, 4, 6, 8]

model = LinearRegression()
model.fit(X, y)
print(model.predict([[5]]))  # Output: [10.]
      

2. Ridge regression
Ridge regression adds L2 regularization to penalize large coefficients and reduce overfitting.
from sklearn.linear_model import Ridge

model = Ridge(alpha=1.0)
model.fit(X, y)
print(model.predict([[5]]))  # Output: [~10.]
      

3. Lasso regression
Lasso adds L1 regularization, which can shrink some coefficients to zero for feature selection.
from sklearn.linear_model import Lasso

model = Lasso(alpha=0.1)
model.fit(X, y)
print(model.predict([[5]]))  # Output: [~10.]
      

4. Elastic Net
ElasticNet combines both L1 and L2 regularization for balance between Ridge and Lasso.
from sklearn.linear_model import ElasticNet

model = ElasticNet(alpha=0.1, l1_ratio=0.5)
model.fit(X, y)
print(model.predict([[5]]))  # Output: [~10.]
      

5. Polynomial regression
Polynomial regression fits a curve to the data by generating polynomial features.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
model.fit(X, y)
print(model.predict([[5]]))  # Output: [10.]
      

6. Decision tree regression
Decision tree regression splits the data into regions and fits a constant in each.
from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor()
model.fit(X, y)
print(model.predict([[5]]))  # Output: [8.0] or [~10]
      

7. Random forest regression
Random forest regression builds many trees and averages their predictions for accuracy.
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X, y)
print(model.predict([[5]]))  # Output: [~10.]
      

8. Support vector regression (SVR)
SVR tries to fit a line within a margin and ignores errors within that margin.
from sklearn.svm import SVR

model = SVR(kernel='linear')
model.fit(X, y)
print(model.predict([[5]]))  # Output: [~10.]
      

9. Gradient boosting regression
Gradient boosting builds an additive model in a forward stage-wise fashion to minimize error.
from sklearn.ensemble import GradientBoostingRegressor

model = GradientBoostingRegressor()
model.fit(X, y)
print(model.predict([[5]]))  # Output: [~10.]
      

10. Model evaluation for regression
Common metrics: MAE, MSE, RMSE, R² score.
from sklearn.metrics import mean_squared_error, r2_score

y_true = [2, 4, 6, 8]
y_pred = [2.1, 4.1, 6, 7.9]

print("MSE:", mean_squared_error(y_true, y_pred))  # Output: small number
print("R² Score:", r2_score(y_true, y_pred))       # Output: close to 1
      

1. Cross-validation techniques
Cross-validation helps estimate model performance by splitting data into multiple training/testing folds.
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
scores = cross_val_score(LogisticRegression(max_iter=1000), X, y, cv=5)
print(scores)  # Output: array of scores from 5 folds
      

2. Grid search with GridSearchCV
GridSearchCV performs an exhaustive search over hyperparameter combinations to find the best.
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
grid = GridSearchCV(SVC(), param_grid, cv=3)
grid.fit(X, y)
print("Best Params:", grid.best_params_)
      

3. Randomized search
RandomizedSearchCV samples a fixed number of random parameter combinations for faster tuning.
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
import numpy as np

param_dist = {'n_estimators': [10, 50, 100], 'max_depth': [None, 3, 5, 10]}
rand_search = RandomizedSearchCV(RandomForestClassifier(), param_dist, n_iter=4, cv=3)
rand_search.fit(X, y)
print(rand_search.best_params_)
      

4. Scoring parameters
Scoring defines the metric used during evaluation such as accuracy, f1, neg_mean_squared_error.
score = cross_val_score(LogisticRegression(max_iter=1000), X, y, cv=5, scoring='accuracy')
print("Accuracy Scores:", score)
      

5. Confusion matrix
A confusion matrix shows true vs predicted classifications in a table format.
from sklearn.metrics import confusion_matrix
y_true = [0, 1, 0, 1]
y_pred = [0, 1, 1, 1]
print(confusion_matrix(y_true, y_pred))
      

6. Precision, recall, F1 score
These are important metrics for classification, especially in imbalanced datasets.
from sklearn.metrics import precision_score, recall_score, f1_score

print("Precision:", precision_score(y_true, y_pred))
print("Recall:", recall_score(y_true, y_pred))
print("F1 Score:", f1_score(y_true, y_pred))
      

7. ROC curve and AUC
The ROC curve plots TPR vs FPR. AUC summarizes its area; higher is better.
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_prob = model.predict_proba(X_test)[:, 1]
print("AUC Score:", roc_auc_score(y_test, y_prob, multi_class='ovr'))
      

8. Regression metrics (MSE, MAE, R²)
Evaluate regression models using Mean Squared Error, Mean Absolute Error, and R² score.
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]

print("MSE:", mean_squared_error(y_true, y_pred))
print("MAE:", mean_absolute_error(y_true, y_pred))
print("R²:", r2_score(y_true, y_pred))
      

9. Learning curves
Learning curves show model performance over increasing training sizes and help detect overfitting.
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
import numpy as np

train_sizes, train_scores, test_scores = learning_curve(
    LogisticRegression(max_iter=1000), X, y, cv=5)

plt.plot(train_sizes, np.mean(train_scores, axis=1), label='Train')
plt.plot(train_sizes, np.mean(test_scores, axis=1), label='Validation')
plt.legend()
plt.title("Learning Curve")
plt.xlabel("Training Size")
plt.ylabel("Score")
plt.grid(True)
plt.show()
      

10. Validation curves
Validation curves show performance with respect to changes in a specific hyperparameter.
from sklearn.model_selection import validation_curve

param_range = [0.01, 0.1, 1, 10]
train_scores, test_scores = validation_curve(
    LogisticRegression(max_iter=1000), X, y, param_name="C", param_range=param_range, cv=5)

plt.plot(param_range, np.mean(train_scores, axis=1), label="Train")
plt.plot(param_range, np.mean(test_scores, axis=1), label="Validation")
plt.xscale('log')
plt.legend()
plt.title("Validation Curve for C")
plt.xlabel("C")
plt.ylabel("Score")
plt.grid(True)
plt.show()
      

1. Principal Component Analysis (PCA)
PCA reduces the dimensionality of data by projecting it onto principal components that maximize variance, helping to speed up training and visualize high-dimensional data.
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

X, y = load_iris(return_X_y=True)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y)
plt.title("PCA - Iris Dataset")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.grid(True)
plt.show()
      

2. t-SNE and UMAP
t-SNE and UMAP are nonlinear dimensionality reduction techniques that are better for preserving local structure and used mainly for visualization.
# t-SNE example
from sklearn.manifold import TSNE

X_tsne = TSNE(n_components=2, perplexity=30, random_state=42).fit_transform(X)
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y)
plt.title("t-SNE - Iris Dataset")
plt.grid(True)
plt.show()
      

# UMAP example (requires umap-learn)
# pip install umap-learn

import umap
X_umap = umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2).fit_transform(X)
plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y)
plt.title("UMAP - Iris Dataset")
plt.grid(True)
plt.show()
      

3. Evaluating clustering models
Clustering evaluation can be done using unsupervised metrics like Silhouette Score, Davies-Bouldin index, and if labels are available, Adjusted Rand Index (ARI).
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score, adjusted_rand_score

# Clustering on iris
model = KMeans(n_clusters=3, random_state=42)
y_pred = model.fit_predict(X)

print("Silhouette Score:", silhouette_score(X, y_pred))
print("Davies-Bouldin Score:", davies_bouldin_score(X, y_pred))
print("Adjusted Rand Index (with true labels):", adjusted_rand_score(y, y_pred))
      

1. Importance of feature engineering
Feature engineering transforms raw data into meaningful features that improve model performance.
# Example: creating new feature from date
import pandas as pd

df = pd.DataFrame({'date': pd.to_datetime(['2020-01-01', '2020-02-01'])})
df['month'] = df['date'].dt.month
print(df)
      

2. Univariate feature selection
Select features using statistical tests like chi-squared for classification tasks.
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
print(X_new[:2])
      

3. Recursive Feature Elimination (RFE)
RFE recursively removes the least important features to select the best subset.
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
selector = RFE(model, n_features_to_select=2)
X_rfe = selector.fit_transform(X, y)
print(X_rfe[:2])
      

4. Feature importance with trees
Tree-based models provide built-in feature importance scores.
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(X, y)
print(model.feature_importances_)
      

5. L1-based selection
Lasso (L1) can eliminate irrelevant features by setting coefficients to zero.
from sklearn.linear_model import Lasso
from sklearn.datasets import load_boston
import numpy as np

boston = load_boston()
X, y = boston.data, boston.target
model = Lasso(alpha=0.1)
model.fit(X, y)
print(np.round(model.coef_, 2))
      

6. Dimensionality reduction as selection
Techniques like PCA reduce dimensionality while preserving important information.
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
print(X_pca[:2])
      

7. Feature correlation analysis
High correlation between features may lead to redundancy and overfitting.
import seaborn as sns
import matplotlib.pyplot as plt

corr = pd.DataFrame(X).corr()
sns.heatmap(corr, annot=True)
plt.title("Feature Correlation Heatmap")
plt.show()
      

8. Mutual information
Mutual information captures non-linear dependencies between features and the target.
from sklearn.feature_selection import mutual_info_classif

mi = mutual_info_classif(X, y)
print(mi)
      

9. Automated feature selection
Libraries like `sklearn` or `mlxtend` support automation of feature selection.
from sklearn.feature_selection import SelectFromModel

model = RandomForestClassifier()
model.fit(X, y)
selector = SelectFromModel(model, prefit=True)
X_sel = selector.transform(X)
print(X_sel[:2])
      

10. Domain-specific feature design
Use domain knowledge to create or modify features that improve model relevance and accuracy.
# Example: BMI from weight/height in a medical dataset
df = pd.DataFrame({'weight_kg': [70, 80], 'height_m': [1.75, 1.8]})
df['BMI'] = df['weight_kg'] / df['height_m']**2
print(df)
      

1. Saving and loading models
You can persist trained models to disk for later use.
from sklearn.linear_model import LogisticRegression
import joblib

model = LogisticRegression()
model.fit([[0], [1]], [0, 1])
joblib.dump(model, 'model.pkl')  # Save
loaded = joblib.load('model.pkl')  # Load
print(loaded.predict([[2]]))
      

2. Serialization with joblib and pickle
`pickle` and `joblib` are both used to serialize models; `joblib` is preferred for large numpy arrays.
import pickle

with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)

with open('model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)
print(loaded_model.predict([[3]]))
      

3. Building a prediction API
Turn your model into an API to accept input and return predictions.
# Simple API idea (in Flask)
from flask import Flask, request, jsonify

app = Flask(__name__)

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json['data']
    result = loaded_model.predict([data])
    return jsonify({'prediction': result.tolist()})

# Run with: flask run
      

4. Integrating with Flask
Flask allows model deployment via lightweight HTTP APIs.
# Save this as app.py and run it
from flask import Flask, request, jsonify
app = Flask(__name__)

@app.route('/')
def home():
    return "ML Model API Running"

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json['data']
    result = loaded_model.predict([data])
    return jsonify(result.tolist())
      

5. Streamlit for ML apps
Streamlit enables building interactive web apps for ML models with ease.
# Save as app.py, then run with: streamlit run app.py
import streamlit as st

st.title("Simple ML App")
val = st.number_input("Enter a value:")
if st.button("Predict"):
    result = loaded_model.predict([[val]])
    st.write("Prediction:", result[0])
      

6. Dockerizing your model
Docker packages your app and environment into containers for consistent deployment.
# Dockerfile Example
FROM python:3.9
COPY . /app
WORKDIR /app
RUN pip install -r requirements.txt
CMD ["python", "app.py"]
      

7. Deployment on Heroku
Heroku supports deploying Python web apps (Flask/Streamlit) easily via Git.
# Required files: Procfile, requirements.txt
# Procfile content:
web: python app.py
      

8. Using FastAPI
FastAPI is a high-performance Python framework for APIs and ML model deployment.
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class Input(BaseModel):
    data: list

@app.post("/predict")
def predict(input: Input):
    return {"result": loaded_model.predict([input.data]).tolist()}
      

9. Monitoring model predictions
Log predictions and inputs to track model drift or anomalies.
import logging

logging.basicConfig(filename='predictions.log', level=logging.INFO)
input_val = [2]
pred = loaded_model.predict([input_val])
logging.info(f"Input: {input_val}, Prediction: {pred}")
      

10. Updating and versioning models
Maintain multiple versions of models to allow rollback or experimentation.
# Save models with versioning
joblib.dump(model, 'model_v1.pkl')
joblib.dump(model, 'model_v2.pkl')
# Load specific version
v2 = joblib.load('model_v2.pkl')
      

1. Motivation for pipelines
Pipelines simplify and standardize workflows by chaining steps like preprocessing and modeling.
# Instead of separate fit() calls, use pipelines to reduce errors and code clutter.

2. Pipeline and make_pipeline
Pipelines link preprocessing and model in one object.
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = make_pipeline(StandardScaler(), LogisticRegression())
pipe.fit([[1], [2], [3], [4]], [0, 0, 1, 1])
print(pipe.predict([[1.5]]))

3. Combining preprocessing and modeling
Helps ensure all preprocessing happens in both training and prediction phases.
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression())
])
pipe.fit([[1], [2], [3], [4]], [0, 0, 1, 1])

4. Nesting pipelines
Pipelines can be nested to handle sub-tasks or grouped logic.
# Useful when combining multiple column pipelines using ColumnTransformer

5. Using FeatureUnion
FeatureUnion combines multiple transformer outputs.
from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

combined = FeatureUnion([("pca", PCA(n_components=2)), ("scale", StandardScaler())])

6. Integrating GridSearchCV with pipelines
Use full pipeline in GridSearchCV to tune both preprocessing and model hyperparameters.
from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(pipe, {'logisticregression__C': [0.1, 1, 10]}, cv=3)
grid.fit([[1], [2], [3], [4]], [0, 0, 1, 1])

7. Saving pipeline objects
Pipelines can be saved and reused just like models.
import joblib
joblib.dump(pipe, 'pipeline.pkl')

8. Debugging pipelines
You can inspect steps with `named_steps` or break steps individually.
print(pipe.named_steps['logisticregression'])

9. Custom pipeline components
Build your own transformers with `TransformerMixin` and `BaseEstimator`.
from sklearn.base import BaseEstimator, TransformerMixin

class MultiplyByTwo(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None): return self
    def transform(self, X): return X * 2

10. Best practices
Always include preprocessing in the pipeline, avoid data leakage, and use GridSearch with pipelines.

1. Text preprocessing basics
Lowercasing, removing punctuation, and tokenizing.
import re
text = "Hello, World!"
cleaned = re.sub(r'[^\w\s]', '', text.lower())
print(cleaned)

2. Bag-of-words model
Represents text using word frequencies.
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(["I love AI", "AI loves me"])
print(X.toarray())

3. TF-IDF vectorization
Gives importance to rare but informative words.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(["I love AI", "AI loves me"])
print(X.toarray())

4. CountVectorizer and TfidfVectorizer
Both turn text into numerical features for ML.
cv = CountVectorizer()
tfidf = TfidfVectorizer()

5. Text classification pipeline
Chain vectorizer + classifier in a pipeline.
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB

pipe = make_pipeline(CountVectorizer(), MultinomialNB())
pipe.fit(["spam spam", "ham email"], [1, 0])
print(pipe.predict(["spam now"]))

6. Naive Bayes for text
Great for classifying short texts using word frequency.
model = MultinomialNB()

7. Text clustering
Use KMeans on vectorized text.
from sklearn.cluster import KMeans
X = TfidfVectorizer().fit_transform(["dog cat", "apple orange", "cat dog", "fruit banana"])
model = KMeans(n_clusters=2).fit(X)
print(model.labels_)

8. Dimensionality reduction for text
Use TruncatedSVD for sparse text matrices.
from sklearn.decomposition import TruncatedSVD

X = TfidfVectorizer().fit_transform(["text one", "text two"])
svd = TruncatedSVD(n_components=2)
X_reduced = svd.fit_transform(X)
print(X_reduced)

9. Handling n-grams and stopwords
Capture word sequences, remove unhelpful common words.
vectorizer = CountVectorizer(ngram_range=(1,2), stop_words='english')

10. Sentiment analysis case study
Train model on text labeled positive or negative.
texts = ["I love this!", "I hate that."]
labels = [1, 0]
pipe = make_pipeline(TfidfVectorizer(), MultinomialNB())
pipe.fit(texts, labels)
print(pipe.predict(["love it"]))

1. Identifying imbalance
Check target class distribution.
import numpy as np
y = [0]*90 + [1]*10
print("Class 0:", y.count(0), "Class 1:", y.count(1))

2. Oversampling (SMOTE)
SMOTE creates synthetic samples of minority class.
from imblearn.over_sampling import SMOTE

X = [[i] for i in range(100)]
y = [0]*90 + [1]*10
X_res, y_res = SMOTE().fit_resample(X, y)
print("New class distribution:", {i: y_res.count(i) for i in set(y_res)})

3. Undersampling techniques
Remove samples from the majority class.
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler()
X_res, y_res = rus.fit_resample(X, y)
print("Resampled:", len(y_res))

4. Class weighting strategies
Give higher weight to minority class in models.
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(class_weight='balanced')

5. Evaluation metrics for imbalance
Use precision, recall, f1-score instead of accuracy.
from sklearn.metrics import classification_report

y_true = [0]*95 + [1]*5
y_pred = [0]*90 + [1]*10
print(classification_report(y_true, y_pred))

6. Using imbalanced-learn with scikit-learn
`imblearn` works seamlessly with pipelines.
from imblearn.pipeline import Pipeline

pipe = Pipeline([('smote', SMOTE()), ('clf', LogisticRegression())])
pipe.fit(X, y)

7. Synthetic data generation
SMOTE is one form; others include ADASYN or GANs.
# See: imblearn.over_sampling.ADASYN

8. Ensemble methods for imbalance
BalancedBaggingClassifier uses undersampled data in each estimator.
from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.tree import DecisionTreeClassifier

model = BalancedBaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=10)

9. Threshold moving
Change decision threshold to favor minority prediction.
probs = model.predict_proba(X_res)[:, 1]
y_pred = [1 if p > 0.3 else 0 for p in probs]

10. Real-world example
Credit fraud detection often uses oversampling + ensemble + recall focus.
# Combine SMOTE + LogisticRegression + F1-score on fraud dataset.

1. What are AI Agents?
AI Agents are autonomous programs that perceive their environment and act to achieve goals.
# Simple agent example: a chatbot responding to user input
def agent(input_text):
    if "hello" in input_text.lower():
        return "Hello! How can I help you?"
    return "I don't understand."

print(agent("Hello there!"))

2. Types of AI Agents
Agents can be simple reflex, model-based, goal-based, or utility-based.
# Example: Reflex agent responds to keywords
def reflex_agent(input_text):
    if "weather" in input_text:
        return "Check the weather app."
    return "Sorry, I can't help."

print(reflex_agent("What's the weather?"))

3. Automation with AI Agents
Automation lets agents perform repetitive tasks, e.g., email filtering.
# Email filter example
emails = ["Buy now!", "Meeting at 3pm", "Cheap meds"]
filtered = [e for e in emails if "buy" not in e.lower()]
print(filtered)

4. Multi-agent systems
Systems with multiple interacting agents working cooperatively or competitively.
# Simplified simulation: two agents competing for a resource
agent1 = {"energy": 5}
agent2 = {"energy": 3}

def compete(a1, a2):
    winner = "Agent1" if a1['energy'] > a2['energy'] else "Agent2"
    return winner

print(compete(agent1, agent2))

5. Reinforcement Learning Agents
Agents learn optimal actions through trial and error interacting with an environment.
import gym
env = gym.make('CartPole-v1')
obs = env.reset()
print("Initial observation:", obs)
env.close()

6. Task automation examples
Examples include scheduling, customer service chatbots, and recommendation engines.
# Simple scheduler example
def schedule_task(task, time):
    print(f"Scheduled '{task}' at {time}")

schedule_task("Backup", "02:00 AM")

7. Natural Language Interfaces
Agents interact using human language for commands and responses.
# Basic command parser
def parse_command(text):
    if "turn on light" in text.lower():
        return "Lights turned on"
    return "Command not recognized"

print(parse_command("Please turn on light"))

8. Robotics and AI Agents
Robots use AI agents to perceive and act in physical environments.
# Pseudo-code for robot navigation
def move_robot(direction):
    print(f"Moving {direction}")

move_robot("forward")

9. Challenges in AI automation
Challenges include safety, ethics, and unpredictability of autonomous decisions.
# No direct code, but always consider ethical constraints!

10. Future of AI Agents
Increasingly adaptive, collaborative, and context-aware agents are expected.
# Imagine AI agents that learn from each other and adapt dynamically.

1. NumPy and pandas integration
scikit-learn works smoothly with NumPy arrays and pandas DataFrames.
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

X = pd.DataFrame({'feature': [1, 2, 3, 4]})
y = np.array([2, 4, 6, 8])

model = LinearRegression()
model.fit(X, y)
print(model.predict(pd.DataFrame({'feature': [5]})))

2. Visualization with matplotlib and seaborn
Visualize data and model results using matplotlib and seaborn.
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
data = sns.load_dataset('iris')
sns.pairplot(data, hue='species')
plt.show()

3. Using XGBoost with scikit-learn API
XGBoost integrates with sklearn interface for easy model training.
from xgboost import XGBClassifier
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
model.fit(X, y)
print(model.predict(X[:2]))

4. LightGBM integration
LightGBM is a fast gradient boosting library with sklearn API.
import lightgbm as lgb
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)
model = lgb.LGBMClassifier()
model.fit(X, y)
print(model.predict(X[:2]))

5. Using CatBoost
CatBoost handles categorical features natively and integrates with sklearn.
from catboost import CatBoostClassifier
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
model = CatBoostClassifier(verbose=0)
model.fit(X, y)
print(model.predict(X[:2]))

6. Hyperopt and Optuna for tuning
These libraries provide efficient hyperparameter tuning using Bayesian optimization.
# Example Optuna tuning skeleton:
import optuna

def objective(trial):
    param = {'C': trial.suggest_loguniform('C', 1e-3, 10)}
    model = LogisticRegression(**param)
    model.fit(X, y)
    return model.score(X, y)

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=10)
print(study.best_params)

7. sklearn + MLflow
MLflow tracks experiments and model versions.
import mlflow
import mlflow.sklearn

with mlflow.start_run():
    model.fit(X, y)
    mlflow.sklearn.log_model(model, "model")

8. Integration with Dask
Dask enables scalable ML with scikit-learn on big data.
import dask.array as da
X = da.random.random((1000, 10), chunks=(100, 10))
# Train models with dask-ml wrappers

9. Interfacing with TensorFlow/Keras
Use sklearn wrappers or converters to integrate deep learning models.
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

def create_model():
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Dense
    model = Sequential([Dense(10, input_shape=(4,), activation='relu'), Dense(3, activation='softmax')])
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

keras_model = KerasClassifier(build_fn=create_model, epochs=5, batch_size=10)
keras_model.fit(X, y)

10. Exporting models (ONNX, PMML)
Export models to interoperable formats like ONNX and PMML for deployment.
import skl2onnx
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType

initial_type = [('float_input', FloatTensorType([None, X.shape[1]]))]
onnx_model = convert_sklearn(model, initial_types=initial_type)
with open("model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

1. Predicting house prices
Use regression to predict house prices from features.
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
print(model.predict([X_test[0]]))

2. Customer churn prediction
Classification model to predict if a customer will leave.
# Use logistic regression on customer data (pseudo-example)
# X, y = customer_features, churn_labels
# model = LogisticRegression().fit(X, y)

3. Credit scoring model
Predict credit risk using classification algorithms.
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
model = LogisticRegression(max_iter=1000)
model.fit(X, y)
print(model.predict(X[:2]))

4. Spam email classification
Classify emails as spam or not using text processing and Naive Bayes.
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

texts = ["Buy cheap meds", "Meeting at 10", "Cheap meds now", "Lunch tomorrow"]
labels = [1, 0, 1, 0]
pipe = make_pipeline(CountVectorizer(), MultinomialNB())
pipe.fit(texts, labels)
print(pipe.predict(["Cheap meds offer"]))

5. Image classification using PCA
Reduce dimensionality with PCA and classify images.
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
from sklearn.linear_model import LogisticRegression

digits = load_digits()
pca = PCA(n_components=30)
X_pca = pca.fit_transform(digits.data)
model = LogisticRegression(max_iter=1000)
model.fit(X_pca, digits.target)
print(model.predict(X_pca[:1]))

6. Customer segmentation with K-means
Group customers into segments based on features.
from sklearn.cluster import KMeans
import numpy as np

X = np.array([[1, 2], [1, 4], [5, 8], [8, 8]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
print(kmeans.labels_)

7. Movie recommendation engine
Collaborative filtering or content-based filtering to suggest movies.
# Simple example: recommend movies based on user ratings similarity (pseudo-code)

8. Time series forecasting (hybrid)
Combine models like ARIMA and machine learning for forecasting.
# Use statsmodels for ARIMA and sklearn for ML models on residuals.

9. Fraud detection pipeline
Use oversampling, classification, and monitoring to detect fraud.
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier

# X, y = fraud_data_features, fraud_labels
# X_res, y_res = SMOTE().fit_resample(X, y)
# model = RandomForestClassifier().fit(X_res, y_res)

10. End-to-end ML project checklist
- Data collection
- Data cleaning
- Feature engineering
- Model selection
- Training & validation
- Hyperparameter tuning
- Deployment
- Monitoring & maintenance

1. Time series vs traditional ML
Time series data has temporal order, requiring specialized methods unlike traditional independent samples.
# Example: stock prices over time vs random tabular data

2. Feature engineering for time data
Create features like lag, rolling averages, and date parts.
import pandas as pd

df = pd.DataFrame({'value': [1,2,3,4,5]})
df['lag_1'] = df['value'].shift(1)
df['rolling_mean_2'] = df['value'].rolling(window=2).mean()
print(df)

3. Sliding window technique
Use windows of past observations as input features for forecasting.
# Convert series into supervised learning samples with windows

4. Lag features and rolling stats
Lag features represent past values, rolling stats smooth data.
# See example above with lag and rolling_mean

5. Time-aware train-test splits
Split data respecting temporal order to avoid leakage.
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=3)
for train_index, test_index in tscv.split(df):
    print("TRAIN:", train_index, "TEST:", test_index)

6. Using pipelines for time series
Combine feature engineering and models while preserving order.
# Pipelines can include custom transformers for lag features

7. Forecasting with regression models
Use regression on lagged features to forecast.
from sklearn.linear_model import LinearRegression
import numpy as np

X = np.array([[1], [2], [3]])
y = np.array([2, 3, 4])
model = LinearRegression().fit(X, y)
print(model.predict([[4]]))

8. Autocorrelation analysis
Analyze correlations between current and past values.
import matplotlib.pyplot as plt
from pandas.plotting import autocorrelation_plot

autocorrelation_plot(df['value'])
plt.show()

9. Cross-validation for time series
Use time series split rather than random splits.
# See TimeSeriesSplit example above

10. Case study: sales forecasting
Combine lag features, rolling stats, and regression to forecast sales.
# Practical example would combine all above techniques on sales data

1. One-hot encoding
Convert categories into binary columns.
from sklearn.preprocessing import OneHotEncoder
import numpy as np

data = np.array([['red'], ['green'], ['blue']])
encoder = OneHotEncoder(sparse=False)
print(encoder.fit_transform(data))

2. Ordinal encoding
Map categories to integers preserving order.
from sklearn.preprocessing import OrdinalEncoder

data = np.array([['low'], ['medium'], ['high']])
encoder = OrdinalEncoder(categories=[['low', 'medium', 'high']])
print(encoder.fit_transform(data))

3. Binary encoding
Encode categories as binary digits, reducing dimensionality.
# Use category_encoders library
# import category_encoders as ce
# encoder = ce.BinaryEncoder(cols=['feature'])

4. Target encoding
Replace categories with mean target value.
# Use category_encoders or custom implementation

5. Hashing trick
Map categories to fixed-length vectors using hashing to handle large cardinality.
from sklearn.feature_extraction import FeatureHasher

data = [{'cat': 'dog'}, {'cat': 'fish'}, {'cat': 'dog'}]
hasher = FeatureHasher(input_type='dict')
print(hasher.transform(data).toarray())

6. High cardinality issues
Too many categories can cause overfitting and slow models.
# Consider grouping rare categories or using hashing

7. Encoding in pipelines
Include encoding as preprocessing step in sklearn pipelines.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('encoder', OneHotEncoder(handle_unknown='ignore')),
    ('clf', LogisticRegression())
])

8. Handling mixed data types
Use ColumnTransformer to process numerical and categorical separately.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['num_feature']),
        ('cat', OneHotEncoder(), ['cat_feature'])
    ])

9. Encoding best practices
Avoid data leakage, handle unknown categories gracefully, and choose encoding based on model.
# Example: OneHotEncoder(handle_unknown='ignore')

10. Visualizing encoded features
Visualize using bar plots or PCA.
import matplotlib.pyplot as plt
import numpy as np

encoded = encoder.fit_transform(data)
plt.bar(range(encoded.shape[1]), np.sum(encoded, axis=0))
plt.show()

1. How decision trees work
Trees split data recursively by feature thresholds to classify or regress.
from sklearn.tree import DecisionTreeClassifier

X = [[0, 0], [1, 1]]
y = [0, 1]
model = DecisionTreeClassifier()
model.fit(X, y)
print(model.predict([[2, 2]]))

2. Gini impurity vs entropy
Criteria to measure split quality: Gini impurity favors purity; entropy measures disorder.
# Specify criterion in DecisionTreeClassifier(criterion='gini' or 'entropy')

3. Tree depth and pruning
Limit tree depth or prune to prevent overfitting.
model = DecisionTreeClassifier(max_depth=3)
model.fit(X, y)

4. Visualizing decision boundaries
Plot how the tree splits the feature space.
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

plot_tree(model)
plt.show()

5. Feature importance
Trees provide feature importance scores.
print(model.feature_importances_)

6. Handling overfitting
Use pruning, max depth, min samples split to reduce overfitting.
model = DecisionTreeClassifier(max_depth=5, min_samples_split=4)

7. Interpreting splits
Understand decisions based on feature thresholds.
# Use export_text to print tree rules
from sklearn.tree import export_text
print(export_text(model))

8. Categorical splits
Handle categorical variables by encoding or using special tree implementations.
# sklearn requires categorical features encoded

9. Cost-complexity pruning
Balance tree size and accuracy by pruning low importance branches.
path = model.cost_complexity_pruning_path(X, y)
ccp_alphas = path.ccp_alphas

10. Real-world example: risk scoring
Use decision trees to score customer credit risk.
# Train tree on credit data, interpret splits for risk factors

1. Why interpretability matters
Understanding model decisions builds trust and helps debug.
# Interpretability crucial in sensitive fields like healthcare.

2. Coefficients and feature importance
Linear models have coefficients; tree-based models have feature importances.
print(model.coef_)  # For linear models
print(rf.feature_importances_)  # For tree ensembles

3. Partial dependence plots
Show marginal effect of features on predictions.
from sklearn.inspection import plot_partial_dependence
import matplotlib.pyplot as plt

plot_partial_dependence(rf, X, [0,1])
plt.show()

4. Permutation importance
Measure feature importance by random shuffling and evaluating drop in performance.
from sklearn.inspection import permutation_importance

result = permutation_importance(rf, X, y, n_repeats=10)
print(result.importances_mean)

5. SHAP values overview
SHAP explains output of any model with additive feature attribution.
import shap

explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values, X)

6. LIME basics
Local interpretable model-agnostic explanations explain individual predictions.
# Use lime package: lime.lime_tabular.LimeTabularExplainer

7. Decision path analysis
Trace how samples traverse decision trees.
node_indicator = rf.decision_path(X)
print(node_indicator.shape)

8. Tree visualization
Visualize tree structure and splits.
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plot_tree(rf.estimators_[0])
plt.show()

9. Interpreting pipeline outputs
Understand transformations and model predictions combined.
# Use pipeline.named_steps to access individual components
print(pipe.named_steps['clf'].coef_)

10. Fairness and transparency
Ensure models do not discriminate and provide transparent results.
# Evaluate bias and fairness metrics during model validation.

1. StandardScaler
Standardizes features to mean=0 and variance=1.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

2. MinMaxScaler
Scales features to a fixed range, typically 0 to 1.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

3. RobustScaler
Scales features using median and IQR, robust to outliers.
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)

4. QuantileTransformer
Transforms features to follow a uniform or normal distribution.
from sklearn.preprocessing import QuantileTransformer
scaler = QuantileTransformer(output_distribution='normal')
X_scaled = scaler.fit_transform(X)

5. PowerTransformer
Applies power transforms like Box-Cox to stabilize variance.
from sklearn.preprocessing import PowerTransformer
scaler = PowerTransformer()
X_scaled = scaler.fit_transform(X)

6. Normalizer vs scaler
Normalizer scales samples individually to unit norm, scalers adjust features.
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
X_normalized = normalizer.fit_transform(X)

7. Scaling sparse data
Use scalers that support sparse matrices to avoid dense conversion.
# StandardScaler supports sparse data with with_mean=False
scaler = StandardScaler(with_mean=False)
X_scaled = scaler.fit_transform(X_sparse)

8. Applying scalers in pipelines
Include scalers as steps in sklearn pipelines.
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression())
])
pipe.fit(X, y)

9. Custom scalers
Create your own scalers by extending TransformerMixin.
from sklearn.base import TransformerMixin

class CustomScaler(TransformerMixin):
    def fit(self, X, y=None): return self
    def transform(self, X): return X / 100

10. Best scaling practices
Fit scalers on training data only, avoid data leakage, and choose scaler per model needs.
# Always call fit on training set, transform on train/test

1. Limitations of in-memory ML
Traditional ML models assume data fits into memory, which becomes problematic with large datasets.

2. Using partial_fit
Scikit-learn supports incremental learning via `partial_fit`, especially for models like SGD.
by
from sklearn.linear_model import SGDClassifier
model = SGDClassifier()
for X_batch, y_batch in stream_batches():
    model.partial_fit(X_batch, y_batch, classes=[0, 1])

3. Online learning algorithms
Algorithms like PassiveAggressive and Perceptron support learning one sample at a time.
by
from sklearn.linear_model import PassiveAggressiveClassifier
model = PassiveAggressiveClassifier()
model.partial_fit(X_train, y_train, classes=[0, 1])

4. Batching and sampling
Instead of processing entire data at once, break it into batches.
by
import pandas as pd
for chunk in pd.read_csv("large_dataset.csv", chunksize=10000):
    process(chunk)

5. Sparse matrices
Efficiently store data with many zeroes using sparse matrices to save memory.
by
from scipy.sparse import csr_matrix
sparse_data = csr_matrix([[0, 0, 3], [4, 0, 0]])

6. Dask integration
Dask extends Pandas to work with larger-than-memory datasets and parallel computation.
by
import dask.dataframe as dd
df = dd.read_csv("large_dataset.csv")
print(df.head())

7. Feature hashing for scalability
Use fixed-size hash representations instead of expanding large categorical features.
by
from sklearn.feature_extraction import FeatureHasher
hasher = FeatureHasher(n_features=8, input_type='string')
features = hasher.transform([["dog"], ["cat"], ["fish"]])

8. Memory-efficient pipelines
Design pipelines to process chunks and use models that support streaming.
by
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler

pipe = Pipeline([
    ('scaler', StandardScaler(with_mean=False)),
    ('clf', SGDClassifier())
])
# Feed batches one by one using partial_fit

9. Out-of-core preprocessing
Apply transformations like scaling in batches instead of fitting on entire data.
by
for chunk in pd.read_csv(\"large.csv\", chunksize=5000):
    scaled = scaler.transform(chunk)
    model.partial_fit(scaled, labels)

10. Dataset reduction techniques
Reduce dimensionality or remove unnecessary features to fit memory constraints.
by
from sklearn.decomposition import PCA
pca = PCA(n_components=20)
X_reduced = pca.fit_transform(X)