Machine Learning Interview Questions


Beginners To Experts


The site is under development.

Machine Learning Interview Questions

Machine Learning is a subset of Artificial Intelligence that allows systems to learn from data and improve performance without being explicitly programmed.

<!-- Example -->
<p>Machine Learning helps email systems to filter spam messages based on data patterns.</p>

Supervised Learning, Unsupervised Learning, Semi-supervised Learning, Reinforcement Learning.

<!-- Example -->
<ul>
  <li>Supervised Learning</li>
  <li>Unsupervised Learning</li>
  <li>Semi-supervised Learning</li>
  <li>Reinforcement Learning</li>
</ul>

Supervised Learning is a type of ML where the model is trained on labeled data.

<!-- Example -->
<p>Training a model to predict house prices using past housing data with known prices.</p>

Unsupervised Learning is a type of ML where the model is trained on unlabeled data to find hidden patterns.

<!-- Example -->
<p>Clustering customers based on purchasing behavior without knowing their buying intent.</p>

Reinforcement Learning is a type of ML where an agent learns to make decisions by receiving rewards or penalties.

<!-- Example -->
<p>Teaching a robot to walk by rewarding successful steps and penalizing falls.</p>

AI is a broader concept where machines simulate human intelligence. ML is a subset of AI focused on learning from data.

<!-- Example -->
<p>AI includes ML, natural language processing, robotics, etc. ML specifically deals with data-driven learning.</p>

Overfitting is when a model learns the training data too well and performs poorly on new data.

<!-- Example -->
<p>A model predicting stock prices perfectly on past data but failing on new unseen data.</p>

Underfitting is when a model is too simple to learn the underlying pattern in the data.

<!-- Example -->
<p>Using a linear model for a complex dataset with curves and interactions.</p>

A training set is the portion of the dataset used to train a model.

<!-- Example -->
<p>70% of the data used for training a spam detection algorithm.</p>

A test set is used to evaluate the performance of a trained model on unseen data.

<!-- Example -->
<p>Using 30% of the total dataset for testing after training the model.</p>

Cross-validation is a technique to assess model performance by splitting the data into multiple subsets and training/testing on different combinations.

<!-- Example -->
<p>K-fold cross-validation divides the data into K parts and rotates training/testing on each fold.</p>

A feature is an individual measurable property or characteristic of a data point.

<!-- Example -->
<p>In a dataset about houses, features include square footage, number of rooms, and location.</p>

A label is the output or target value associated with each training example.

<!-- Example -->
<p>For a spam classifier, labels are "spam" or "not spam" tags for each email.</p>

A model is a function or algorithm that makes predictions based on input data.

<!-- Example -->
<p>A decision tree model classifies whether a loan should be approved based on applicant details.</p>

Linear regression is a supervised learning algorithm that models the relationship between input features and a continuous output.

<!-- Example -->
<p>Predicting house price based on size and number of rooms using a best-fit line.</p>

Logistic regression is a classification algorithm used to predict the probability of a binary outcome.

Regression predicts continuous values, while classification predicts discrete labels or categories.

A decision tree is a flowchart-like tree structure where internal nodes represent tests on features, and leaf nodes represent output labels.

A random forest is an ensemble method that combines multiple decision trees to improve performance and reduce overfitting.

Overfitting occurs when a model learns the noise in training data. Prevent it using techniques like cross-validation, regularization, and pruning.

Underfitting happens when a model is too simple to capture the data pattern. Fix it by increasing model complexity or training longer.

A confusion matrix is a table used to evaluate classification models by comparing predicted and actual values.

Precision is the ratio of true positives to total predicted positives; recall is the ratio of true positives to actual positives.

F1-score is the harmonic mean of precision and recall. It balances both metrics especially for imbalanced datasets.

Bias refers to the error due to overly simplistic assumptions in the model. High bias can cause underfitting.

Variance refers to the model’s sensitivity to small fluctuations in training data. High variance can lead to overfitting.

The bias-variance tradeoff is the balance between underfitting (high bias) and overfitting (high variance).

Gradient descent is an optimization algorithm that updates model weights to minimize loss function by moving in the direction of the steepest descent.

Learning rate is a hyperparameter that controls the step size of gradient descent. Too high may overshoot; too low may be slow.

Feature engineering is the process of selecting, modifying, or creating new features to improve model performance.

One-hot encoding converts categorical variables into binary vectors representing presence or absence of each category.

Cross-validation is a technique to assess model generalization by splitting data into training and validation folds.

PCA is a dimensionality reduction technique that transforms correlated features into uncorrelated principal components.

Regularization adds a penalty term to the loss function to reduce model complexity and prevent overfitting.

L1 adds absolute value penalties causing sparsity; L2 adds squared penalties shrinking weights evenly.

Early stopping halts training when validation performance stops improving to prevent overfitting.

A learning curve plots model performance against training size or epochs to diagnose bias or variance issues.

Normalization rescales features to a common scale, usually between 0 and 1, to improve training stability.

Standardization scales data to have zero mean and unit variance, useful for algorithms assuming normal distribution.

SVM is a supervised algorithm that finds the optimal hyperplane to separate classes with maximum margin.

Kernel trick transforms data into higher dimensions to make it linearly separable without computing coordinates explicitly.

Neural networks are interconnected layers of nodes designed to recognize patterns and model complex data.

Activation functions introduce non-linearity to neural networks, enabling them to learn complex functions.

Examples include sigmoid, ReLU, tanh, and softmax.

Backpropagation is an algorithm to update neural network weights by propagating the error backward using gradients.

Dropout randomly disables neurons during training to prevent overfitting and improve generalization.

Batch normalization normalizes layer inputs to stabilize and accelerate training.

CNNs use convolutional layers to extract spatial features, commonly used in image processing tasks.

RNNs have loops allowing information to persist, ideal for sequential data like text or time series.

LSTM is a type of RNN designed to capture long-term dependencies with gating mechanisms.

Overfitting occurs when a machine learning model learns the training data too well, including noise and outliers, leading to poor generalization on unseen data.<br> Prevention techniques include:
- Using more training data
- Applying regularization (L1, L2)
- Pruning models (in decision trees)
- Early stopping during training
- Cross-validation to monitor model performance
- Simplifying the model architecture
Understanding and controlling overfitting is essential for building robust ML models.

k-Nearest Neighbors (k-NN) is a simple supervised ML algorithm that classifies based on majority vote of k nearest neighbors.<br> Below is a simple Python example using scikit-learn:<br>
from sklearn.neighbors import KNeighborsClassifier<br>
from sklearn.datasets import load_iris<br>
from sklearn.model_selection import train_test_split<br>

# Load dataset<br>
iris = load_iris()<br>
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)<br>

# Create k-NN classifier with k=3<br>
knn = KNeighborsClassifier(n_neighbors=3)<br>
knn.fit(X_train, y_train)<br>

# Predict on test data<br>
predictions = knn.predict(X_test)<br>

print(predictions)<br>
      
This code trains a k-NN model on Iris dataset and prints predictions for the test set.

The bias-variance tradeoff balances two sources of error that affect ML model performance:
- Bias: Error from overly simplistic assumptions, causing underfitting.
- Variance: Error from excessive sensitivity to training data noise, causing overfitting.
A good model minimizes total error by balancing bias and variance. Techniques like cross-validation help find this balance.
Understanding this tradeoff helps in selecting the right model complexity for your data.

Data normalization rescales features to a fixed range (usually 0 to 1). MinMaxScaler achieves this by subtracting min and dividing by data range.<br> Example using scikit-learn:<br>
from sklearn.preprocessing import MinMaxScaler<br>
import numpy as np<br>

data = np.array([[10, 200], [15, 300], [20, 400]], dtype=float)<br>

scaler = MinMaxScaler()<br>
normalized_data = scaler.fit_transform(data)<br>

print(normalized_data)<br>
      
This code normalizes the 2D array column-wise between 0 and 1.

Classification predicts discrete labels or categories (e.g., spam or not spam).
Regression predicts continuous values (e.g., house price, temperature).
Both are supervised learning tasks but differ in output types and algorithms used.
For example, logistic regression is used for classification, while linear regression is used for regression problems.
Choosing the correct task depends on your prediction goal.

Model deployment is the process of integrating a trained machine learning model into a production environment where it can make predictions on real data.

# Example: Deploying a model using Flask
from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)
model = joblib.load('model.pkl')

@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json()
prediction = model.predict([data['features']])
return jsonify({'prediction': prediction.tolist()})

app.run(debug=True)

Overfitting can be prevented using techniques like cross-validation, regularization, pruning (for decision trees), and using more data or simpler models.

# Example: Using L2 Regularization in Logistic Regression
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(penalty='l2', C=0.1)
model.fit(X_train, y_train)

Hyperparameter tuning is the process of choosing the optimal parameters for a learning algorithm to improve performance.

# Example: Grid Search for tuning SVM parameters
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
grid = GridSearchCV(SVC(), parameters)
grid.fit(X_train, y_train)
print(grid.best_params_)

Feature scaling ensures that each feature contributes equally to the model by putting them on a similar scale.

# Example: Using StandardScaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Imbalanced datasets can be handled using techniques like oversampling the minority class, undersampling the majority class, or using specialized algorithms.

# Example: Using SMOTE for oversampling
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)

Dimensionality reduction reduces the number of input variables in a dataset. It helps improve model performance and visualization.

# Example: PCA in sklearn
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

The curse of dimensionality refers to the exponential increase in data requirements as the number of features grows. It can lead to overfitting and computational inefficiency.

# Example: Comparing model performance before and after dimensionality reduction
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = RandomForestClassifier()
model.fit(X_train, y_train)
original_score = accuracy_score(y_test, model.predict(X_test))

pca = PCA(n_components=5)
X_pca = pca.fit_transform(X)
X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca, y, test_size=0.3)
model.fit(X_train_pca, y_train)
pca_score = accuracy_score(y_test, model.predict(X_test_pca))
print("Original Score:", original_score)
print("PCA Score:", pca_score)

Ensemble methods combine multiple models to improve prediction accuracy. Popular methods include Bagging, Boosting, and Stacking.

# Example: Using Random Forest (Bagging technique)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Stacking is an ensemble learning technique that combines predictions from multiple base models using a meta-model.

# Example: Using StackingClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

base_learners = [ ('dt', DecisionTreeClassifier()), ('svc', SVC(probability=True)) ]
stack_model = StackingClassifier(estimators=base_learners, final_estimator=LogisticRegression())
stack_model.fit(X_train, y_train)
stack_preds = stack_model.predict(X_test)

AdaBoost is a boosting algorithm that adjusts the weights of incorrectly classified samples to focus more on hard examples in subsequent rounds.

# Example: AdaBoost with decision trees
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
model = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1), n_estimators=50)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Stacking is an ensemble learning method where multiple models are trained, and a meta-model learns to combine their predictions.

# Example: Stacking with two base models and a meta-model
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
estimators = [('svm', SVC()), ('tree', DecisionTreeClassifier())]
model = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Batch size is the number of training samples processed before the model is updated. It impacts model accuracy and training time.

# Example: Training with different batch sizes
model.fit(X_train, y_train, epochs=10, batch_size=32)
model.fit(X_train, y_train, epochs=10, batch_size=64)

Overfitting happens when the model learns the training data too well and fails to generalize to new data.

# Example: Overfitting visible in training vs. validation accuracy
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=100)

Techniques include dropout, regularization, early stopping, and using more training data.

# Example: Applying dropout
from tensorflow.keras.layers import Dropout
model.add(Dropout(0.5))

Dropout randomly disables neurons during training, preventing co-dependence and improving generalization.

# Example: Adding dropout to a layer
model.add(Dropout(0.25))

Early stopping halts training when performance on validation data starts degrading, preventing overfitting.

# Example: Using early stopping in Keras
from tensorflow.keras.callbacks import EarlyStopping
callback = EarlyStopping(monitor='val_loss', patience=3)
model.fit(X_train, y_train, validation_data=(X_val, y_val), callbacks=[callback])

L1 regularization adds the absolute value of weights to the loss, encouraging sparsity in the model.

# Example: Applying L1 regularization
from tensorflow.keras import regularizers
Dense(64, kernel_regularizer=regularizers.l1(0.01))

L2 regularization adds the square of the weights to the loss, preventing large weights and reducing overfitting.

# Example: Applying L2 regularization
Dense(64, kernel_regularizer=regularizers.l2(0.01))

CNN is a deep learning model primarily used for image processing, which uses convolutional layers to extract features.

# Example: Simple CNN with Keras
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
model.add(Conv2D(32, (3,3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(64, activation='relu'))

A pooling layer reduces the spatial dimensions of feature maps and helps control overfitting.

# Example: MaxPooling layer
model.add(MaxPooling2D(pool_size=(2, 2)))

KNN classifies a data point based on how its neighbors are classified. It finds the 'k' closest points and assigns the label most common among them.

# Example: Using KNN in scikit-learn
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Regularization techniques like L1 (Lasso) and L2 (Ridge) help prevent overfitting by adding penalties to the loss function during training.

# Example: Ridge regression
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

ROC curve plots true positive rate against false positive rate at various thresholds. AUC measures the area under this curve and shows model performance.

# Example: Plotting ROC curve
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test)[:,1])
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label='AUC = %0.2f' % roc_auc)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

Multicollinearity occurs when independent variables are highly correlated, which can distort the model. Solutions include removing variables or using PCA.

# Example: Using VIF to detect multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]
print(vif_data)

A confusion matrix summarizes the performance of a classification model using true positives, false positives, true negatives, and false negatives.

# Example: Confusion Matrix with scikit-learn
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, predictions)
print(cm)

Handling missing data is crucial. Techniques include removing rows, imputing with mean/median/mode, or using algorithms that support missing values.

# Example: Impute missing values with mean
from sklearn.impute import SimpleImputer
import numpy as np
imp = SimpleImputer(strategy='mean')
X_imputed = imp.fit_transform(X)

Regularization reduces overfitting by penalizing large coefficients in models. L1 (Lasso) and L2 (Ridge) are common forms.

# Example: Ridge regression
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)
model.fit(X_train, y_train)

Logistic regression models the probability of a binary outcome using a logistic function (sigmoid).

# Example: Logistic regression model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Cross-validation splits the data into k folds and trains the model on k-1 parts while testing on the remaining. It helps evaluate model performance reliably.

# Example: 5-fold cross-validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
print("Average score:", scores.mean())

Principal Component Analysis (PCA) is a dimensionality reduction method that transforms features into a set of uncorrelated variables called principal components.

# Example: Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

The ROC (Receiver Operating Characteristic) curve plots TPR vs. FPR at various threshold settings, evaluating classification performance.

# Example: Plot ROC curve
from sklearn.metrics import roc_curve, auc
y_score = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_score)
roc_auc = auc(fpr, tpr)

Batch learning trains on the full dataset at once, while online learning updates the model incrementally with each data point or batch.

# Example: Online learning with SGDClassifier
from sklearn.linear_model import SGDClassifier
model = SGDClassifier()
for X_batch, y_batch in data_stream:
model.partial_fit(X_batch, y_batch, classes=np.unique(y))

Anomaly detection identifies data points that deviate significantly from the norm. It's useful in fraud detection, health monitoring, etc.

# Example: Isolation Forest for anomaly detection
from sklearn.ensemble import IsolationForest
model = IsolationForest()
model.fit(X)
predictions = model.predict(X)

Stratified sampling splits data while preserving class proportions. It's vital in classification tasks with imbalanced datasets.

# Example: Stratified train-test split
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2)
for train_index, test_index in split.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

A confusion matrix summarizes the prediction results of a classification model, showing true positives, false positives, true negatives, and false negatives.

# Example: Confusion matrix with sklearn
from sklearn.metrics import confusion_matrix
predictions = model.predict(X_test)
cm = confusion_matrix(y_test, predictions)
print(cm)

Bagging (Bootstrap Aggregating) trains multiple models on different subsets of the training data and combines their results to improve performance and reduce variance.

# Example: Bagging classifier
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
model = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=10)
model.fit(X_train, y_train)

Pre-pruning stops tree growth early based on criteria like max depth, while pruning removes branches after full tree growth to avoid overfitting.

# Example: Set max depth for pre-pruning
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=3)
model.fit(X_train, y_train)

A hyperparameter is a configuration set before training a model (e.g., learning rate, number of trees). It's tuned using techniques like GridSearchCV or RandomSearchCV.

# Example: Grid search for best parameters
from sklearn.model_selection import GridSearchCV
params = {'n_estimators': [50, 100]}
grid = GridSearchCV(estimator=model, param_grid=params, cv=3)
grid.fit(X_train, y_train)

An SVM finds the optimal hyperplane that best separates classes. It works well for high-dimensional data and uses kernel tricks for non-linearity.

# Example: Linear SVM
from sklearn.svm import SVC
model = SVC(kernel='linear')
model.fit(X_train, y_train)
      

Feature engineering is the process of selecting, transforming, or creating new input features to improve model performance.

# Example: Creating interaction feature
X['new_feature'] = X['feature1'] * X['feature2']

Overfitting occurs when a model performs well on training data but poorly on unseen data. Prevent it by using regularization, pruning, cross-validation, or simplifying the model.

# Example: Use regularization in logistic regression
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=0.1)
model.fit(X_train, y_train)

K-fold cross-validation splits data into K subsets. The model is trained on K-1 parts and validated on the remaining one. This process is repeated K times to ensure robust evaluation.

# Example: 5-fold cross-validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
print(scores.mean())

Gradient descent comes in three types: Batch (uses whole dataset), Stochastic (uses one sample), and Mini-batch (uses small batches). Each has tradeoffs in speed and stability.

# Example: SGD in Scikit-learn
from sklearn.linear_model import SGDClassifier
model = SGDClassifier() model.fit(X_train, y_train)

The ROC curve shows the trade-off between true positive and false positive rates. AUC (Area Under Curve) summarizes its performance. Closer to 1 means better classification.

# Example: Plot ROC curve
from sklearn.metrics import roc_curve, auc
y_scores = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_scores)
roc_auc = auc(fpr, tpr)
print(roc_auc)

Bagging (Bootstrap Aggregating) reduces variance by training multiple models on different subsets of data and averaging their predictions. It's commonly used in ensemble methods like Random Forest.

# Example: Bagging with Decision Tree
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
model = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=10)
model.fit(X_train, y_train)

Boosting is an ensemble technique that builds models sequentially. Each new model corrects errors from previous ones. It improves accuracy and reduces bias.

# Example: Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(n_estimators=100)
model.fit(X_train, y_train)

Feature selection improves model performance by removing irrelevant or redundant data. It also reduces overfitting and training time.

# Example: SelectKBest feature selection
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(score_func=f_classif, k=5)
X_new = selector.fit_transform(X, y)

Dimensionality reduction reduces the number of input variables. PCA (Principal Component Analysis) is a common technique to project data to lower dimensions while preserving variance.

# Example: PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

Model interpretability means understanding how a model makes decisions. It's essential in sensitive domains like healthcare and finance to ensure transparency and trust.

# Example: Using SHAP for interpretability
import shap
explainer = shap.Explainer(model, X_train)
shap_values = explainer(X_test)
shap.summary_plot(shap_values, X_test)

Regularization adds a penalty to the loss function, discouraging large coefficients. This helps the model generalize better and reduces overfitting.

# Example: L2 Regularization with Ridge Regression
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)
model.fit(X_train, y_train)

Cross-validation splits the data into multiple training and validation sets. This gives a better estimate of model performance and reduces the risk of overfitting.

# Example: 5-Fold Cross Validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
print(scores.mean())