1. What is Logistic regression?
What is the primary difference between logistic regression and
linear regression?
2. List and briefly explain the main steps involved in building a logistic regression model.
3. Explain the difference between regression and segmentation in the context of supervised and
unsupervised learning?
4. What is feature extraction and why is it important for making accurate predictions?
5. Explain the concept of decision trees and discuss how they can be used for both regression
and classification tasks. How can overfitting be prevented in decision trees?
6. Describe the STL (Seasonal-Trend decomposition using Loess) approach in time series
analysis. What are its advantages?
7. Explain about Intellectual Property and Copyright?
8. What are Standard Operating Procedures (SOPs) in documentation, and why are they
essential for knowledge sharing?
Objective
1. In logistic regression, the function used to estimate probabilities is the Sigmoid
(Logistic) function.
2. During model construction, techniques such as forward selection, backward elimination,
and cross-validation are used for feature selection.
3. Decision Trees used for classification are an example of supervised learning.
4. Pruning is a technique used to simplify a decision tree by removing branches that offer
little predictive power.
5. ARIMA models are particularly suitable for analyzing non-stationary time series data.
6. The component of a time series that captures periodic patterns is called the seasonal
component.
7. Standard Operating Procedures (SOPs) are primarily used to provide guidelines for
document preparation and usage.
8. A white paper is used to provide technical explanations or solutions to complex issues.
9. A purpose and scope document defines the goals, stakeholders, and timelines, but not
the design layout.
10. A technical report avoids including personal opinions or unsupported claims.
11. In logistic regression, the output variable is usually binary in nature.
12. A confusion matrix summarizes the performance of a classification model.
13. Segmentation groups data into similar segments.
14. Overfitting in decision trees occurs when the model becomes too complex.
15. Feature extraction helps understand the behavior of a time series.
16. In unsupervised learning, models operate without any labeled or supervised data.
17. Intellectual Property refers to rights over inventions, creations, and original works.
18. Copyrights protect artistic, literary, and musical works.
19. Document preparation tools assist in creating and editing documents.
20. SOPs in documentation help ensure quality, consistency, and standardization.
21. In logistic regression, the output values are typically probabilities between 0 and 1.
22. In business analytics, logistic regression is often applied to predict customer churn
(yes/no).
23. An example of an unsupervised learning technique is K-Means Clustering.
24. The process of dividing a dataset into distinct groups based on similarities is called
segmentation.
25. Features such as height, average energy, and seasonality can be extracted from a time
series model.
26. A common method for making a time series model stationary is differencing.
27. The primary purpose of SOPs is to provide a standard method for performing tasks and
documenting processes.
28. Intellectual Property rights are designed to protect the original creation of authors,
inventors, and creators.
29. In a professional setting, documents should be edited for accuracy and clarity.
30. Document preparation tools like Microsoft Word, Google Docs, or LaTeX help to
organize and structure the content of a document.
31. The function that maps any real-valued number into a value between 0 and 1 is called
the Sigmoid function.
32. In healthcare analytics, logistic regression can be used to predict the probability of
disease occurrence.
33. In regression, the goal is to predict a dependent variable based on one or more
independent variables.
34. Pruning is the process of removing branches from a decision tree to reduce its
complexity and improve generalization.
35. The STL approach in time series stands for Seasonal-Trend decomposition using Loess.
36. The ARIMA model includes three main components: AutoRegressive (AR), Integrated (I),
and Moving Average (MA).
37. A typical technical report structure includes an introduction, methodology, analysis,
conclusion, and recommendations.
38. A white paper is used to present research or solutions to specific challenges.
39. The purpose section of a document typically explains the reason for its creation and the
goals it aims to achieve.
40. Copyright ensures that intellectual property remains protected from unauthorized use
or reproduction.
1. What is Logistic regression? What is the primary difference between logistic regression
and linear regression?
Logistic regression is a statistical method used in data analytics for binary classification
tasks, where the goal is to predict the probability that a given instance belongs to a
particular category (e.g., fraud detection, customer churn).
Logistic regression is a supervised machine learning algorithm used for classification
tasks, where the goal is to predict the probability that an instance belongs to a specific
class.
It analyzes the relationship between independent variables and a categorical dependent
variable.
While it shares similarities with linear regression, it is primarily designed for
classification rather than regression tasks.
Logistic regression is often used for binary classification. For example, consider a medical
study trying to predict whether a patient has a certain disease (Class 1) or not (Class 0)
based on factors like age, blood pressure, and cholesterol level. The model assigns a
probability score between 0 and 1 for each patient. If the score is above a set threshold
(e.g., 0.5), the patient is predicted to have the disease (Class 1); otherwise, they are
classified as not having the disease (Class 0).
Mathematically, it uses the sigmoid function to convert linear combinations of input
features into a probability range: where is the linear combination of the input features.
Types of Logistic Regression
Binomial: Two possible categories (e.g., Pass/Fail, 0/1).
Multinomial: Three or more possible unordered categories (e.g., Cat, Dog, Sheep).
Ordinal: Three or more possible ordered categories (e.g., Low, Medium, High).
The primary difference between logistic regression and linear regression is that logistic
regression predicts a categorical (binary) outcome, while linear regression predicts a continuous
numerical value. Logistic regression uses a sigmoid function to map inputs to a probability
range, while linear regression fits a straight line to predict continuous outputs.
Feature Linear Regression Logistic Regression
Purpose Regression problems Classification problems
Output Type Continuous values (e.g., age, price) Categorical values (e.g., 0 or 1)
Curve Type Straight line S-Curve
Estimation Method Least Squares Maximum Likelihood Estimation
Collinearity Handling Allows collinearity Limited tolerance for collinearity
Linearity Requirement Requires linear relationship No strict linearity required
2. List and briefly explain the main steps involved in building a logistic regression model.
1. Import the Required Libraries
Start by importing essential Python libraries such as pandas, numpy, and later sklearn for
building the model.
Example:
import pandas as pd
import numpy as np
2. Clean the Dataset
Handle missing values using techniques like imputation (filling with mean) or removal.
Drop irrelevant or redundant features that do not add value to the prediction (e.g.,
currentSmoker, education).
This step reduces noise and ensures the model trains on meaningful data.
3. Analyze the Dataset (Feature Engineering)
Perform exploratory data analysis (EDA) to understand patterns and correlations.
Visualize and filter the most relevant features.
Drop features with low variance or those that don't help differentiate classes (e.g., BMI,
heartRate).
Apply scaling if necessary since logistic regression is sensitive to feature scales.
4. Prepare the Model
Split the dataset into training and testing sets using train_test_split.
Initialize and train the logistic regression model using:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
[Link](X_train, y_train)
5. Evaluate the Model
Evaluate performance using metrics like accuracy, confusion matrix, precision, recall,
F1-score, and ROC-AUC.
The model score or accuracy is a good initial check; for instance, an accuracy of 85.6%
was reported in the example.
These steps align closely with the practical implementation of logistic regression in Python and
reflect the complete pipeline from data loading to model evaluation.
3. Explain the difference between regression and segmentation in the context of supervised
and unsupervised learning?
Regression Segmentation
Regression is a type of supervised learning. Segmentation is a type of unsupervised learning.
It predicts a continuous numeric value. It groups data into clusters based on similarity.
Requires both input features and labeled
Requires only input features, no labeled output.
output data.
Used when we know the outcome we want to Used when we want to explore or discover
predict. structure in the data.
Example: Predicting house prices based on Example: Grouping customers based on
size and location. purchase behavior.
Models identify patterns without knowing
Models are trained using known outcomes.
outcomes.
Common algorithms include Linear Regression Common algorithms include K-Means and
and Ridge Regression. DBSCAN.
Evaluation metrics include MAE, MSE, and R- Evaluation metrics include Silhouette Score and
squared. Davies–Bouldin index.
It assumes a relationship between input and
It does not assume any target relationship.
output.
Regression Segmentation
Mainly used for forecasting and numeric Mainly used for pattern recognition and data
predictions. exploration.
Outliers can heavily influence regression Outliers may be grouped separately or affect
results. cluster shapes.
Feature scaling may improve model Feature scaling is often essential for accurate
performance. grouping.
Produces a specific prediction for each input. Assigns each input to a group or segment.
Focuses on similarity/distance between data
Can model linear or nonlinear relationships.
points.
Helps in estimating future values. Helps in identifying natural divisions in data.
4. What is feature extraction and why is it important for making accurate predictions?
Feature extraction transforms raw, high-dimensional, or unstructured data into a compact and
informative representation that a machine learning algorithm can understand and use
effectively.
It’s not just about selecting useful data — it often involves creating new features that represent
hidden structure or patterns in the original input.
Feature extraction improves model performance by capturing the most relevant
information.
It enables learning from unstructured data like text, images, and audio.
Without good features, even the best algorithms fail to produce accurate predictions.
Why is Feature Extraction Important?
Here’s why it’s crucial for making accurate predictions:
1. Simplifies the Data: It reduces the original data to a smaller set of relevant features,
making models faster and easier to train.
2. Improves Accuracy: Better features lead to better patterns, which result in more
accurate predictions.
3. Removes Noise: It helps eliminate irrelevant or redundant information that may confuse
the model.
4. Enables Generalization: Good feature extraction helps the model generalize better to
unseen data (prevents overfitting).
5. Essential for Unstructured Data: For data like text, images, or audio, raw data isn’t
usable directly—features must be extracted (e.g., pixel intensities in images, TF-IDF in
text).
6. Boosts Interpretability: Extracted features can offer insight into which inputs most
influence the output.
7. Reduces Dimensionality: Too many features can overwhelm a model (the "curse of
dimensionality").
Feature extraction helps reduce input space while preserving key information.
8. Improves Model Efficiency:
Less data to process → faster training and inference times.
9. Supports Better Learning:
Helps models identify underlying trends and relationships more easily.
10. Necessary for Complex Data:
Raw images, audio, and text must be converted into numbers before analysis.
Example:
Raw data: A photo of a handwritten digit.
Feature extraction: Turn the image into pixel intensity values or extract edges.
Use in model: The features help the model distinguish between a “2” and a “3”.
Examples by Data Type
📊 Structured Data (Tabular):
Extracting age from a date of birth column.
Converting time stamps into hour-of-day or day-of-week.
Creating an “income per dependent” feature from total income and family size.
Image Data:
Extracting edges, textures, or color histograms.
Using techniques like:
o SIFT (Scale-Invariant Feature Transform)
o HOG (Histogram of Oriented Gradients)
o CNN features (from deep learning models)
📄 Text Data:
Converting sentences into:
o Bag-of-Words
o TF-IDF (Term Frequency-Inverse Document Frequency)
o Word embeddings like Word2Vec, GloVe, or BERT features
🔊 Audio Data:
Extracting:
o MFCCs (Mel-frequency cepstral coefficients)
o Spectrogram features
o Pitch, tempo, or rhythm patterns
🌐 Time Series Data:
Extracting trends, seasonality, and statistical measures (mean, standard deviation).
Features like rolling averages or lag variables.
5. Explain the concept of decision trees and discuss how they can be used for both regression
and classification tasks. How can overfitting be prevented in decision trees?
A decision tree is a supervised machine learning algorithm that uses a tree-like model of
decisions. It splits the dataset into smaller and smaller subsets based on feature values,
eventually reaching a decision or prediction at the leaf node.
Each internal node represents a feature test (e.g., "Is Age > 50?"),
Each branch represents the outcome of the test, and
Each leaf node represents the final prediction (a class label or a number).
🔁 How Decision Trees Work
Start at the root node.
Split the data based on the best feature (using criteria like Gini impurity, entropy, or
mean squared error).
Recursively apply this process to each child node.
Stop when:
o All data points in a node belong to the same class (classification).
o Or a stopping criterion (like max depth or min samples) is met.
📌 Used for Both:
Task How Decision Trees Are Used
Classification The leaves hold class labels. E.g., "spam" or "not spam".
Regression The leaves hold numeric values. E.g., "house price = $250,000".
Classification trees use metrics like Gini index or entropy.
Regression trees use mean squared error (MSE) or mean absolute error (MAE) to find
the best split.
Classification Trees assign class labels (e.g., spam vs not spam).
Regression Trees assign numerical values (e.g., house price, temperature).
The core mechanism (splitting based on feature values) is the same — what differs is the
type of output and how the best split is chosen.
Decision Trees for Classification:
In classification, a decision tree splits the data based on feature values to predict a category or
class label.
✅ Example:
Suppose we want to predict whether a person will buy a car based on:
Age
Income
Marital status
Training Data (Simplified):
Age Income Married Buy Car?
25 Low No No
45 High Yes Yes
35 Medium Yes Yes
22 Low No No
Tree Could Look Like:
Is Age > 30?
├── Yes → Is Income High?
│ ├── Yes → Predict: Yes
│ └── No → Predict: Yes
└── No → Predict: No
This tree splits based on age, then income.
Each path leads to a class label: "Yes" or "No".
📈 Decision Trees for Regression
In regression, the decision tree splits data to predict a numeric value instead of a class.
✅ Example:
Suppose we want to predict the price of a house based on:
Size (in square feet)
Number of rooms
Location
Training Data (Simplified):
Size (sq ft) Rooms Price ($)
1200 3 200,000
1500 4 250,000
1700 4 275,000
900 2 150,000
Tree Could Look Like:
Is Size > 1300?
├── Yes → Predict: 262,500 (average of 250k and 275k)
└── No → Predict: 175,000 (average of 200k and 150k)
This tree splits based on numerical thresholds and predicts the average value of the
target in each leaf.
🔁 Key Differences in Use
Aspect Classification Regression
Target output Category/class (e.g., "Yes", "No") Continuous number (e.g., price, weight)
Split criteria Gini Index, Entropy Mean Squared Error (MSE), MAE
Leaf prediction Most common class in the leaf Average of numeric values in the leaf
⚠️Overfitting in Decision Trees
Overfitting happens when a decision tree becomes too complex, learning not just the true
patterns in the data but also the noise and random fluctuations. As a result, the model
performs well on training data, but poorly on new, unseen data.
Example of Overfitting:
A tree that splits until each leaf has only one data point will perfectly memorize the training
data — but will fail to generalize.
✅ How to Prevent Overfitting in Decision Trees
Here are the main techniques to prevent overfitting, with simple explanations:
1. Limit the Maximum Depth of the Tree
Set a limit on how many levels (splits) the tree can go down.
A shallow tree is less likely to overfit.
📌 Example: max_depth = 5
2. Set Minimum Samples per Leaf
This forces the tree to only create a leaf node if it contains at least a certain number of
samples.
Prevents the tree from making splits based on very small and potentially misleading data
subsets.
📌 Example: min_samples_leaf = 10
3. Set Minimum Samples per Split
This prevents the tree from splitting a node unless there are enough data points in it.
Avoids creating tiny branches from just a few samples.
📌 Example: min_samples_split = 20
4. Pruning the Tree
Pruning means trimming the tree after it is fully grown to remove branches that add
little value.
There are two types:
o Pre-pruning: Use limits like depth or sample size while building.
o Post-pruning: Build a full tree and then remove less important nodes using a
validation set.
5. Use Cross-Validation
Split your data into multiple training/testing parts.
Ensures that your model generalizes well to different data splits.
Helps you detect when the tree is too complex.
6. Use Ensemble Methods
Instead of one tree, use many trees and average or vote their results.
Common ensemble techniques:
o Random Forest: Builds many trees on random subsets of data and features.
o Gradient Boosting: Builds trees sequentially to fix previous mistakes.
✅ Ensemble methods are much less prone to overfitting than a single deep decision tree.
7. Use Feature Selection
Remove irrelevant or noisy features that can mislead the tree into unnecessary splits.
Less noise = less overfitting.
How to Prevent Overfitting
Technique Description
Max Depth Limit how deep the tree can go. Prevents it from splitting endlessly.
Min Samples per
Set a minimum number of samples required at a leaf node.
Leaf
Technique Description
Min Samples to
Require a minimum number of samples to make a split.
Split
Pruning Remove branches that have little impact on prediction accuracy.
Cross-validation Validate performance on multiple data splits to detect overfitting.
Use Ensemble Techniques like Random Forest and Gradient Boosting average or combine
Models multiple trees to reduce overfitting.
Decision trees are intuitive models that split data into smaller chunks to make
predictions.
They can be used for both classification and regression by adjusting how splits and
predictions are handled.
Overfitting is a common issue, but it can be prevented using depth limits, pruning, and
ensemble techniques.
6. Describe the STL (Seasonal-Trend decomposition using Loess) approach in time series
analysis. What are its advantages?
Here’s the refined answer to your question using the content you shared:
STL (Seasonal-Trend Decomposition using Loess) in Time Series Analysis
STL, which stands for Seasonal and Trend decomposition using Loess, is a method used to
break down time series data into three key components:
Trend: the long-term direction or movement in the data.
Seasonal: repeating patterns that occur at regular intervals (e.g., monthly, yearly).
Residual: the random noise or irregular component left after removing trend and
seasonality.
It is especially useful in time series data where seasonal patterns and trends may change over
time or be non-linear.
STL uses Loess (Locally Weighted Scatterplot Smoothing), a non-parametric regression method,
to smooth the data. It applies Loess iteratively to:
Extract the trend,
Capture the seasonal component,
Separate out the residual/noise.
This makes it flexible and adaptive to various time series structures.
✅ Advantages of STL
1. Handles Complex Seasonality: STL can detect and adapt to changing seasonal patterns
over time.
2. Robust to Outliers: The robust=True option allows it to handle anomalies without
distorting the trend or seasonality.
3. Works on Any Frequency: It can decompose daily, monthly, quarterly, or any regular
time interval data.
4. Separates Components Clearly: Trend, seasonal, and residuals are distinctly extracted
for better analysis.
5. Supports Deseasonalization: Seasonal component can be removed, making data more
suitable for forecasting or modeling.
6. Improves Forecasting: By removing seasonal noise, models trained on deseasonalized
data often perform better.
7. Better Visualization: Makes it easier to visualize underlying trends and cyclic patterns in
raw data.
8. No Stationarity Required: Unlike traditional techniques, STL does not assume the time
series is stationary.
9. Customizable: Parameters like seasonal window size can be tuned depending on the
data’s behavior.
10. Applicable Across Domains: Useful in finance, retail, healthcare, climate research, and
more.
In summary, STL is a powerful and flexible decomposition tool that makes time series data
easier to understand and prepare for tasks like forecasting, anomaly detection, and decision-
making.