0% found this document useful (0 votes)
10 views7 pages

Understanding Regression Analysis Basics

Regression is a supervised learning technique used to find correlations between variables and predict continuous output based on predictors. Gradient descent is an optimization algorithm that iteratively adjusts model parameters to minimize error, while overfitting and underfitting are common issues in model training that affect generalization. Multivariate regression extends simple linear regression to multiple predictors, capturing complex relationships but facing challenges like multicollinearity and the assumption of linearity.

Uploaded by

vivekjadhav1236
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views7 pages

Understanding Regression Analysis Basics

Regression is a supervised learning technique used to find correlations between variables and predict continuous output based on predictors. Gradient descent is an optimization algorithm that iteratively adjusts model parameters to minimize error, while overfitting and underfitting are common issues in model training that affect generalization. Multivariate regression extends simple linear regression to multiple predictors, capturing complex relationships but facing challenges like multicollinearity and the assumption of linearity.

Uploaded by

vivekjadhav1236
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Q.

State and explain need of regression • Gradient descent is popular for very large-scale
analysis. optimization problems because it is easy to
Ans. : Regression is a supervised learning implement, can handle black box functions, and
technique which helps in finding the correlation each iteration is cheap.
between variables and enables us to predict the • Given a differentiable scalar field f (x) and an
continuous output variable based on the one or initial guess x₁, gradient descent iteratively
more predictor variables. moves the guess toward lower values of "f" by
• Regression analysis is a statistical method to taking steps in the direction of the negative
model the relationship between a dependent gradient - V f (x).
(target) and independent (predictor) variables Locally, the negated gradient is the steepest
with one or more independent variables. descent direction, i.e., the direction that x would
Regression analysis helps in the prediction of a need to move in order to decrease "f" the
continuous variable. There are various scenarios fastest. The algorithm typically converges to a
in the real world where we need some future local minimum, but may rarely reach a saddle
predictions such as weather condition, sales point, or not move at all if x, lies at a local
prediction, marketing trends, etc., for such case maximum.
we need some technology which can make • The gradient will give the slope of the curve at
predictions more accurately. that x and its direction will point to an increase
So for such case we need Regression analysis in the function. So we change x in the opposite
which is a statistical method and used in direction to lower the function value:
machine learning and data science. Below are Q. Explain overfitting and underfitting.
some other reasons for using Regression Ans. : Overfitting
analysis: • Training error can be reduced by making the
i) Regression estimates the relationship between hypothesis more sensitive to training data, but
the target and the independent variable. this may lead to overfitting and poor
ii) It is used to find the trends in data. generalization.
iii) It helps to predict real/continuous values. • Overfitting occurs when a statistical model
Q. How gradient descent does helps to optimize describes random error or noise instead of the
linear regression model. underlying relationship. Low error rates and a
Ans.: Optimization is the process where we train high variance are good indicators of overfitting.
the model iteratively that results in a maximum • Overfitting is when a classifier fits the training
and minimum function evaluation. It is one of data too tightly. Such a classifier works well on
the most important phenomena in machine the training data but not on independent test
learning to get better results. data. It is a general problem that plagues all
We compare the results in every iteration by machine learning methods
changing the hyperparameters in each step until
we reach the optimum results. We create an
accurate model with less error rate,
There are different ways using which we can
optimize a model. Two important optimization
algorithms Gradient descent and Stochastic
gradient descent algorithms.
• Gradient descent is an iterative optimization
algorithm to find the minimum of a function.
Overfitting generally occurs when a model is [Link] of Multivariate Regression and State
excessively complex, such as having too many advantages and disadvantages of Multivariate
parameters relative to the number of Regression.
observations. Multivariate regression is a statistical technique
• We can determine whether a predictive model used to examine the relationship between
is under-fitting or overfitting the training data by multiple independent variables (predictors) and
looking at the prediction error on the training a single dependent variable. It extends the idea
data and the evaluation data. of simple linear regression, which deals with the
• Reasons for overfitting relationship between two variables, to situations
1. Noisy data where there are more than two variables
2. Training set is too small involved
3. Large number of features Advantages of Multivariate Regression:
Q. What are different cost functions to access 1. Captures Complex Relationships:
the performance of linear Regression model ? In Multivariate regression allows for the
the given dataset the outliers represent examination of complex relationships
anomalies. Which cost function will be more among multiple variables. This is
suitable and why? particularly useful in scenarios where
Ans. : Outliers are those data points that are the outcome is influenced by more than
significantly different from the rest of the one factor.
dataset. They are often abnormal observations 2. Improved Predictive Accuracy: By
that skew the data distribution and arise due to considering multiple predictors,
inconsistent data entry or erroneous multivariate regression models can
observations. potentially provide more accurate
. Outlier can be of two types: Univariate and predictions compared to simple linear
Multivariate. regression models. This is especially true
Methods is used to detect an outliers: when there are interactions or
Hypothesis Testing, Z-score method, DBSCAN dependencies among the predictors.
Clustering, Isolation Forest, Linear Regression Disadvantages of Multivariate Regression:
Models (PCA, LMS) and Standard Deviation 1. Assumption of Linearity: Like simple
The most common measurement of overall error linear regression, multivariate
is the sum of the squares of the errors, or SSE regression assumes a linear relationship
(sum of squared errors). The line with the between the predictors and the
smallest SSE is called the least-squares dependent variable. If this assumption is
regression line. violated, the model may not accurately
Mean Squared Error (MSE) is calculated by taking represent the underlying data.
the average of the square of the difference 2. Multicollinearity: Multivariate
between the original and predicted values of the regression is sensitive to
data. It can also be called the quadratic cost multicollinearity, which occurs when
function or sum of squared errors. two or more independent variables are
• The value of MSE is always positive or greater highly correlated. This can make it
than zero. A value close to zero will represent challenging to assess the individual
better quality of the estimator/predictor. An contributions of each predictor and can
MSE of zero (0) represents the fact that the lead to unstable coefficient estimates.
predictor is a perfect predictor.
[Link] Network performance of machine learning models,
Ans. : particularly in situations where the algorithm
Bayesian belief networks represent the full might be biased towards predicting the majority
joint distribution over the variables more class.
compactly with a smaller number of ii) Gini Index: The Gini index is a metric used in
parameters. decision tree algorithms to evaluate the impurity
It take advantage of conditional and or disorder of a set of data points. In the context
marginal independences among random of decision trees, each node represents a subset
variables. of the data, and the Gini index measures how
A and B are independent then P(A, B) = often a randomly chosen element would be
P(A)P(B). incorrectly labeled if it was randomly labeled
. A and B are conditionally independent according to the distribution of labels in that
given C. P(A, BC) P(A | C)P(B | C) P(AC, B) P(A node.
| C) iii) Entropy: Entropy is a measure of disorder or
impurity in a set of data. In the context of
[Link] and disadvantages of Naive decision trees and information theory, entropy is
Bayes Classifier: used to quantify the uncertainty associated with
Advantages: a random variable. In decision trees, the goal is
• Simple to implement. to reduce entropy at each node by creating splits
• If the conditional Independence that result in more homogeneous subsets.
assumption holds, it could give great iv) Information Gain: Information Gain is a
results. concept used in decision tree algorithms to
• It is not sensitive to noisy features. measure the effectiveness of a particular
• No overfitting. attribute in reducing uncertainty (entropy or Gini
• Suitable for Large Datasets. index) about the classification of data points.
Disadvantages: When building a decision tree, the algorithm
• It will assume that all the attributes are evaluates different attributes and selects the
independent, which rarely happens in one that provides the highest information gain.
ID3 Decision Tree AlgorithmThe ID3 (Iterative Dichotomiser 3)
real life. It will limit the application of algorithm is a widely used decision tree algorithm for classification
this algorithm in real-world situations. tasks. It was developed by Ross Quinlan in 1986 and is an example
• It will estimate things wrong sometimes. of a top-down, greedy approach to building decision trees. The
ID3 algorithm builds a decision tree by recursively selecting the
• The Naive Bayes algorithm has trouble feature that best splits the data at each node, aiming to maximize
with the 'zero-frequency problem. It information [Link] Concepts1. Decision Tree: A flowchart-like
happens when we assign zero structure where:Each internal node represents a feature (or
attribute).Each branch represents a decision rule based on that
probability for categorical variables in [Link] leaf node represents a class label (outcome)
the training dataset that is not available. [Link] in ID3 Algorithm1. Select the best
[Link] and explain following terms 1) feature based on the information gain (highest
Minority Class 2)Gini Index 3) Entropy 4) gain).2. Split the dataset into subsets based on the
Information Gain selected feature.3. Repeat the process for each
i) Minority Class: In the context of classification subset until: All instances in the subset belong to
problems, the minority class refers to the class or the same [Link] are no more features to split
category in a dataset that has fewer instances or on. 4. Prune the tree if needed (optional for better
examples compared to the majority class. generalization)
Imbalances in class distribution can affect the
Q. Explain types of linkege single linkage, or more items in a dataset. These rules
complete linkage and average linkage. are often expressed in the form of "If A
Ans. In the agglomerative hierarchical approach, then B," indicating that the occurrence
we start by defining each data point to be a of item A is associated with the
cluster and combine existing clusters at each occurrence of item B.
step. ii) Support:
• Here are four different methods for doing this: • Support is a measure of how frequently
1. Single linkage: a particular set of items appears in a
• Smallest pairwise distance between elements dataset. It helps identify how popular or
from each cluster. Also referred to as nearest common a rule is within the dataset. The
neighbour or minimum method. support of a rule is calculated as the ratio
This measure defines the distance between two of the number of transactions containing
clusters as the minimum distance found the rule to the total number of
between one case from the first cluster and one transactions in the dataset. Higher
case from the second cluster. support values indicate that the rule is
the smallest distance found between the For more significant or prevalent.
example, if cluster 1 contains cases a and b, and iii) Lift:
cluster 2 contains cases c, d, and e, then the • Lift is a measure of how much more
distance between cluster 1 and cluster 2 would likely item B is to be bought when item A
be following pairs of cases: (a, c), (a, d), (a, c), (b, is bought, compared to when item B is
c), (b, d), and (b, e). bought without item A. It is calculated as
2. Complete linkage: Largest distance between the ratio of the observed support of the
elements from each cluster. rule to the expected support if the items
• Also referred to as furthest neighbor or were independent. A lift value greater
maximum method. This measure is similar to the than 1 indicates that the items are
single linkage measure described above, but positively correlated, meaning their
instead of searching for the minimum distance occurrence together is more significant
between pairs of cases, it considers the furthest than random chance.
distance between pairs of cases. iv) Confidence:
• Although this solves the problem of chaining, it • Confidence is a measure of the reliability
creates another problem. or strength of a rule. It indicates the
3. Average linkage: The average distance probability that item B will be bought
between elements from each cluster when item A is bought. Confidence is
• Also referred to as the unweighted pair-group calculated as the ratio of the support for
method using Arithmetic averages. the combined occurrence of items A and
• To overcome the limitations of single and B to the support for item A alone. Higher
complete linkage, Sokal and Michener proposed confidence values imply a stronger
taking an average of the distance values association between items.
between pairs of cases.
Q. Explain following Terms i) Rule ii) Support iii)
Lift iv) Confidence.
i) Rule:
• In data mining, a rule is a statement that
describes the association between two
Q. What is biological neuron ? Explain with • Recurrent Neural Networks : RNN is one
diagram and its components. of the fundamental network
Ans.: Biological neurons, consisting of a cell architectures from which other deep
body, axons, dendrites and synapses, are able to learning architectures are built. RNNs
process and transmit neural activation. can use their internal state (memory) to
• Artificial neural systems are inspired by process variable-length sequences of
biological neural systems. The elementary inputs.
building block of biological neural systems is the • Long Short-Term Memory: LSTM has
neuron. feedback connections. This means that it
• The brain is a collection of about 10 billion can process not only single data points
interconnected neurons. Each neuron is a cell (such as images) but also entire
[right] that uses biochemical reactions to sequences of data.
receive, process and transmit information . • Gated Recurrent Unit: It's a type of
LSTM. GRUs are used for smaller and less
frequent datasets, where they show
better performance.
• Convolutional Neural Networks: This
architecture is commonly used for image
processing, image recognition, video
analysis, and NLP. It can take in an input
image, assign importance to various
• The single cell neuron consists of the cell body aspects/objects in the image, and be
or soma, the dendrites and the axon. The able to differentiate one from the
dendrites receive signals from the axons of other others. CNNs consist of an input and an
neurons. The small space between the axon of output layer, as well as multiple hidden
one neuron and the dendrite of another is the layers. The CNN's hidden layers typically
synapse. The afferent dendrites conduct consist of a series of convolutional
impulses toward the soma. The efferent axon layers.
conducts impulses away from the soma. • Deep Belief Network: DBN is a multilayer
Basic Components of Biological Neurons network in which each pair of connected
1. The majority of neurons encode their layers is a Restricted Boltzmann
activations or outputs as a series of brief Machine.
electrical pulses (i.e. spikes or action potentials).
2. The neuron's cell body (soma) processes the
incoming activations and converts them into
output activations.
3. The neuron's nucleus contains the genetic
material in the form of DNA. This exists in most
types of cells, not just neurons.
Q. Explain what is deep learning and its
different architectures ? State the various
applications of deep learning?
Ans. : Deep learning architectures are RNN,
LSTM, GRU, CNN, DBN and DSN.
[Link] the architecture of feed forward transferring. The weights in the network are
neural network. State its limitations. constantly updated to make it easily predictable.
Ans.: Feed Forward Neural Network is an • Output layer contains one or more output
artificial neural network in which the nodes.
connections between nodes does not form a Feed foward neural networks are primarily used
cycle. The feed forward model is the simplest for supervised learning in cases where the data
form of neural network as information is only to be learned is neither sequential nor time-
processed in one direction. While the data may dependent.
pass through multiple hidden nodes, it always Limitations:
moves in one direction and never backwards. 1. Loss of neighborhood information.
• They are called feed forward because 2. More parameters to optimize
information only travels forward in the network 3. These networks require large amounts of data
(no loops), first through the input nodes, then in order to function properly, as well as a high
through the hidden nodes (if present) and finally level of computational power
through the output nodes. [Link] is perceptron explan Multilayer
• Feed-forward networks tends to be simple perceptron ?
networks that associates inputs with outputs. It Ans.: An arrangement of one input layer of
can be used in pattern recognition. This type of McCulloch-Pitts neurons feeding forward to one
organization is represented as bottom-up or top- output layer of McCulloch-Pitts neurons is
down. known as a perceptron.
• The perceptron is a feed-forward network with
one output neuron that learns a separating
hyper plane in a pattern space.

Input layer contains one or more input nodes. Multilayer perceptron


For example, suppose we want to predict Ans.: Multilayer perceptron is a neural network
whether it will rain tomorrow and base our that learns the relationship between linear and
decision on two variables, humidity and wind non-linear data. Multilayer Perceptron has input
speed. In that case, our first input would be the and output layers, and one or more hidden
value for humidity, and the second input would layers with many neurons stacked together.
be the value for wind speed.
Hidden layer: This layer contains an activation • Multilayer perceptron falls under the category
function. The hidden layers are positioned of feed-forward algorithms, because inputs are
between the input and the output layer. The combined with the initial weights in a weighted
number of hidden layers depends on the type of sum and subjected to the activation function.
model. Hidden layers have several neurons that
impose transformations on the input before
Q.10 Write K-means algorithm.
Ans.: K-Means clustering is heuristic method.
Here each cluster is represented by the center of
the cluster.
• The "K" stands for number of clusters, it is
typically a user input to the algorithm; some
criteria can be used to automatically estimate K.
• Given K, the K-means algorithm consists of four
steps:
1. Select initial centroids at random.
[Link] is ANN artificial neural network
2. Assign each object to the cluster with the
McCulloch Pitts neuron ?
nearest centroid.
Ans.: Artificial Neural Network (ANN) is a
3. Compute each centroid as the mean of the
computational system inspired by the structure,
objects assigned to it.
processing method, learning ability of a
4. Repeat previous 2 steps until no change.
biological brain. An artificial neural network is
Centroid: In the context of the K-means
composed of many artificial neurons that are
algorithm, a centroid is the mean position of all
linked together according to specific network
the points in a cluster. During each iteration of
architecture.
the algorithm, the centroids are recalculated as
• ANNs do not execute programmed
the average of all the data points assigned to that
instructions; they respond in parallel to the
cluster.
pattern of inputs presented to it. There are also
Medoid: While a centroid is the mean position, a
no separate memory addresses for storing data.
medoid is the most central point in a cluster,
McCulloch Pitts neuron The first mathematical
representing the cluster's central tendency.
model of a biological neuron was presented by
Unlike the centroid, which might not necessarily
McCulloch and Pitts. This model is known as
be an actual data point, the medoid is an existing
McCulloch Pirt model. It is basic building block of
data point within the cluster.
neural network Directed weight graph is used for
Distance Measures: Different distance measures
connecting neurons.
are used to quantify the dissimilarity or similarity
• McCulloch and Pitts describe a neuron as a
between two data points. Some common
logical threshold element with two possible
distance measures include:
states. Such a threshold element has "N" input
1. Euclidean Distance:
channels and one output channel. An input
Euclidean distance is the straight-line distance
channel is either active (input 1) or silent (input
between two points in Euclidean space. For two
0)
points (x1, y1) and (x2, y2), the Euclidean
The activity states of all input channels thus
distance is given by:
encode the input information as a binary
2. Manhattan Distance (L1 Norm):
sequence of N bits.
Also known as the city block or taxicab distance,
Manhattan distance is the sum of absolute
differences between corresponding coordinates.

You might also like