0% found this document useful (0 votes)

20 views9 pages

Types and Techniques in Machine Learning

Q: Explain the significance of the decision boundary in logistic regression and how polynomial features can alter its representation.

In logistic regression, the decision boundary represents the threshold point at which the model classifies inputs into different categories, typically separating positive from negative classes. This boundary is determined by the hypothesis function, where inputs result in probabilities around the threshold value (usually 0.5). By incorporating polynomial features, the decision boundary can be transformed from a simple linear one to a more complex non-linear form, allowing the model to better fit data that is not linearly separable by capturing interactions between features .

Q: How do support vector machines (SVMs) decide on the optimal decision boundary for classification, and what role does the parameter 'C' play?

Support Vector Machines determine the optimal decision boundary by maximizing the margin—the distance between the nearest data points of different classes (support vectors)—to the boundary line. The parameter 'C' controls the trade-off between achieving a low training error and a low testing error, or margin maximization. A large 'C' results in a narrower margin but lower bias and higher variance, emphasizing correct classification of all training data. Conversely, a small 'C' allows for a wider margin with some points misclassified, focusing on a simpler decision boundary and potentially better generalization .

Q: How does the use of principal component analysis (PCA) aid in dimensionality reduction, and what are its computational benefits?

Principal Component Analysis (PCA) aids in dimensionality reduction by transforming the original n-dimensional data into k dimensions, which retain as much variance as possible, thereby minimizing information loss. This transformation projects the data onto a new coordinate system defined by the principal components. The computational benefits include reduced storage and processing requirements, lessened overfitting due to simpler models, and potentially improved model performance due to more abstract but informative representations of the data .

Q: What is K-Means Clustering, and how does it function differently from classification algorithms?

K-Means Clustering is an unsupervised learning algorithm used to partition a dataset into K distinct clusters based on feature similarity. Unlike classification algorithms, which assign predefined labels to input data, K-Means does not require labeled inputs. Instead, it groups data points by minimizing the variance within each cluster and iteratively updating the centroids of the clusters until convergence. This method is particularly useful for discovering underlying patterns or natural groupings within data .

Q: Compare and contrast feed-forward neural networks and recurrent neural networks in terms of their architectures and applications.

Feed-forward neural networks consist of layers where the information flows in one direction—from input to output—through hidden layers. They are commonly used for tasks like image classification or simple predictive analytics. In contrast, recurrent neural networks (RNNs) feature cycles in their connections, allowing information to persist over time, making them suitable for tasks involving sequential data, such as time series prediction or natural language processing. RNNs are more complex and powerful due to their ability to handle sequences, though they are also more difficult to train compared to feed-forward networks .

Q: How does feature scaling affect the performance of the gradient descent algorithm in machine learning?

Feature scaling significantly impacts the performance of gradient descent by ensuring that each feature contributes equally to the result, which speeds up convergence. Without feature scaling, features with larger ranges can dominate the gradient descent updates, causing slow convergence or even divergence. By reducing all features to the same interval, gradient descent can perform more efficiently since the optimization process is more stable and converges faster to the optimal solution .

Q: What are the main types of machine learning and how do they fundamentally differ?

The main types of machine learning are supervised learning, unsupervised learning, and reinforcement learning. Supervised learning involves training a model on a labeled dataset, which means the output for each input example is known, and the goal is to learn a mapping from inputs to outputs. In unsupervised learning, the model is trained on an unlabeled dataset, and the goal is to infer the natural structure present within a set of data points, such as through clustering. Reinforcement learning involves an agent that interacts with an environment, learns through trial and error to perform tasks, and is rewarded or punished for the actions it takes to optimize its behavior over time .

Q: What is the concept of 'batch gradient descent' in optimization methods, and how does it compare to other forms like stochastic and mini-batch gradient descent?

Batch gradient descent is an optimization method in which the gradient for the entire training dataset is computed to update model parameters in each iteration. This contrasts with stochastic gradient descent, which updates parameters using a single data point per iteration, and mini-batch gradient descent, which updates parameters using a subset of the data. Batch gradient descent generally converges steadily to the global minimum for convex functions but can be computationally expensive on large datasets. Stochastic gradient descent, meanwhile, is computationally efficient and can escape local minima, while mini-batch gradient descent offers a balance between the stability of updates and computational efficiency .

Q: Discuss the importance of error analysis in machine learning model development and suggest techniques for conducting it effectively.

Error analysis is crucial in machine learning as it helps identify where a model is failing, thereby guiding further improvements. It involves examining the errors that the model makes on the validation or test data to determine any patterns or recurring problems. Techniques for effective error analysis include plotting learning curves to observe model performance over time, manually reviewing misclassified examples to uncover potential biases or erroneous patterns, and refining features based on these insights. This continuous feedback loop helps enhance the model's predictive capabilities .

Q: In what ways can regularization help address the problem of overfitting in machine learning models?

Regularization helps address overfitting by introducing a penalty for larger coefficients in the model, effectively limiting the complexity of the model. This is achieved by adding a regularization term to the loss function, which discourages overly complex models that fit the training data too closely but fail to generalize to new data. By reducing the magnitude of the parameters (theta), regularization reduces variance and helps the model generalize better, balancing the trade-off between bias and variance .

The document provides an overview of machine learning, detailing its main types: supervised, unsupervised, and reinforcement learning, along with key algorithms such as linear regression, logistic regression, and neural networks. It discusses concepts like gradient descent, regularization, and error analysis, as well as advanced topics like support vector machines and collaborative filtering. Additionally, it covers neural network architectures, including feed-forward and recurrent networks, and introduces fundamental statistical concepts relevant to machine learning.

Uploaded by

Richard Ardelean

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views9 pages

Types and Techniques in Machine Learning

Uploaded by

Richard Ardelean

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Machine Learning

Ardelean Eugen-Richard

30433

Machine Learning - Andrew Ng

Neural Networks - Geoffrey Hinton

Statistics One – Andrew Conway

Machine Learning Course

My definition for machine learning: a way of creating programs that can find patterns more complicated
than humans can.

There are several types of machine learning based on different criterias.

The 3 main types would be:

1. Supervised Learning – the label/result is known for each of the training examples and we try to
generate a function that maps inputs to outputs (predicts outputs) according to these training
examples (ex. Linear Regression)
2. Unsupervised Learning – we do not know the label/result of any of the training examples, we are
trying to separate the examples into different groups (ex. K-Means Clustering)
3. Reinforcement Learning – a reward/punishment system is implemented with points, each action
having a certain amount of points, the agent is trying to optimize his actions by trial and error, it
learns from past experiences

Supervised Learning Algorithms

Linear Regression

As the name says a line is used to fit the data. To find this line we use the equation Y=w*X+b, where Y is
the vector of labels, X the vector of features (if more than 1 feature it becomes a matrice), w and b are
the weight and bias, both vectors.

Can be univariate(one feature) or multivariate(multiple features).

We use a function called “hypothesis” to transform the features into the label.

The hypothesis is calculated by multiplying each of the features with a parameter theta and then adding
them up.

For one feature, this would be the slope.

Using this hypothesis we calculate the cost function, which when minimized gives us the best line to fit
the data. Cost function is the sum of the squares of the difference between prediction(hypothesis) and
actual result(label=y)

Gradient Descent = the method for minimizing the cost function, it is done by synchronously updating
the parameters theta of the hypothesis, theta becomes theta “j” from which we subtract the (learning
rate multiplied with the) partial derivative (according to that theta “j” ) of the cost function.

Feature scaling – reduce all features to almost the same intervals, either by dividing to the range or
subtract the mean and after that dividing to the range

Learning rate – for a good value, the cost should decrease on each iteration

Polynomial regression – adding new features as the power of other features (ex. square, square root)

Normal equation

Gradient Descent Normal Equation

Need to choose learning rate No need to choose learning rate
Needs many iterations Doesn’t need many iterations
Work well with a lot of features Slow if many features

Logistic Regression
A way of classifying if an example is of this or that kind. The label is usually a fix number (ex. 0, 1) but the
hypothesis can be a real number and the user chooses a threshold, it predicts the probability of
something based on the training data.

Decision boundary – the threshold for which the hypothesis predicts 0 or 1, the line between the 2 parts

Using polynomial features you can get non-linear decision boundaries

Gradient Descent

For Multiclass Classification we have One-vs-all, which means we apply logistic regression for each type
of data we have, by considering all the others to be of another type (not multiple).

Regularization

Reduce overfitting

 Reduce number of features

- Select which features remain
- Model selection algorithm
 Regularization
- Reduce magnitude/values of parameters theta

Small values for parameters theta – less prone to overfitting

If lambda (regularization parameter) very large, it will underfit

Neural Networks

Sigmoid (logistic) activation function

Uses activation of layers, each neuron of a hidden layer is a combination of the neurons from the
previous layer multiplied by their theta parameters.

Simplest has the input layer, one hidden layer and the output layer

Small – computationally cheaper, prone to underfitting

Large – computationally expensive, prone to overfitting, can use regularization

Training

 Randomly initialize weights

 Implement forward propagation to get hypothesis for any example
 Implement code to compute cost function
 Implement back propagation to compute partial derivatives
 For i from 1 to number of training examples
 Perform forward propagation and back propagation using example i
 Use gradient descent/advanced optimization method with back propagation to try minimize
cost function
Underfitting – when because of the number of training examples or because of their quality the training
accuracy is very low.

Train cost will be high, Cross-validation and Test cost will be approx. the same

Solutions:

 Try getting additional features

 Try adding polynomial features
 Try decreasing regularization parameter

Overfitting – the hypothesis fits the training set very well, but fails to generalize (predict on new data,
test data)

Train cost will be low, Cross will be much higher than test

Solutions:

 Get more training examples

 Try smaller set of features
 Try increasing regularization parameter

Error Analysis

o Start with simple implementation and test on cross validation dataset

o Plot learning curve to decide if more data, features will help
o Error Analysis: manually examine examples (cross validation set) that your algorithm made errors
on. See if you spot a pattern

Precision = true positives / number of predicted positives (true positive + false positives)

Recall = true positive / number of actual positives (true positive + false negatives)

F1 Score: 2*PR / (P+R)

Support Vector Machines (SVM)

The best line is the one that has the largest margin(distance to the data)
C=1/lambda

SVM is used for classification, the algorithm draws lines between the data

Large C: low bias, high variance

Small C: high bias, low variance

Large sigma^2: high bias, low variance

Small sigma^2: low bias, high variance

Unsupervised Learning – Clustering

K-Means Algorithm

Separate the data into K clusters

 Start by initializing the centroids of the clusters randomly

 (Repeat) Then you assign the cluster to the closest training example
 (Repeat) And you move the centroid the average mean of points assigned to cluster k

Should have a smaller number of clusters than training examples

You can randomly pick K training examples and initialize cluster centroids with those

Optimal init:

Run 100 times (Randomly initialize, run K-means, compute cost) and pick clustering with lowest cost
Dimensionality Reduction

Data compression: Reduce from 2D to 1D

Principal Component Analysis(PCA)

Reduce n-dimension to k-dimension: find k vectors onto which to project the data as to minimize
projection error

Anomaly Detection

Choose features that you think might indicate anomalies

Fit parameters: mean and standard deviation

Given a new example, compute p(x), anomaly if <epsilon

Collaborative Filtering

Incomplete training examples

Use features to estimate parameters theta and then parameters theta to estimate features

Different types of gradient descent

Batch gradient descent: Use all examples in each iteration

Stochastic gradient descent: Use 1 example in each iteration (useful when a lot of data)

Mini-batch gradient descent: Use b examples in each iteration

Map-reduce Batch gradient descent: the processing of the data is given to different computers to reduce
the time it takes to process

Getting additional data

Introducing distortions (Ex: audio – background noise, bad connection), usually doesn’t help to add
purely random distortions

This should be applied only to low bias classifiers

Neural Networks Course
By far the commonest type of architecture in practical applications is a feed forward neural network
where the information comes into the imput units and flows in one direction through hidden layers until
each reaches the output units.

A much more interesting kind architecture is a recurrent neural network in which information can flow
round in cycles. These networks can remember information for a long [Link] can exhibit all sorts of
interesting oscillations but they are much more difficult to train in part because they are so much more
complicated in what they can do.

The last kind of architecture that I'll describe is a symmetrically-connected network, one in which the
weights are the same in both directions between two units.

The commonest type of neural network in practical applications is a feed-forward neural network. This
has some input units. And in the first layer at the bottom, some output units in the last layer at the top,
and one or more layers of hidden units.

If there's more than one layer of hidden units, we call them deep neural networks.

These networks compute a series of transformations between their input and their output. So at each
layer, you get a new representation of the input in which things that were similar in the previous layer
may have become less similar, or things that were dissimilar in the previous layer may have become
more similar.

Recurrent neural networks are much more powerful than feed forward neural networks. They have
directed cycles in the direct, in their connection graph. What this means is that if you start at a node or a
neuron and you follow the arrows, you can sometimes get back to the neuron you started at. They can
have very complicated dynamics, and this can make them very difficult to train.

Quite different from recurrent nets, symmetrically connected networks. In these the connections
between units have the same weight in both directions. Symmetric networks are much easier to analyze
than recurrent networks. This is mainly because they're more restricted in what they can do, and that's
because they obey an energy function. So they come, for example, model cycles. You can't get back to
where you started in one of these symmetric networks.

A perceptron is a particular example of a statistical pattern recognition system. So there are actually
many different kinds of perceptrons, but the standard kind, which Rosenblatt called an alpha
perceptron, consists of some inputs which are then converted into future activities. They might be
converted by things that look a bit like neurons, but that stage of the system does not learn. Once
you've got the activities of the features, you then learn some weights, so that you can take the feature
activities times the weights and you decide whether or not it's an example of the class you're interested
in by seeing whether that sum of feature activities times learned weights is greater than a threshold.

Statistics Course
Independent variables = variable manipulated by experimenter

Dependent variables = aspect of the world that the experimenter predicts

Double-blind experiments – experimenter and experimented don’t know if placebo or not

Causality = why stuff happens, allows prediction, prevent bad, promote good

Descriptive statistics = procedures used to summarize, organize and simplify data

Inferential statistics = techniques that allow generalizations about population parameters based on
sample statistics

Non-normal distribution:

Positive skew – few at high score

Negative skew – few at low score

Common questions