0% found this document useful (0 votes)

12 views15 pages

BDA Unit 4

The document provides an overview of machine learning, detailing its definition, categories (supervised, unsupervised, reinforcement, and semi-supervised learning), and various tasks such as classification, regression, and clustering. It discusses the relationship between machine learning and data mining, emphasizing their distinct goals and methodologies. Additionally, it covers the k-nearest neighbors (kNN) algorithm, its implementation, advantages, disadvantages, and a case study on detecting prostate cancer using this algorithm.

Uploaded by

Ajay Kumar Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views15 pages

BDA Unit 4

Uploaded by

Ajay Kumar Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

UNIT IV

Machine Learning
Hypothesis testing and determining the multiple analytical methodologies

What is Machine Learning :-

Machine learning usually refers to changes in systems that perform tasks associated with
artificial intelligence (AI). Such tasks involve recognition, diagnosis, planning, robot control,
prediction, etc.
Machine learning is a branch in computer science that studies the design of algorithms that
can learn. Typical machine learning tasks are concept learning, function learning or
“predictive modeling”, clustering and finding predictive patterns. These tasks are learned
through available data that were observed through experiences or instructions, for example.
Machine learning hopes that including the experience into its tasks will eventually improve
the learning. The ultimate goal is to improve the learning in such a way that it becomes
automatic, so that humans like ourselves don't need to interfere any more.
Machine learning tasks are typically classified into three broad categories, depending on the
nature of the learning "signal" or "feedback" available to a learning system. These are:

Supervised learning: The computer is presented with example inputs and their desired
outputs, given by a "teacher", and the goal is to learn a general rule that maps inputs to
outputs.
 Unsupervised learning: No labels are given to the learning algorithm, leaving it on its
own to find structure in its input. Unsupervised learning can be a goal in itself (discovering
hidden patterns in data) or a means towards an end.
 Reinforcement learning: A computer program interacts with a dynamic environment in
which it must perform a certain goal (such as driving a vehicle), without a teacher explicitly
telling it whether it has come close to its goal or not. Another example is learning to play a
game by playing against an opponent.
 Between supervised and unsupervised learning is semi-supervised learning, where the
teacher gives an incomplete training signal: a training set with some (often many) of the
target outputs missing. Transduction is a special case of this principle where the entire set of
problem instances is known at learning time, except that part of the targets is missing.
A support vector machine is a classifier that divides its input space into two regions,
separated by a linear boundary. Here, it has learned to distinguish black and white circles.
Among other categories of machine learning problems, learning to learn learns its own
inductive bias based on previous experience. Developmental learning, elaborated for robot
learning, generates its own sequences (also called curriculum) of learning situations to
cumulatively acquire repertoires of novel skills through autonomous self-exploration and
social interaction with human teachers, and using guidance mechanisms such as active
learning, maturation, motor synergies, and imitation.

Another categorization of machine learning tasks arises when one considers the desired
output of a machine-learned system.

 In classification, inputs are divided into two or more classes, and the learner must produce
a model that assigns unseen inputs to one (or multi-label classification) or more of these
classes. This is typically tackled in a supervised way. Spam filtering is an example of
classification, where the inputs are email (or other) messages and the classes are "spam" and
"not spam".
 In regression, also a supervised problem, the outputs are continuous rather than discrete.
 In clustering, a set of inputs is to be divided into groups. Unlike in classification, the
groups are not known beforehand, making this typically an unsupervised task.
 Density estimation finds the distribution of inputs in some space.
 Dimensionality reduction simplifies inputs by mapping them into a lower-dimensional
space.

Machine learning and data mining often employ the same methods and overlap significantly.
They can be roughly distinguished as follows:
 Machine learning focuses on prediction, based on known properties learned from the
training data.
 Data mining focuses on the discovery of (previously) unknown properties in the data. This
is the analysis step of Knowledge Discovery in Databases.

The two areas overlap in many ways: data mining uses many machine learning methods, but
often with a slightly different goal in mind. On the other hand, machine learning also
employs data mining methods as "unsupervised learning" or as a preprocessing step to
improve learner accuracy. Much of the confusion between these two research communities
(which do often have separate conferences and separate journals, ECML PKDD being a
major exception) comes from the basic assumptions they work with: in machine learning,
performance is usually evaluated with respect to the ability to reproduce known knowledge,
while in Knowledge Discovery and Data Mining (KDD) the key task is the discovery of
previously unknown knowledge. Evaluated with respect to known knowledge, an uninformed
(unsupervised) method will easily be outperformed by supervised methods, while in a typical
KDD task, supervised methods cannot be used due to the unavailability of training data.

Machine learning also has intimate ties to optimization: many learning problems are
formulated as minimization of some loss function on a training set of examples. Loss
functions express the discrepancy between the predictions of the model being trained and the
actual problem instances
For example, in classification, one wants to assign a label to instances, and models are trained
to correctly predict the pre-assigned labels of a set examples.
The difference between the two fields arises from the goal of generalization: while
optimization algorithms can minimize the loss on a training set, machine learning is
concerned with minimizing the loss on unseen samples.

Train model using statistical/machine learning algorithms, Test model

To train the algorithm we feed it quality data known as a training set. A training set is the set
of training examples we‟ll use to train our machine learning algorithms.
Train the algorithm: - This is where the machine learning takes place. This step and the
next step are where the “core” algorithms lie, depending on the algorithm. You feed the
algorithm good clean data from the first two steps and
extract knowledge or information. This knowledge you often store in a format that‟s readily
useable by a machine for the next two steps. In the case of unsupervised learning, there‟s no
training step because you don‟t have a target value. Everything is used in the next step.
Test the algorithm:-This is where the information learned in the previous step is put to use.
When you‟re evaluating an algorithm, you‟ll test it to see how well it does. In the case of
supervised learning, you have some known values you can use to evaluate the algorithm. In
unsupervised learning, you may have to use some other metrics to evaluate the success.
Sample for prediction
For prediction various types of algorithms are used.
 Collect [Link] could collect the samples by scraping a website and extracting data, or
you could get information from an RSS feed or an API. You could have a device collect wind
speed measurements and send them to you, or blood glucose levels, or anything you can
measure. The number of options is endless. To save some time and effort, you could use
publicly available data.
 Prepare the input data. Once you have this data, you need to make sure it‟s in a useable
format. The format we‟ll be using in this book is the Python list. We‟ll talk about Python
more in a little bit, and lists are reviewed in appendix A. The benefit of having this standard
format is that you can mix and match algorithms and data sources. You may need to do some
algorithm-specific formatting here. Some algorithms need features in a special format, some
algorithms can deal with target variables and features as strings, and some
need them to be integers. We‟ll get to this later, but the algorithm-specific formatting is
usually trivial compared to collecting data. One idea that naturally arises is combining
multiple [Link] that do this are known as ensemble methods or meta-algorithms.
Ensemble methods can take the form of using different algorithms, using the same algorithm
with different settings, or assigning different parts of the dataset to different classifiers.
Explore the chosen algorithms for more accuracy
Analyze the input data. This is looking at the data from the previous task. This could be as
simple as looking at the data you‟ve parsed in a text editor to make sure that data is collected
and prepared in proper way and are actually working and you don‟t have a bunch of empty
values. You can also look at the data to see if you can recognize any patterns or if there‟s
anything obvious, such as a few data points that are vastly different from the rest of the set.
Plotting data in one, two, or three dimensions can also help. But most of the time you‟ll have
more than three features and you can‟t easily plot the data across all features at one time. You
could, however, use some advanced methods we‟ll talk about later to distill multiple
dimensions down to two or three so you can visualize the data.
If you‟re working with a production system and you know what the data should look like, or
you trust its source, you can skip this step. This step takes human involvement, and for an
automated system you don‟t want human involvement. The value of this step is that it makes
you understand you don‟t have garbage coming in.

What is kNN Algorithm?

Let‟s assume we have several groups of labeled samples. The items present in the groups are
homogeneous in nature. Now, suppose we have an unlabeled example which needs to be
classified into one of the several labeled groups.

How do you do that?

Unhesitatingly, using kNN Algorithm.

k nearest neighbors is a simple algorithm that stores all available cases and classifies new
cases by a majority vote of its k neighbors. This algorithms segregates unlabeled data points
into well defined groups.

How to select appropriate k value?

Choosing the number of nearest neighbors i.e. determining the value of k plays a significant
role in determining the efficacy of the model. Thus, selection of k will determine how well
the data can be utilized to generalize the results of the kNN algorithm. A large k value has
benefits which include reducing the variance due to the noisy data; the side effect being
developing a bias due to which the learner tends to ignore the smaller patterns which may
have useful insights.

The following example will give you practical insight on selecting the appropriate k value.

Example of kNN Algorithm

Let‟s consider 10 ‟drinking items‟ which are rated on two parameters on a scale of 1 to 10.
The two parameters are “sweetness” and “fizziness”. This is more of a perception based
rating and so may vary between individuals. I would be considering my ratings (which might
differ) to take this illustration ahead. The ratings of few items look somewhat as:

“Sweetness” determines the perception of the sugar content in the items. “Fizziness”
ascertains the presence of bubbles in the drink due to the carbon dioxide content in the drink.
Again, all these ratings used are based on personal perception and are strictly relative.
From the above figure, it is clear we have bucketed the 10 items into 4 groups namely,
‟COLD DRINKS‟, „ENERGY DRINKS‟, „HEALTH DRINKS‟ and „HARD DRINKS‟. The
question here is, to which group would „Maaza‟ fall into? This will be determined by
calculating distance.

Calculating Distance

Now, calculating distance between „Maaza‟ and its nearest neighbors („ACTIV‟, „Vodka‟,
„Pepsi‟ and „Monster‟) requires the usage of a distance formula, the most popular
being Euclidean distanceformula i.e. the shortest distance between the 2 points which may be
obtained using a ruler.
Using the co-ordinates of Maaza (8,2) and Vodka (2,1), the distance between „Maaza‟ and
„Vodka‟ can be calculated as:

dist(Maaza,Vodka) = 6.08

Using Euclidean distance, we can calculate the distance of Maaza from each of its nearest
neighbors. Distance between Maaza and ACTIV being the least, it may be inferred that
Maaza is same as ACTIV in nature which in turn belongs to the group of drinks (Health
Drinks).

If k=1, the algorithm considers the nearest neighbor to Maazai.e, ACTIV; if k=3, the
algorithm considers „3‟ nearest neighbors to Maaza to compare the distances (ACTIV,
Vodka, Monster) – ACTIV stands the nearest to Maaza.

kNN Algorithm – Pros and Cons

Pros: The algorithm is highly unbiased in nature and makes no prior assumption of the
underlying data. Being simple and effective in nature, it is easy to implement and has gained
good popularity.

Cons: Indeed it is simple but kNN algorithm has drawn a lot of flake for being extremely
simple! If we take a deeper look, this doesn‟t create a model since there‟s no abstraction
process involved. Yes, the training process is really fast as the data is stored verbatim (hence
lazy learner) but the prediction time is pretty high with useful insights missing at times.
Therefore, building this algorithm requires time to be invested in data preparation (especially
treating the missing data and categorical features) to obtain a robust model.

Case Study: Detecting Prostate Cancer

Machine learning finds extensive usage in pharmaceutical industry especially in detection of

oncogenic (cancer cells) growth. R finds application in machine learning to build models to
predict the abnormal growth of cells thereby helping in detection of cancer and benefiting the
health system.

Let‟s see the process of building this model using kNN algorithm in R Programming. Below
you‟ll observe I‟ve explained every line of code written to accomplish this task.

Step 1- Data collection

We will use a data set of 100 patients (created solely for the purpose of practice) to
implement the knn algorithm and thereby interpreting results .The data set has been prepared
keeping in mind the results which are generally obtained from DRE (Digital Rectal Exam).
You can download the data set and practice these steps as I explain.

The data set consists of 100 observations and 10 variables (out of which 8 numeric variables
and one categorical variable and is ID) which are as follows:

1. Radius
2. Texture
3. Perimeter
4. Area
5. Smoothness
6. Compactness
7. Symmetry
8. Fractal dimension

In real life, there are dozens of important parameters needed to measure the probability of
cancerous growth but for simplicity purposes let‟s deal with 8 of them!

Here‟s how the data set looks like:

Step 2- Preparing and exploring the data

Let‟s make sure that we understand every line of code before proceeding to the next stage:

setwd("C:/Users/Payal/Desktop/KNN") #Using this command, we've imported the

„Prostate_Cancer.csv‟ data file. This command is used to point to the folder containing the
required file. Do keep in mind, that it‟s a common mistake to use “\” instead of “/” after the

setwd command.

prc<- [Link]("Prostate_Cancer.csv",stringsAsFactors = FALSE) #This command imports

the required data set and saves it to the prc data frame.

stringsAsFactors = FALSE #This command helps to convert every character vector to a

factor wherever it makes sense.

str(prc) #We use this command to see whether the data is structured or not.

We find that the data is structured with 10 variables and 100 observations. If we observe the
data set, the first variable „id‟ is unique in nature and can be removed as it does not provide
useful information.

prc<- prc[-1] #removes the first variable(id) from the data set.

The data set contains patients who have been diagnosed with either Malignant (M) or Benign
(B) cancer

table(prc$diagnosis_result) # it helps us to get the numbers of patients

(The variable diagnosis_result is our target variable i.e. this variable will determine the
results of the diagnosis based on the 8 numeric variables)
In case we wish to rename B as”Benign” and M as “Malignant” and see the results in the
percentage form, we may write as:

prc$diagnosis<- factor(prc$diagnosis_result, levels = c("B", "M"), labels = c("Benign",

"Malignant"))

round([Link](table(prc$diagnosis)) * 100, digits = 1) # it gives the result in the

percentage form rounded of to 1 decimal place( and so it‟s digits = 1)

Normalizing numeric data

This feature is of paramount importance since the scale used for the values for each variable
might be different. The best practice is to normalize the data and transform all the values to a
common scale.

normalize<- function(x) {

return ((x - min(x)) / (max(x) - min(x))) }

Once we run this code, we are required to normalize the numeric features in the data set.
Instead of normalizing each of the 8 individual variables we use:

prc_n<- [Link](lapply(prc[2:9], normalize))

The first variable in our data set (after removal of id) is „diagnosis_result‟ which is not
numeric in nature. So, we start from 2nd variable. The function lapply() applies normalize() to
each feature in the data frame. The final result is stored to prc_n data frame using
[Link]() function
Let‟s check using the variable „radius‟ whether the data has been normalized.

summary(prc_n$radius)

The results show that the data has been normalized. Do try with the other variables such as
perimeter, area etc.

Creating training and test data set

The kNN algorithm is applied to the training data set and the results are verified on the test
data set.

For this, we would divide the data set into 2 portions in the ratio of 65: 35 (assumed) for the
training and test data set respectively. You may use a different ratio altogether depending on
the business requirement!

We shall divide the prc_n data frame into prc_train and prc_test data frames

prc_train<- prc_n[1:65,]

prc_test<- prc_n[66:100,]

A blank value in each of the above statements indicate that all rows and columns should be
included.

Our target variable is „diagnosis_result‟ which we have not included in our training and test
data sets.

prc_train_labels<- prc[1:65, 1]
prc_test_labels<- prc[66:100, 1] #This code takes the diagnosis factor in column 1 of the prc

data frame and on turn creates prc_train_labels and prc_test_labels data frame.

Step 3 – Training a model on data

The knn () function needs to be used to train a model for which we need to install a package
„class‟. The knn() function identifies the k-nearest neighbors using Euclidean distance where
k is a user-specified number.

You need to type in the following commands to use knn()

[Link](“class”)

library(class)

Now we are ready to use the knn() function to classify test data

prc_test_pred<- knn(train = prc_train, test = prc_test,cl = prc_train_labels, k=10)

The value for k is generally chosen as the square root of the number of observations.

knn() returns a factor value of predicted labels for each of the examples in the test data set
which is then assigned to the data frame prc_test_pred
Step 4 – Evaluate the model performance

We have built the model but we also need to check the accuracy of the predicted values in
prc_test_pred as to whether they match up with the known values in prc_test_labels. To
ensure this, we need to use the CrossTable() function available in the package „gmodels‟.

We can install it using:

[Link]("gmodels")
The test data consisted of 35 observations. Out of which 5 cases have been accurately
predicted (TN->True Negatives) as Benign (B) in nature which constitutes 14.3%. Also, 16
out of 35 observations were accurately predicted (TP-> True Positives) as Malignant (M) in
nature which constitutes 45.7%. Thus a total of 16 out of 35 predictions where TP i.e, True
Positive in nature.

There were no cases of False Negatives (FN) meaning no cases were recorded which actually
are malignant in nature but got predicted as benign. The FN‟s if any poses a potential threat
for the same reason and the main focus to increase the accuracy of the model is to reduce
FN‟s.

There were 14 cases of False Positives (FP) meaning 14 cases were actually benign in nature
but got predicted as malignant.

The total accuracy of the model is 60 %( (TN+TP)/35) which shows that there may be
chances to improve the model performance

Step 5 – Improve the performance of the model

This can be taken into account by repeating the steps 3 and 4 and by changing the k-value.
Generally, it is the square root of the observations and in this case we took k=10 which is a
perfect square root of [Link] k-value may be fluctuated in and around the value of 10 to
check the increased accuracy of the model. Do try it out with values of your choice to
increase the accuracy! Also remember, to keep the value of FN‟s as low as possible.

Common questions

One advantage of the k-nearest neighbors (kNN) algorithm is its simplicity and flexibility, as it makes few assumptions about the data, thus allowing it to adapt well to various datasets . However, a significant disadvantage is its computational inefficiency in prediction, as it requires scanning the entire dataset for each prediction, which can be slow and resource-intensive .

The three main categories of machine learning tasks are supervised learning, unsupervised learning, and reinforcement learning. In supervised learning, the computer is given example inputs and their desired outputs by a 'teacher,' and the goal is to learn a general rule that maps inputs to outputs . On the other hand, unsupervised learning involves no labels given to the algorithm, which allows it to independently find structure or patterns within the input data . Reinforcement learning involves the algorithm interacting with a dynamic environment to perform a task, such as playing a game, without being explicitly taught how close it is to achieving its goal .

Normalization in machine learning is crucial as it scales input features to a common range, which helps improve model training efficiency and prediction accuracy. It ensures that the differing magnitudes of input variables do not unduly bias the model. Normalization is typically implemented using a function to transform features such that they lie within a specific range, often between 0 and 1, calculated as (x - min(x)) / (max(x) - min(x)).

Ensemble methods enhance machine learning model performance by combining multiple models to improve predictions. These methods can use different algorithms, the same algorithm with varying settings, or assign different parts of the dataset to various classifiers . This approach helps reduce variance and bias, leading to higher accuracy than a single model by aggregating predictions, thereby improving robustness and mitigating weaknesses inherent to individual models .

Machine learning leverages cross-validation techniques to assess model performance by dividing the dataset into multiple subsets or 'folds.' The model is trained on several folds and tested on the remaining fold, rotating until all folds are used for testing . This technique enables a more accurate estimation of the model's ability to generalize to unseen data by mitigating overfitting and ensuring that the evaluation results are not dependent on a particular train-test split .

Choosing the value of k in the k-nearest neighbors (kNN) algorithm presents challenges related to bias-variance trade-off. A small k may lead to a model with high variance and overfitting, while a large k might result in a high bias with underfitting . These challenges can be addressed by employing cross-validation to empirically determine the optimal k that minimizes prediction error on unseen data, ensuring a balanced trade-off between bias and variance .

Semi-supervised learning differs from supervised learning by utilizing only a partial set of labeled data, with most training examples being unlabeled . Unlike unsupervised learning, which deals entirely with unlabeled data, semi-supervised learning aims to use the limited labeled data to improve learning accuracy and prediction, effectively combining the principles of both supervised and unsupervised techniques to leverage the strengths of labeled data while exploring potential patterns in unlabeled data .

Machine learning and data mining differ mainly in their primary goals. Machine learning focuses on prediction, leveraging properties learned from training data to make accurate predictions on new data . On the contrary, data mining aims to discover previously unknown patterns or properties in the data, functioning mainly as the analysis step in Knowledge Discovery in Databases (KDD). While both share similar methodologies, data mining is often used for understanding data rather than making predictions, which is a key distinction from machine learning's focus on prediction accuracy .

Dimension reduction in machine learning simplifies inputs by mapping them into a lower-dimensional space. This process is crucial because it reduces complexity, alleviates the curse of dimensionality, and improves model performance by removing irrelevant or redundant features . By focusing on the most informative features, dimension reduction helps in enhancing computational efficiency and aids in better data visualization and understanding .

Transduction in machine learning refers to a scenario where the entire set of problem instances is known at learning time, except that a portion of the targets is missing. Unlike traditional learning methods that aim to infer a rule applicable to all potential instances (induction), transduction focuses on predicting specific instances directly by learning from complete sets of instances, thus bridging the gap between supervised and semi-supervised learning .

ML Unit-1
No ratings yet
ML Unit-1
14 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
30 pages
Understanding Machine Learning Types
No ratings yet
Understanding Machine Learning Types
50 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
62 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
15 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
4 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
18 pages
DL - Unit 1
No ratings yet
DL - Unit 1
23 pages
Understanding Machine Learning Types
No ratings yet
Understanding Machine Learning Types
12 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
14 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
4 pages
Supervised vs. Deep Learning Overview
No ratings yet
Supervised vs. Deep Learning Overview
83 pages
Lect 4
No ratings yet
Lect 4
37 pages
Supervised Learning in Machine Learning
No ratings yet
Supervised Learning in Machine Learning
67 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
29 pages
Machine Learning - Class Notes
No ratings yet
Machine Learning - Class Notes
11 pages
Understanding Machine Learning Concepts
No ratings yet
Understanding Machine Learning Concepts
54 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
20 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
45 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
34 pages
Machine Learning Overview and Types
No ratings yet
Machine Learning Overview and Types
63 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
77 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
12 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
30 pages
Machine Learning Basics and Applications
No ratings yet
Machine Learning Basics and Applications
22 pages
World's Largest Science, Technology & Medicine Open Access Book Publisher
No ratings yet
World's Largest Science, Technology & Medicine Open Access Book Publisher
32 pages
Machine Learning for Metadata Extraction
No ratings yet
Machine Learning for Metadata Extraction
8 pages
Industrial Training in Machine Learning
No ratings yet
Industrial Training in Machine Learning
21 pages
Machine Learning Basics with Python
No ratings yet
Machine Learning Basics with Python
12 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
31 pages
Machine Learning
No ratings yet
Machine Learning
91 pages
UNIT-3 Course Material
No ratings yet
UNIT-3 Course Material
8 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
96 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
28 pages
Machine Learning: Concepts and Algorithms
No ratings yet
Machine Learning: Concepts and Algorithms
20 pages
ML UNIT 1 Final - 260122 - 091452
No ratings yet
ML UNIT 1 Final - 260122 - 091452
27 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
62 pages
Types and Concepts of Machine Learning
No ratings yet
Types and Concepts of Machine Learning
10 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
119 pages
Machine Learning Unit-1 Notes Overview
No ratings yet
Machine Learning Unit-1 Notes Overview
28 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
42 pages
Lecture 1
No ratings yet
Lecture 1
41 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
44 pages
Lec 01
No ratings yet
Lec 01
18 pages
Industrial Training Report: Machine Learning
No ratings yet
Industrial Training Report: Machine Learning
70 pages
Introduction to Machine Learning Concepts
100% (1)
Introduction to Machine Learning Concepts
54 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
34 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
19 pages
Machine Learning Concepts and Techniques
No ratings yet
Machine Learning Concepts and Techniques
6 pages
Module 1 - Concepts in Deep Learning Notes
No ratings yet
Module 1 - Concepts in Deep Learning Notes
39 pages
Machine Learning in Trading Basics
No ratings yet
Machine Learning in Trading Basics
3 pages
Overview of Machine Learning Types
No ratings yet
Overview of Machine Learning Types
27 pages
Types of Machine Learning Algorithms
No ratings yet
Types of Machine Learning Algorithms
13 pages
Machine Learning: Algorithms Types
No ratings yet
Machine Learning: Algorithms Types
32 pages
Types of Machine Learning Algorithms
No ratings yet
Types of Machine Learning Algorithms
31 pages
InTech-Types of Machine Learning Algorithms PDF
No ratings yet
InTech-Types of Machine Learning Algorithms PDF
32 pages
Types of Machine Learning Algorithms
No ratings yet
Types of Machine Learning Algorithms
30 pages
Introduction to Machine Learning Basics
No ratings yet
Introduction to Machine Learning Basics
20 pages
Introduction To Machine Learning: TIET, Patiala
No ratings yet
Introduction To Machine Learning: TIET, Patiala
37 pages
BDA Unit 2
No ratings yet
BDA Unit 2
30 pages
BDA Unit 5
No ratings yet
BDA Unit 5
26 pages
BDA Unit 1
No ratings yet
BDA Unit 1
44 pages
BDA Unit 3
No ratings yet
BDA Unit 3
19 pages
Understanding Tree Data Structures
No ratings yet
Understanding Tree Data Structures
33 pages
Stack Implementation in C: Arrays & Links
0% (1)
Stack Implementation in C: Arrays & Links
2 pages
Data Analytics Course Overview CS513PE
No ratings yet
Data Analytics Course Overview CS513PE
18 pages
Extendible Hashing in DBMS Explained
No ratings yet
Extendible Hashing in DBMS Explained
4 pages
Application Portfolio Rationalization Insights
No ratings yet
Application Portfolio Rationalization Insights
8 pages
Database Indexing Techniques Explained
No ratings yet
Database Indexing Techniques Explained
20 pages
Introduction to Data Analytics
100% (1)
Introduction to Data Analytics
24 pages
Document Analysis and Insights
No ratings yet
Document Analysis and Insights
11 pages
Inventory Management Impact on Customer Service
No ratings yet
Inventory Management Impact on Customer Service
24 pages
TESA MICRO-HITE Height Gauges Overview
No ratings yet
TESA MICRO-HITE Height Gauges Overview
32 pages
Fluke 9118A Calibration Furnace Guide
No ratings yet
Fluke 9118A Calibration Furnace Guide
38 pages
Social Media Analytics Exam Paper
No ratings yet
Social Media Analytics Exam Paper
4 pages
AI-Based Facial Emotion Detection
100% (1)
AI-Based Facial Emotion Detection
6 pages
KYB to Eaton Pump Cross Reference
No ratings yet
KYB to Eaton Pump Cross Reference
4 pages
Massive MIMO For 5G - 2015 - Emil Bjornson
No ratings yet
Massive MIMO For 5G - 2015 - Emil Bjornson
58 pages
Detailed Construction Reinforcement Guide
No ratings yet
Detailed Construction Reinforcement Guide
1 page
Concrete Spacers Material Submittal
No ratings yet
Concrete Spacers Material Submittal
2 pages
PM5110 Power Meter Specifications
No ratings yet
PM5110 Power Meter Specifications
4 pages
Malabar Model 8817A Axle Jack Transporter
No ratings yet
Malabar Model 8817A Axle Jack Transporter
3 pages
Alerton Unitary Controller Overview
No ratings yet
Alerton Unitary Controller Overview
7 pages
Science: Quarter 4 - Module 6: How Heat Transfer and Energy Transformation Makes Heat Engine Work
No ratings yet
Science: Quarter 4 - Module 6: How Heat Transfer and Energy Transformation Makes Heat Engine Work
11 pages
NACE vs BRF+: Output Management Comparison
No ratings yet
NACE vs BRF+: Output Management Comparison
5 pages
Fuzzy Logic Concepts and Applications
No ratings yet
Fuzzy Logic Concepts and Applications
4 pages
Three-State TMA for Secure OFDM Transmission
No ratings yet
Three-State TMA for Secure OFDM Transmission
8 pages
Optimizing Wind Turbine Performance Techniques
No ratings yet
Optimizing Wind Turbine Performance Techniques
8 pages
The Freshie Magazine 2022
No ratings yet
The Freshie Magazine 2022
60 pages
YouTube Marketing Strategy Overview
No ratings yet
YouTube Marketing Strategy Overview
6 pages
MarkVIe I/O Monitoring and Configuration Guide
No ratings yet
MarkVIe I/O Monitoring and Configuration Guide
25 pages
Kioti CK2810, CK2810H Tractor Operator's Manual
No ratings yet
Kioti CK2810, CK2810H Tractor Operator's Manual
15 pages
A52s Disassembly & Reassembly
No ratings yet
A52s Disassembly & Reassembly
9 pages
Installing TQRChart with QuickReport
No ratings yet
Installing TQRChart with QuickReport
4 pages
1 s2.0 S1877050923001795 Main
No ratings yet
1 s2.0 S1877050923001795 Main
15 pages
Dr. Deepak Kumar's Academic CV
No ratings yet
Dr. Deepak Kumar's Academic CV
7 pages
Crypto Library Implementation Guide
No ratings yet
Crypto Library Implementation Guide
17 pages
Final Exam Java Programming Questions
No ratings yet
Final Exam Java Programming Questions
8 pages
(8th) James Stewart - Student Solutions Manual, Chapters 1-11 For Stewart's Single Variable Calculus, 8th-Brooks Cole (2015)
100% (1)
(8th) James Stewart - Student Solutions Manual, Chapters 1-11 For Stewart's Single Variable Calculus, 8th-Brooks Cole (2015)
2 pages
EC-NetAX Certification Course Overview
100% (1)
EC-NetAX Certification Course Overview
300 pages
Iterative Prototyping in HCI Design
No ratings yet
Iterative Prototyping in HCI Design
11 pages

BDA Unit 4

Uploaded by

BDA Unit 4

Uploaded by

UNIT IV

What is Machine Learning :-

Train model using statistical/machine learning algorithms, Test model

What is kNN Algorithm?

How do you do that?

Unhesitatingly, using kNN Algorithm.

How to select appropriate k value?

Example of kNN Algorithm

kNN Algorithm – Pros and Cons

Case Study: Detecting Prostate Cancer

Machine learning finds extensive usage in pharmaceutical industry especially in detection of

Step 1- Data collection

Here‟s how the data set looks like:

Step 2- Preparing and exploring the data

setwd("C:/Users/Payal/Desktop/KNN") #Using this command, we've imported the

prc<- [Link]("Prostate_Cancer.csv",stringsAsFactors = FALSE) #This command imports

stringsAsFactors = FALSE #This command helps to convert every character vector to a

factor wherever it makes sense.

table(prc$diagnosis_result) # it helps us to get the numbers of patients

prc$diagnosis<- factor(prc$diagnosis_result, levels = c("B", "M"), labels = c("Benign",

round([Link](table(prc$diagnosis)) * 100, digits = 1) # it gives the result in the

percentage form rounded of to 1 decimal place( and so it‟s digits = 1)

Normalizing numeric data

return ((x - min(x)) / (max(x) - min(x))) }

prc_n<- [Link](lapply(prc[2:9], normalize))

Creating training and test data set

Step 3 – Training a model on data

You need to type in the following commands to use knn()

prc_test_pred<- knn(train = prc_train, test = prc_test,cl = prc_train_labels, k=10)

We can install it using:

Step 5 – Improve the performance of the model

Common questions

Identify and explain one advantage and one disadvantage of using the k-nearest neighbors (kNN) algorithm in machine learning.

Identify and explain one advantage and one disadvantage of using the k-nearest neighbors (kNN) algorithm in machine learning.

What are the three main categories of machine learning tasks, and how do they fundamentally differ in their learning approach?

What are the three main categories of machine learning tasks, and how do they fundamentally differ in their learning approach?

What role does normalization play in preparing data for machine learning, and how is it typically implemented?

What role does normalization play in preparing data for machine learning, and how is it typically implemented?

Explain how ensemble methods enhance the performance of machine learning models compared to using a single model.

Explain how ensemble methods enhance the performance of machine learning models compared to using a single model.

How does machine learning leverage cross-validation techniques in evaluating model performance?

How does machine learning leverage cross-validation techniques in evaluating model performance?

What challenges might arise when choosing the value of k in the k-nearest neighbors algorithm, and how can these challenges be addressed?

What challenges might arise when choosing the value of k in the k-nearest neighbors algorithm, and how can these challenges be addressed?

In what ways does semi-supervised learning differ from both supervised and unsupervised learning?

In what ways does semi-supervised learning differ from both supervised and unsupervised learning?

Discuss how machine learning and data mining differ in their primary goals, despite using similar methodologies.

Discuss how machine learning and data mining differ in their primary goals, despite using similar methodologies.

Describe the process and importance of dimension reduction in machine learning.

Describe the process and importance of dimension reduction in machine learning.

What is transduction in the context of machine learning, and how does it differ from traditional learning methods?

What is transduction in the context of machine learning, and how does it differ from traditional learning methods?

You might also like