0% found this document useful (0 votes)
21 views194 pages

Understanding Big Data Analytics Concepts

The document provides an overview of big data analytics, defining analytics as the systematic discovery and interpretation of data patterns for decision-making. It distinguishes between data analytics and big data analytics, emphasizing the latter's focus on large, complex datasets using advanced techniques like machine learning. The document also outlines the stages of big data processing, challenges, and the importance of extracting valuable insights from massive data collections.

Uploaded by

dine mohammed
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views194 pages

Understanding Big Data Analytics Concepts

The document provides an overview of big data analytics, defining analytics as the systematic discovery and interpretation of data patterns for decision-making. It distinguishes between data analytics and big data analytics, emphasizing the latter's focus on large, complex datasets using advanced techniques like machine learning. The document also outlines the stages of big data processing, challenges, and the importance of extracting valuable insights from massive data collections.

Uploaded by

dine mohammed
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Big Data Analytics

What is analytics?
• Analytics is the systematic discovery, interpretation,
and use of meaningful patterns in data.
• It also entails organizing and processing data as well as
extracting patterns in data towards effective problem
solving and decision making.

• What is data? What makes it different from


information vs knowledge vs Wisdom?
• Whose value is better for decision making
• It is good to well-understand DIKW?
Data, Information, Knowledge, Wisdom & Truth
FACT

DATA Explicit
Creating concepts

INFORMATION
Depth of meaning

Creating context

KNOWLEDGE
Creating Patterns

WISDOM
Creating Principles
Tacit

TRUTH
Data analytics vs. Big data
analytics
• Data analytics is the broad process of extracting
meaningful insights from data,
• while big data analytics focuses specifically on analyzing very
large, complex datasets.
• Big data analytics employs advanced techniques like
machine learning, deep learning and data mining to process
these datasets effectively
• Both data analytics and big data analytics aim to provide
valuable insights for decision-making
Key Similarities and Differences
Feature Data Analytics Big Data Analytics

Data Size Can handle various data sizes Primarily deals with very large
datasets

Data Can handle structured, semi- Often deals with diverse, complex
structured, unstructured data that is too large or complex for
traditional methods

Techniques May include statistical analysis, Often employs machine learning,


data mining, reporting, and deep learning, data mining, and
more distributed processing systems.

Tools Can use standard software like May utilize platforms like Hadoop,
SQL, Excel, or specialized Spark, and cloud-based solutions
analytical tools.

Goal Extract insights and knowledge Extract insights and knowledge


from data from large datasets
Why Data Analytics?
• Reason one: Managing complex business environment
• We are living in complex and dynamic business
environment
• How to gain competitive advantage, though the competitive
pressure is very strong?
• How to control the volatile market (7Ps: Product, Price, Place,
Promotion, Promotion, People, Process and Physical evidence)?
• How to satisfy users (customers or consumers) that are
professional?
• How to manage the high turnover rate of professionals which
results in diminishing individual and organizational experience?
• Requirement: Business Intelligence
• Prediction: attempting to know what may happen in the future
• Just-in-time response
• Quality, rational, sound and value added decision and problem
solving
• Enhance efficiency and competency
Why Data Analytics
Reason two: Massive data collection
• Data is being produced (generated & collected) at alarming rate
because of:
• The computerization of business & scientific transactions
• Advances in data collection tools, ranging from scanned texts &
image platforms to satellite remote sensing systems
• Above all, popular use of WWW as a global information system
• Nowadays large databases (data warehouses) are growing at
unprecedented rates to manage the explosive growth of data.
• Examples of massive data sets
• Google: Order of 10 billion Web pages indexed
• 100’s of millions of site visitors per day
• MEDLINE text database: contains more than 31 million references
to journal articles
• Retail e-commerce transaction data: EBay, Amazon, Wal-Mart,
Apple, Alibaba: order of 100 million transactions per day
• Visa, MasterCard: similar or larger numbers With the phenomenal rate of
growth of data, users expect more sophisticated useful and valuable
information
Too much data & information, but too
little knowledge
• With the phenomenal rate of growth of data, users expect
more useful and valuable information
• There is a need to extract knowledge (useful information) from
the massive data.
• Facing too enormous volumes of data, human analysts
with no special tools can no longer make sense.
• Data analytics with data mining and machine learning can
automate the process of finding patterns & relationships in raw
data and the results can be utilized for decision support. That is
why data analytics is used, in science, health and business areas.
• Data analytics is the technology that extracts diamonds of
knowledge from historical data & predict outcome of the
future.
Outline
Topic Coverage
Introducing Big Overview of Big Data, Characteristics of Big Data, types of big
Data data, application of big data, challenges of big data

Introducing Big Meaning of big data analytics, Data analytics vs. Big data analytics,
Data analytics Types of big data analytics, Classification of analytics, Challenges
to big data analytics, How Big Data Analytics Works, application of
big data analytics, future trends
Big Data Hadoop system architecture, HDFS (Hadoop Distributed File
Technologies System), MapReduce computational model, Apache Spark in
memory data analytics, NoSQL database management system
Large-Scale Introduction to Supervised learning, machine learning vs. deep
predictive learning, probabilistic modeling, artificial neural networks, deep
modeling learning, model parameters and hyperparameters optimization,
Regularization
Large-Scale Introduction to Unsupervised learning, evaluation techniques, K-
descriptive means & K-medoids clustering, hierarchical clustering, density
modeling based clustering
Evaluation
• Assignments & Presentation 20%
• Project
30%
• Final exam 40%
• Knowledge sharing 10%
(Class attendance & participation)
Presentation assignment
• Instruction: As per the given topic, review at least 5+
journal articles & prepare presentation slides on the
following topics;
• (i) Introduce what it means, i.e. overview and definition of
the concept;
• (ii) explain why we need it, pros & cons, significance;
• (iii) discuss how it works, architecture, & approaches
followed;
• (iv) concluding remarks (show strength & weakness of the
concept with the way forward);
• (iv) reference.
Presentation assignment
No Name Topic Date
25

1 Yilkal Yehualaw Multimodal data analytics

2 Wondwosen Kebede Process analytics

3 Mekeds Zerihun Yilma Search Analytics

4 Dine Mohammed Jibril Image analytics

5 Hafiz Aman Tuka Text Analytics


Big Data and its challenges

• Big Data requires the storage, organization, and


processing of data at a scale and efficiency that goes
well beyond the capabilities of conventional
information technologies.
What is Big Data?

Big data refers


to extremely large,
complex & diverse
collections of
structured,
unstructured, and
semi-structured
data that
continues to grow
exponentially over
time.
What is Big Data?

Big data is a
collection of data
sets so large and
complex in
volume,
velocity, and
variety, that
traditional data
management
systems cannot
store, process,
and analyze
them.
Characteristics of Big Data (5
Vs)
• Volume – The sheer amount of data
• Large amounts of data (terabytes, petabytes)
• Velocity – The speed at which data is generated and
processed.
• High-speed data generation (real-time streaming)
• Variety – The different types and formats of data
• Different data types (structured, unstructured, semi-
structured)
• Veracity – Data quality and reliability of the data
• Value – Extracting meaningful insights
• The potential insights and business benefits that can
be derived from the data.
Big Data
• Big data is a collection of data sets so large and complex that it
becomes difficult to process using on-hand database
management tools.
56 V’s of Big data
Two types of big data
• Big data is divided into data at rest and data in motion.
• Data at rest:
• This refers to data that has been collected from various sources
and is then analyzed after the event occurs.
• The point where the data is analyzed and the point where
action is taken on it occur at two separate times.
• Data in motion:
• The collection process for data in motion is similar to that of
data at rest; however, the difference lies in the analytics.
• In this case, the analytics occur in real-time as the event
happens.
Stages of Big data
• Data Generation: concerns how data are being generated, this is to mean large
diverse and complex dataset that is generated from different data sources.
• However there are technical challenges in collecting, processing and analyzing these
datasets.

• Data acquisition: refers to the process of obtaining information and is


subdivided into data collection, data transmission, and data pre-processing.
• Data storage and retrieval: concerns persistently storing and managing large-
scale datasets as well as searching for relevant data as per the information
need of users.
• Data analytics: leverages data analysis methods or tools to inspect data
quality, preprocess, and model data to extract value.

• Each component of this value chain presents various challenges that require
deep research into, mostly because of the heterogeneous and complex
character of the data involved.
Big Data Challenges
Classification of big data
challenges
• Challenges of big data can be classified into:
data management and data analytics.
• Data management involves processes and
supporting technologies to acquire and store data
and to prepare and retrieve it for analysis.
• Data analytics refers to techniques used to discover
and acquire intelligence from big data.
• Needs to handle efficiently and effectively using
big data analytics
Big Data Analytics

• Big data analytics allows for the uncovering of


trends, patterns and correlations in large amounts
of raw data to help analysts make data-informed
decisions
Big Data Analytics: An Overview
• Big Data Analytics refers to the process of
examining large and complex data sets to uncover
patterns, trends, and insights.
• It leverages advanced technologies like artificial
intelligence (AI), machine learning (ML), and statistical
methods to analyze vast amounts of structured and
unstructured data.

• Big Data Analytics involves processing and analyzing


massive datasets to uncover patterns, trends, and
insights.
• It uses technologies like Hadoop, Spark, SQL, NoSQL, and
Machine Learning for efficient data handling.
Types of Big Data Analytics
How Big Data Analytics
Works?
How Big Data Analytics Works?
• Data Collection
• The first crucial step in big data processing is data collection. Here,
the goal is to accumulate data from various sources, ranging from
structured (e.g., databases, spreadsheets) to unstructured sources
(e.g., social media, emails, videos, and IoT devices).
• The key considerations during data collection are:
• Accuracy: Ensuring that the data collected is accurate and free from
errors or omissions. Any inaccuracies at this stage can ripple through
the entire processing pipeline, leading to flawed insights.
• Completeness: Gathering all relevant data is imperative. Missing pieces
of the puzzle may result in incomplete analysis and inaccurate
conclusions rely on comprehensive data for risk assessment and
decision-making.
• Real-time Data: In some scenarios, real-time or near-real-time data
collection is essential. This is especially true in trading environments,
where up-to-the-minute information is critical for making investment
decisions.
• Sources: Social media, IoT devices, transaction records,
sensors, and more.
• Technologies: Apache Kafka, Flume, and cloud-based data
lakes.
How Big Data Analytics Works?
• Data Preprocessing
• Once data is collected, it rarely arrives in a pristine, ready-to-analyze state.
• Data preprocessing involves several critical tasks to clean, transform, and
prepare the data into a suitable format for analysis:
• Cleansing: This involves identifying and rectifying errors, such as missing values,
outliers, and inconsistencies. In investment banking, data cleansing ensures that
financial data is accurate, which is vital for modeling and predicting market
trends.
• Transformation: Data often requires transformation to make it compatible with
the analysis tools and techniques that will be used. This may include converting
data types, scaling values, or aggregating data over specific time periods.
• Deduplication: Duplicate records can skew analysis results. Removing duplicate
entries ensures that each piece of data contributes meaningfully to the analysis.
• Normalization: Scaling data to a standard range can help prevent the dominance
of certain variables in statistical analysis, promoting fairness in decision-making.
• It may involve data integration from multiple sources.
• Batch Processing: Apache Hadoop, Spark.
• Real-time Processing: Apache Kafka, Apache Flink, Apache Storm.
• Tools that extract, transform, and load (ETL) data from different sources.
• Apache NiFi – Automates data flows between systems.
• Talend – Open-source ETL and data integration.
• Informatica – Enterprise-grade data integration and governance.
How Big Data Analytics
• DataWorks?
Storage
• The third step, data storage, is fundamental in the big data processing pipeline. It
involves housing the prepared data in storage systems designed to handle large
volumes of information efficiently.
• Key considerations in data storage include:
• Scalability: As data volumes grow, the storage system should be able to scale
seamlessly to accommodate increasing amounts of data.
• Redundancy: Ensuring data availability and integrity is critical. Implementing
redundancy mechanisms like data replication & backup systems safeguards against
data loss.
• Data Retrieval: Access to stored data should be fast and reliable. For example, In
healthcare, quick access to historical medical data can make a difference in timely
decision-making.
• Storing the large volumes of processed data efficiently, often using distributed
systems or cloud-based solutions.
• Structured Data: Stored in relational databases like MySQL, PostgreSQL.
• Unstructured Data: Managed using NoSQL databases like MongoDB, Cassandra, Data lake
and data lakehouse
• Big Data Frameworks: Hadoop Distributed File System (HDFS), Amazon S3.
• Apache Hadoop – Distributed storage and processing using HDFS.
• Apache HBase – NoSQL database optimized for real-time data access.
• Apache Cassandra – High availability and fault-tolerant NoSQL database.
• Amazon S3 – Scalable cloud storage service.
• Google BigQuery – Fully managed data warehouse for real-time analytics.
How Big Data Analytics Works?
• Data Analysis
• The heart of big data processing lies in data analysis, which applies advanced analytics tools &
techniques to derive actionable insights from the prepared data.
• This stage is vital for assessment, optimization, and identifying opportunities by applying various
analytical techniques and tools, such as DM, ML, predictive analytics, and statistical analysis, to
extract meaningful insights.
• Key aspects of data analysis include:
• Descriptive Analytics: This involves summarizing and visualizing data to gain an understanding of its
characteristics and trends. Tools like data dashboards and charts are commonly used for this purpose.
• Diagnostic Analytics: This is a form of advanced analytics that examines data or content to answer the
question, “Why did it happen?” It is characterized by techniques such as drill-down, data discovery,
data mining and correlations.
• Predictive Analytics: Predictive modeling and machine learning are employed to forecast future
trends and outcomes.
• In sectors like healthcare and financial services, predictive analytics can aid in predicting market
movements and assessing risk.
• Prescriptive Analytics: provides recommendations for actions and decision-making based on
predictions.
• Investment bankers can use this to optimize investment strategies and make informed decisions.
• The following platforms help analyze and extract insights from big data.
• TensorFlow – Deep learning framework for large-scale AI.
• PyTorch – Machine learning and AI model development.
• RapidMiner, WEKA, R-Miner – Data science platform for predictive analytics.
• Apache Mahout – Machine learning library for big data.
• [Link] – Open-source AI and ML platform.
• Apache Spark – Fast, in-memory big data processing.
• Apache Flink – Stream processing for real-time analytics.
• Apache Storm – Real-time event stream processing.
How Big Data Analytics
Works?
• Data Visualization
• The final step in the big data processing journey is data visualization.
• After deriving insights from the analysis, presenting the results in a
comprehensible format is crucial for decision-makers.
• For instance, In investment banking, clear visualizations of financial data,
risk & assessments facilitate effective communication & decision-making.
• Key considerations for data visualization include:
• Clarity: Visualizations should be clear and easy to understand, even for
non-technical stakeholders.
• Interactivity: Interactive dashboards allow users to explore data further
and gain deeper insights.
• Relevance: Ensure that the visualizations align with the objectives of the
analysis and address the specific needs of the audience.
• Effective big data processing is a multi-step journey, from collecting and
preprocessing data to storing, analyzing, and visualizing it.
• Tools that help interpret findings visually and turn raw data into
actionable insights
• Tableau – Interactive dashboards and visual analytics.
• Power BI – Microsoft’s data visualization and reporting tool.
• Looker – Cloud-based BI tool with real-time reporting.
• Qlik Sense – Self-service BI for ad-hoc analysis.
• Matplotlib/Seaborn – Python libraries for statistical data visualization.
Applications of Big Data Analytics
Big data analytics is being applied across various industries to:
• Improve Customer Experience: Personalizing recommendations, providing
targeted marketing, and enhancing customer service (e.g., Amazon's product
recommendations, Netflix's personalized content).
• Optimize Operations: Streamlining supply chains, improving efficiency, and
reducing costs (e.g., optimizing delivery routes, predictive maintenance in
manufacturing).
• Enhance Healthcare: Improving diagnostics, personalizing treatments, predicting
disease outbreaks, and facilitating medical research (e.g., analyzing patient
records to identify risk factors, wearable devices monitoring health in real-time).
• Detect Fraud and Manage Risk: Identifying unusual patterns and anomalies in
financial transactions and other data (e.g., banks detecting fraudulent credit card
activity).
• Drive Innovation: Identifying market trends, understanding customer needs, and
developing new products and services.
• Improve Public Services: Optimizing traffic flow, predicting resource needs, and
enhancing security (e.g., smart city initiatives, crime prediction).
Challenges in Big Data Analytics
Despite its potential, big data analytics also presents several
challenges:
• Data Quality: Ensuring the accuracy, completeness, and
consistency of large datasets.
• Scalability: Building systems that can handle the increasing
volume and velocity of data
• Managing and storing the massive volumes of data efficiently and cost-
effectively.
• Data Integration: Combining data from diverse sources and
formats seamlessly.
• Data Governance: Establishing policies and procedures for
managing data across an organization.
• Data Interpretation and Analysis: Extracting meaningful insights
and communicating them effectively to stakeholders.
Future Trends
• AI & Machine Learning Integration:
• Enhancing automation in analytics.
• Edge Computing:
• Processing data closer to its source for real-time insights.
• edge computing represents a paradigm shift in the way we process
and handle data. By bringing computational power closer to where
data is generated, edge computing promises to usher in a new era
of fast, efficient data processing that will shape the future of
technology and business.
• Blockchain & Data Security:
• Securing data transactions.
• blockchain are increasing trust, security and transparency among
member organizations by improving the traceability of data shared
across a business network, plus delivering cost savings through new
efficiencies
• Quantum Computing:
• Boosting computational power for data analysis.
Big Data Analytics Tools &
Technologies: The Hadoop
Ecosystem
Big Data Analytics Tools &
Technologies
• Big Data analytics relies on powerful tools for data collection, storage, processing, and
visualization.
• Here’s a breakdown of the most widely used tools categorized by their functions:
• Hadoop Ecosystem (Batch Processing)
• HDFS (Hadoop Distributed File System) – Storage system for big data
• MapReduce – Parallel processing model
• Hive – SQL-like querying for big data
• HBase – NoSQL database for real-time processing
• Apache Spark (Fast In-Memory Processing)
• PySpark – Python API for Spark
• Spark SQL – Query big data using SQL
• Spark Streaming – Real-time data processing
• MLlib – Machine Learning on big data
• NoSQL Databases
• MongoDB – Document-based NoSQL database
• Cassandra – Highly scalable distributed database
• Elasticsearch – Search and analytics engine
• Cloud-Based Big Data Platforms provide managed big data solutions with scalability.
• Google Cloud Dataflow – Server-less stream and batch processing.
• AWS Glue (Amazon S3, Redshift, EMR)– Fully managed ETL service for data lakes.
• Microsoft Azure Data Lake and Synapse Analytics – Scalable data warehousing and analytics.
• Snowflake – Cloud data platform for data warehousing and analytics.
What is Hadoop?
• Apache Hadoop is a framework for processing &
storage of data in large-scale.
• The Hadoop venture was made in 2005 by Doug Cutting.
• Hadoop is an open source framework that is meant for
storage and processing of big data in a distributed
manner (across a large number of clusters).
• It is the best solution for handling big data challenges on
computer clusters built from commodity hardware (or off-
the-shelf hardware).
• The purpose of Hadoop ecosystem is to utilize cheap storage
and processing (available across the world in form of cloud
referred to as commodity hardware) for storage and processing
of large datasets in the most efficient way; thus, facilitating
cheap computation.
What is Hadoop?
• Before Hadoop or any distributed computation, as the amount of
data within enterprises increased, vertical scaling (processed by
one powerful server) was becoming more and more difficult.
• Even though companies had servers with high processing power,
process swapping between disks took a lot of time reducing CPU
utilization.
• Hadoop introduces linear horizontal scaling where it can keep track
of available clusters and improves processing time by taking the
most efficient route.
Important features of Hadoop
• Open Source
• Hadoop is an open source framework which is available free of cost. Also, the
users are allowed to change the source code as per their requirements.
• Distributed Processing
• Hadoop supports distributed processing of data i.e. faster processing. The data
in Hadoop HDFS is stored in a distributed manner and MapReduce is
responsible for the parallel processing of data.
• Fault Tolerance
• Hadoop is highly fault-tolerant. It creates three replicas for each block (default)
at different nodes.
• Reliability
• Hadoop stores data on the cluster in a reliable manner that is independent of
machine. So, the data stored in Hadoop environment is not affected by the
failure of the machine.
• Scalability
• It is compatible with the other hardware and we can easily add/remove the
new hardware to the nodes.
• High Availability
• The data stored in Hadoop is available to access even after the hardware
failure. In case of hardware failure, the data can be accessed from another
node.
Apache Hadoop Ecosystem
Components of Hadoop
Ecosystem
• The foundation ideas of the Hadoop ecosystem
are GFS and MapReduce. GFS (Google File System)
that inspired Hadoop’s HDFS, Hadoop Distributed
File System.
• The idea of GFS architecture is that the master maintains
all information about its system components.
• MapReduce can be understood as
• transforming data in parallel across clusters (mappers)
and
• aggregating the data after processing (Reducers)
Core components of Hadoop: HDFS
• HDFS (Hadoop Distributed File System) is the basic storage system
of Hadoop, which is developed by the inspiration of Google File
System(GFS).
• The large data files running on a cluster of commodity hardware are
stored in HDFS.
• It can store data in a reliable manner even when hardware fails.
The key aspects of HDFS are:
• Storage component: Stores data in
Hadoop
• Distributes data across several nodes:
divides large file into blocks and stores
in various data nodes.
• Natively redundant: replicates the
blocks in various data nodes.
• High throughput access: Provides access
to data blocks which are nearer to the
client.
• Re-replicates the nodes when nodes are
failed.
HDFS (Hadoop Distributed File
System)
• HDFS is the storage component of the Hadoop Ecosystem. It handles very
large files and breaks them into small blocks, remember the small blocks are
not too small. In this age of astronomical data, HDFS considers 128 GBs as a
small block for large processing.
Core components of Hadoop: MapReduce
• Map Reduce: is the Hadoop layer that is responsible for data
processing. It writes an application to process unstructured
and structured data stored in HDFS. It is responsible for the
parallel processing of high volume of data by dividing data into
independent tasks.
• The processing is done in two phases Map and Reduce. The Map is
the first phase of processing that specifies complex logic code and the
Reduce is the second phase of processing that specifies lightweight
operations.
• The key aspects of Map Reduce are: (a) computational framework; (b)
Splits a task across multiple nodes; (c) Processes data in parallel
• Importance of MapReduce in Hadoop environment for data
processing.
• MapReduce programming helps to process massive amounts of data
in parallel.
• Input data set splits into independent chunks.
• Map tasks process these independent chunks completely in a parallel
manner.
• Reduce tasks provide reduced output by combining the output of various
mappers.
MapReduce working
• MapReduce divides a data analysis task into two parts – Map and
Reduce.
• Map takes care of loading, parsing, transforming and filtering.
• The reduce is grouping and aggregating data that is produced by
map tasks to generate final output.
• Steps:
• 1. First, the input dataset is split into multiple pieces of data.
• 2. Next, the framework creates a master and several slave processes and
executes the worker processes remotely.
• 3. Several map tasks work simultaneously and read pieces of data that can be
assigned to each map task.
• 4. Map worker uses partitioner function to divide the data into regions.
• 5. When the map slaves complete their work, the master instructs the reduce
slaves to begin their work.
• 6. When all the reduce slaves complete their work, the master transfers the
control to the user program.
Example: MapReduce working
Apache Hadoop Ecosystem
Components of Hadoop ecosystem
• HDFS (Hadoop Distributed File System): It simply stores data files as close to the original form
as possible.
• HDFS is the primary storage system of Hadoop. It employs a NameNode and DataNode architecture. It is a
distributed file system able to store large files running over the cluster of commodity hardware.
• HBase: It is Hadoop’s distributed column based database. It supports structured data storage
for large tables. It is column-oriented and non-relational and provides a fault-tolerant way of
storing sparse data sets. It works very well for real-time data processing.
• Hive: It is an ETL and Data warehousing tool, enables querying or analysis of large data sets
stored within the Hadoop ecosystem. So, one can access data stored in hadoop cluster by using
Hive. Hive has three main functions: data summarization, query, and analysis of unstructured
and semi-structured data in Hadoop.
• Pig: is an easy to understand data flow language. It helps with the analysis of large data sets
which is quite the order with Hadoop without writing codes in MapReduce paradigm.
• ZooKeeper: is an open source application that configures synchronizes the distributed systems.
• Oozie: is a workflow scheduler system to manage apache hadoop jobs within the clusters.
Zookeeper and ambari coordinate everything across clusters. It maintains reliability and keeps
track of the clusters that are up or down.
• Mahout: It is a scalable Machine Learning and data mining library.
• Chukwa: It is a data collection system for managing large distributed systems.
• Sqoop and Flumes: it is used to transfer bulk data between Hadoop and structured data stores
such as relational databases. Sqoop ties structured (relational) database to Hadoop and Flumes
ties non-relational/unstructured data. Flume transforms weblogs very fast into Hadoop.
• Apache Ambari: it is a web based tool for provisioning, Managing and Monitoring Apache
Hadoop clusters.
• Kafka: Kafka collects data from any source and broadcasts it.
NoSQL (Not Only SQL)

• NoSQL databases are a new way of thinking about data that is non-
relational, schema-less, and can be distributed and fault tolerant.
• data came in all shapes and sizes — structured, semi-structured, and
unstructured — and defining the schema in advance became nearly
impossible. NoSQL databases allow developers to store huge amounts of
unstructured data, giving them a lot of flexibility.
• refers to non-relational databases that store data in a non-tabular
format, rather than in rule-based, relational tables like relational
databases do.
• NoSQL databases store data in a more natural and flexible way.
NoSQL, as opposed to SQL, is a database management approach,
whereas SQL is just a query language, similar to the query languages
of NoSQL databases.
• four major types of NoSQL databases have emerged: document
databases, key-value databases, wide-column stores, and graph
databases.
NoSQL
• Due to the exponential growth of digitization, businesses now collect as much
unstructured data as possible. To be able to analyze and derive
actionable real-time insights from such big data, businesses need modern
solutions that go beyond simple storage.
• Businesses need a platform that can easily scale, transform, and visualize data; create
dashboards, reports, and charts; and work with AI & BI tools to accelerate their
business productivity.
• Due to their flexible and distributed nature, NoSQL databases (for example,
MongoDB) shine in these tasks.
Document-oriented databases
• A document-oriented database stores data in documents such that each
document contains pairs of fields and values. The values can typically be a
variety of types, including things like strings, numbers, booleans, arrays, or
even other objects. A document database offers a flexible data model, much
suited for semi-structured and typically unstructured data sets.
• Examples of document databases are MongoDB and Couchbase.
• A typical document will look like: {
"_id": "12345",
"name": "foo bar",
"email": "foo@[Link]",
"address": {
"street": "123 foo street",
"city": "some city",
"state": "some state",
"zip": "123456"
},
"hobbies": ["music", "guitar", "reading"]
}
Key-value databases
• A key-value store is a simpler type of database where each item
contains keys and values. Each key is unique and associated with a
single value. They are used for caching and session management and
provide high performance in reads and writes because they tend to
store things in memory.
• Examples are Amazon DynamoDB and Redis. A simple view of data
stored in a key-value database is given below:
Key: user:12345
Value: {"name": "foo bar", "email": "foo@[Link]", "designation": "software
developer"}
Wide-column stores
• Wide-column stores store data in tables, rows, and dynamic columns.
The data is stored in tables. However, unlike traditional SQL
databases, wide-column stores are flexible, where different rows can
have different sets of columns. These databases can employ column
compression techniques to reduce the storage space and enhance
performance. The wide rows and columns enable efficient retrieval of
sparse and wide data.
• Some examples of wide-column stores are Apache Cassandra and
HBase. A typical example of how data is stored in a wide-column is as
follows:
Graph databases
• A graph database stores data in the form of nodes and edges. Nodes typically
store information about people, places, and things (like nouns), while edges
store information about the relationships between the nodes.
• Examples of graph databases are Neo4J, Amazon Neptune & MongoDB.
Below is an example of how data is stored:
RDBMS vs. NoSQL databases
• There are a variety of differences between relational database management systems and non-relational databases.
• Data modeling
• NoSQL: Data models vary based on the type of NoSQL database used — for example, key-value, document, graph, and wide-column — making the model
suitable for semi-structured and unstructured data.
• RDBMS: RDBMS uses a tabular data structure, with data represented as a set of rows and columns, making the model suitable for structured data.
• Schema
• NoSQL: It provides a flexible schema where each set of documents/row-column/key-value pairs can contain different types of data. It’s easier to change
schema, if required, due to the flexibility.
• RDBMS: This is a fixed schema where every row should contain the same predefined column types. It is difficult to change the schema once data is stored.
• Query language
• NoSQL: It varies based on the type of NoSQL database used. For example, MongoDB has MQL, and Neo4J uses Cypher.
• RDBMS: This uses structured query language (SQL).
• Scalability
• NoSQL: NoSQL is designed for vertical and horizontal scaling.
• RDBMS: RDBMS is designed for vertical scaling. However, it can extend limited capabilities for horizontal scaling.
• Data relationships
• NoSQL: Relationships can be nested, explicit, or implicit.
• RDBMS: Relationships are defined through foreign keys and accessed using joins.
• Transaction type
• NoSQL: Transactions are either ACID- or BASE-compliant.
• RDBMS: Transactions are ACID-compliant.
• Performance
• NoSQL: NoSQL is suitable for real-time processing, big data analytics, and distributed environments.
• RDBMS: RDBMS is suitable for read-heavy and transaction workloads.
• Data consistency
• NoSQL: This offers eventual consistency, in most cases.
• RDBMS: This offers high data consistency.
• Distributed computing
• NoSQL: One of the main reasons to introduce NoSQL was for distributed computing, and NoSQL databases support distributed data storage, vertical and
horizontal scaling through sharding, replication, and clustering.
• RDBMS: RDBMS supports distributed computing through clustering and replication. However, it’s less scalable and flexible as it’s not traditionally designed
to support distributed architecture.
• Fault tolerance
• NoSQL: NoSQL has built-in fault tolerance and high availability due to data replication.
• RDBMS: RDBMS uses replication, backup, and recovery mechanisms. However, as they are designed for these, additional measures like disaster recovery
mechanisms may need to be implemented during application development.
• Data partitioning
• NoSQL: It’s done through sharding and replication.
• RDBMS: It supports table-based partitioning and partition pruning.
• Data to object mapping
• NoSQL: NoSQL stores the data in a variety of ways — for example, as JSON documents, wide-column stores, or key-value pairs. It provides abstraction
through the ODM (object-data mapping) frameworks to work with NoSQL data in an object-oriented manner.
• RDBMS: RDBMS relies more on data-to-object mapping so that there is seamless integration between the database columns and the object-oriented
application code.
Relational database vs NoSQL
database
• Assume example
storing information about a user and their hobbies. We need to store a
user's first name, last name, cell phone number, city, and hobbies.
• In a RDBMS, two tables are created: Users & Hobbies tables
• In order to retrieve all of the information about a user and their hobbies, information
from the Users table and Hobbies table will need to be joined together.
• The data model for a NoSQL database will depend on the type of NoSQL database
selected. Let's store the same data about a user and their hobbies in a document
database like MongoDB.
• In order to retrieve all of the information about a user and their hobbies, a single
document can be retrieved from the database. No joins are required, resulting in faster
queries.
Modeling

• Predictive modeling is used to make predictions about future


events.
• Descriptive modeling is used to summarize and describe the data,
• Predictive modeling uses historical data to forecast future
events.
• Descriptive modeling focuses on understanding past events and
trends.
• Predictive modeling uses classification algorithms
• Descriptive modeling uses clustering algorithms
Classification methods
• Goal: Predict class Ci = f(x1, x2, .. xn)
• There are various classification methods. Popular
classification techniques include the following.
• K-nearest neighbor
• Decision tree classifier: divide decision space into
piecewise constant regions.
• Bayesian network: a probabilistic model
• Support vector machine

• Neural networks: partition by non-linear boundaries


• Deep learning

78
Bayesian Learning
CONDITIONAL PROBABILITY
• The issue is, How likely is it that an event will happen?
• Sample Space S
• An event A and C are a subset of S
• Prior knowledge and observed data can be combined

• P(C|A)- Probability that event C occurs given that event A has


already occurred. P ( A, C )
P (C | A) 
P ( A)
Example of conditional probability:
• There are 2 baskets. B1 has 2 red ball and 5 blue ball. B2 has 4
red ball and 3 blue ball.
• Find probability of picking a blue ball from basket 1?
P(red ball | basket 1) =
• What about the probability that the picked red ball is from basket 2,
P(basket2 | red ball) ?
Bayes Classifier
• A probabilistic framework for solving classification problems
• There are two types of probabilities: Posterior Probability [P(C/A)] and
Prior Probability [P(C)]
• Prior probability represents what is originally believed before new evidence
is introduced, and posterior probability takes this new information into
account.
• Bayes theorem:
P ( A, C ) P ( A, C )
P (C | A)  P( A | C ) 
P ( A) P (C )
P ( A | C ) P (C )
P (C | A) 
P ( A)

• Example of Bayes Theorem


• Given: A doctor knows that flue causes head ache 50% of the time. Prior
probability of any patient having flue is 1/50. Prior probability of any patient
having head ache is 1/20. If a patient has head ache, what’s the probability
he/she has flue?
Example: Bayes Classifier
• A medical cancer diagnosis problem. There are 2 possible outcomes of
a diagnosis: +ve, -ve.
We know 10% of world population has cancer. Test gives correct +ve
result 98% of the time and gives correct –ve result 97% of the time. If a
patient’s test returns +ve, should we diagnose the patient as having
cancer?
P(C) = 0.10 p(NC) = 0.90
P(+ve|C) = 0.98 P(-ve|C) = 0.02
P(+ve|NC) = 0.03 P(-ve|NC) = 0.97

Using Bayes Formula:

= 0.98 x 0.10 = 0.098 / P(+ve)

= 0.03 x 0.90 = 0.027 / P(+ve)


So, the patient most likely have cancer.
General Bayes Theorem
• Consider each attribute and class label as random variables
• Given a record with attributes (A1, A2,…,An)
• Goal is to predict class C
• Specifically, we want to find the value of C that maximizes P(C|
A1, A2,…,An )
• Can we estimate P(C| A1, A2,…,An ) directly from data?
• Approach: compute the posterior probability P(C | A1, A2, …, An)
for all values of C using the Bayes theorem
P( A1 A2  An | C ) P(C )
P(C | A1 A2  An ) 
P( A1 A2  An )

• Choose value of C that maximizes: P(C | A1, A2, …, An)


• Equivalent to choosing value of C that maximizes
P(A1, A2, …, An|C) P(C)
• How to estimate P(A , A , …, A | C )?
Naïve Bayes Classifier
• Assume independence among attributes A when class
i
is given:
• P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)

• Can estimate P(A | C ) for all A and C .


i j i j

• New point is classified to Cj if P(Cj)  P(Ai| Cj) is maximal.

C Naive Bayes arg max P (C j ) P ( Ai | C j )


j i
Example. ‘Play Tennis’ data
• Suppose that Mr. Alex has a free afternoon next Sunday and he is
thinking whether to play tennis or not. Given the following weather
data, will you advise Mr. Alex to play tennis or not? Apply Naïve Bayes
Day Outlook Temperature Humidity Wind Play Tennis
Day1 Sunny Hot High Weak No
Day2 Sunny Hot High Strong No
Day3 Overcast Hot High Weak Yes
Day4 Rainy Mild High Weak Yes
Day5 Rainy Cool Normal Weak Yes
Day6 Rainy Cool Normal Strong No
Day7 Overcast Cool Normal Strong Yes
Day8 Sunny Mild High Weak No
Day9 Sunny Cool Normal Weak Yes
Day10 Rainy Mild Normal Weak Yes
Day11 Sunny Mild Normal Strong Yes
Day12 Overcast Mild High Strong Yes
Day13 Overcast Hot Normal Weak Yes
Day14 Rainy Mild High Strong No
Naive Bayesian Classifier
• Given a training set, we can compute the probabilities
P(yes) = 9/14 • Where, P(yes) is the probability of playing
P(no) = 5/14 tennis, Yes and P(no) is the probability of
playing tennis, No

Outlook Yes No Humidity Yes No


sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Tempreature Windy
hot 2/9 2/5 Strong 3/9 3/5
mild 4/9 2/5 Weak 6/9 2/5
cool 3/9 1/5
Play-tennis example
Based on the examples in the table, classify the following unseen
sample X :
x=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=strong)
• That means: Play tennis or not?
C NB arg max P (C ) P (at | C )
C[ yes , no ] t

arg max P (C ) P (Outl sunny | C ) P (Temp cool | C ) P ( Hum high | C ) P (Wind strong | C )
C[ yes , no ]
• Compare P(yes/Ai) and P(no/Ai), and select the one with max prob
• P(yes)*P(sunny/yes)*P(cool/yes)*P(high/yes)*P(strong/yes)= 0.0053
• P(no)*P(sunny/no)*P(cool/no)*P(high/no)*P(strong/no)= 0.0206
 Answer: Play tennis = no
Naive Bayesian Classifier
• Advantages
• Easy to implement
• Good results obtained in most of the cases
• Robust to isolated noise points
• Handle missing values by ignoring the instance during probability
estimate calculations
• Robust to irrelevant attributes
• Disadvantages
• Class conditional independence assumption may not hold for some
attributes, therefore loss of accuracy
• Practically dependencies exist among variables
• E.g. hospitals: patients: profile: age, family history, etc. symptoms:
fever, cough etc. Disease: lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayesian
classifier
• How to deal with these dependencies? Bayesian Belief Networks

92
Assignment
• Show with example how Bayesian Belief Networks
(BBNs) work
• Your report should,
• (i) introduce BBNs,
• (ii) show algorithm,
• (iii) work out using example scenario,
• (iv) conclusion,
• (v) reference

93
Neural Network

94
The Power of Brain vs. Machine
• While the human brain is superior in creativity,
emotional intelligence, and complex problem solving,
 computers are superior in processing speed,
logical reasoning, and accuracy in computation

• The Brain
– Creativity
– Association
– Complexity
– Noise Tolerance

• The Machine
– Calculation
– Precision
– Logic

95
Features of the Brain
• Ten billion (1010) neurons
 Neuron switching time >10-3secs
• Face Recognition ~0.1secs
• On average, each neuron has several thousand
connections
• Hundreds of operations per second
• High degree of parallel computation
• Compensated for problems by massive
parallelism
• Distributed representations
• Die off frequently (never replaced)

96
Neural Network classifier
Input layers
• It is represented as a
layered set of Hidden
interconnected layers
processors. These
processor nodes has a
relationship with the
neurons of the brain.
• Each node has a weighted
connection to several Output
other nodes in adjacent layer
layers.
• Individual nodes take the
input received from
connected nodes and use
the weights together to
compute output values.
98
Architecture of Neural network
• Neural networks are used to look for patterns in data, learn
these patterns, & then classify new patterns & make forecasts
• A network with the input and output layer only is called
single-layered neural network. Whereas, a multilayer neural
network is a generalized one with one or more hidden layer.
• A network containing two hidden layers is called a three-layer neural
network, and so on.
Single layered NN Multilayer NN
n
x1 x1
w1 o  (  wi xi )
x2 i 1 x2
w2
x3 w3 1 x3
 ( y) 
1  e y Input Hidden Output
nodes nodes nodes
A Multilayer Neural Network
• Input Layer: corresponds with class attribute that are with
normalized attributes values.
• There are as many nodes as class attributes, X = {x1, x2, …. xm}, where m is the
number of attributes.
• Hidden Layer
– neither its input nor its output
can be observed from outside.
– The number of nodes in the
hidden layer & the number of
hidden layers depends on
implementation.
– Hidden layers are what make NNs
"deep" & enable them to learn
complex data representations.
– Hidden layers enable to extract
the relevant information from
the input data that is necessary
for making predictions or
decisions.
• Output Layer – corresponds to the class attribute. There are as
many nodes as classes (values of the class attribute).
–Ok, where k= 1, 2,.. n, where n is number of classes
Steps followed in NN
• The neuron is the basic information processing unit of a NN. It
consists of:
1 A set of links, describing the neuron inputs, with neurons
connection weights W1, W2, …, Wm
2. An adder function (linear combiner) for computing the weighted
m
sum of the inputs :
y  w jx j
j1

3. Activation function (also called squashing function): for limiting


the output behavior of the neuron.

A step function or threshold sigmoid function:


function (hardlimiting): 1/(1+e-x)
Activation Functions
Activation Range of output Use case
Function
Sigmoid squashes the input values to a binary classification. If the ticket is cheap,
range between 0 and 1. (0 to 1) open to go (output close to 1). If it’s
expensive, close to go (output close to 0).
Tanh squashes values between -1 and 1. Best for both positive and negative
(Hyperbolic It’s centered around zero, handles outcomes, such as sentiment analysis (+ve/-
Tangent) negative and positive values. (-1 to ve emotion) or temperature gauges
1) (hot/cold).
ReLU max(0,x); If the input is positive, it popular in hidden layers; speeds up learning
(Rectified outputs the same value. If the in hidden layers
Linear Unit) input is negative, it outputs 0
Leaky ReLU max(0.01x,x); Instead of giving a avoid situations where neurons “die” by
strict 0 for negative inputs, it gives always being zero in a network, which helps
a tiny negative output (0.01 * the network learn better
input)
Softmax takes multiple values and squashes useful for multi-class classification in the
them into a range from 0 to 1, output layer. Imagine you have 3 options:
where the total adds up to 1. cat, dog, or cow. Softmax converts the
outputs into probability for each option, like
0.3 for cat, 0.1 for dog, & 0.6 for cow.
Activation Functions
The need to add “Bias“
• Changing the bias weight W0,i moves the threshold
location
• Bias helps the neural network to be more flexible since it
adjust the activation function left-or-right, making it
centered on some other value than x = 0.
• To this effect an additional node is added to the input
layer, with its constant input; say, 1 or -1, …
• When this is multiplied by the weights of the hidden layer,
it provides a bias (DC offset) to activation function.
Two Topologies of neural network
• NN can be designed in a feed forward or recurrent manner
• In a feed forward neural network connections
between the units do not form a directed cycle.
• In this network, the information moves in only one
direction, forward, from the input nodes, through the
hidden nodes & to the output nodes. There are no cycles or
loops or no feedback connections are present in the
network, that is, connections extending from outputs of
units to inputs of units in the same layer or previous layers.

• In recurrent networks data circulates back & forth


until the activation of the units is stabilized
• Recurrent networks have a feedback loop where data can
be fed back into the input at some point before it is fed
forward again for further processing and final output.

108
Training the neural network
• The purpose is to learn to generalize using a set of sample
patterns where the desired output is known.
• Back Propagation (short for, backward propagation of
errors) is the most commonly used method for training
multilayer feed forward NN.
• Back propagation learns by iteratively processing a set of training
data (samples).
• For each sample, weights are modified to minimize the error
between the desired output and the actual output.
• After propagating an input through the network, the error
is calculated and the error is propagated back through the
network while the weights are adjusted in order to make
the error smaller.
109
Training Algorithm
• The learning algorithm is as follows
• Initialize the weights and threshold to small random
numbers.
• Present a vector x to the neuron inputs and calculate the
output using the adder function. m
y  w jx j
j 1
• Apply the activation function (in this case step function)
such that

 0 if y 0
y 

 1 if y  0
• Update the weights according to the error.

W j W j   * ( yT  y ) * x j
ANN Training Example
Bias 1st input 2nd input Target
Given the following two inputs x1, x2; (x1) (x2) output
find equation that helps to draw the
boundary? -1 0 0 0
• Let say we have the following initializations:
W1(0) = 0.92, W2(0) = 0.62, W0(0) = 0.22, -1 1 0 0
ή = 0.1 -1 0 1 1
-1 1 1 1
• Training – epoch 1:

y1 = 0.92*0 + 0.62*0 – 0.22 = -0.22  y = 0


X
y2 = 0.92*1 + 0.62*0 – 0.22 = 0.7  y =1

W1(1) = 0.92 + 0.1 * (0 – 1) * 1 = 0.82

W2(1) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62

W0(1) = 0.22 + 0.1 * (0 – 1) * (-1)= 0.32

y3 = 0.82*0 + 0.62*1 – 0.32 = 0.3  y = 1


ANN Training Example
• Training – epoch 2:
y1 = 0.82*0 + 0.62*0 – 0.32 = -0.32  y= 0
y2 = 0.82*1 + 0.62*0 – 0.32 = 0.5  y= 1 X
W1(2) = 0.82 + 0.1 * (0 – 1) * 1 = 0.72
W2(2) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62
W0(2) = 0.32 + 0.1 * (0 – 1) * (-1)= 0.42
y3 = 0.72*0 + 0.62*1 – 0.42 = 0.2  y= 1
y4 = 0.72*1 + 0.62*1 – 0.42 = 0.92  y = 1

• Training – epoch 3:
y1 = 0.72*0 + 0.62*0 – 0.42 = -0.42  y = 0
y2 = 0.72*1 + 0.62*0 – 0.42 = 0.4  y = 1 X

W1(3) = 0.72 + 0.1 * (0 – 1) * 1 = 0.62


W2(3) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62
W0(3) = 0.42 + 0.1 * (0 – 1) * (-1)= 0.52
y3 = 0.62*0 + 0.62*1 – 0.52 = 0.1 y = 1
y4 = 0.62*1 + 0.62*1 – 0.52 = 0.72 y = 1
ANN Training Example
• Training – epoch 4:
y1 = 0.62*0 + 0.62*0 – 0.52 = -0.52  y = 0
y2 = 0.62*1 + 0.62*0 – 0.52 = 0.10 y = 1 X
W1(4) = 0.62 + 0.1 * (0 – 1) * 1 = 0.52
W2(4) = 0.62 + 0.1 * (0 – 1) * 0 = 0.62
W0(4) = 0.52 + 0.1 * (0 – 1) * (-1)= 0.62
y3 = 0.52*0 + 0.62*1 – 0.62 = 0  y = 0
X
W1(4) = 0.52 + 0.1 * (1 – 0) * 0 = 0.52
W2(4) = 0.62 + 0.1 * (1 – 0) * 1 = 0.72
W0(4) = 0.62 + 0.1 * (1 – 0) * (-1)= 0.52
y4 = 0.52*1 + 0.72*1 – 0.52 = 0.72  y = 1

• Finally:
y1 = 0.52*0 + 0.72*0 – 0.52 = -0.52  y = 0
y2 = 0.52*1 + 0.72*0 – 0.52 = -0.0  y = 0
y3 = 0.52*0 + 0.72*1 – 0.52 = 0.2  y= 1
y4 = 0.52*1 + 0.72*1 – 0.52 = 0.72  y= 1
ANN Training Example

1+ + 1+ +

x2 x2

0o x1 1
o 0o x1 1
o
Pros and Cons of Neural Network
• Useful for learning complex data like handwriting, speech
and image recognition
Cons
Pros
­Slow training time
+ Can learn more complicated
­ Hard to interpret &
class boundaries understand the learned
+ Fast application function (weights)
+ Can handle large number of
­Hard to implement: trial &
features error for choosing number of
nodes
Neural Network needs long time for training.
Neural Network has a high tolerance to noisy and
incomplete data

118
Machine Learning vs. Deep Learning
• AI is a broad field; machine learning is a subset (and an application)
of AI & Deep learning is a subset of machine learning
• Machine learning is
more explicitly used as
a means to extract
knowledge from data
through simpler
methods such as
decision trees, linear
regression, neural
networks
• Deep learning uses the
more advanced
methods found in
artificial neural
networks.
Deep Learning vs. Machine Learning
ML DL
Problem Helps to solve less-complex tasks help to solve the most complex tasks
data volume Small datasets: ML achieves Big data: effectiveness of DL models
meaningful results with thousands depend on millions of data points
of data points (terabytes and petabytes)

Data type Structured Structured, unstructured (videos, texts,


sensor data, images, etc.)
Model ML models are less complex DL models are very complex
complexity
Structure: ML algorithms have simple DL is based on an ANN with multi-layered
How it structure, such as linear regression structure, like a human brain, which is
works or a decision tree. complex and intertwined

Algorithms Supervised & unsupervised learning In addition self-supervised learning

Human ML uses features selected by data DL requires much less human


intervention analyst, check whether the output is intervention; DL extracts features
as required, & adjust the algorithm if automatically, and the algorithm learns
this is not the case. from its own errors

Computing Good computing power (runs on Requires more computational power, like
power CPU) GPU, TPU, DPU, QPU
Self-supervised learning
• Self-supervised learning (SSL) is a machine learning
technique where a model learns representations or
features directly from the input data without explicit
supervision or labelled targets.
• Unlike supervised learning, where models are trained
on labelled data (input-output pairs)
and unsupervised learning, which deals with
unlabeled data,
• SSL utilizes the inherent structure or characteristics within the
data to generate supervisory signals
Benefits of Self-Supervised Learning
• Self-supervised learning (SSL) introduces a paradigm shift in ML, offering a range of
advantages that redefine how models learn from data without explicit supervision.
• 1. Addressing Data Scarcity and Labeled Data Challenges
• Mitigating the Need for Extensive Labeled Data: SSL reduces dependence on
large, annotated datasets, making it feasible to train models even when labelled
data is scarce or costly.
• Leveraging Unlabeled Data: SSL efficiently utilizes vast pools of unlabeled data,
tapping into their latent information to generate valuable supervisory signals for
training.
• 2. Improving Model Generalization and Performance
• Learning Richer Representations: SSL facilitates the extraction of high-quality,
nuanced representations directly from raw data, enhancing a model’s ability to
generalize across diverse tasks and datasets.
• Enhanced Transfer Learning: Models trained using SSL often exhibit superior
transfer learning capabilities, as the learned representations are more adaptable
and applicable to new, unseen domains or tasks.
• 3. Reducing Human Intervention and Labor-Intensive Labeling Processes
• Cost and Time Efficiency: By minimizing the need for manual labelling efforts, SSL
streamlines the training process, reducing time and monetary investments
associated with data annotation.
• Automation and Scalability: SSL’s reliance on self-generated tasks enables
automated learning processes, facilitating scalability across domains without
Deep learning
• Deep learning is a method that teaches computers to process data in
a way that is inspired by the human brain.
• A neural network is the underlying technology in deep learning. It
consists of interconnected nodes or neurons in a layered structure.
• Deep learning models can recognize complex patterns in pictures,
text, sounds, and other data to produce accurate insights and
predictions.
Deep Neural Networks (DNN)
• Deep Neural Network is with
multiple hidden layers
between the input & output
layers.
• Deep neural network is simply
a feed forward network with
many hidden layers.

The difference between deep and normal neural networks?


• In a simple NN, there is only a single hidden layer.
• The number of parameters in a simple neural network is relatively low compared to
deep learning systems.
• Simple neural networks are less complex and computationally less demanding.
• In contrast, deep NN have several hidden layers that make them deep
• deep learning algorithms are more complicated than simple neural networks as they
involve more layers of nodes.
Hyper-parameters
Type of Hyper- How it is used and explanation Possible values
parameter

Learning Rate Used to control the step size during gradient descent, and the learning rate 0.0001 to 0.001
used. Learning rate is fine-tuned to optimize the convergence rate and (grid search)
reduce the likelihood of divergence. For transformer-based models,
smaller learning rates were prioritized to avoid overfitting due to the high
complexity of pretrained embedding.

Batch Size The number of training samples used in one iteration, allowing the model to 16, 32, 64, 128
process multiple input samples simultaneously, thereby speeding up the (grid search)
training process while maintaining a good level of generalization. Batch size is
necessary to balance training speed & memory needs. While a bigger batch
size is used where resources permitted faster convergence, a smaller batch
size is explored in complicated models to control computing restrictions.

Optimizer used to fine-tunes a neural network's parameters during training. Adam Optimizer is Adam,SGA
used in most cases
Dropout involves temporarily removing nodes (input or hidden) in a NN, along with 0.3 to 0.5
their connections, creating a new architecture from the original network. It
reduces overfitting while ensuring generalization capability. Dropout rate is
carefully adjusted to achieve robust sequence labelling.

Loss-function to compute the difference between the Categorical Cross-


predicted probabilities and the true labels Entropy
Activation Sigmoid for binary classification; Softmax for multi-class classification; ReLU for
Function Tanh (Hyperbolic Tangent) for +ve and –ve results optimizing
hidden layer
Types of Deep Neural Networks
(DNN)
• Multi-Layer Perceptrons (MLP)
• Convolutional Neural Network (CNN)
• Recurrent Neural Network (RNN)
• Generative Adversarial Network (GAN)
• Transfer Learning (TL)
Multilayer Perceptrons (MLPs)
• A multilayer perceptron (MLP) is a class of a feed forward artificial neural
network (ANN) with multiple layers, including an input layer, one or more
hidden layers and output layer.
• MLP models are the most basic Deep Neural Network, which is composed of a series
of fully connected layers.
• Each new layer is a set of nonlinear functions of a weighted sum of all
outputs (fully connected) from the prior one.
Convolutional Neural Network

(CNN)
A convolutional neural network (CNN, or ConvNet) is another class of deep
neural networks.
• Different from fully connected layers in MLPs,
• in CNN models, one or multiple convolution layers extract features from input by
executing convolution operations.
• The process starts by sliding a filter designed to detect certain features over
the input image, using the convolution operation (hence the name
"convolutional neural network").
• The result of the convolution operation is a feature map that highlights the presence
of the detected features in the image. This feature map then serves as an input for the
next layer, enabling a CNN to gradually build a hierarchical representation of the
image.
Architecture of a CNN
• Most commonly CNNs is found in computer vision.
• Given a series of images or videos from the real world, CNN automatically
extracts features (or a high-level representation of the input data).
• Once the features are extracted, the network takes steps to reduce
the spatial dimensions of the feature maps using pooling operation
to improve efficiency and accuracy.
• In the final layer of a CNN, the model makes a final decision - for
example, image classification, face recognition
Convolution Operation
• The convolution operation involves multiplying the
kernel values by the original pixel values of the
image and then summing up the results.
Input Output
7 7 6 5 5 6 7 7 Kernel 9 8 6 6 8 9
7 7 6 5 5 6 7 7 0 -1 0 8 2 1 1 2 8
6 6 4 3 3 4 6 6 -1 5 -1 6 1 0 0 1 6
*
5 5 3 2 2 3 5 5 0 -1 0 6 1 0 0 1 6
5 5 3 2 2 3 5 5 8 2 1 1 2 8
6 6 4 3 3 4 6 6 9 8 6 6 8 9
7 7 6 5 5 6 7 7
7 7 6 5 5 6 7 7

• There are various methods to decide the digits inside the kernel. This will
depend on the effect you want to achieve such as detecting edges,
blurring, sharpening
Effects of Kernel
How Pooling Layers Work
• Imagine you have a large image and want to make it smaller but keep all the
important features like edges and colors.
• The pooling layer operates independently on every depth slice of the input. It
resizes it spatially, using the Max or Average of the values in a window slide
over the input data.
• In this example, given a 2x2 kernel the pooling operation reduces the feature
map from (6 × 6) to (2 × 2).
9 8 6 6 8 9 9 8 6 6 8 9
8 2 1 1 2 8 8 2 1 1 2 8
6 1 0 0 1 6
Convolved 6 1 0 0 1 6
Feature
6 1 0 0 1 6 (6 x 6) 6 1 0 0 1 6
8 2 1 1 2 8 8 2 1 1 2 8
9 8 6 6 8 9 9 8 6 6 8 9

output output
9 6 9 7 4 7
Max 6 0 6 Average 5 0 5
Values Values
9 6 9 7 4 7
Architecture of a CNN: Fully
connected layer
• The fully connected layer is
responsible for classifying images
based on the features extracted in
the previous layers.

• Without dense layers, CNNs would not be able to perform tasks, such as
images classification, smile detection, human activity recognition or making
predictions based on visual inputs.
• Dense layers allow each neuron to interact with all neurons in the previous layer. In
contrast, sparse layers only allow each neuron to interact with a subset of the neurons
in the previous layer
• Not all layers in a CNN are fully connected. Because fully connected layers
have many parameters, applying this approach throughout the entire network
would create unnecessary density, increase the risk of overfitting and make
the network very expensive to train in terms of memory and computation.
• Limiting the number of fully connected layers balances computational efficiency and
generalization ability with the capability to learn complex patterns.
Architecture of a CNN: Fully
connected layer
• While convolutional layers are good
at detecting features in input data,
– dense layers are essential for integrating these
features into final classification decision, say
predictions.
• Fully connected layers (dense
layers) are designed to
operate on 1-dimensional
data, hence,
• Flattening is a necessary step to
transit from the
multidimensional tensors
produced by convolutional
layers to the format required for
dense layers.
Flattening layers
• After convolutional and pooling layers have extracted relevant
features from the input image we have to turn this high-dimensional
feature map into a format suitable for feeding into fully connected layers.
• Here is where flattening layers come into action
• Flattening layer takes the entire feature map and reorganizes it into a single, long
vector.

Flattening layers
Examples of CNN Models
• Example applications of CNN include
• image classification (e.g., AlexNet, VGG, ResNet, MobileNet)
• object detection (e.g., Fast R-CNN, Mask R-CNN, YOLO, SSD).
• AlexNet. For image classification, as the first CNN neural network to win
the ImageNet Challenge in 2012, AlexNet consists of five convolution layers
and three fully connected layers. Thus, AlexNet requires 61 million weights
and 724 million MACs (multiply-add computation) to classify the image with a
size of 227×227.
• VGG-16. To achieve higher accuracy, VGG-16 is trained to a deeper structure
of 16 layers consisting of 13 convolution layers and three fully connected
layers. This requires 138 million weights and 15.5G MACs to classify the image
with a size of 224×224.
• GoogleNet. To improve accuracy while reducing the computation of DNN
inference, GoogleNet introduces an inception module composed of different-
sized filters. As a result, GoogleNet achieves a better accuracy performance
than VGG-16 while only requiring seven million weights and 1.43G MACs to
process the image with the same size.
• ResNet. the state-of-the-art effort, ResNet uses the “shortcut” structure to
reach a human-level accuracy with a top-5 error rate below 5%. In addition,
the “shortcut” module can solve the gradient vanishing problem during the
training of the model, making it possible to train a DNN model with a deeper
CNN Application in Healthcare

• In the healthcare sector, CNNs are used to assist in medical


diagnostics and imaging.
• For example, a CNN could analyze medical images such as X-rays, CT
scan, MRI or pathology slides to detect anomalies indicative of
disease, thereby aiding in diagnosis and treatment planning.
Recurrent Neural Networks (RNNs)
• A recurrent neural network (RNN) is another class of deep neural networks
that use sequential data feeding.
• RNNs have been developed to address the time-series problem of sequential input
data.
• The input of RNN consists of the current input and the previous samples.
• The connections between nodes form a directed graph along a temporal sequence.
• Each neuron in an RNN owns an internal memory that keeps the information of the
computation from the previous samples.
RNN
• ANNs and CNNs are example of Feed-Forward NN while RNNs is Recurrent
NN, where information can flow back and forth through internal loops,
allowing the network to consider past information when processing the
current input.
• Unlike traditional NNs where each input is independent, RNNs can access
and process information from previous inputs. This makes them particularly
useful for tasks that involve sequences, like text, speech, or time series data.
Why use RNN ?
• Traditional Artificial Neural Networks (ANNs) are powerful tools, but they
struggle with sequential data like text because they require fixed-size inputs.
• Each input in an ANN is treated independently, making them unsuitable for tasks
where the order and relationships between elements are crucial.
• Suppose if we use zero padding concept in which Shorter sequences are
padded with zeros at the end to reach the length of the longest sequence in
the batch. These zeros act as placeholders and don’t carry any meaningful
information.
• Padding introduces irrelevant zeros that the network needs to process alongside the
actual data, increasing computational burden.
• And also due to no sequences in passing input in ANN, we loss the context
or sequential information. Apart from this if any user gives the input length
of higher size that we expect then in that scenario we can nothing.
• For example, we set our input size to 5 words but any user may give it 15 words at a
time then in that case we can’t handle it with ANN.
Recurrent Neural Network (RNNs)
• RNN models are widely used in Natural Language
Processing (NLP) due to the superiority of processing the
data with an input length that is not fixed.
– The task of RNN is to build a model that can comprehend
natural language spoken by humans. For example, natural
language modeling, word embedding, and machine translation.
• In RNNs, each subsequent
layer is a collection of
nonlinear functions of
weighted sums of outputs
and the previous state.
Thus, the basic unit of RNN
is a “cell”, consisting of
layers and series of cells
enabling the sequential
processing of recurrent
neural network models.
How RNNs works ?
• The figure below shows a simplified sentiment analysis process using a
Recurrent Neural Network (RNN).
• Sample text is assigned sentiment labels (0/1 for positive/negative). Unique words are
converted to numbers for the RNN to understand. These numbers are fed into the RNN
one by one, with each word considered a single time step in the sequence.
• This demonstrates how RNNs can analyze sequential data like text to predict sentiment.
How RNNs works ?
• Let’s say input weights are denoted by wi, output weights with wo,
feedback loop weights with wh and biases for first hidden layer is bi and
bias for output layer is bo, then mathematical formulation for above is
given by :
Recurrent Neural Network (RNN) Models
• Long Short-Term Memory (LSTM). LSTM models address the
vanishing gradient problem.
• The incorporation of specialized memory cells and gating mechanisms
makes the learning of long-term dependencies in sequential data possible.
• LSTM architectures are capable of learning long-term dependencies in
sequential data, which makes them well-suited for tasks such as language
translation, speech recognition, and time series forecasting.
• Gated Recurrent Unit (GRU). Similar to LSTMs, GRU networks
capture long-range dependencies in sequential data.
• GRU architecture is simpler when compared to LSTMs, with fewer
parameters, thus, making them more computationally efficient in some
cases.
• Bidirectional RNNs (BiLSTM and BIGRU). Process input sequences
both forward and backward allowing them to capture
dependencies from past and future contexts.
• consist of two LSTMs or GRUs; one taking the input in a forward direction,
and the other in a backwards direction, hence a bidirectional recurrent
neural network.
• In turn, making them useful in tasks such as speech recognition and
machine translation.
CNN vs. RNN
• CNNs are commonly used to solve problems involving spatial data, such as
images.
• RNNs are better suited to analyzing temporal and sequential data, such as text or
videos.
• CNNs and RNNs have different architectures.
• CNNs are feed forward neural networks that use filters and pooling layers,
• RNNs feed results back into the network.
• In CNNs, the size of the input and the resulting output are fixed. A CNN
receives images of fixed size and outputs a predicted class label for each
image along with a confidence level.
• In RNNs, the size of the input and the resulting output can vary.
• Common use cases for CNNs include facial recognition, medical analysis and
image classification.
• Common use cases for RNNs include machine translation, natural language
processing, sentiment analysis and speech analysis.
Transfer Learning
• In transfer learning, a machine exploits the knowledge gained
from a previous task to improve generalization about another.
• Transfer learning is a technique in machine learning where a model trained on one
task is used as the starting point for a model on a second task.
• For example, in training a classifier to predict whether an image is a Dog, you could
use the knowledge it gained during training to recognize Cat.

•Transfer learning is a
popular approach
in deep learning, as it
enables the training of
deep neural networks
with less data.
– where an already
developed ML model is
reused in another task.
Advantages of Transfer Learning
• Training a model takes a large amount of computer resources, data and time. Using a pretrained
model as a starting point helps cut down on all three, as developers don't have to start from
scratch, training a large model on what would be an even bigger data set.
• Reduces data needs. By using pretrained models that were already trained on their own large data sets,
transfer learning enables developers to create new models even when they don't have access to massive
amounts of labeled data.
• Speeds up the training process. Transfer learning speeds up the training process of a new model, as it
starts with pre-learned features, leading to less time required to learn a new task.
• Reduces computational cost. Transfer learning reduces the costs of building models by enabling them to
reuse previously trained parameters. This process is more efficient than training a model from scratch.
• ML algorithms are typically designed to address isolated tasks. Through transfer learning, methods
are developed to transfer knowledge from one or more of these source tasks to improve learning
in a related target task.
• Developers can choose to reuse in-house ML models, or they can download them from other developers
who have published them on online repositories or hubs. Knowledge from an already trained ML model
must be similar to the new task to be transferable. For example, the knowledge gained from recognizing
an image of a dog in a supervised ML system could be transferred to a new system to recognize images of
cats. The new system filters out images it already recognizes as a dog.
• Provides performance improvements. In cases where the target task is closely related to the
source task, performance can improve due to the knowledge it gains from training on the first task.
• Prevents overfitting. Overfitting occurs when a model fits too closely to its training data, making
the model unable to make accurate generalizations. By starting with a well-trained model, transfer
learning helps prevent overfitting, especially when target data sets are small.
• Provides versatility. Retrained models consist of knowledge gained from one or more previous
data sets. This can potentially lead to better performance on different tasks. Transfer learning can
also be applied to different ML tasks, such as image recognition and natural language processing
(NLP).
Types of transfer learning
• Transfer learning can be accomplished in several ways.
• One way is to find a related learned task -- labeled as Task B -- that has
plenty of transferable labeled data. The new model is then trained on Task
B. After this training, the model has a starting point for solving its initial
task, Task A.
• Another way to accomplish transfer learning is to use a pretrained model.
This process is easier, as it involves the use of an already trained model.
The pretrained model should have been trained using a large data set to
solve a similar task as task A. Models can be imported from other
developers who have published them online.
• A third approach, called feature extraction or representation learning, uses
deep learning to identify the most important features for Task A, which
then serves as a representation of the task. Features are normally created
manually, but deep learning automatically extracts features. The learned
representation can be used for other tasks as well.
Classification of transfer learning
One way of classifying transfer learning
• Transductive transfer.
• Target tasks are the same but use different data sets.
• Inductive transfer.
• Source and target tasks are different, regardless of the data set. Source and target data are typically
labeled.
• Unsupervised transfer.
• Source and target tasks are different, but the process uses unlabeled source and target data.
Unsupervised learning is useful in settings where manually labeling data is impractical.

Transfer learning can also be classified into near and far transfers.
• Near transfers are when the source and target tasks are closely related,
• while far transfers are when source and target tasks are vaguely related.
• If the tasks are closely related, this means they share similar data structures, features or
domains.
Another way to classify transfer learning is based on how well the knowledge
from a pretrained model facilitates performance on a new task. These are
classified as positive, negative and neutral transfers:
• Positive transfers occur when the knowledge gained from the source task
actively improves the performance on the target task.
• Negative transfers see a decrease in the performance of the new task.
• Neutral transfers occur when the knowledge gained from the source tasks has
little to no impact on the performance of the target task.
Key use cases for transfer learning
• Deep learning. Transfer learning is commonly used for deep learning neural
networks to help solve problems with limited data. Deep learning models
typically require large amounts of training data, which can be difficult and
expensive to acquire.
• NLP. Using transfer learning to train NLP models can improve performance by
transferring knowledge across tasks related to machine translation, sentiment
analysis and text classification.
• Computer vision. Pretrained models are useful for training computer vision tasks
like image segmentation, facial recognition and object detection, if the source
and target tasks are related.
• Image recognition. Transfer learning can improve the performance of models trained on limited
labeled data, which is useful in situations with limited data, such as medical imaging.
• Speech recognition. Models previously trained on large speech data sets are
useful for creating more versatile models. For example, a pretrained model could
be adapted to recognize specific languages, accents or dialects.
• Object detection. Pretrained models that were trained to identify specific objects
in images or videos could hasten the training of a new model. For example, a
pretrained model used to detect mammals could be added to a data set used to
identify different types of animals.
Future of transfer learning
• The future of transfer learning includes the following trends, which
might further shape ML and the development of ML models:
• The increased use of multimodal transfer learning. Models are
designed to learn from multiple types of data simultaneously. These can
include text, image and audio data sets, for example, which leads to
more versatile ML and artificial intelligence (AI) systems.
• Federated transfer learning. This combines transfer and federated
learning.
• Federated transfer learning enables models to transfer knowledge between
decentralized data sources but does so in a way that keeps local data private. This
enables multiple organizations to collaborate to improve their models across
decentralized data sources while also maintaining data privacy.
• Lifelong transfer learning. This creates a model that can continuously
learn and adapt to new tasks and data over time.
• Zero-shot and few-shot transfer learning. Both methods are designed
to enable ML models to perform well with minimal or no training data.
• Zero-shot revolves around the concept of predicting labels for unseen data
classes, and few-shot learning involves learning from only a small amount of data
per class. Using this practice, models can rapidly learn to make effective
generalizations with little data. This practice has the potential to reduce the
reliance organizations have on collecting large data sets for training.
Descriptive modeling using
Clustering algorithms
Clustering
• Clustering is a data mining (machine learning)
technique that finds similarities between data
according to the characteristics found in the data &
groups similar data objects into one cluster
• Given a set of data points, each
having a set of attributes, and a
similarity measure among them,
group the points into some number
of clusters, so that
• Data points in the same cluster are
similar to one another.
• Data points in separate clusters are
dissimilar to one another.

169
Clustering: Document Clustering
• Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
• Approach:
 Identify content-bearing terms in each document.
 Form a similarity measure based on the frequencies of different terms and use it to cluster
documents.

• Application:
 Information Retrieval can utilize the clusters to relate a new document or search term to clustered
documents.
Hard vs. soft clustering
• Hard clustering: Each
document belongs to
exactly one cluster
• More common and easier
to do

• Soft clustering: A document can belong to more than one


cluster.
– Makes more sense for applications like creating browsable hierarchies
– You may want to put a book that discusses about health informatics
in two clusters: (i) Health and (ii) Information Technology
– You can only do that with a soft clustering approach.
Property of clustering
• In the process of constructing groups (or clusters) ensure the following
property:
• Data points within a group must be as similar as possible (high intracluster similarity),
• Data points belonging to different groups must be as different as possible (low
intercluster similarity).
Cluster quality evaluation
• Two categories of cluster evaluation measures:-
• External Measure:
• Given a class label, we use this class label to evaluate the
clustering results.
• Some techniques to measure cluster quality
include Jaccard index, Rand index, purity
• Internal Measure:
• This is the more general one when the class label is not
available.
• The silhouette coefficient is one such popular measure.
Internal measure: Silhouette Coefficient:
• The silhouette value is a measure of how similar an object is to its own
cluster (cohesion) compared to other clusters (separation).
• The Silhouette coefficient is a value between -1 and 1, where higher
values indicate a better clustering.
• The silhouette coefficient is calculated for each point, values for
individual points are calculated by averaging across clusters or an entire
dataset.
Silhouette score
• Given the mean distance between
data point i and all other data
points in the same cluster, where i
∈ Ci
• a(i) = avg distance of i to other
points in the same cluster
• Given mean dissimilarity of point i
to some cluster Cj as the mean of
the distance from i to all points in
Cj
• b(i) = avg distance to nearest other
cluster
• a silhouette (value) of one data
point i
Calculate silhouette score for the below
problem
Similarity/Dissimilarity Measures
• Each clustering problem is based on some kind of distance
“farness” or “nearness” measurement between data points.
• Distances are normally used to measure the similarity or dissimilarity
between two data objects
• Similarity Measures:
• Euclidean Distance if attributes are continuous.
• Other Problem-specific Measures.
• Popular similarity measure is: Minkowski distance:

n q
dis( X ,Y ) q  (| x  y |)
i 1 i i
where X = (x1, x2, …, xn) and Y = (y1, y2, …, yn) are two n-
dimensional data objects; n is size of vector attributes of the
data object; q= 1,2,3,…
182
Similarity & Dissimilarity Between Objects

• If q = 1, dis is Manhattan (or city block) distance


n
dis ( X , Y )  (| xi  yi |
i 1

• If q = 2, dis is Euclidean distance:

n 2
dis( X ,Y )  (| x  y |)

i 1 i i

183
The need for representative
• Key problem: as you build clusters, how do you represent the location
of each cluster, to tell which pair of clusters is closest?
• For each cluster assign a centroid (closest to all other points)= average
of its points.

• One can measure intercluster distances by distances of centroids.


iN1(C )
Cm  N ip
Major Clustering Approaches
• Partitioning clustering approach: also called Centroid-based
Clustering
• Construct various partitions as per the given number of clusters
• Typical methods: distance-based K-means clustering
• model-based: expectation maximization (EM) clustering.
• Hierarchical clustering approach: also called Connectivity-based
Clustering
• Create a hierarchical decomposition of the set of data (or objects)
using some criterion
• Typical methods: agglomerative vs. divisive clustering
• single link vs. complete link vs. average link clustering

• Density based clustering (like DBSCAN)


• Partitioning methods (K-means) and hierarchical clustering are suitable
only for compact and well-separated clusters. Moreover, they are also
severely affected by the presence of noise and outliers in the data.
• Real-life data may contain irregularities, like Clusters can be of
arbitrary shape and/or Data may contain noise. 186
Partitioning Algorithms: Basic Concept
• Partitioning method: Construct a partition of a database D of
n objects into a set of k clusters; such that, sum of squared
distance is minimum
• Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
• Heuristic methods: k-means and k-medoids algorithms
• k-means: Each cluster is represented by the center of the
cluster
• K is the number of clusters to partition the dataset
• Means refers to the average of data points in a particular cluster that is used for representing the
cluster
• k-medoids or PAM (Partition Around Medoids): Each cluster is
represented by one of the objects in the cluster

187
The K-Means Clustering Algorithm
 Given k (number of clusters), the k-means algorithm is
implemented as follows:
• Select K cluster points randomly as initial centroids
• Repeat until the centroid don’t change
• Compute similarity between each instance and
each cluster
• Assign each instance to the cluster with the
nearest seed point
• Recompute the centroids of each K clusters of the
current partition (the centroid is the center, i.e.,
mean point, of the cluster)

188
Example Problem
• Cluster the following eight points (with (x, y)
representing locations) into three clusters :
A1(2, 10) A2(2, 5) A3(8, 4) A4(5, 8)
A5(7, 5) A6(6, 4) A7(1, 2) A8(4, 9).
• Assume that initial cluster centers are:
A1(2, 10), A4(8,4) and A7(1, 2).
• The distance function between two points Aj=(x1, y1)
and Ci=(x2, y2) is defined as:
dis(Aj, Ci) = |x2 – x1| + |y2 – y1| .
• Use k-means algorithm to find optimal centroids to group
the given data into three clusters.
Iteration 1
First we list all points in the first column of the table below. The initial
cluster centers - centroids, are (2, 10), (8,4) and (1, 2) - chosen
randomly.
Data Points Cluster 1 with Cluster 2 with Cluster 3 with Cluster
centroid (2,10) centroid (8, 4) centroid (1, 2)
A1 (2, 10) 0 12 9 1
A2 (2, 5) 5 7 4 3
A3 (8, 4) 12 0 9 2
A4 (5, 8) 5 7 10 1
A5 (7, 5) 10 2 9 2
A6 (6, 4) 10 2 7 2
A7 (1, 2) 9 9 0 3
A8 (4, 9) 3 9 10 1
Next, we will calculate the distance from each points to each of the
three centroids, by using the distance function:
dis(point i,mean j)=|x2 – x1| + |y2 – y1|
Second epoch
• Using the new centroid compute cluster members again.
Data Points Cluster 1 Cluster 2 Cluster 3 Cluster
with centroid with centroid with centroid
(3.67, 9) (7, 4.33) (1.5, 3.5)
A1 (2, 10) 2.67 10.67 7 1
A2 (2, 5) 5.67 5.67 2 3
A3 (8, 4) 2
A4 (5, 8) 1
A5 (7, 5) 2
A6 (6, 4) 2
A7 (1, 2) 3
• After
A8 the(4,29)
nd
epoch the results would be: 1
cluster 1: {A1,A4,A8} with new centroid=(3.67,9);
cluster 2: {A3,A5,A6} with new centroid = (7,4.33);
cluster 3: {A2,A7} with new centroid=(1.5,3.5)
Final results
• Finally in the 2th epoch there is no change of members of
clusters and centroids. So the algorithm stops.
• The result of clustering is shown in the following figure
Density based clustering
• Density based clustering attempts to detect areas
• where data points are concentrated and where they are separated
by areas that are sparse
• Known density based clustering is DBSCAN:
• distance between nearest points
• Two parameters required for DBSCAN algorithm
• eps: defines the neighborhood around a data point. i.e. two
data points are considered neighbors if
Dis(x, y) <= ‘eps’
• If the eps value is chosen too small then a large part of the data will be considered as an outlier.
• If it is chosen very large then the clusters will merge and the majority of the data points will be in the
same clusters.
• MinPts: Minimum number of neighbors (data points) within
eps radius. The larger the dataset, the larger value of MinPts
must be chosen.
• As a general rule, the minimum MinPts can be derived from the number of dimensions D in the
dataset as, MinPts >= D+1.
DBSCAN algorithm
• In this algorithm, there are
three types of data points to
be identified.
• Core Point: A point is a core point if it has more
than MinPts data points within eps.
• Border Point: A data point which has fewer than
MinPts within eps but it is in the neighborhood of
a core point.
• Noise or outlier: A point which is not a core point
or border point.

In DBSCAN each point is checked for eps and MinPts


parameters and the decision about the clustering is
DBSCAN Algorithm in action
• Given a data with two attributes
(x, y), apply DBSCAN to identify
clusters
• Assume, eps = 0.6 and MinPts =
4.
• Let’s consider the first data point
in the dataset (1,2) and calculate
its Euclidean distance from every
other data points in the data set.
DBSCAN Algorithm in action

• The point (1, 2) has


only two other points
in its neighborhood (1,
2.5), (1.2, 2.5) for the
assumed value of eps,
as its less than MinPts,
we can’t declare it as a
core point

• Let’s repeat the above process for every point in the dataset and find
out the neighborhood of each.
DBSCAN Algorithm in action
DBSCAN Algorithm result
DBSCAN Algorithm final result
Project (Demo: June 2)
• Requirement:
–Select a problem that requires use of images or unstructured text.
–Prepare a dataset to conduct experiment and construct the
intended model.
–Use DL algorithms and pretrained models to construct or update a
model using Python

Project Report: Write a report with the following sections:


•Abstract (problem, approach, result, concluding remarks) -- ½ page
•Introduce the problem attempted by the group with objective of the
project -- 2 pages
•Review related works from at least 3 articles -- 4 pages
•Description of Data preparation -- 3 pages
•Description of ML algorithms used for the experiment -- 3 pages
•Discussion of experimental result, with findings --- 3 pages
•Concluding remarks (strength & weakness/limitation), with one
major recommendation --- 1 page
•Reference (use referencing style of your choice, but be consistent)
208
Important Dates:

•Concept presentation

•Project Presentation: June 2

•Final exam: June 10

209
THANK YOU
(PHDS2023@[Link])
Python
Python is a high-level, general-purpose programming
language. Its design philosophy emphasizes code
readability with the use of significant indentation
What software we need to use
python for different tasks?

• Anaconda
• Jupyter Notebook
• Different packages (Libraries), such as
• scikit-learn (for ML algorithms),
• opencv (for DIP),
• numpy (high dimensional data manipulation),
• pandas (for data processing),
• HDF5 (store and manipulate data)
• matplotlib (data visualization), etc.
Installing Anaconda on
Windows
• Anaconda is a package manager, an
environment manager, and Python
distribution that contains a collection of
many open source packages.
This is advantageous, when you are working
on a project, you may need many different
packages (scikit-learn, numpy, scipy, pandas
to name a few), which an installation of
Anaconda comes preinstalled with.
Installing Anaconda on Windows
• If you need additional packages after installing
Anaconda, you can use Anaconda's package
manager,
• conda, or pip to install those packages (pip install
PACKAGE).
pip install scikit-learn
pip install pandas or pip3 install pandas
From Jupyter notebook: !pip install pandas
This is highly advantageous as you don't have to
manage dependencies between multiple packages
yourself. Conda even makes it easy to switch
between Python 2 and 3.
• In fact, an installation of Anaconda is also the
recommended way to install Jupyter Notebooks.
Python modules for machine
learning, data mining and data
analytics
Python modules for machine learning,
data mining and data analytics
• Scikit-learn is probably the most useful library for machine
learning in Python. The sklearn library contains a lot of
efficient tools for machine learning and statistical modeling
including classification, regression, clustering and
dimensionality reduction.
• scikit-learn is a Python module for machine learning built on
top of SciPy
• scikit-learn requires:
• Python (>= 3.6)
• NumPy (>= 1.13.3)
• SciPy (>= 0.19.1)
• joblib (>= 0.11)
• threadpoolctl (>= 2.0.0)
Use sklearn for Classification

• Scikit-learn: machine learning algorithms


• TensorFlow: deep learning with neural networks
• Keras: high level neural networks API
Divide the dataset into training & test using
percentage
import pandas as pd
split
from [Link] import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from [Link] import accuracy_score
from sklearn import metrics
music_data = pd.read_csv('[Link]’)
# to use all data for training
X = music_data.drop(columns=['genre'])
y = music_data['genre']
# to split the data for training & testing using 70:30 perecentage
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3)
# to train the classifier using 70% train data set
model = DecisionTreeClassifier()
[Link](X_train,y_train)
# to test using 20% of the test set
predictions = [Link](X_test)
score = accuracy_score(y_test, predictions)
score

# for reporting accuracy, Precision, Recall


print(metrics.classification_report(y_test, predictions))

# to create confusion matrix


print(metrics.confusion_matrix(y_test, predictions))
Pandas for data processing

• Pandas is a popular open-source data


manipulation and analysis library for Python. It
provides easy-to-use functions needed to work
with structured data seamlessly

• Data processing is making the data ready for


data analysis using ML, DM and DA algorithms
Pandas for data processing

• Pandas introduces two key data structure:


Series and DataFrame.
• A Series is a one-dimensional array-like object that
can hold any data type,
• A Data Frame is a two-dimensional table with
labelled axes (rows and columns).

• These structures allow users to manipulate,


clean, and analyse datasets efficiently.
Import the Necessary Libraries
• To save time & typing, import Pandas as pd
• #!pip install pandas
• import pandas as pd
• To load dataset into a Pandas DataFrame, df
• #df = pd.read_csv('your_dataset.csv’ )
• df = pd.read_csv('C:/Users/milli/Downloads/[Link]')
• Exploratory Data Analysis (EDA): to gain insights into the dataset
• [Link]() will call the first 5 rows of the dataset. You can specify
the number of rows to be displayed in the parentheses.
print([Link]())
• [Link]() gives statistical data like percentile, mean and
standard deviation of the numerical values.
print([Link]())
• [Link]() gives the number of columns, column labels, column data
types, memory usage, range index, and the number of cells in
each column (non-null values).
print([Link]())
Handling Missing Values
• In Pandas, missing values are represented by None or NaN, which can occur
due to uncollected data or incomplete entries
• Missing values can adversely impact your analysis or model.
• Pandas provides methods to handle this problem.
• One way to do this is by removing the missing values altogether using the
Sample Code below:
#Check for missing values
print([Link]().sum())
#Drop rows with missing values and place it in a new variable "df_clean"
df_clean = [Link]()

# Drop rows where all values are missing


df_clean = [Link](how='all’, inplace = True)

# Drop columns with at least one missing value


df_clean = [Link](axis=1)
Handling missing values
#Fill with Previous Value (Forward Fill)
df_fill = [Link](method='pad')
df_fill = [Link]()

#Fill with Next Value (Backward Fill)


[Link](method='bfill') # Backward fill
df_fill = [Link]()

# for replacing missing values with 0


print([Link](0))
Handling Missing Values
• For numerical data, you can simply compute the mean and input it into the
rows that have missing values using the Sample Code below:
• #Replace missing values with the mean of each column
[Link]([Link](), inplace=True)
df_fill = [Link](value=[Link]())

• #If you want to replace missing values in a specific column: Replace


'column_name' with the actual column name
df_fill = df['column_name'].fillna(df['column_name'].mean())
df_fill = df[‘age'].fillna(df[‘age'].mean())
df_fill = df['gender'].fillna(df['gender'].median())
• #Now, df contains no missing values, and NaNs have been replaced with
column mean
Removing Duplicate Records
• Duplicate records can distort your analysis by influencing the results in
ways that do not accurately show trends and underlying patterns (by
producing outliers).
• Pandas helps to identify and remove the duplicate values in an easy way by
placing them in new variables using the Sample code below:

#Identify duplicates
print([Link]().sum())

#Remove duplicates
df_no_duplicates = df.drop_duplicates()
Class label encoding
• Convert each unique value category into a numeric value

from sklearn import preprocessing


# Load the dataset as pandas dataframe
#df = pd.read_csv('C:/Users/milli/Downloads/[Link]')
df = pd.read_csv('C:/Users/milli/Downloads/[Link]')

# Display the class distribution


#print(df[‘genre'].value_counts())
label_encoder = [Link]()
df['genre'] = label_encoder.fit_transform(df['genre'])
df['genre'].unique()
print(df)#[‘genre'])
Sampling

• Sampling for balancing the imbalanced data or increase or


decrease the size of dataset
• Using under sampling (or down sampling) or
• over sampling (or up sampling)

• !pip install imblearn


Sampling methods
• Oversampling methods
• Oversampling works by replicating samples of the minority class through several
variations of the oversampling technique, such as
• synthetic minority oversampling (SMOTE),
• Random oversampling (ROS) involves randomly selecting examples from the minority
class, with replacement, and adding them to the training dataset.
• Undersampling methods
• Random undersampling (RUS) involves randomly selecting examples from the
majority class and deleting them from the training dataset
• Repetitive undersampling,
• Tomek’s link undersampling.
• Hybrid sampling combines the capability of both oversampling and
undersampling together, thus allowing researchers to increase the number
of minority classes & decrease the number of majority classes at the same
time.
• Some popular variations of hybrid resampling methods are the combination of SMOTE
and edited nearest neighbor undersampling (ENN) and the combination of SMOTE
and Tomek’s link undersampling
Implementing Sampling in
Python
• Download and Load the Dataset
• You can download the dataset from Kaggle.
• It consists of numerical features representing credit card
transactions.
• This dataset is highly imbalanced, with the majority of
transactions being legitimate and only a small fraction being
fraudulent.
• For this binary classification problem we have:
• The majority class (label 0) represents legitimate transactions.
• The minority class (label 1) represents fraudulent transactions
Read & split data set
import pandas as pd
# Load the dataset as pandas dataframe
df = pd.read_csv('C:/Users/milli/Downloads/[Link]’)
print(df['Class'].value_counts()) # Display the class distribution
• Lets split the data into features (X) and target (y), and after
that, into training set and test set.
from sklearn.model_selection import train_test_split
# Separate features (X) and target (y)
X = [Link]('Class', axis=1)
y = df['Class']

# Split the dataset into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split( X, y,
test_size=0.3, random_state=42, stratify=y )
Undersampling
• To apply the undersampling technique, we will use the
RandomUnderSampler algorithm, which randomly removes instances.
• It is available in the imbalanced-learn library.
• To apply under sampling
• create a RandomUnderSampler object with the random_state set to 42.
• This ensures that the random selection is reproducible when running the code multiple
times. We can also set the sampling strategy as ‘majority’, where only the majority class
will have instances removed.
• Finally, we will apply the RandomUnderSampling technique to the input data. The
fit_resample function fits the RandomUnderSampler object to the data and returns the
balanced data
# Import the necessary libraries
from imblearn.under_sampling import RandomUnderSampler

# Create a RandomUnderSampler object


rus = RandomUnderSampler(random_state=42, sampling_strategy =
'majority')

# Balancing the data


X_resampled, y_resampled = rus.fit_resample(X_train, y_train)
print(y_resampled.value_counts())
Oversampling
• For oversampling, we will use SMOTE (widely used method in classification
problems where the minority class is significantly smaller than the majority
class.
• The technique works by selecting an example from the minority class and finding its k
nearest neighbours. It then creates new synthetic examples by randomly interpolating
the attributes of the selected examples & adding them to the dataset.
• To use SMOTE, we will import the necessary libraries.
• Similarly, we will create an instance of the SMOTE object, which will be applied to the
training data to perform the oversampling & balance the data.
• With imbalance-learn, we have the flexibility to adjust the number of minority class
samples that we want to create, by modifying the sampling_strategy parameter, which
specifies the desired ratio (from 0 to 1) of the minority class relative to the majority class.
• As a result, we will have balanced data with new instances added.
# Import the necessary libraries
from imblearn.over_sampling import SMOTE
# Creating an instance of SMOTE
smote = SMOTE(sampling_strategy = 1.0, random_state=42)
# Balancing the data
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
print(y_resampled.value_counts())
Train a Classifier on the Imbalanced
Dataset
#from [Link] import RandomForestClassifier
from [Link] import DecisionTreeClassifier
from sklearn import metrics
from [Link] import classification_report
# Train a Random Forest classifier on the imbalanced dataset
clf = DecisionTreeClassifier()
#RandomForestClassifier(random_state=42)
[Link](X_train, y_train)
# Predict on the test set
y_pred = [Link](X_test)
# Evaluate the model
print("Classification Report (Before SMOTE):")
print(classification_report(y_test, y_pred))

# for reporting accuracy, Precision, Recall


print(metrics.classification_report(y_test, y_pred))
# to create confusion matrix
print(metrics.confusion_matrix(y_test, y_pred))
Train the Classifier on the SMOTE-
Augmented Dataset
• We’ll train a decision tree classifier using the
SMOTE-balanced training data so that we can
analyse the difference.
# Train the classifier on the SMOTE-balanced dataset
#clf_smote = RandomForestClassifier(random_state=42)
clf_smote = DecisionTreeClassifier()
clf_smote.fit(X_resampled, y_resampled)

# Predict on the test set


y_pred_smote = clf_smote.predict(X_test)

# Evaluate the model


print("Classification Report (After SMOTE):")
print( classification_report(y_test, y_pred_smote))
Construct and save a model

• Rather than training always for testing, we can


follow two step;
• First, train and save the model using ‘joblib’ package
• Then use the constructed optimal model for prediction
Training step
import pandas as pd
from [Link] import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from [Link] import accuracy_score
# from [Link] import joblib
import joblib
music_data = pd.read_csv('[Link]’)
# to use all data for training
X = music_data.drop(columns=['genre'])
y = music_data['genre']
# to split the data for training & testing using 80:20 perecntage
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
model = DecisionTreeClassifier() # create model
# train the classifier using 80% train data set
[Link](X_train,y_train)
# save the model using joblib dump
[Link](model,'[Link]’)
Prediction and testing step
import pandas as pd
from [Link] import DecisionTreeClassifier
from [Link] import accuracy_score
# from [Link] import joblib
import joblib

# load the model using joblib load for testing


[Link]('[Link]')

# to test using two data set given in array form [age, gender], 1 represent
M & 0 for F

predictions = [Link]([ [21, 1],[22, 0] ])


predictions
Machine learning and pretrained
models
• A pre-trained model is a machine learning (ML) model that has been
trained on a large dataset
• It can be fine-tuned for a specific task.
• Pre-trained models are often used as a starting point for developing
ML models
• Sample pre-trained models for image classification: VGG-16,
Inception, ResNet50, EfficientNet, DenseNet, MobileNet, …
• Sample pre-trained models for text: BERT, T5, RoBERTa, GPT, ELMo,

Image similarity
#!pip install tensorflow
#!pip install opencv-python

import tensorflow as tf
import numpy as np
import cv2
import [Link] as plt
from [Link] import image

#file path
imageName = 'C:/Users/Milli/Desktop/python_code//[Link]'

#load image
img = image.load_img(imageName,target_size = (224,224))
[Link](img)
#img = [Link](imageName)
#img = [Link](img,(224,224))
#img = [Link](img,cv2.COLOR_BGR2RGB)

#Load DL model, if model isn’t installed, it will be automatically downloaded


model = [Link]()
# model = [Link].mobilenet_v2.MobileNetV2()
Image similarity
# before prediction, the raw image needs preprocessing
from [Link] import image

#convert to array
resized_img = image.img_to_array(img)

#convert 3D dims to 4D to meet DL requirement


final_img =np.expand_dims(resized_img,axis=0)
final_img
=[Link].preprocess_input(final_img)
final_img.shape

predictions = [Link](final_img)
#print(predictions)

from [Link] import imagenet_utils


results = imagenet_utils.decode_predictions(predictions)
print(results)
Plant disease detection

By training transfer learning


model
Plant disease detection
import tensorflow as tf
from tensorflow import keras
from [Link] import
image_dataset_from_directory
import os
import zipfile
from [Link] import layers
from [Link] import Sequential

#with [Link]('[Link]', 'r') as zip_ref:


# zip_ref.extractall('/content/cats_dogs')
base_dir = 'Leaf-Image'
train_dir = [Link](base_dir, 'training_set')
validation_dir = [Link](base_dir, 'test_set')
Plant disease detection
# Generates a [Link] from image files in a directory.
Supported image formats: .jpeg, .jpg, .png, .bmp, .gif.
train_set = image_dataset_from_directory(base_dir,
shuffle=True,
batch_size=32,
image_size=(150, 150))
val_dataset = image_dataset_from_directory(base_dir,
shuffle=True,
batch_size=32,
image_size=(150, 150))

data_augmentation = [Link](
[ # [Link]("horizontal"),
#[Link](0.1),
[Link]("horizontal"),
[Link](0.1),
]
Plant disease detection
import numpy as np
import [Link] as plt
for images, labels in train_set.take(1):
[Link](figsize=(12, 12))
first_image = images[0]
for i in range(12):
# subplot(3,4,i+1) means divided into 3 row, 4 column & creates into position i+1
ax = [Link](3, 4, i + 1)
augmented_image = data_augmentation(
tf.expand_dims(first_image, 0)
)
[Link](augmented_image[0].numpy().astype("int32"))
[Link]("off")
Plant disease detection
base_model = [Link](
weights='imagenet',
input_shape=(150, 150, 3),
include_top=False) #remove fully connected layer of CNN
base_model.trainable = False
inputs = [Link](shape=(150, 150, 3))

x = data_augmentation(inputs)
x = [Link].preprocess_input(x)
x = base_model(x, training=False)
x = [Link].GlobalAveragePooling2D()(x)
x = [Link](0.2)(x)
outputs = [Link](1)(x)
model = [Link](inputs, outputs)
Plant disease detection
#[Link](optimizer='adam',loss=[Link](from_logits=True),met
rics=[Link]())
[Link](optimizer='adam', loss='binary_crossentropy',
metrics=[[Link]()]) #test sgd in place of adam
[Link](train_set, epochs=5, validation_data=val_dataset)
base_model.trainable = True

#[Link](optimizer=[Link](1e-5),
# loss=[Link](from_logits=True),
# metrics=[Link]())

[Link](optimizer='adam',
loss=[Link](from_logits=True),
metrics=['accuracy'])
[Link]()

from [Link] import EarlyStopping, TensorBoard


#rm -rf logs
%load_ext tensorboard
log_folder = 'logs'
callbacks = [
EarlyStopping(patience = 5),
TensorBoard(log_dir=log_folder)
]
Plant disease detection
history = [Link](train_set, epochs=5,validation_data=val_dataset,callbacks=callbacks)

epochs = 5
acc = [Link]['accuracy']
val_acc = [Link]['val_accuracy']

loss = [Link]['loss']
val_loss = [Link]['val_loss']

epochs_range = range(epochs)

[Link](figsize=(8, 8))
[Link](1, 2, 1)
[Link](epochs_range, acc, label='Training Accuracy')
[Link](epochs_range, val_acc, label='Validation Accuracy')
[Link](loc='lower right')
[Link]('Training and Validation Accuracy')

[Link](1, 2, 2)
[Link](epochs_range, loss, label='Training Loss')
[Link](epochs_range, val_loss, label='Validation Loss')
[Link](loc='upper right')
[Link]('Training and Validation Loss')
[Link]()
Project
• This is a project that helps you to exercise Python for data analysis.
- Present in class the result obtained in your data analysis
- prepare a report (DOC & PDF) and upload it along with PPT, python code, data set, and
reviewed articles

•Requirement:
–Choose text or image dataset for the experiment using Python
–Use (i) 2 ML and 1 DL algorithms, or (ii) DL algorithms, (iii) pretrained models for the
experiment using Python
–compare the performance of the selected algorithms

•Project Report
• Write a report with the following sections:
• Abstract -- ½ page
• Introduce problem and objective of the project -- 2 pages
• Description of algorithms used for the experiment -- 3 pages
• Discussion of experimental result --- 3 pages
• Concluding remarks, with major recommendation --- 1 page
• Reference (use IEEE referencing style)
Clustering
• Machine learning algorithms can be broadly classified into two categories:
supervised (classification) and unsupervised learning (clustering).
• The difference between them happens because of presence of target
variable. In clustering, there is no target variable, class. The dataset only has
input or independent variables which describe the data.
• K-Means clustering is the most popular clustering algorithm. It is used when
we have unlabelled data which is data without defined categories or groups.
• The algorithm follows an easy or simple way to classify a given data set through a certain
number of clusters. K-Means algorithm works iteratively to assign each data point to one
of K groups based on the features that are provided. Data points are clustered based on
feature similarity.
Clustering
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from [Link] import MinMaxScaler
from [Link] import KMeans
import numpy as np
from [Link] import silhouette_score

#df = pd.read_csv('[Link]’) #, usecols = ['longitude', 'latitude', 'population', 'households',


'median_house_value'])
df = pd.read_csv('[Link]’, usecols = ['longitude', 'latitude', 'ocean_proximity'])

[Link]()
[Link]

#Convert categorical value to numeric ()


df['ocean_proximity'] = df['ocean_proximity'].map({'NEAR BAY': 0, 'INLAND': 1})

[Link]().sum() #check for missing values


dfnan = [Link]()#df
Clustering into 2 clusters
X = [Link](columns=['ocean_proximity'])
y = dfnan['ocean_proximity']
y

X_norm = [Link](X)
X_norm

kmeans = KMeans(n_clusters = 2) #, random_state = 0, n_init='auto')


[Link](X_norm) #df_scale[['latitude', 'longitude']])

kmeans.cluster_centers_
#centers = [Link](p.cluster_centers_)
#print(centers)

silhouette_score(X_norm, kmeans.labels_, metric='euclidean')


Cluster evaluation
labels = kmeans.labels_
#labels

# check how many of the samples were correctly labeled


correct_labels = sum(y == labels)
print("Result: %d out of %d samples were correctly labeled." % (correct_labels,
[Link]))

#Check for silhouette score


silhouette_score(X_norm, kmeans.labels_, metric='euclidean')
Select optimal k cluster by running 2-9 for k

K = range(2, 9)
#fits = []
score = []

for k in K:
# train the model for current value of k on training data
model = KMeans(n_clusters = k)
model.fit_predict(X_norm)
print ('running',k)

# Append the silhouette score to scores


[Link](silhouette_score(X_norm, model.labels_,
metric='euclidean'))
print(score)
Plot cluster result

import seaborn as sns

[Link](data = X, x = 'longitude', y = 'latitude', hue =


kmeans.labels_)

You might also like