0% found this document useful (0 votes)
8 views10 pages

Data Science: NoSQL & Machine Learning Basics

The document provides an overview of data science concepts, focusing on NoSQL databases, the ACID principles of relational databases, and types of machine learning, including supervised and unsupervised learning. It discusses the challenges of handling large datasets and offers programming tips for efficient data processing. Additionally, it includes case studies and applications for various machine learning techniques and database types.

Uploaded by

Mamatha
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views10 pages

Data Science: NoSQL & Machine Learning Basics

The document provides an overview of data science concepts, focusing on NoSQL databases, the ACID principles of relational databases, and types of machine learning, including supervised and unsupervised learning. It discusses the challenges of handling large datasets and offers programming tips for efficient data processing. Additionally, it includes case studies and applications for various machine learning techniques and database types.

Uploaded by

Mamatha
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Introduction to Data Science Unit-1

UNIT – III
TOPICS:
NoSQL movement for handling Bigdata: Distributing data storage and processing with Hadoop
framework, case study on risk assessment for loan sanctioning, ACID principle of relational
databases, CAP theorem, base principle of NoSQL databases, types of NoSQL databases, case
study on disease diagnosis and profiling

ACID : the core principle of relational databases


Atomicity—The “all or nothing” principle. If a record is put into a database, it’s put in
completely or not at all. If, for instance, a power failure occurs in the middle of a database write
action, you wouldn’t end up with half a record; it wouldn’t be there at all.
Consistency—This important principle maintains the integrity of the data. No entry that makes it
into the database will ever be in conflict with predefined rules, such as lacking a required field or
a field being numeric instead of text.
Isolation—When something is changed in the database, nothing can happen on the exact copy of
same data at exactly the same moment. Instead, the actions happen in serial with other changes.
Isolation is a scale going from low isolation to high isolation. On this scale, traditional databases
are on the “high isolation” end. An example of low isolation would be Google Docs: Multiple
people can write to a document at the exact same time and see each other’s changes happening
instantly. A traditional Word document has high isolation; it’s locked for editing by the first user
to open it. The second person opening the document can view its last saved version but is unable
to see unsaved changes or edit the document without first saving it as a copy. So once someone
has it opened, the most up-to-date version is completely isolated from anyone but the editor who
locked the document.
Durability—If data has entered the database, it should survive permanently. Physical damage to
the hard discs will destroy records, but power outages and software crashes should not. ACID
applies to all relational databases and certain NoSQL databases, such as the graph database
Neo4j.
For most other NoSQL databases another principle applies: BASE.

1
Introduction to Data Science Unit-1

Types of NoSQL databases


The four biggest types of NoSQL databases are
Key-value stores—Essentially a bunch of key-value pairs stored in a database. These databases
can be immensely big and are hugely versatile but the data complexity is low. A well-known
example is Redis.
Wide-column databases—These databases are a bit more complex than key value stores in that
they use columns but in a more efficient way than a regular RDBMS would. The columns are
essentially decoupled, allowing you to retrieve data in a single column quickly. A well-known
database is Cassandra.
Document stores—These databases are little bit more complex and store data as documents.
Currently the most popular one is MongoDB, but in our case study we use Elasticsearch, which
is both a document store and a search engine.
Graph databases—These databases can store the most complex data structures, as they treat the
entities and relations between entities with equal care. This complexity comes at a cost in lookup
speed. A popular one is Neo4j, but GraphX (a graph database related to Apache Spark) is
winning ground.

2.4 Types of machine learning


We can divide the different approaches to machine learning by the amount of human effort that’s
required to coordinate them and how they use labeled data. Labeled data is the data with a
category or a real-value number assigned to it that represents the outcome of previous
observations. Some of the main types of machine learning algorithms are as follows:
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
2.6.1 Supervised Machine Learning

Supervised learning is defined as when a model gets trained on a “Labelled Dataset”.


Labelled datasets have both input and output parameters. In Supervised Learning algorithms
learn to map points between inputs and correct outputs. It has both training and validation
datasets labelled.

2
Introduction to Data Science Unit-1

Fig. 2.3 Supervised Learning


Example: Consider a scenario where you have to build an image classifier to differentiate
between cats and dogs. If you feed the datasets of dogs and cats labelled images to the algorithm,
the machine will learn to classify between a dog or a cat from these labeled images. When we
input new dog or cat images that it has never seen before, it will use the learned algorithms and
predict whether it is a dog or a cat. This is how supervised learning works, and this is
particularly an image classification.
Advantages of Supervised Machine Learning
• Supervised Learning models can have high accuracy as they are trained on labelled data.
• The process of decision-making in supervised learning models is often interpretable.
• It can often be used in pre-trained models which saves time and resources when
developing new models from scratch.
Disadvantages of Supervised Machine Learning
• It has limitations in knowing patterns and may struggle with unseen or unexpected
patterns that are not present in the training data.
• It can be time-consuming and costly as it relies on labeled data only.
• It may lead to poor generalizations based on new data.
Applications of Supervised Learning
Supervised learning is used in a wide variety of applications, including:
• Image classification: Identify objects, faces, and other features in images.

3
Introduction to Data Science Unit-1

• Natural language processing: Extract information from text, such as sentiment, entities,
and relationships.
• Speech recognition: Convert spoken language into text.
• Recommendation systems: Make personalized recommendations to users.
• Predictive analytics: Predict outcomes, such as sales, customer churn, and stock prices.
• Medical diagnosis: Detect diseases and other medical conditions.
• Fraud detection: Identify fraudulent transactions.
• Autonomous vehicles: Recognize and respond to objects in the environment.
• Email spam detection: Classify emails as spam or not spam.
• Quality control in manufacturing: Inspect products for defects.
• Credit scoring: Assess the risk of a borrower defaulting on a loan.
• Gaming: Recognize characters, analyze player behavior, and create NPCs.
• Customer support: Automate customer support tasks.
• Weather forecasting: Make predictions for temperature, precipitation, and other
meteorological parameters.
• Sports analytics: Analyze player performance, make game predictions, and optimize
strategies.
2.6.2 Unsupervised Machine Learning
Unsupervised learning is a type of machine learning technique in which an algorithm discovers
patterns and relationships using unlabeled data. Unlike supervised learning, unsupervised
learning doesn’t involve providing the algorithm with labeled target outputs. The primary goal
of Unsupervised learning is often to discover hidden patterns, similarities, or clusters within
the data, which can then be used for various purposes, such as data exploration, visualization,
dimensionality reduction, and more.

4
Introduction to Data Science Unit-1

Fig. 2.4 Unsupervised Learning


Let’s understand it with the help of an example.
Example: Consider that you have a dataset that contains information about the purchases you
made from the shop. Through clustering, the algorithm can group the same purchasing behaviour
among you and other customers, which reveals potential customers without predefined labels.
This type of information can help businesses get target customers as well as identify outliers.
Advantages of Unsupervised Machine Learning
• It helps to discover hidden patterns and various relationships between the data.
• Used for tasks such as customer segmentation, anomaly detection, and data exploration.
• It does not require labeled data and reduces the effort of data labeling.
Disadvantages of Unsupervised Machine Learning
• Without using labels, it may be difficult to predict the quality of the model’s output.
• Cluster Interpretability may not be clear and may not have meaningful interpretations.
• It has techniques such as autoencoders and dimensionality reduction that can be used to
extract meaningful features from raw data.
Applications of Unsupervised Learning
Here are some common applications of unsupervised learning:
• Clustering: Group similar data points into clusters.
• Anomaly detection: Identify outliers or anomalies in data.
• Dimensionality reduction: Reduce the dimensionality of data while preserving its
essential information.
• Recommendation systems: Suggest products, movies, or content to users based on their
historical behaviour or preferences.

5
Introduction to Data Science Unit-1

• Topic modeling: Discover latent topics within a collection of documents.


• Density estimation: Estimate the probability density function of data.
• Image and video compression: Reduce the amount of storage required for multimedia
content.
• Data preprocessing: Help with data preprocessing tasks such as data cleaning,
imputation of missing values, and data scaling.
• Market basket analysis: Discover associations between products.
• Genomic data analysis: Identify patterns or group genes with similar expression
profiles.
• Image segmentation: Segment images into meaningful regions.
• Community detection in social networks: Identify communities or groups of
individuals with similar interests or connections.
• Customer behaviour analysis: Uncover patterns and insights for better marketing and
product recommendations.
• Content recommendation: Classify and tag content to make it easier to recommend
similar items to users.
• Exploratory data analysis (EDA): Explore data and gain insights before defining
specific tasks.

2.7.1 The problems you face when handling large data


• A large volume of data poses new challenges, such as overloaded memory and algorithms
that never stop running. It forces you to adapt and expand your collection of techniques.
• But even when we perform analysis, we should take care of issues such as I/O (input/output)
and CPU starvation because these can cause speed issues.

6
Introduction to Data Science Unit-1

Fig. 2.6 Overview of problems encountered when working with more data than can fit in
memory
Not enough memory:
• A computer only has a limited amount of RAM. When we try to input more data into this
memory than actually fits, the OS will start swapping out memory blocks to disks, which is
far less efficient than having it all in memory.
• But only a few algorithms are designed to handle large data sets; most of them load the
whole data set into memory at once, which causes the out-of-memory error.
• Other algorithms need to hold multiple copies of the data in memory or store intermediate
results. All of these exaggerate the problem.

Processes that never end:


• Even when we cure the memory issues, we may need to deal with another limited resource
ie., time.
• Certain algorithms don’t take time into account; they’ll keep running forever.
• Other algorithms can’t end in a reasonable amount of time when they need to process only a
few megabytes of data.
Some components form bottleneck while others remain idle:
• A third issue that arises when dealing with large data sets is that components of the
computer can start to form a bottleneck while leaving other systems idle.

7
Introduction to Data Science Unit-1

• Although this isn’t as severe as a never-ending algorithm or out-of-memory errors, but it still
incurs a serious cost which is in terms of person days and computing infrastructure for CPU
starvation.
Not enough speed:
• Certain programs don’t feed data fast enough to the processor because they have to read data
from the hard drive, which is one of the slowest components on a computer.
• We can overcome this problem by using Solid State Drives (SSD), but SSDs are much more
expensive than the slower and more widespread Hard Disk Drive (HDD) technology.

General programming tips for dealing with large data sets


• The tricks that work in a general programming context still apply for data science.
• Several might be worded slightly differently, but the principles are essentially the same for
all programmers. The general tricks are divided into three parts. They are:
1. Don’t reinvent the wheel. Use tools and libraries developed by others.
2. Get the most out of your hardware. Your machine is never used to its full potential; with
simple adaptations you can make it work harder.
3. Reduce the computing need. Slim down your memory and processing needs as much as
possible.
1. Avoid duplicating existing efforts / Don’t reinvent the wheel:
• "Avoid repetition" is like "avoid repeating yourself."
• Act in a way that adds significance and worth. Revisiting an issue that has previously been
resolved is inefficient.
• As a data scientist, there are two fundamental principles that can enhance your productivity
while working with enormous datasets:
➢ Exploit the potential of databases: Most data scientists first choose to create their
analytical base tables within a database when dealing with huge data sets. This strategy is
effective for preparing straightforward features. Determine if user-defined functions and
procedures may be utilized while using advanced modeling in this preparation.
➢ Utilize optimized libraries: Developing libraries such as Mahout, Weka and other
machine learning algorithms demands effort and expertise. The products are highly
optimized and utilize best practices and cutting-edge technologies. Focus your attention

8
Introduction to Data Science Unit-1

on accomplishing tasks rather than duplicating or reiterating the labor of others, unless it
is for the purpose of comprehending processes.
2. Get the most out of your hardware:
 Over-utilization of resources can slow down programs and cause them to fail.
 Shifting workload from overtaxed to underutilized resources can be achieved using the
following techniques:
1. Feeding CPU compressed data: Shift more work from hard disk to CPU to avoid CPU
starvation.
2. Utilizing GPU: Switch to GPU for parallelize computations due to its higher throughput.
3. Using CUDA Packages: Use CUDA packages like PyCUDA for parallelization.
4. Using Multiple Threads: Parallelize computations on CPU using normal Python threads
3. Reduce the computing need
 “Working smart + hard = achievement.” This also applies to the programs you write.
 The best way to avoid having large data problems is by removing as much of the work as
possible up front and letting the computer work only on the part that can’t be skipped.
The following list contains methods to help us achieve this:
➢ Utilize a profiler to identify and remediate slow code parts
➢ Use compiled code, especially when loops are involved, and functions from packages
optimized for numerical computations.
➢ If a package is not available, compile the code yourself.
➢ Use computational libraries like LAPACK, BLAST, Intel MKL, and ATLAS for high
performance.
➢ Avoid pulling data into memory when working with data that doesn't fit in memory.
➢ Use generators to avoid intermediate data storage by returning data per observation
instead of in batches.
➢ Use as little data as possible if no large-scale algorithm is available.
➢ Use math skills to simplify calculations

Set1

1. a. Discuss about unsupervised Machine Learning.

9
Introduction to Data Science Unit-1

b. List out the problems we face when handling large data sets.
2. Explain ACID principle of relational databases.
Set2
1. a. Discuss about supervised Machine Learning.
b. List out general programming tips to handle large data sets.

2. Explain various types of NoSQL databases

10

Common questions

Powered by AI

Supervised machine learning models have the advantage of high accuracy and interpretability due to their training on labeled datasets, making them suitable for predictive analytics. However, they are limited by their dependence on labeled data, which can be costly to obtain, and often struggle with generalization to new, unseen patterns not present in their training data .

Isolation levels can vary to balance between performance and consistency. High isolation levels ensure data integrity by preventing concurrent transaction interference but can decrease performance due to increased transaction latency. Conversely, lower isolation levels improve performance by allowing more concurrent processing but risk data anomalies. The choice depends on the application requirements for concurrency and consistency .

A graph database excels in scenarios involving complex relationships and connected data, such as social networks, recommendation systems, and fraud detection, where the data involves numerous entities connected by relationships. In contrast, key-value stores are more suitable for simple, fast read/write operations, and document stores are optimal when handling hierarchical data structures that align closely with JSON-like documents .

The CAP theorem states that it is impossible for a distributed data store to simultaneously provide Consistency, Availability, and Partition tolerance. NoSQL databases often choose to prioritize Availability and Partition tolerance over Consistency, accepting that data might not be consistent across all nodes at all times. This trade-off allows NoSQL databases to remain highly available and resilient in distributed systems, even when some components fail .

Supervised learning requires labeled datasets, where each entry has an input-output pair used for mapping. This model type is effective for prediction tasks such as classification and regression. Unsupervised learning, in contrast, deals with unlabeled data and identifies patterns or structures, such as clusters, without prior category designations. Its outcomes often involve discovering hidden data patterns and are applied in tasks like clustering and dimensionality reduction .

Machine learning can enhance fraud detection by identifying patterns and anomalies that suggest fraudulent activity. Supervised models can be trained on historical transaction data to recognize suspicious behavior, while unsupervised models can detect novel fraud patterns without predefined labels. Key considerations include ensuring data quality, balancing sensitivity and specificity to minimize false positives/negatives, and continuously updating models to adapt to evolving fraud tactics .

The ACID principles, which stand for Atomicity, Consistency, Isolation, and Durability, are fundamental to relational databases. They ensure complete transactions (Atomicity), data validity according to rules (Consistency), isolated operations (Isolation), and permanence of committed transactions (Durability). In contrast, BASE principles for NoSQL databases prioritize high Availability, Soft state, and Eventual consistency. This means the systems prioritize availability and partition tolerance over immediate consistency, accepting that data may be internally inconsistent temporarily but will become eventually consistent .

SSDs improve data processing speed by offering faster data read/write times compared to traditional hard disk drives, reducing I/O bottlenecks. However, SSDs are more expensive than HDDs, making them less cost-effective for storing large volumes of data. The trade-off involves higher initial costs but significantly better performance for data-intensive operations .

Challenges in processing large datasets include memory limitations, infinite or prolonged processing times, and bottlenecks where not all system resources are efficiently used. Strategies to mitigate these include using optimized libraries and tools, leveraging databases for feature creation, exploiting hardware capabilities like GPUs, using multi-threading, and minimizing data loads into memory by utilizing generators and profiling to optimize code .

NoSQL databases, such as document stores and graph databases, are well-suited for handling unstructured data by allowing flexible schema designs and horizontal scalability. Challenges in implementing NoSQL solutions include ensuring data consistency, integrating with existing systems, managing distributed data storage, and maintaining high availability. Organizations must address these while maximizing the flexibility and scalability benefits NoSQL offers .

You might also like