0% found this document useful (0 votes)

8 views10 pages

Data Science: NoSQL & Machine Learning Basics

The document provides an overview of data science concepts, focusing on NoSQL databases, the ACID principles of relational databases, and types of machine learning, including supervised and unsupervised learning. It discusses the challenges of handling large datasets and offers programming tips for efficient data processing. Additionally, it includes case studies and applications for various machine learning techniques and database types.

Uploaded by

Mamatha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views10 pages

Data Science: NoSQL & Machine Learning Basics

Uploaded by

Mamatha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduction to Data Science Unit-1

UNIT – III
TOPICS:
NoSQL movement for handling Bigdata: Distributing data storage and processing with Hadoop
framework, case study on risk assessment for loan sanctioning, ACID principle of relational
databases, CAP theorem, base principle of NoSQL databases, types of NoSQL databases, case
study on disease diagnosis and profiling

ACID : the core principle of relational databases

Atomicity—The “all or nothing” principle. If a record is put into a database, it’s put in
completely or not at all. If, for instance, a power failure occurs in the middle of a database write
action, you wouldn’t end up with half a record; it wouldn’t be there at all.
Consistency—This important principle maintains the integrity of the data. No entry that makes it
into the database will ever be in conflict with predefined rules, such as lacking a required field or
a field being numeric instead of text.
Isolation—When something is changed in the database, nothing can happen on the exact copy of
same data at exactly the same moment. Instead, the actions happen in serial with other changes.
Isolation is a scale going from low isolation to high isolation. On this scale, traditional databases
are on the “high isolation” end. An example of low isolation would be Google Docs: Multiple
people can write to a document at the exact same time and see each other’s changes happening
instantly. A traditional Word document has high isolation; it’s locked for editing by the first user
to open it. The second person opening the document can view its last saved version but is unable
to see unsaved changes or edit the document without first saving it as a copy. So once someone
has it opened, the most up-to-date version is completely isolated from anyone but the editor who
locked the document.
Durability—If data has entered the database, it should survive permanently. Physical damage to
the hard discs will destroy records, but power outages and software crashes should not. ACID
applies to all relational databases and certain NoSQL databases, such as the graph database
Neo4j.
For most other NoSQL databases another principle applies: BASE.

1
Introduction to Data Science Unit-1

Types of NoSQL databases

The four biggest types of NoSQL databases are
Key-value stores—Essentially a bunch of key-value pairs stored in a database. These databases
can be immensely big and are hugely versatile but the data complexity is low. A well-known
example is Redis.
Wide-column databases—These databases are a bit more complex than key value stores in that
they use columns but in a more efficient way than a regular RDBMS would. The columns are
essentially decoupled, allowing you to retrieve data in a single column quickly. A well-known
database is Cassandra.
Document stores—These databases are little bit more complex and store data as documents.
Currently the most popular one is MongoDB, but in our case study we use Elasticsearch, which
is both a document store and a search engine.
Graph databases—These databases can store the most complex data structures, as they treat the
entities and relations between entities with equal care. This complexity comes at a cost in lookup
speed. A popular one is Neo4j, but GraphX (a graph database related to Apache Spark) is
winning ground.

2.4 Types of machine learning

We can divide the different approaches to machine learning by the amount of human effort that’s
required to coordinate them and how they use labeled data. Labeled data is the data with a
category or a real-value number assigned to it that represents the outcome of previous
observations. Some of the main types of machine learning algorithms are as follows:
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
2.6.1 Supervised Machine Learning

Supervised learning is defined as when a model gets trained on a “Labelled Dataset”.

Labelled datasets have both input and output parameters. In Supervised Learning algorithms
learn to map points between inputs and correct outputs. It has both training and validation
datasets labelled.

2
Introduction to Data Science Unit-1

Fig. 2.3 Supervised Learning

Example: Consider a scenario where you have to build an image classifier to differentiate
between cats and dogs. If you feed the datasets of dogs and cats labelled images to the algorithm,
the machine will learn to classify between a dog or a cat from these labeled images. When we
input new dog or cat images that it has never seen before, it will use the learned algorithms and
predict whether it is a dog or a cat. This is how supervised learning works, and this is
particularly an image classification.
Advantages of Supervised Machine Learning
• Supervised Learning models can have high accuracy as they are trained on labelled data.
• The process of decision-making in supervised learning models is often interpretable.
• It can often be used in pre-trained models which saves time and resources when
developing new models from scratch.
Disadvantages of Supervised Machine Learning
• It has limitations in knowing patterns and may struggle with unseen or unexpected
patterns that are not present in the training data.
• It can be time-consuming and costly as it relies on labeled data only.
• It may lead to poor generalizations based on new data.
Applications of Supervised Learning
Supervised learning is used in a wide variety of applications, including:
• Image classification: Identify objects, faces, and other features in images.

3
Introduction to Data Science Unit-1

• Natural language processing: Extract information from text, such as sentiment, entities,
and relationships.
• Speech recognition: Convert spoken language into text.
• Recommendation systems: Make personalized recommendations to users.
• Predictive analytics: Predict outcomes, such as sales, customer churn, and stock prices.
• Medical diagnosis: Detect diseases and other medical conditions.
• Fraud detection: Identify fraudulent transactions.
• Autonomous vehicles: Recognize and respond to objects in the environment.
• Email spam detection: Classify emails as spam or not spam.
• Quality control in manufacturing: Inspect products for defects.
• Credit scoring: Assess the risk of a borrower defaulting on a loan.
• Gaming: Recognize characters, analyze player behavior, and create NPCs.
• Customer support: Automate customer support tasks.
• Weather forecasting: Make predictions for temperature, precipitation, and other
meteorological parameters.
• Sports analytics: Analyze player performance, make game predictions, and optimize
strategies.
2.6.2 Unsupervised Machine Learning
Unsupervised learning is a type of machine learning technique in which an algorithm discovers
patterns and relationships using unlabeled data. Unlike supervised learning, unsupervised
learning doesn’t involve providing the algorithm with labeled target outputs. The primary goal
of Unsupervised learning is often to discover hidden patterns, similarities, or clusters within
the data, which can then be used for various purposes, such as data exploration, visualization,
dimensionality reduction, and more.

4
Introduction to Data Science Unit-1

Fig. 2.4 Unsupervised Learning

Let’s understand it with the help of an example.
Example: Consider that you have a dataset that contains information about the purchases you
made from the shop. Through clustering, the algorithm can group the same purchasing behaviour
among you and other customers, which reveals potential customers without predefined labels.
This type of information can help businesses get target customers as well as identify outliers.
Advantages of Unsupervised Machine Learning
• It helps to discover hidden patterns and various relationships between the data.
• Used for tasks such as customer segmentation, anomaly detection, and data exploration.
• It does not require labeled data and reduces the effort of data labeling.
Disadvantages of Unsupervised Machine Learning
• Without using labels, it may be difficult to predict the quality of the model’s output.
• Cluster Interpretability may not be clear and may not have meaningful interpretations.
• It has techniques such as autoencoders and dimensionality reduction that can be used to
extract meaningful features from raw data.
Applications of Unsupervised Learning
Here are some common applications of unsupervised learning:
• Clustering: Group similar data points into clusters.
• Anomaly detection: Identify outliers or anomalies in data.
• Dimensionality reduction: Reduce the dimensionality of data while preserving its
essential information.
• Recommendation systems: Suggest products, movies, or content to users based on their
historical behaviour or preferences.

5
Introduction to Data Science Unit-1

• Topic modeling: Discover latent topics within a collection of documents.

• Density estimation: Estimate the probability density function of data.
• Image and video compression: Reduce the amount of storage required for multimedia
content.
• Data preprocessing: Help with data preprocessing tasks such as data cleaning,
imputation of missing values, and data scaling.
• Market basket analysis: Discover associations between products.
• Genomic data analysis: Identify patterns or group genes with similar expression
profiles.
• Image segmentation: Segment images into meaningful regions.
• Community detection in social networks: Identify communities or groups of
individuals with similar interests or connections.
• Customer behaviour analysis: Uncover patterns and insights for better marketing and
product recommendations.
• Content recommendation: Classify and tag content to make it easier to recommend
similar items to users.
• Exploratory data analysis (EDA): Explore data and gain insights before defining
specific tasks.

2.7.1 The problems you face when handling large data

• A large volume of data poses new challenges, such as overloaded memory and algorithms
that never stop running. It forces you to adapt and expand your collection of techniques.
• But even when we perform analysis, we should take care of issues such as I/O (input/output)
and CPU starvation because these can cause speed issues.

6
Introduction to Data Science Unit-1

Fig. 2.6 Overview of problems encountered when working with more data than can fit in
memory
Not enough memory:
• A computer only has a limited amount of RAM. When we try to input more data into this
memory than actually fits, the OS will start swapping out memory blocks to disks, which is
far less efficient than having it all in memory.
• But only a few algorithms are designed to handle large data sets; most of them load the
whole data set into memory at once, which causes the out-of-memory error.
• Other algorithms need to hold multiple copies of the data in memory or store intermediate
results. All of these exaggerate the problem.

Processes that never end:

• Even when we cure the memory issues, we may need to deal with another limited resource
ie., time.
• Certain algorithms don’t take time into account; they’ll keep running forever.
• Other algorithms can’t end in a reasonable amount of time when they need to process only a
few megabytes of data.
Some components form bottleneck while others remain idle:
• A third issue that arises when dealing with large data sets is that components of the
computer can start to form a bottleneck while leaving other systems idle.

7
Introduction to Data Science Unit-1

• Although this isn’t as severe as a never-ending algorithm or out-of-memory errors, but it still
incurs a serious cost which is in terms of person days and computing infrastructure for CPU
starvation.
Not enough speed:
• Certain programs don’t feed data fast enough to the processor because they have to read data
from the hard drive, which is one of the slowest components on a computer.
• We can overcome this problem by using Solid State Drives (SSD), but SSDs are much more
expensive than the slower and more widespread Hard Disk Drive (HDD) technology.

General programming tips for dealing with large data sets

• The tricks that work in a general programming context still apply for data science.
• Several might be worded slightly differently, but the principles are essentially the same for
all programmers. The general tricks are divided into three parts. They are:
1. Don’t reinvent the wheel. Use tools and libraries developed by others.
2. Get the most out of your hardware. Your machine is never used to its full potential; with
simple adaptations you can make it work harder.
3. Reduce the computing need. Slim down your memory and processing needs as much as
possible.
1. Avoid duplicating existing efforts / Don’t reinvent the wheel:
• "Avoid repetition" is like "avoid repeating yourself."
• Act in a way that adds significance and worth. Revisiting an issue that has previously been
resolved is inefficient.
• As a data scientist, there are two fundamental principles that can enhance your productivity
while working with enormous datasets:
➢ Exploit the potential of databases: Most data scientists first choose to create their
analytical base tables within a database when dealing with huge data sets. This strategy is
effective for preparing straightforward features. Determine if user-defined functions and
procedures may be utilized while using advanced modeling in this preparation.
➢ Utilize optimized libraries: Developing libraries such as Mahout, Weka and other
machine learning algorithms demands effort and expertise. The products are highly
optimized and utilize best practices and cutting-edge technologies. Focus your attention

8
Introduction to Data Science Unit-1

on accomplishing tasks rather than duplicating or reiterating the labor of others, unless it
is for the purpose of comprehending processes.
2. Get the most out of your hardware:
 Over-utilization of resources can slow down programs and cause them to fail.
 Shifting workload from overtaxed to underutilized resources can be achieved using the
following techniques:
1. Feeding CPU compressed data: Shift more work from hard disk to CPU to avoid CPU
starvation.
2. Utilizing GPU: Switch to GPU for parallelize computations due to its higher throughput.
3. Using CUDA Packages: Use CUDA packages like PyCUDA for parallelization.
4. Using Multiple Threads: Parallelize computations on CPU using normal Python threads
3. Reduce the computing need
 “Working smart + hard = achievement.” This also applies to the programs you write.
 The best way to avoid having large data problems is by removing as much of the work as
possible up front and letting the computer work only on the part that can’t be skipped.
The following list contains methods to help us achieve this:
➢ Utilize a profiler to identify and remediate slow code parts
➢ Use compiled code, especially when loops are involved, and functions from packages
optimized for numerical computations.
➢ If a package is not available, compile the code yourself.
➢ Use computational libraries like LAPACK, BLAST, Intel MKL, and ATLAS for high
performance.
➢ Avoid pulling data into memory when working with data that doesn't fit in memory.
➢ Use generators to avoid intermediate data storage by returning data per observation
instead of in batches.
➢ Use as little data as possible if no large-scale algorithm is available.
➢ Use math skills to simplify calculations

Set1

1. a. Discuss about unsupervised Machine Learning.

9
Introduction to Data Science Unit-1

b. List out the problems we face when handling large data sets.
2. Explain ACID principle of relational databases.
Set2
1. a. Discuss about supervised Machine Learning.
b. List out general programming tips to handle large data sets.

2. Explain various types of NoSQL databases

Common questions

Supervised machine learning models have the advantage of high accuracy and interpretability due to their training on labeled datasets, making them suitable for predictive analytics. However, they are limited by their dependence on labeled data, which can be costly to obtain, and often struggle with generalization to new, unseen patterns not present in their training data .

Isolation levels can vary to balance between performance and consistency. High isolation levels ensure data integrity by preventing concurrent transaction interference but can decrease performance due to increased transaction latency. Conversely, lower isolation levels improve performance by allowing more concurrent processing but risk data anomalies. The choice depends on the application requirements for concurrency and consistency .

A graph database excels in scenarios involving complex relationships and connected data, such as social networks, recommendation systems, and fraud detection, where the data involves numerous entities connected by relationships. In contrast, key-value stores are more suitable for simple, fast read/write operations, and document stores are optimal when handling hierarchical data structures that align closely with JSON-like documents .

The CAP theorem states that it is impossible for a distributed data store to simultaneously provide Consistency, Availability, and Partition tolerance. NoSQL databases often choose to prioritize Availability and Partition tolerance over Consistency, accepting that data might not be consistent across all nodes at all times. This trade-off allows NoSQL databases to remain highly available and resilient in distributed systems, even when some components fail .

Supervised learning requires labeled datasets, where each entry has an input-output pair used for mapping. This model type is effective for prediction tasks such as classification and regression. Unsupervised learning, in contrast, deals with unlabeled data and identifies patterns or structures, such as clusters, without prior category designations. Its outcomes often involve discovering hidden data patterns and are applied in tasks like clustering and dimensionality reduction .

Machine learning can enhance fraud detection by identifying patterns and anomalies that suggest fraudulent activity. Supervised models can be trained on historical transaction data to recognize suspicious behavior, while unsupervised models can detect novel fraud patterns without predefined labels. Key considerations include ensuring data quality, balancing sensitivity and specificity to minimize false positives/negatives, and continuously updating models to adapt to evolving fraud tactics .

The ACID principles, which stand for Atomicity, Consistency, Isolation, and Durability, are fundamental to relational databases. They ensure complete transactions (Atomicity), data validity according to rules (Consistency), isolated operations (Isolation), and permanence of committed transactions (Durability). In contrast, BASE principles for NoSQL databases prioritize high Availability, Soft state, and Eventual consistency. This means the systems prioritize availability and partition tolerance over immediate consistency, accepting that data may be internally inconsistent temporarily but will become eventually consistent .

SSDs improve data processing speed by offering faster data read/write times compared to traditional hard disk drives, reducing I/O bottlenecks. However, SSDs are more expensive than HDDs, making them less cost-effective for storing large volumes of data. The trade-off involves higher initial costs but significantly better performance for data-intensive operations .

Challenges in processing large datasets include memory limitations, infinite or prolonged processing times, and bottlenecks where not all system resources are efficiently used. Strategies to mitigate these include using optimized libraries and tools, leveraging databases for feature creation, exploiting hardware capabilities like GPUs, using multi-threading, and minimizing data loads into memory by utilizing generators and profiling to optimize code .

NoSQL databases, such as document stores and graph databases, are well-suited for handling unstructured data by allowing flexible schema designs and horizontal scalability. Challenges in implementing NoSQL solutions include ensuring data consistency, integrating with existing systems, managing distributed data storage, and maintaining high availability. Organizations must address these while maximizing the flexibility and scalability benefits NoSQL offers .

Apache Spark and Big Data Overview
No ratings yet
Apache Spark and Big Data Overview
11 pages
Data Science Basics and Machine Learning
No ratings yet
Data Science Basics and Machine Learning
9 pages
Understanding Machine Learning Concepts
No ratings yet
Understanding Machine Learning Concepts
29 pages
DataScience StudyMaterial
No ratings yet
DataScience StudyMaterial
13 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
40 pages
Machine Learning and Data Science Overview
No ratings yet
Machine Learning and Data Science Overview
43 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
43 pages
AI and Big Data Fundamentals Guide
No ratings yet
AI and Big Data Fundamentals Guide
23 pages
Textbook ML - Removed - Removed - Removed - Removed - Removed
No ratings yet
Textbook ML - Removed - Removed - Removed - Removed - Removed
37 pages
ML Module1
No ratings yet
ML Module1
70 pages
IoT Data Analytics and Machine Learning
No ratings yet
IoT Data Analytics and Machine Learning
39 pages
Real Output for f(x) at x=3?
No ratings yet
Real Output for f(x) at x=3?
22 pages
Understanding Machine Learning Types
No ratings yet
Understanding Machine Learning Types
7 pages
L6 What Is ML, Types of Machine Learning
No ratings yet
L6 What Is ML, Types of Machine Learning
28 pages
IoT Data Analytics and Machine Learning
No ratings yet
IoT Data Analytics and Machine Learning
69 pages
Machine Learning Concepts and Models
No ratings yet
Machine Learning Concepts and Models
331 pages
Ad8552 ML Unit I
No ratings yet
Ad8552 ML Unit I
31 pages
Machine Learning in Data Science Overview
No ratings yet
Machine Learning in Data Science Overview
41 pages
Types of Machine Learning Explained
No ratings yet
Types of Machine Learning Explained
32 pages
Deep Learning Basics and Applications
No ratings yet
Deep Learning Basics and Applications
40 pages
Machine Learning Overview and Techniques
No ratings yet
Machine Learning Overview and Techniques
58 pages
Machine Learning: Problems & Solutions
No ratings yet
Machine Learning: Problems & Solutions
9 pages
Data Science Unit - 5
No ratings yet
Data Science Unit - 5
41 pages
Machine Learning and Data Mining Overview
No ratings yet
Machine Learning and Data Mining Overview
9 pages
AI and Machine Learning Course Syllabus
100% (1)
AI and Machine Learning Course Syllabus
15 pages
Understanding Machine Learning Types
No ratings yet
Understanding Machine Learning Types
21 pages
Class Notes and Questions
No ratings yet
Class Notes and Questions
38 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
32 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
41 pages
Big Data ML Techniques Overview
No ratings yet
Big Data ML Techniques Overview
17 pages
Applications of AI and Data Concepts
No ratings yet
Applications of AI and Data Concepts
33 pages
Machine Learning Imp Questionspdf
No ratings yet
Machine Learning Imp Questionspdf
40 pages
Machine Learning Overview and Types
No ratings yet
Machine Learning Overview and Types
80 pages
Machine Learning Overview - 21CS54
No ratings yet
Machine Learning Overview - 21CS54
42 pages
CSE - 8th - Sem - ML Notes
No ratings yet
CSE - 8th - Sem - ML Notes
36 pages
Textbook ML - Removed - Removed - Removed - Removed
No ratings yet
Textbook ML - Removed - Removed - Removed - Removed
40 pages
Unsupervised Machine Learning Guide
No ratings yet
Unsupervised Machine Learning Guide
5 pages
Introduction to Machine Learning Concepts
100% (9)
Introduction to Machine Learning Concepts
112 pages
Understanding Machine Learning Types
No ratings yet
Understanding Machine Learning Types
15 pages
Data Science Fundamentals and Techniques
No ratings yet
Data Science Fundamentals and Techniques
44 pages
Understanding AI and Data Science Basics
No ratings yet
Understanding AI and Data Science Basics
31 pages
Data Science and Machine Learning Guide
No ratings yet
Data Science and Machine Learning Guide
67 pages
Understanding Data Science Basics
No ratings yet
Understanding Data Science Basics
31 pages
ML Unit 1
No ratings yet
ML Unit 1
22 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
7 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
32 pages
Datasceince Module 1 Notes 26.9.2025
No ratings yet
Datasceince Module 1 Notes 26.9.2025
32 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
20 pages
Unsupervised Learning in Machine Learning
No ratings yet
Unsupervised Learning in Machine Learning
25 pages
Machine Learning Basics and Applications
No ratings yet
Machine Learning Basics and Applications
19 pages
Unsupervised Learning in Neural Networks
No ratings yet
Unsupervised Learning in Neural Networks
51 pages
Data Mining and Machine Learning Insights
No ratings yet
Data Mining and Machine Learning Insights
77 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
123 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
38 pages
Engineering Module-I ML
No ratings yet
Engineering Module-I ML
61 pages
Data Science Material Unit-Wise
No ratings yet
Data Science Material Unit-Wise
49 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
56 pages
Introduction to Machine Learning Basics
No ratings yet
Introduction to Machine Learning Basics
11 pages
Ds Sharing 2024
No ratings yet
Ds Sharing 2024
63 pages
Procedure Calls in Compiler Design
No ratings yet
Procedure Calls in Compiler Design
3 pages
Evaluation Order in Code Generation
No ratings yet
Evaluation Order in Code Generation
13 pages
LAN Configuration and Network Devices
No ratings yet
LAN Configuration and Network Devices
71 pages
Data Science Lab Manual for B.Tech Students
No ratings yet
Data Science Lab Manual for B.Tech Students
51 pages
Data Science Fundamentals Overview
No ratings yet
Data Science Fundamentals Overview
25 pages
NoSQL and Big Data with Hadoop & Spark
100% (1)
NoSQL and Big Data with Hadoop & Spark
24 pages
Unit 5 Ids
No ratings yet
Unit 5 Ids
19 pages
Last Mile Delivery Management System
No ratings yet
Last Mile Delivery Management System
1 page
NoSQL Database Business Drivers
No ratings yet
NoSQL Database Business Drivers
8 pages
Isiolo County GIS Establishment Proposal
No ratings yet
Isiolo County GIS Establishment Proposal
6 pages
ATS-Friendly Resume for Data Analysts
No ratings yet
ATS-Friendly Resume for Data Analysts
4 pages
Diagrama SVG de Arquitectura Azure
No ratings yet
Diagrama SVG de Arquitectura Azure
3 pages
Enhancing Honeypot Effectiveness in Cybersecurity
No ratings yet
Enhancing Honeypot Effectiveness in Cybersecurity
2 pages
Using iSeries Navigator for SQL Cache
No ratings yet
Using iSeries Navigator for SQL Cache
4 pages
YouTube Video Summarizer Project
No ratings yet
YouTube Video Summarizer Project
1 page
CTE, Views, and Temp Tables in SQL
No ratings yet
CTE, Views, and Temp Tables in SQL
5 pages
Data Science Basics with Python Pandas
No ratings yet
Data Science Basics with Python Pandas
6 pages
Unit 14 Assignment 1 Business Inteligent PDF
No ratings yet
Unit 14 Assignment 1 Business Inteligent PDF
12 pages
IBM Address Standarization
No ratings yet
IBM Address Standarization
2 pages
Understanding Operational Analytics
No ratings yet
Understanding Operational Analytics
1 page
Alflytics Manual
No ratings yet
Alflytics Manual
52 pages
Business Analyst Profile: Nirmalsing Patil
No ratings yet
Business Analyst Profile: Nirmalsing Patil
1 page
Cypress AI for Army Knowledge Management
No ratings yet
Cypress AI for Army Knowledge Management
3 pages
Mahak Gupta: Power BI Developer Profile
No ratings yet
Mahak Gupta: Power BI Developer Profile
1 page
Essential Cybersecurity Safeguards
No ratings yet
Essential Cybersecurity Safeguards
3 pages
Navigating Websites: A Lesson Plan
80% (5)
Navigating Websites: A Lesson Plan
5 pages
Understanding Management Information Systems
No ratings yet
Understanding Management Information Systems
14 pages
Understanding Business Analytics Concepts
No ratings yet
Understanding Business Analytics Concepts
55 pages
Excel VLOOKUP, HLOOKUP, and Pivot Tables
No ratings yet
Excel VLOOKUP, HLOOKUP, and Pivot Tables
16 pages
Using Rational Rose for UML Diagrams
No ratings yet
Using Rational Rose for UML Diagrams
4 pages
Power BI for Corporate Finance Valuation
No ratings yet
Power BI for Corporate Finance Valuation
24 pages
Incident Report: Data Breach Analysis
No ratings yet
Incident Report: Data Breach Analysis
2 pages
PeerJ Impact Factor 2023 Analysis
No ratings yet
PeerJ Impact Factor 2023 Analysis
23 pages
Business Intelligence in Risk Management
No ratings yet
Business Intelligence in Risk Management
13 pages
Tableau Data Visualization Guide
No ratings yet
Tableau Data Visualization Guide
7 pages
Surfer Software GIS Training Seminar
No ratings yet
Surfer Software GIS Training Seminar
14 pages
Midterm Exam: IT Application Tools
No ratings yet
Midterm Exam: IT Application Tools
4 pages

Data Science: NoSQL & Machine Learning Basics

Uploaded by

Data Science: NoSQL & Machine Learning Basics

Uploaded by

Introduction to Data Science Unit-1

ACID : the core principle of relational databases

Types of NoSQL databases

2.4 Types of machine learning

Supervised learning is defined as when a model gets trained on a “Labelled Dataset”.

Fig. 2.3 Supervised Learning

Fig. 2.4 Unsupervised Learning

• Topic modeling: Discover latent topics within a collection of documents.

2.7.1 The problems you face when handling large data

Processes that never end:

General programming tips for dealing with large data sets

1. a. Discuss about unsupervised Machine Learning.

2. Explain various types of NoSQL databases

Common questions

What are the key advantages and limitations of using supervised machine learning models for predictive analytics?

Why might the isolation level vary in different database systems, and how can this affect database performance and data integrity?

In what scenarios would a graph database be more appropriate than a key-value store or a document store?

How does the CAP theorem influence the design of NoSQL databases, and what trade-offs are involved?

How do supervised and unsupervised machine learning approaches differ in terms of data requirements and application outcomes?

How can machine learning be applied to enhance fraud detection systems, and what are the key considerations in implementing such systems?

What are the main differences between the ACID principles of relational databases and the BASE principles of NoSQL databases?

What role do solid-state drives (SSDs) play in addressing data processing speed issues, and what are the trade-offs involved in using them?

What are the challenges faced when processing large datasets, and what strategies can be employed to mitigate these issues?

Discuss the role of NoSQL databases in handling unstructured data and what challenges remain in their implementation?

You might also like