Introduction to Big Data Concepts
Introduction to Big Data Concepts
Chapter 1
introduction to Big Data
• Hadoop Principles
• Hadoop Ecosystem
• NoSQL databases
2
Facts
• Every day, we generate 2.5 trillion bytes of data
• 90% of the data in the world has been created in the last two years
last years.
• 90% of the data generated is unstructured.
• Sources:
• Sensors used to collect climate information
• Messages on social media
• Digital images and videos published online
• Online purchase transaction records
• GPS signals from mobile phones
• The development of IoT (Internet of Things) and the generalization
Geolocation or analytics have triggered an explosion
of the volume of data collected,
•… 3
• Data called Big Data or Massive Data
Interests
• Detecting customer sentiments and reactions
• Detect critical or potentially fatal conditions
in the hospitals.
• Making risky decisions based on data
transactional in real time.
• Identify criminals and threats from videos, sounds
and data flow.
• Studying students' reactions during a class, predicting
those who will succeed, according to the gathered statistics and models
over the years.
4
Challenges
• Gather a large volume of varied data to find
new ideas.
5
Traditional approaches
• Suitable for:
• Structured data
6
Big Data Approach vs
Traditional Approach
7
Big Data Approach vs
Traditional Approach
8
Databases and RDBMS
• A database is a set of information that is
organized into tables in a way that is easily
accessible, managed and updated.
• RDBMS: A database management system
relational is software that allows sharing and managing
information and store it in a database.
9
DBMS: ACID
• The basic concepts of RDBMS
• Atomicity: a transaction is executed completely or not at all
• Consistency: the content of a database must be consistent from the beginning and
at the end of a transaction (but not necessarily during its
execution
• Isolation: the modifications of a transaction are not
visible/editable only when it is validated
• Sustainability: once the transaction is validated, the state of the database is
permanent
• Features
• Joins between tables
• Building complex queries
10
• Solid integrity constraints
RDBMS: limitations
• Relational DBMS show their limits with very high
data streams of incompatible types with rigid schemas
of the relational model.
• Limits in the context distributed comment
distribute/partition the data
• Links between entities -> Same server
• But the more links we have, the more complex the placement of data becomes.
• Very complex ACID constraints to ensure (techniques of
verrouillages distribués par exemple)
• Incompatible with performance
• Limits in the context of quantity and data throughput:
• inability to manage very large volumes of data at high speeds
extremes 11
• certain types of data are not suitable
Data warehouses
warehouse)
• A data warehouse A data warehouse is a database
regrouping a part or all of the functional data of a
company. It falls under the category of business intelligence; its goal
is to provide a set of data serving as a unique reference,
used for decision-making in the company through
statistics and reports generated through reporting tools.
12
Entrepôts de données: limites
• The data warehouse does not allow for management:
• The volume: the warehouses are designed to handle Go or To of
data while the exponential growth of data we
conduit to Po or Eo
• The type (variety): several types of data: the data
semi-structured or unstructured textual
• Velocity: data is being generated faster and faster
and require real-time processing
13
ACIDvsBASE
• Modern distributed systems ensure the BASE model
• Basically Available: availability in the face of a large quantity of
requests
• Soft state: the state of the system can change over time even without
new inputs (this is due to the consistency model).
• Eventually consistent: all replicas reach the same state, and the
the system becomes consistent at one point, if we stop the inputs.
CAP
Consistency
(consistency/coherence)
Relational databases
Relational databases CP CA standardized
distributed centralized
14
Partition tolerance Availability
AP
(distribution) availability
NoSQL databases
Big Data
BIGDATA
15
Big Data
• The quantitative explosion of digital data has forced the
researchers to find new ways of seeing and analyzing
world. It is about discovering new orders of magnitude
regarding capture, research, sharing, storage,
the analysis and presentation of data.
• Thus was born 'Big Data'. It is a concept that allows for
store a name and information on a digital database.
• Big Data was born from the evolution of management technologies.
data.
• Big Data refers to any large amount of data.
structured, semi-structured, and unstructured that has the potential
to be exploited for obtaining information. The data
become Big data when they are difficult to process using
traditional techniques.
• Big Data is the ability to manage a huge volume of data, at the
good speed and within the appropriate deadlines to allow for a
analysis and real-time reaction. 16
Data characteristics
• The volumes to be managed are heterogeneous and complex:
• produced by sometimes different applications,
• by different users,
• with explicit links (for example citations, URLs, etc) or
implicit (to extract or to learn).
• We need many servers:
• a single server cannot store this amount of information,
guarantee access times for a large number of users, do
quick calculations, etc.
Need to distribute the calculations and the data
• As we have several servers/clusters, we need
algorithms for calculating and distributing data
17
on a large scale.
Data models
18
Structured data
• Relational data model
• A relation is a table with rows and columns
• Each relationship has a schema defining the types of its
columns
• The predefined schema is static
19
Semi-structured data:
log file
20
Unstructured data
• Examples:
• Facebook Post
• Instagram image
• video
• Blog
• Journal Article
•…
21
Characteristics of Big Data
The 5Vs of Big Data
• Information extraction and decision-making from data
characterized by the 5Vs:
• Variety
• Velocity or Speed
Veracity
Big Variety
25
VelocityorSpeed
• Speed of data arrival
• Refers to the dynamic and/or temporal aspect of the data,
their update and analysis,
• data is no longer processed, analyzed, in a delayed manner, but in
real-time or near real-time,
• they are produced in continuous flows, on which decisions are made
real-time actions can be taken,
• These are data, particularly from sensors, requiring a
fast treatment for a real-time reaction,
• in the case of such high-velocity data generating
very large volumes, it is no longer possible to store them as is,
but only to analyze them in streaming or even to summarize them.
• Example
• It is not enough to know which item a customer has purchased or reserved.
• It is enough to know that the client spent 5 minutes reading an article in
an online store to send him an email as soon as this item 26
is discounted.
Value
• All data must be transformed into actionable values.
Data without value is useless.
• Achieving strategic objectives for value creation for the
clients and for the company in all areas of activity.
• Associated with the use that can be made of this big data, of their
analyze, notably from an economic point of view.
• The analysis of these data requires a certain level of expertise.
related to methods and techniques in statistics, in analysis of
data, that of domain for the interpretation of these analyses.
31
The applications of Big Data
• Blue C.R.U.S.H. (Crime Reduction Utilizing Statistical History):est
a software that collects and gathers with the help of cameras and
police forces a maximum of data on offenses that
occur in a territory.
• It is about sending the police to the 'hot spots'; where the
the probability that a crime occurs is the highest, and thus stop
a crime prevention so that it does not occur.
Generation
Utilization Storage
Analyze 33