0% found this document useful (0 votes)
52 views33 pages

Introduction to Big Data Concepts

This document introduces the concept of Big Data. It describes the characteristics of massive data, including their increasing volume, variety, velocity, and value. The document also presents the challenges posed by Big Data and the traditional and new approaches to manage it.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views33 pages

Introduction to Big Data Concepts

This document introduces the concept of Big Data. It describes the characteristics of massive data, including their increasing volume, variety, velocity, and value. The document also presents the challenges posed by Big Data and the traditional and new approaches to manage it.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Big Data

Chapter 1
introduction to Big Data

Teacher: Nedra Ibrahim


[Link]@[Link] 1
Course Outline

• Introduction to Big Data

• Hadoop Principles

• Hadoop Ecosystem

• Big Data architectures

• NoSQL databases
2
Facts
• Every day, we generate 2.5 trillion bytes of data
• 90% of the data in the world has been created in the last two years
last years.
• 90% of the data generated is unstructured.
• Sources:
• Sensors used to collect climate information
• Messages on social media
• Digital images and videos published online
• Online purchase transaction records
• GPS signals from mobile phones
• The development of IoT (Internet of Things) and the generalization
Geolocation or analytics have triggered an explosion
of the volume of data collected,
•… 3
• Data called Big Data or Massive Data
Interests
• Detecting customer sentiments and reactions
• Detect critical or potentially fatal conditions
in the hospitals.
• Making risky decisions based on data
transactional in real time.
• Identify criminals and threats from videos, sounds
and data flow.
• Studying students' reactions during a class, predicting
those who will succeed, according to the gathered statistics and models
over the years.

4
Challenges
• Gather a large volume of varied data to find
new ideas.

• Quickly create data capture

• Save all this data

• Process this data and use it

5
Traditional approaches
• Suitable for:

• Structured data

• Operations and repetitive processes

• Relatively stable sources

• Needs well understood and well framed

6
Big Data Approach vs
Traditional Approach

7
Big Data Approach vs
Traditional Approach

8
Databases and RDBMS
• A database is a set of information that is
organized into tables in a way that is easily
accessible, managed and updated.
• RDBMS: A database management system
relational is software that allows sharing and managing
information and store it in a database.

9
DBMS: ACID
• The basic concepts of RDBMS
• Atomicity: a transaction is executed completely or not at all
• Consistency: the content of a database must be consistent from the beginning and
at the end of a transaction (but not necessarily during its
execution
• Isolation: the modifications of a transaction are not
visible/editable only when it is validated
• Sustainability: once the transaction is validated, the state of the database is
permanent
• Features
• Joins between tables
• Building complex queries
10
• Solid integrity constraints
RDBMS: limitations
• Relational DBMS show their limits with very high
data streams of incompatible types with rigid schemas
of the relational model.
• Limits in the context distributed comment
distribute/partition the data
• Links between entities -> Same server
• But the more links we have, the more complex the placement of data becomes.
• Very complex ACID constraints to ensure (techniques of
verrouillages distribués par exemple)
• Incompatible with performance
• Limits in the context of quantity and data throughput:
• inability to manage very large volumes of data at high speeds
extremes 11
• certain types of data are not suitable
Data warehouses
warehouse)
• A data warehouse A data warehouse is a database
regrouping a part or all of the functional data of a
company. It falls under the category of business intelligence; its goal
is to provide a set of data serving as a unique reference,
used for decision-making in the company through
statistics and reports generated through reporting tools.

12
Entrepôts de données: limites
• The data warehouse does not allow for management:
• The volume: the warehouses are designed to handle Go or To of
data while the exponential growth of data we
conduit to Po or Eo
• The type (variety): several types of data: the data
semi-structured or unstructured textual
• Velocity: data is being generated faster and faster
and require real-time processing

13
ACIDvsBASE
• Modern distributed systems ensure the BASE model
• Basically Available: availability in the face of a large quantity of
requests
• Soft state: the state of the system can change over time even without
new inputs (this is due to the consistency model).
• Eventually consistent: all replicas reach the same state, and the
the system becomes consistent at one point, if we stop the inputs.
CAP
Consistency
(consistency/coherence)
Relational databases
Relational databases CP CA standardized
distributed centralized

14
Partition tolerance Availability
AP
(distribution) availability
NoSQL databases
Big Data
BIGDATA

15
Big Data
• The quantitative explosion of digital data has forced the
researchers to find new ways of seeing and analyzing
world. It is about discovering new orders of magnitude
regarding capture, research, sharing, storage,
the analysis and presentation of data.
• Thus was born 'Big Data'. It is a concept that allows for
store a name and information on a digital database.
• Big Data was born from the evolution of management technologies.
data.
• Big Data refers to any large amount of data.
structured, semi-structured, and unstructured that has the potential
to be exploited for obtaining information. The data
become Big data when they are difficult to process using
traditional techniques.
• Big Data is the ability to manage a huge volume of data, at the
good speed and within the appropriate deadlines to allow for a
analysis and real-time reaction. 16
Data characteristics
• The volumes to be managed are heterogeneous and complex:
• produced by sometimes different applications,
• by different users,
• with explicit links (for example citations, URLs, etc) or
implicit (to extract or to learn).
• We need many servers:
• a single server cannot store this amount of information,
guarantee access times for a large number of users, do
quick calculations, etc.
Need to distribute the calculations and the data
• As we have several servers/clusters, we need
algorithms for calculating and distributing data
17
on a large scale.
Data models

18
Structured data
• Relational data model
• A relation is a table with rows and columns
• Each relationship has a schema defining the types of its
columns
• The predefined schema is static

19
Semi-structured data:
log file

20
Unstructured data
• Examples:
• Facebook Post
• Instagram image
• video
• Blog
• Journal Article
•…

21
Characteristics of Big Data
The 5Vs of Big Data
• Information extraction and decision-making from data
characterized by the 5Vs:

• Volume (volume) Volume

• Variety

• Velocity or Speed
Veracity
Big Variety

• Veracity or Validity Data


• Value 22
Value Velocity
Volume
• Big Data involves huge volumes of generated data.
by sensors and machines combined with the explosion of the Internet,
of social media, of e-commerce, of devices
GPS, etc
• The price of data storage has decreased significantly in the last 30 years.
last years:
• From $100/GB (1980)
• At $0.10/GB (2013)
• reliable storage locations (SAN: Storage Area Network) or
Storage networks can be expensive.
How to store data in a reliable place that are
cheaper
How to navigate through this data and extract some 23
information easily and quickly?
Variety
• Most existing data is unstructured or semi-
structured.
• Some data may seem outdated but are useful for
certain decisions.
• These data may present complex forms due to
that find their origins in:
• various and diverse sensors (temperature, wind speed,
hygrometry, tours/min, brightness ...
• exchanged messages (emails, social media, image exchanges, of
videos, music,
• texts, online publications (digital libraries, websites
web, blogs, ...
• purchase transaction records, digitized plans,
directories, information from mobile phones, etc.
24
Need for new technologies to analyze and cross-reference
unstructured data (emails, photos, conversations...)
representing at least 90% of the collected information.
Variety

25
VelocityorSpeed
• Speed of data arrival
• Refers to the dynamic and/or temporal aspect of the data,
their update and analysis,
• data is no longer processed, analyzed, in a delayed manner, but in
real-time or near real-time,
• they are produced in continuous flows, on which decisions are made
real-time actions can be taken,
• These are data, particularly from sensors, requiring a
fast treatment for a real-time reaction,
• in the case of such high-velocity data generating
very large volumes, it is no longer possible to store them as is,
but only to analyze them in streaming or even to summarize them.
• Example
• It is not enough to know which item a customer has purchased or reserved.
• It is enough to know that the client spent 5 minutes reading an article in
an online store to send him an email as soon as this item 26
is discounted.
Value
• All data must be transformed into actionable values.
Data without value is useless.
• Achieving strategic objectives for value creation for the
clients and for the company in all areas of activity.
• Associated with the use that can be made of this big data, of their
analyze, notably from an economic point of view.
• The analysis of these data requires a certain level of expertise.
related to methods and techniques in statistics, in analysis of
data, that of domain for the interpretation of these analyses.

• The terms 'Data Scientist' and 'Data Science' are related to


this sought-after expertise and this emerging new discipline.
27
TruthfulnessorValidity
• This refers to the disorder or reliability of the data.
With the increase in quantity, quality, and precision
get lost.
• If we want meaning from this data, we must
first clean them.
• Big Data solutions must address this by referring to
to the volume of existing data.
• Need for precision in the organization of collection and
the merging, data enrichment for:
• Raise the uncertainty of the unpredictable nature of data.
• Build trust and ensure security and integrity
data. 28
What companies do there
win
• Big Data allows organizations to store, manage and
manipulate large amounts of data quickly and
at the right time to obtain the right information.
• Many companies are experimenting with
techniques that allow them to collect quantities
massive data to determine hidden patterns
in these data which could be early indication of a
important change.
• Certain data may indicate:
• The change in customer buying habits.
• Emergence of new opportunities for the company.
• Necessary changes in the production process. 29
Challengesatthelevelof
the company
• La croissance des données entraîne en particulier une hausse
costs of equipment, software, and associated maintenance,
of the administration and services.

• Big Data requires a new set of skills within


of the company.

• Big Data analysis projects require teams


multidisciplinary, and active collaboration must be
engaged between the IT department and the data scientists.
30
The applications of Big Data
• Big Data & Predictive Marketing: forecasts based on
data and probabilities.
• real-time processing of a large volume of data:
knowledge and definition of customer needs and expectations
• In public administration: extraordinary quantities
data is accumulated during the execution of
public services:
• The management of social assistance and public health services,
• The issuance of passports and driving licenses.
• The management of taxes and revenues ...

31
The applications of Big Data
• Blue C.R.U.S.H. (Crime Reduction Utilizing Statistical History):est
a software that collects and gathers with the help of cameras and
police forces a maximum of data on offenses that
occur in a territory.
• It is about sending the police to the 'hot spots'; where the
the probability that a crime occurs is the highest, and thus stop
a crime prevention so that it does not occur.

• Since its launch 7 years ago,


• The number of murders and burglaries has decreased by 36% in Memphis.
• Motor vehicle theft has dropped by 55%!
• Department of Health and Human Services.
• Improve the use of imaging in cancer research
• Department of Energy: enable the collection of observations
32
specify atmospheric phenomena.
Lifecycle of Big Data

Generation

Utilization Storage

Analyze 33

You might also like