UNIT I : Introduction
Evolution of Big data — Best Practices for Big data Analytics —
Big data characteristics — Validating —The Promotion of the
Value of Big Data — Big DataUse Cases- Characteristics of Big
Data Applications —Perception and Quantification of Value –
Understanding Big Data Storage —A General Overview of High-
Performance Architecture—HDFS— Map Reduce and YARN—
MapReduce Programming Model
UNIT II: LISTS
Creating Lists, General List Operations, List Indexing Adding and
Deleting List Elements,Getting the Size of a List, Extended
Example: Text Concordance Accessing List Components and
Values Applying Functions to Lists, Data Frames, Creating Data
Frames, Accessing Data Frames, Other Matrix-LikeOperations
UNIT III: FACTORSANDTABLES
Factors and Levels, Common Functions Used with Factors,
Working with Tables,Matrix/Array-Like Operations on Tables,
Extracting a Sub table, Finding the Largest Cells in aTable,Math
Functions,Calculating a Probability, Cumulative Sums and
Products, Minima and Maxima, Calculus, Functions for Statistical
Distributions R PROGRAMMING
UNIT IV: OBJECT-ORIENTEDPROGRAMMINGSClasses
S Generic Functions, Writing S Classes, Using Inheritance, S
Classes, Writing S Classes, Implementing a Generic Function on
an S Class, visualization, Simulation, code profiling, Statistical
Analysis with R, data manipulation
UNIT-1
[Link] Evolution of Big Data Technologies
The Past: Beginning of Big Data
Traditional Databases
In the early days, data was stored using relational databases.
Worked with structured data
Limited in handling large data
Data Growth
With the rise of the internet in the 1990s, data increased
rapidly.
Data came from websites and social media
Mostly unstructured and complex
Emergence of Big Data
Around 2005, the term “Big Data” became popular.
Traditional systems could not manage large data
New technologies like Apache Hadoop were introduced
The Present: Modern Big Data Technologies
Today, Big Data technologies are widely used in many
industries. Organizations use different tools to manage and
analyze large amounts of data efficiently.
Apache Hadoop
Apache Hadoop is still a key technology in Big Data. It works by
dividing data across many systems.
HDFS – stores large data across multiple machines
MapReduce – processes data in parallel
YARN – manages system resources
NoSQL Databases
Traditional databases are not suitable for all types of data.
NoSQL databases such as MongoDB, Cassandra, and Redis
are more flexible.
No fixed structure (schema-less)
Easy to scale horizontally
High availability and reliability
Data Lakes
Data lakes can store all types of data—structured, semi-
structured, and unstructured. They are more flexible than
traditional data warehouses.
Examples: Amazon S3, Azure Data Lake Storage
They allow companies to store raw data for future
analysis.
The Future: Trends in Big Data
Big Data technologies will continue to grow and improve.
Several important trends will shape the future.
Artificial Intelligence (AI)
AI and machine learning help analyze large datasets
automatically. They find patterns and provide insights quickly.
Tools like TensorFlow and PyTorch are widely used in data
analysis.
Cloud Computing
Cloud platforms make Big Data more accessible. Companies
can store and process data without owning expensive
hardware.
Popular platforms: Amazon Web Services (AWS), Google
Cloud Platform (GCP), Microsoft Azure
Edge Computing
With the rise of IoT devices, large amounts of data are
generated continuously. Edge computing processes data closer
to where it is created.
Benefits: Faster processing, reduced delay, lower network
usage.
Explainable AI (XAI)
Explainable AI helps people understand how AI systems make
decisions. This improves trust and transparency.
[Link] practices for big data analytics
1. Define Clear Goals
Identify Objectives: Decide on the main goal before
starting (e.g., increasing sales or reducing costs).
Stay Focused: Clear goals help in selecting the right data
and tools.
Filter Data: Avoid collecting unnecessary data that does
not serve your goal.
Benefit: Saves significant time, effort, and resources.
2. Use Scalable Technologies
Volume Handling: Big data requires tools that can
manage massive amounts of information.
Key Tools: * Apache Hadoop: Used for storing and
processing large datasets.
o Apache Spark: Provides much faster data
processing.
Benefit: These tools allow systems to grow smoothly as
data increases.
3. Ensure Data Quality
Data Cleaning: Remove errors and duplicate entries.
Complete Records: Handle missing or incomplete
information properly.
Consistency: Maintain a uniform format and structure
across all records.
Benefit: High-quality data leads to accurate results and
better business decisions.
4. Choose the Right Storage
Data Lakes: Store raw and unstructured data in systems
like Amazon S3.
Data Warehouses: Use Google BigQuery for structured
and analyzed data.
Selection: Choose storage based on the specific data
type and how it will be used.
Benefit: Proper storage improves both speed and
operational efficiency.
5. Implement Security and Governance
Access Control: Allow data access only to authorized
users.
Encryption: Protect sensitive data using modern security
methods.
Compliance: Follow legal rules and regulations for data
usage.
Benefit: Prevents data loss, misuse, and security
breaches.
6. Automate Workflows
Automation Tools: Use platforms like Apache Airflow to
manage tasks.
Scheduling: Set data processing jobs to run
automatically at specific times.
Consistency: Reduce manual work and the risk of human
error.
Benefit: Ensures smooth, consistent, and reliable
operations.
7. Focus on Insights and Visualization
Visual Aids: Present results using clear charts, graphs,
and dashboards.
Simplicity: Make complex data easy for all users to
understand.
Actionable Data: Help decision-makers take quick,
informed actions.
Benefit: Clear visualization improves how insights are
communicated and used.
3. Big Data Characteristics
Big Data refers to massive, complex datasets that exceed
the processing capabilities of traditional database
systems.
It is defined by the 5Vs—Volume, Velocity, Variety,
Veracity, anValue—which describe its scale, speed, and
diversity.
1. Volume (Amount of Data)
Volume means the large size of data.
Companies like Netflix and YouTube create huge amounts
of data every day.
This data can be in petabytes (very large size).
Tools like Apache Hadoop and Apache Spark are used to
handle this data.
2. Veracity (Data Quality)
Veracity means how correct and reliable the data is.
Big data may have errors, missing values, or wrong
information.
For example, medical data must be very accurate. Data is
cleaned and checked to improve quality.
3. Velocity (Speed of Data)
Velocity means the speed at which data is created and
processed.
Data from social media, sensors, and banking comes very
fast.
Tools like Apache Kafka help process data in real time. This
is useful for quick actions like fraud detection.
4. Variety (Different Types of Data)
Variety means different types of data. Data can be:
Structured (tables, databases)
Semi-structured (JSON, XML)
Unstructured (text, images, videos) Example: Healthcare
data includes reports, images, and notes. Systems must
handle all these formats together.
5. Value (Usefulness of Data)
Value means how useful the data is.
Data is only helpful if it gives meaningful insights
Companies use data to understand customers and
improve services. Without value, data is just useless
information.
Benefits
Improved Decision Making
Helps organizations make better decisions using accurate
insights.
Better Data Quality (Veracity)
Ensures data is reliable and reduces errors.
Real-Time Processing (Velocity)
Enables fast analysis and quick actions.
Handling Different Data Types (Variety)
Supports structured and unstructured data.
Business Value (Value)
Helps companies improve performance and gain profit.
Unit-2