BDA Unit - 1
Big Data and It’s Importance
Introduction to Big Data:
Big Data refers to very large and complex datasets that cannot be handled using traditional data processing tools. This
data comes from many sources such as social media, sensors, mobile devices, machines, and more.
Big Data is usually described using 5 V's:
1. Volume – The huge amount of data (terabytes, petabytes).
2. Velocity – The speed at which data is generated (real-time data).
3. Variety – Different types of data (structured, semi-structured, unstructured).
4. Veracity – Uncertainty and trustworthiness of the data.
5. Value – Useful insights that can be gained from the data.
Types of Big Data:
1. Structured Data – Stored in tables/databases (e.g., sales data).
2. Unstructured Data – Images, videos, social media posts, etc.
3. Semi-structured Data – JSON, XML files (partially organized).
Sources of Big Data:
Social Media (Facebook, Twitter, Instagram)
Internet of Things (smart devices, sensors)
E-commerce websites
Online transactions
Medical records
Mobile applications
Importance of Big Data:
1. Better Decision Making:
Big Data helps organizations make data-driven decisions by analyzing trends and patterns.
2. Customer Insights:
Companies understand customer behavior, preferences, and feedback, allowing them to improve products and
services.
3. Cost Reduction:
Big Data technologies like Hadoop can store large data at lower cost using commodity hardware.
4. Real-Time Processing:
Organizations can monitor events in real-time (e.g., fraud detection, traffic control).
5. Innovation and Product Development:
Companies can design new products based on customer needs identified from Big Data analytics.
6. Healthcare Advancements:
Big Data is used for disease prediction, treatment planning, and analyzing patient records.
7. Business Growth:
Helps businesses gain competitive advantage by understanding market trends and customer expectations.
8. Risk Management:
Identifies potential risks and prepares solutions using predictive analytics.
BDA Unit - 1 1
6 V’s of Big Data
Introduction:
Big Data is often defined using six key characteristics, known as the 6 V's. These 6 V’s explain what makes data “big”
and why it needs special tools and techniques to handle.
1. Volume (Amount of Data):
Meaning: Volume refers to the huge size of data generated from various sources.
Data is generated in terabytes, petabytes, or even zettabytes.
Traditional systems cannot store or process this much data.
Examples:
Facebook generates 4 petabytes of data per day.
YouTube uploads 500+ hours of video every minute.
Why it matters: To store and manage this huge volume, we need systems like HDFS in Hadoop or cloud storage
platforms.
2. Velocity (Speed of Data Generation):
Meaning: Velocity is the speed at which data is generated and processed.
Some data comes in real-time and needs to be handled immediately.
Examples:
Stock market data updates every second or millisecond.
Social media posts, likes, and comments are generated every second.
Why it matters: Real-time data needs streaming tools like Apache Kafka, Apache Spark, or Flume to process data
instantly.
3. Variety (Types of Data):
Meaning: Variety means Big Data comes in different formats – not just structured tables.
It includes structured, semi-structured, and unstructured data.
Examples:
Structured: Excel sheets, database tables.
Semi-structured: JSON, XML files.
Unstructured: Videos, images, audio, text messages.
Why it matters: Tools like Hive, HBase, NoSQL databases help to manage different types of data formats.
4. Veracity (Trustworthiness of Data):
Meaning: Veracity refers to the quality, accuracy, and reliability of data.
Some data may be incomplete, duplicated, or wrong.
Examples:
In social media data, fake accounts or spam comments reduce data accuracy.
In IoT sensor data, faulty sensors may send incorrect readings.
Why it matters: We need data cleaning and preprocessing techniques to remove errors and ensure high-quality results
from Big Data.
5. Value (Usefulness of Data):
Meaning: Not all data is useful. Value refers to the actual benefit or insights we get from analyzing Big Data.
Raw data is not helpful unless it is processed to extract meaning.
Examples:
Analyzing customer reviews helps improve products.
Sales data analysis helps in making better business strategies.
BDA Unit - 1 2
Why it matters: Big Data tools like Spark, Hadoop, and machine learning algorithms help turn raw data into valuable
business insights.
6. Variability (Inconsistency in Data):
Meaning: Variability means that data meaning or behavior keeps changing over time or in different situations.
Data is inconsistent and unpredictable.
Examples:
A trending topic on Twitter can change every few minutes.
The same word can mean different things in different contexts (e.g., "Java" = coffee or programming language).
Why it matters: To handle variability, Big Data systems must be flexible and adaptive in real-time.
Challenges of Big Data
Introduction:
Big Data deals with extremely large and complex datasets. While it offers many benefits like better decision-making and
improved services, it also comes with several challenges in terms of storage, processing, security, and analysis.
Here are the main challenges of Big Data:
1. Data Storage and Management:
Big Data involves massive volumes of data that traditional storage systems cannot handle.
Need for distributed file systems like HDFS to store large data.
Example: Social media platforms generate terabytes of data every day.
2. Data Processing Speed:
High velocity of data makes it difficult to process in real-time.
Requires powerful processing tools like Apache Spark or Storm.
Example: Fraud detection in banking requires immediate processing of transactions.
3. Data Variety:
Big Data comes in many forms: structured, unstructured, and semi-structured.
Managing and analyzing videos, images, texts, logs together is complex.
Example: A health monitoring system receives data from sensors (structured), reports (semi-structured), and X-ray
images (unstructured).
4. Data Quality and Veracity:
Big Data often contains incomplete, inconsistent, or inaccurate data.
Ensuring data cleanliness and accuracy is a major challenge.
Example: Misinformation on social media can affect sentiment analysis.
5. Data Security and Privacy:
Sensitive data like personal details or financial records need protection.
Ensuring data encryption, access control, and compliance is crucial.
Example: Healthcare data must follow privacy laws like HIPAA.
6. Lack of Skilled Professionals:
There is a shortage of people skilled in data science, big data tools, and analytics.
Organizations need experts to handle data platforms like Hadoop, Spark, Kafka.
7. Integration of Data from Different Sources:
Data may come from multiple sources in different formats.
Integrating it for unified analysis is difficult.
Example: Combining data from CRM, website logs, and IoT devices for customer behavior analysis.
BDA Unit - 1 3
Drivers for Big Data
Introduction:
Drivers of Big Data are the key factors responsible for the rapid growth, need, and adoption of Big Data technologies in
the world today. These drivers explain why companies, governments, and individuals depend on Big Data to manage
huge, fast, and varied data.
1. The Digitization of Society (Rapid Growth of Data)
In today’s world, almost everything is digital—shopping, banking, education, and healthcare.
This digitization leads to huge amounts of data generation every second.
Sources include mobile apps, websites, online forms, emails, etc.
Traditional systems cannot handle this volume of data effectively.
Example: Online learning platforms generate data from student login times, progress tracking, and video views.
2. The Plummeting of Technology Costs (Cheaper & Faster Computing Power)
Hardware and storage have become more affordable than ever.
Cloud computing platforms (like AWS, Azure) allow businesses to rent powerful computing resources at low cost.
Due to distributed systems, we can now process huge data across clusters.
Cost-effective computing encourages even small companies to adopt Big Data tools.
Example: Startups use cloud services like Google BigQuery to analyze user data without buying expensive servers.
3. Connectivity Through Cloud Computing (Increased Storage Capabilities)
Cloud computing provides scalable and flexible storage for massive datasets.
Tools like HDFS and Amazon S3 allow storage and retrieval of data across distributed systems.
Cloud also enables anytime, anywhere access to data.
It supports collaborative work and fast data sharing across the globe.
Example: A global company stores customer data in the cloud to be accessed by teams in different countries.
4. Increased Knowledge About Data Science (Advanced Analytics & Machine Learning)
Businesses now understand the value of analyzing data to improve decisions.
Big Data allows predictive analytics, machine learning, and real-time dashboards.
Data scientists use data to detect fraud, optimize marketing, and personalize experiences.
Growing skills in AI and analytics drive Big Data adoption.
Example: Banks use machine learning to predict loan defaulters by analyzing customer transaction patterns.
5. Social Media Applications (Growth of Internet & Social Media)
Social media apps like Facebook, Instagram, Twitter, and YouTube generate billions of posts, likes, comments, and
videos.
This user-generated data is unstructured and massive.
Businesses analyze this data for trends, customer opinions, and brand performance.
This explosion in social media activity is a major driver of Big Data.
Example: A company analyzes Twitter comments about its new product to understand customer sentiment.
6. The Upcoming of Internet of Things (IoT)
IoT devices like smart TVs, fitness bands, industrial sensors, and home assistants are everywhere.
These devices generate real-time data continuously.
Big Data systems are needed to store, manage, and analyze this massive streaming data.
IoT and Big Data work together to enable automation, alerts, and efficiency.
Example: A smart factory uses IoT sensors to monitor equipment and detect issues before breakdowns happen.
Importance of Big Data in business
BDA Unit - 1 4
1. Data-Driven Decision Making:
Big Data helps businesses make decisions based on facts and data instead of guesswork.
It provides real-time and historical insights that guide strategy.
Example: A company can decide where to open a new store based on customer footfall and sales data from different
locations.
2. Customer Insights and Personalization:
Big Data helps companies understand customer behavior, needs, and preferences.
Businesses can then offer personalized recommendations, discounts, or products.
Example: Netflix uses viewing history to suggest shows each user might like.
3. Operational Efficiency and Optimization:
Companies use Big Data to identify delays, reduce waste, and improve processes.
It helps in automating repetitive tasks and increasing productivity.
Example: A factory can use sensor data to monitor machine health and schedule maintenance before breakdowns
happen.
4. Risk Management and Fraud Detection:
Big Data tools can detect unusual patterns or transactions that may signal fraud or errors.
It also helps in predicting and preparing for future risks.
Example: Banks use Big Data to detect fraud in credit card transactions in real-time.
5. Supply Chain Optimization:
Big Data improves supply chain by tracking inventory, shipments, and demand patterns.
It helps reduce costs, delays, and stock-outs.
Example: Amazon uses real-time data to manage its vast supply chain across warehouses and delivery partners.
6. Market Research and Competitor Analysis:
Businesses use Big Data to analyze market trends, customer feedback, and competitor performance.
This helps them stay ahead in the competition.
Example: A smartphone company monitors online reviews and competitors’ pricing strategies to plan their next product
launch.
7. Predictive Analytics for Future Trends:
Big Data can forecast future trends, customer behavior, and demand patterns.
Helps companies to plan resources, campaigns, and strategies ahead of time.
Example: Retailers predict which products will be in high demand during the holiday season using past sales data.
8. Enhanced Customer Service:
Businesses use Big Data to understand common customer issues and respond faster.
Chatbots and customer support systems use data to provide better help.
Example: E-commerce platforms track delivery issues and refund patterns to improve customer satisfaction.
9. Innovation and Product Development:
Companies use insights from Big Data to develop new products or improve existing ones.
Customer feedback, usage data, and complaints guide product design.
Example: Smartphone companies use app usage data and hardware feedback to improve the next model.
10. Compliance and Security:
Big Data helps companies track and ensure compliance with government laws and data regulations.
It also supports data security, access control, and audit trails.
Example: Health apps must follow data privacy laws like HIPAA, and Big Data tools help track and ensure compliance.
BDA Unit - 1 5
Introduction to Big Data Analytics
Introduction to Big Data Analytics:
Big Data Analytics is the process of examining large and complex datasets to uncover hidden patterns, unknown
correlations, trends, and useful insights. It helps businesses, governments, healthcare, and many sectors to make better
decisions using data.
Big Data Analytics is important because:
It helps in understanding customer behavior.
It supports real-time decision making.
It increases efficiency and reduces costs.
Big Data Analytics is divided into four main types, and there are also special types like Text Analytics and Spatial
Analytics.
1. Descriptive Analytics: "What happened?"
Meaning: Descriptive analytics summarizes past data to show what has happened.
It uses reports, charts, graphs, and dashboards to provide insights.
Example:
A company uses descriptive analytics to find monthly sales totals.
A website uses it to see how many users visited each day.
Tools used: Excel, Tableau, Power BI, Hadoop (with Hive)
Why it's important: It helps organizations understand past performance and behavior.
2. Diagnostic Analytics: "Why did it happen?"
Meaning: Diagnostic analytics looks at past data to find the reason or cause of events or trends.
It goes deeper than descriptive analytics using data drilling and correlation analysis.
Example:
If sales dropped last month, diagnostic analytics will help find why – maybe due to poor marketing or increased
competition.
Tools used: SQL, Python, R, Hadoop (with Pig or Hive)
Why it's important: It helps businesses identify root causes of problems and fix them.
3. Predictive Analytics: "What will happen?"
Meaning: Predictive analytics uses past data and statistical models to predict future outcomes.
It uses machine learning, data mining, and forecasting techniques.
Example:
E-commerce websites predict what products a customer is likely to buy next.
Banks use it to predict if a customer might default on a loan.
Tools used: Python, R, Apache Spark, TensorFlow, Scikit-learn
Why it's important: It helps in planning ahead and preparing for future events.
4. Prescriptive Analytics: "What should be done?"
Meaning: Prescriptive analytics suggests actions or solutions to take in order to get the desired outcome.
It uses AI, optimization algorithms, and simulations.
Example:
A delivery company uses prescriptive analytics to find the best route for its drivers.
Hospitals use it to recommend treatment plans for patients.
Tools used: IBM Watson, SAS, MATLAB, Apache Mahout
Why it's important: It helps in decision-making by suggesting the best course of action.
5. Text Analytics (Text Mining):
BDA Unit - 1 6
Meaning: Text analytics is the process of analyzing unstructured text data to extract meaningful insights.
This includes emails, social media posts, reviews, and comments.
Example:
Analyzing customer reviews to find common complaints or satisfaction.
Monitoring tweets to track public opinion on a product or brand.
Tools used: NLP tools, Python (NLTK, SpaCy), Apache Lucene
Why it's important: A large amount of Big Data is in text format, and analyzing it helps in understanding customer
sentiment.
6. Spatial Analytics (Geospatial Analytics):
Meaning: Spatial analytics is used to analyze data related to location, maps, or geography.
It combines location data with other types of data for analysis.
Example:
Tracking the spread of diseases using patient location data.
Ride-sharing apps like Uber use it to match drivers with passengers based on their locations.
Tools used: GIS tools, ArcGIS, QGIS, Google Maps API
Why it's important: Location-based analysis helps in urban planning, disaster management, and logistics.
Components of Big Data Analytics
Introduction:
Big Data Analytics involves a series of steps that help in collecting, storing, processing, analyzing, and visualizing large
and complex data to generate valuable insights. These steps are known as the main components of Big Data Analytics.
Let’s understand each component in detail:
1. Data Collection:
This is the first step, where data is gathered from various sources.
Big Data can come from:
Social media platforms
IoT sensors
Web applications
Databases
Mobile devices
Key tools:
Apache Flume (for streaming data)
Sqoop (for data from databases)
Kafka (for real-time messaging)
Example: A retail company collects data from online orders, customer feedback, and in-store purchases.
2. Data Storage:
After collection, data needs to be stored in a way that supports scalability and reliability.
Since Big Data is huge, traditional systems are not enough. So, distributed storage systems are used.
Key technologies:
HDFS (Hadoop Distributed File System)
Amazon S3
NoSQL Databases (like MongoDB, HBase)
Example: A company may use HDFS to store petabytes of customer data across multiple nodes.
3. Data Processing:
BDA Unit - 1 7
In this step, the raw data is cleaned, filtered, and transformed to make it usable.
Data may contain duplicates, errors, or missing values which need to be fixed.
Also includes processing real-time or batch data.
Key tools:
Apache Spark (fast, in-memory processing)
MapReduce (batch processing)
Apache Storm (real-time processing)
Example: An e-commerce site processes order logs to detect fraud or failed transactions.
4. Data Analysis:
This component is about using analytical methods to discover patterns, correlations, and insights.
It includes:
Descriptive Analytics (what happened)
Predictive Analytics (what will happen)
Prescriptive Analytics (what to do)
Key tools:
Python, R, Hive, Pig, Mahout, Machine Learning libraries
Example: A telecom company uses analytics to predict which customers might cancel their service.
5. Data Visualization:
Final results of analysis are displayed using charts, graphs, dashboards, or reports.
Helps non-technical users understand the data clearly and make decisions.
Good visualization brings out trends and outliers in the data.
Key tools:
Tableau, Power BI, QlikView, Grafana
Example: A marketing team uses dashboards to see the performance of different ad campaigns across regions.
Applications of Big Data Analytics
1. Business Intelligence & Decision Making
Big Data helps companies make smart, data-driven decisions.
It collects and analyzes data from various departments like sales, HR, and finance.
It helps track performance, customer behavior, and market trends.
Real-time dashboards provide updated reports and KPIs for managers.
Example: A company uses Big Data to decide the best location for opening a new store based on customer footfall and
sales trends.
2. Healthcare Revolution
Hospitals and clinics use Big Data to analyze patient records, treatment history, and medical images.
It helps in predicting diseases early using machine learning.
Big Data supports personalized treatments and remote health monitoring.
It improves healthcare efficiency and reduces readmission rates.
Example: Wearable devices like FitBit collect patient vitals, and Big Data analyzes them to alert doctors in real-time.
3. Financial Sector Optimization
Big Data helps banks detect fraudulent transactions and unusual activity.
It supports risk management and credit scoring.
Used for analyzing spending behavior, loan defaults, and customer segmentation.
BDA Unit - 1 8
Helps in personalized banking and investment recommendations.
Example: A bank uses Big Data to approve loans instantly after analyzing the applicant’s income, credit score, and
spending patterns.
4. E-Commerce & Customer Experience
Big Data tracks user behavior, preferences, and browsing history.
It powers personalized recommendations (e.g., “You may also like…”).
Improves customer engagement through targeted advertisements.
Helps companies understand feedback and product reviews.
Example: Amazon recommends products based on previous purchases and search history using Big Data algorithms.
5. Supply Chain Optimization
Companies use Big Data to monitor inventory levels, supplier data, and logistics.
It helps reduce delays, stock-outs, and overstocking.
Predicts demand trends and automates ordering systems.
Supports real-time tracking of shipments and deliveries.
Example: Walmart uses Big Data to manage its inventory in real time across thousands of stores worldwide.
6. Smart Cities & Urban Planning
Big Data helps in traffic management, energy usage, waste management, and public safety.
Sensors collect data from roads, buildings, and transport systems.
Governments use this data for planning infrastructure and public services.
Improves the quality of life for citizens.
Example: Smart traffic lights in cities adjust timing based on real-time traffic data to reduce congestion.
7. Education & Learning Analytics
Big Data is used to track student performance, attendance, and learning behavior.
It helps identify students who need extra support or are at risk.
Supports personalized learning paths for different types of learners.
Enables institutions to improve teaching strategies and content.
Example: Online learning platforms like Coursera suggest courses based on a student’s past activity and performance.
8. Predictive Maintenance in Industry
Machines and equipment are fitted with sensors to collect operational data.
Big Data predicts when a machine might fail before it actually does.
Reduces downtime, repair costs, and accidents.
Helps in scheduling maintenance at the right time.
Example: An airline company uses predictive analytics to maintain aircraft engines before failure occurs, ensuring safety
and cost savings.
Classification of Digital Data
Introduction:
Digital data is the core of Big Data. Based on how it is organized and stored, digital data is classified into three main types:
👉 Structured data
👉 Unstructured data
👉 Semi-structured data
Each type of data requires different tools and methods for storage, processing, and analysis.
1. Structured Data
This type of data is highly organized and stored in tables (rows and columns).
BDA Unit - 1 9
It follows a fixed format and is easy to enter, query, and analyze using traditional databases (RDBMS).
It is mostly text or numbers stored in databases.
Examples:
Bank transaction records
Employee details in an Excel sheet
Student database with roll number, name, and marks
Sales data in rows and columns
Tools used:
SQL, Oracle, MySQL, Microsoft Excel
✅ Advantages: Easy to store, manage, and analyze
❌ Limitations: Can’t handle complex or multimedia data
2. Unstructured Data
This type of data does not follow a fixed structure or format.
It is hard to store in traditional databases.
Includes text, images, videos, audio, social media posts, etc.
It needs advanced tools like machine learning and natural language processing to extract meaning.
Examples:
YouTube videos
WhatsApp chat messages
Tweets, Facebook posts
Emails with attachments
CCTV footage
Tools used:
Hadoop, Apache Spark, NoSQL, NLP tools (like NLTK), AI-based tools
✅ Advantages: Rich in insights, found in real-world applications
❌ Limitations: Harder to process and analyze
3. Semi-Structured Data
It is partially organized – not as strict as structured data but not completely messy like unstructured data.
It has tags or markers to separate data items, but does not follow a fixed schema like a table.
Easy to store in NoSQL databases.
Examples:
XML and JSON files
Emails (subject, to, from – structured; message body – unstructured)
Log files from servers
Sensor data with time and reading
Tools used:
MongoDB, Apache Hive, HBase, XML parsers
✅ Advantages: Flexible format, supports different types of data
❌ Limitations: Requires special tools to manage
BDA Unit - 1 10