0% found this document useful (0 votes)
3 views9 pages

Big Data Analytics: Challenges & Insights

The document discusses Big Data, its challenges, and the importance of Big Data Analytics. It covers various concepts including the CAP theorem, NewSQL, Hadoop ecosystem components, and NoSQL databases. Additionally, it highlights the characteristics of data and the differences between traditional business intelligence and Big Data analytics.

Uploaded by

shahinmulla851
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views9 pages

Big Data Analytics: Challenges & Insights

The document discusses Big Data, its challenges, and the importance of Big Data Analytics. It covers various concepts including the CAP theorem, NewSQL, Hadoop ecosystem components, and NoSQL databases. Additionally, it highlights the characteristics of data and the differences between traditional business intelligence and Big Data analytics.

Uploaded by

shahinmulla851
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

BIG DATA ANALYTICS ( BCS714D )

Module 1
All Answer are from the prescribed VTU textbook for the above course

1 What is Big Data? Explain the challenges With Big Data.

Big Data:
Big data is high-volume, high-velocity, and high-variety information assets that demand cost effective,
innovative forms of information processing for enhanced insight and decision making.

Challenges Of Big Data:

1. Exponential Data Growth:


Data is growing very rapidly, making it difficult to decide which data is useful, how much to analyze,
and how to separate meaningful insights from noise.
2. Cloud & Infrastructure Decisions:
While cloud computing offers scalability and cost efficiency, deciding whether to host big data
solutions inside or outside the organization remains a challenge.
3. Data Retention Period:
Determining how long to store data is difficult, as some data is valuable long-term while other data
becomes irrelevant very quickly.
4. Lack of Skilled Professionals:
There is a shortage of highly skilled data science professionals required to design and manage big
data solutions.
5. Data Management Complexity:
Capturing, storing, processing, securing, and analyzing large, fast-moving, and unstructured data
exceeds the capabilities of traditional databases.
6. Data Visualization Challenges:
Although data visualization is gaining importance, there is a lack of experts who can effectively
present complex big data insights.

2 What is big data analytics? Explain classification of analytics.

Big data analytics is the process of examining large data sets containing a variety of data types to discover
some knowledge in databases, to identify interesting patterns and establish relationships to solve problems,
market trends, customer preferences, and other useful information.

Classification Of Analytics:
3 Explain CAP Theorem?

The CAP theorem is also called the Brewer's Theorem. It states that in a distributed computing environment
(a collection of interconnected nodes that share data), it is impossible to provide the following guarantees.
At best you can have two of the following three - one must be sacrificed.
Consistency
Availability
Partition tolerance

1. Consistency (C):
Consistency means that every read operation returns the most recent write. In other words, all
nodes see the same data at the same time.
2. Availability (A):
Availability means that every request (read or write) receives a response within a reasonable amount
of time, even if some nodes fail.
3. Partition Tolerance (P):
Partition tolerance means that the system continues to operate even when network failures or
communication breakdowns occur between nodes.

4 What is NewSQL? Explain the Characteristics of NewSQL

A database that has the same scalable performance of NoSQL systems for On Line Transaction Processing
(OLTP) while still maintaining the ACID guarantees of a traditional database. This new modern RDBMS is
called NewSQL.. It supports relational data model and uses SQL as Their primary interface.
5 Explain the Hadoop Ecosystem Components for Data Processing and Data Analysis

Hadoop Ecosystem Components for Data Processing


1. MapReduce
MapReduce is a distributed programming model used to process large datasets in parallel.
• Data is read from HDFS.
• Map phase converts input data into key–value pairs.
• Reduce phase aggregates and processes this data.
• Final output is stored back in HDFS.
It provides fault-tolerant and scalable batch processing.

2. Spark
Spark is a fast, in-memory data processing framework and an alternative to MapReduce.
• Processes data in memory, making it 10–100 times faster.
• Reads data from HDFS but does not use MapReduce.
• Can run on YARN or in standalone mode.
• Supports Scala, Python, Java, and R.
Spark Libraries:
• Spark SQL – SQL-based data querying
• Spark Streaming – real-time data processing
• MLlib – machine learning support
• GraphX – graph processing

Hadoop Ecosystem Components for Data Analysis


1. Pig
Pig is a high-level scripting platform for analyzing large datasets.
• Uses Pig Latin, a SQL-like language.
• Pig scripts are converted into MapReduce jobs.
• Used mainly for ETL, data transformation, filtering, and analysis.
• Preferred by non-SQL developers.
2. Hive
Hive is a data warehousing tool built on Hadoop.
• Uses HiveQL (HQL), a SQL-like query language.
• Converts queries into MapReduce jobs.
• Used for querying, summarization, and analysis of large datasets.
• Suitable for SQL-oriented users.

Extra Imp Ques

1 Characteristics Of Data.

1. Composition
The composition of data deals with the structure of data, such as the sources of data, its granularity, types,
and nature—whether the data is static or real-time streaming.

2. Condition
The condition of data refers to the state and quality of data, that is, whether the data can be used directly
for analysis or requires cleansing, enhancement, or enrichment before use.

3. Context
The context of data explains the background of data, including where it was generated, why it was
generated, how sensitive it is, and the events or situations associated with it.
2 Why Big Data?

3 TRADITIONAL BUSINESS INTELLIGENCE (Bl) VERSUS BIG DATA


4 NoSQL

NoSQL (Not Only SQL) is a type of database that stores and manages data without using traditional
relational tables.
It is designed to handle large amounts of data, provide high performance, and support flexible data
structures, especially in distributed systems.

Few features of NoSQL databases are as follows:


1. They are open source.
2. They are non-relational.
3. They are distributed.
4. They are schema-less.
5. They are cluster friendly.
6. They are born out of 21st-century web applications.

Types of NoSQL Databases:


1. Key-Value Store – Stores data as simple key and value pairs.
2. Document Store – Stores data in document format like JSON or XML.
3. Column-Based Store – Stores data in columns instead of rows.
4. Graph Database – Stores data as nodes and relationships.

5 Typical Hadoop Environment


6 WHY IS BIG DATA ANALYTICS IMPORTANT?

1. Reactive – Business Intelligence (BI):


Uses past and historical data to generate reports, dashboards, alerts, and notifications for better
decision-making.
2. Reactive – Big Data Analytics:
Analyzes very large datasets but still works on static, historical data.
3. Proactive – Analytics:
Uses techniques like data mining and predictive modeling to support future decisions, but has limits
in storage and processing.
4. Proactive – Big Data Analytics:
Analyzes massive data volumes to quickly extract useful insights and solve complex problems.

You might also like