2marks
Unit-1
1. Illustrate various types of analytics that can be applied on Big Data.
Descriptive Analytics: Summarizes historical data to understand what
happened.
Predictive Analytics: Forecasts future trends and outcomes based on historical
data.
Diagnostic Analytics: Examines data to understand why certain events
happened.
Prescriptive Analytics: Provides recommendations on what actions to take
based on analysis.
2. Interpret Predictive Analytics and Diagnostic Analytics in brief
Predictive Analytics: Predicts future outcomes based on historical data and
statistical algorithms.
Diagnostic Analytics: Analyzes data to determine the cause of past events or
trends.
3. Infer the applications of Big Data.
Applications include personalized marketing, predictive maintenance, fraud
detection, healthcare analytics, and supply chain optimization.
4. How Big data analytics is useful in the Travel and Transportation.
Big data analytics can optimize routes, predict demand, improve scheduling,
enhance safety, and personalize customer experiences in the travel and
transportation industry.
5. Infer the essence of Web Analytics?
Web analytics involves analyzing web data to understand and optimize web
usage, user behavior, and website performance.
6. Illustrate the examples of web analytics
Examples include tracking website traffic, analyzing click-through rates,
monitoring conversion rates, and studying user demographics and preferences.
7. Is Predictive analysis helpful in predicting the frauds? Justify your
answer with a simple scenario.
Yes, predictive analysis can help predict fraud. For example, analyzing past
transaction data and user behavior can identify patterns indicative of
fraudulent activity, allowing businesses to take preventive measures.
8. How Predictive analysis is useful in Predicting the frauds?
Predictive analysis utilizes historical data and algorithms to identify patterns
and anomalies indicative of fraudulent behavior, enabling businesses to
proactively detect and prevent fraud.
9. Illustrate the reasons of Growing Complexity/Abundance of Healthcare
Data.
Reasons include the digitization of medical records, advances in medical
imaging technology, increased patient monitoring devices, and the
proliferation of wearable health devices.
10. Illustrate the sources of medical data to analyze the data using Big data
analytics.
Sources include electronic health records (EHRs), medical imaging, genomic
data, patient-generated health data (PGHD), wearable devices, and healthcare
sensors.
11. Illustrate the big Data technologies that are useful in working with
unstructured data and semi-structured data.
Technologies include Hadoop, Spark, NoSQL databases (MongoDB,
Cassandra), and distributed file systems (HDFS).
12. Can we handle unstructured data using big data analytics? Justify your
answer
Yes, big data analytics can handle unstructured data using technologies like
natural language processing (NLP), sentiment analysis, and machine learning
algorithms designed for text and image analysis.
13. Illustrate the differences between Cloud and Big Data
Cloud computing is a service delivery model for computing resources, while
big data refers to large volumes of structured, semi-structured, and
unstructured data. Cloud computing can be used to store and process big data,
but they serve different purposes.
14. Is Cloud and Big data useful on same context? Justify your answer
Yes, cloud computing and big data are often used together. Cloud platforms
provide scalable storage and processing resources required for big data
analytics, making it feasible to analyze large datasets without significant
upfront investment in infrastructure.
15. Illustrate the 4 types of crowd sourcing.
Crowd Voting, Crowd Wisdom, Crowd Creativity, and Crowd Funding.
16. Illustrate the benefits of Crowd Sourcing.
Benefits include access to a diverse pool of talent, faster problem-solving,
cost-effectiveness, scalability, and increased innovation through collective
intelligence.
Unit-2
• Analyse the NoSQL databases existence in real-world applications
NoSQL databases are extensively used in applications requiring scalability, high
availability, and flexible data models. They are prevalent in social media platforms, e-
commerce websites, real-time analytics, IoT systems, and content management
systems.
• Identify the difference between NoSQL and Relational databases
NoSQL databases offer schema flexibility, horizontal scalability, and better
performance for unstructured data compared to relational databases. Relational
databases, on the other hand, provide strong consistency, ACID transactions, and are
suitable for structured data with predefined schemas.
• Illustrate aggregate stores
Aggregate stores are NoSQL databases optimized for storing and querying aggregated
data, such as metrics, statistics, and summaries. They facilitate efficient analysis and
reporting by pre-calculating and storing aggregated results.
• Illustrate Key-Value databases
Key-Value databases store data in a schema-less fashion, where each data item is stored
as a key-value pair. They provide fast access to data based on keys and are suitable for
simple data retrieval and caching purposes.
• Illustrate the graph databases usage in real-world applications
Graph databases are used in applications requiring complex relationship mapping, such
as social networks, recommendation systems, fraud detection, and network analysis.
• Build an example for materialized view
Example: In an e-commerce platform, a materialized view can store the total sales
revenue for each product category, updating in real-time as new orders are processed.
• Illustrate replication in a distributed model
Replication in a distributed model involves maintaining multiple copies of data across
different nodes to ensure data availability, fault tolerance, and load distribution. It
enhances data reliability and reduces latency by serving data from nearby replicas.
• Illustrate the importance of sharding and its support for scaling
Sharding distributes data across multiple database instances based on a shard key,
allowing horizontal scaling by adding more nodes. It improves performance and
accommodates growing datasets by partitioning data into smaller, manageable chunks.
• Infer the details of strong consistency and eventual consistency
Strong consistency ensures that all replicas of data are updated synchronously and
uniformly, guaranteeing that any read operation returns the most recent write. Eventual
consistency allows replicas to diverge temporarily but ensures they eventually converge
to a consistent state over time.
• Construct the schematic diagram to demonstrate the importance of CAP Theorem
[Diagram]
[Consistency] [Availability] [Partition Tolerance]
[Relational DB] [NoSQL DB] [Distributed System]
•
Illustrate the key features of Cassandra
Key features include decentralized architecture, linear scalability, high availability,
fault tolerance, tunable consistency, support for wide-column data model, and built-in
replication and partitioning.
• Identify key components of Cassandra
Key components include Nodes (seed, coordinator, and replica), Data Center,
KeySpace, Column Family (Table), Column (Field), Partition Key, Replica Placement
Strategy, and Consistency Level.
• Construct Keyspace using Cassandra
Example:
sql
Copy code
CREATE KEYSPACE my_keyspace
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};
• Illustrate replication factor
The replication factor in Cassandra determines the number of replicas for each data
partition. It ensures data redundancy and fault tolerance by maintaining multiple copies
of data across different nodes.
• Infer the purpose of SELECT command in Cassandra
The SELECT command in Cassandra is used to retrieve data from a table based on
specified criteria. It allows querying data using primary keys, secondary indexes, or
clustering columns.
• Construct a query with the use of WHERE clause and a second query without
using a WHERE clause in Cassandra. Infer both outputs
Query with WHERE clause:
sql
SELECT * FROM my_table WHERE id = '123';
Output: Returns all columns for the row with the id '123'.
Query without WHERE clause:
Sql
SELECT * FROM my_table;
Output: Returns all columns for all rows in the table.
Unit-3b
• Analyse the similarities and differences between Pig Vs SQL.
Similarities: Both Pig and SQL are query languages used for data manipulation and
analysis. They support operations like filtering, grouping, and joining.
Differences:
Pig is a high-level data flow scripting language designed for processing large datasets.
It provides a procedural approach to data processing, allowing users to define data flows
using Pig Latin scripts.
SQL (Structured Query Language) is a standard language for managing relational
databases. It follows a declarative approach, where users specify the desired result
without specifying the step-by-step procedure.
• Analyse the similarities and differences Pig Latin Vs. Hive
Similarities: Both Pig Latin and HiveQL are query languages used in the Hadoop
ecosystem for data processing. They support similar operations like loading data,
filtering, joining, and aggregating.
Differences:
Pig Latin is a procedural scripting language that provides a data flow approach to data
processing. Users define data transformations using Pig Latin scripts.
HiveQL is a SQL-like declarative query language that translates SQL queries into
MapReduce or Tez jobs. It allows users to query data stored in HDFS using SQL syntax.
• Identify the use of "explain" operator?
The "explain" operator in Pig Latin is used to show the logical, physical, and
MapReduce execution plans of a Pig Latin script. It helps users understand how the
script will be executed and optimize it for better performance.
• Interpret "explain" and "illustrate" operators.
"explain" operator: Provides the execution plan of a Pig Latin script.
"illustrate" operator: Provides sample output for a portion of the Pig Latin script. It
helps users understand how data flows through the script.
• Interpret the output of the following script
A = load 'student' as (name:chararray, age:int, gpa:float);
dump A;
(joe,18,2.5)
(sam,,3.0)
(bob,,3.5)
X = group A by age;
dump X;
The script loads data from a file named 'student' with columns (name, age, gpa), groups
the data by age, and then dumps the grouped data. The output displays each unique age
and the corresponding records grouped under it.
Output:
Displays the contents of the 'student' file loaded into relation A.
Groups the data in relation A by the 'age' field and displays the result.
•
Interpret the output of the following script?
Script:
A = LOAD 'data1' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
B = LOAD 'data2' AS (b1:int,b2:int);
DUMP B;
(2,4)
(8,9)
(1,3)
X = CROSS A, B;
DUMP X;
• Analyze the differences between Hive and HBase
Hive: Hive is a data warehousing infrastructure built on top of Hadoop for querying
and analyzing large datasets using a SQL-like language (HiveQL). It is best suited for
batch processing and is schema-on-read.
HBase: HBase is a distributed, scalable, NoSQL database that runs on top of Hadoop.
It provides real-time read/write access to large datasets and is suitable for applications
requiring random, real-time read/write access to data. It follows a schema-on-write
approach.
• Articulate the role of Hive in executing a map-reduce script
Hive translates SQL-like queries into MapReduce jobs, allowing users to query and
analyze data stored in Hadoop's distributed file system (HDFS) using familiar SQL
syntax. It abstracts the complexities of MapReduce programming and provides a high-
level interface for data processing.
• Develop code snippets to drop a database, alter and delete a table in Hive
Snippets:
Drop database: DROP DATABASE IF EXISTS my_database;
Alter table: ALTER TABLE my_table ADD COLUMN new_column INT;
Delete table: DROP TABLE IF EXISTS my_table;
• Build the HiveQL statements to create a table with the following details and store
the table HDFS in text format
CREATE TABLE student_details (
name STRING,
address STRING,
cgpa FLOAT,
branch STRING
)
STORED AS TEXTFILE;
• Select the best alternative among HBase and RDBMS for Big Data
HBase is a better alternative for Big Data applications requiring real-time, scalable, and
distributed storage and processing of semi-structured or unstructured data. RDBMS
may struggle to handle the volume, velocity, and variety of Big Data efficiently.
• Experiment the put and scan operations of HBase
Experimenting with the put operation involves inserting data into HBase tables using
the HBase shell or API. Scanning involves retrieving data from HBase tables using
filters and scanning methods provided by the HBase API.
• Interpet the output of the following Query
SELECT DISTINCT eprofileclass, fueltypes FROM geog_all ;
The query selects distinct combinations of 'eprofileclass' and 'fueltypes' from the
'geog_all' table.
• Articulate the output of a Hive Query with and Without DISTINCT Keyword
A Hive query without the DISTINCT keyword returns all rows from the result set,
including duplicates, while a query with the DISTINCT keyword eliminates duplicate
rows and returns only unique rows.
• Choose the better alternative for data processing from CEP and Streams
The choice depends on the specific requirements of the data processing task. Complex
Event Processing (CEP) is suitable for analyzing and correlating high-velocity, event-
driven data streams in real-time, whereas Streams processing is more generic and can
handle various data processing tasks, including real-time analytics, ETL (Extract,
Transform, Load), and data integration.
• Identify the usage of Streams
Streams are used for processing continuous streams of data in real-time or near real-
time, enabling applications to react quickly to changing data and events. They are
commonly used in various domains such as finance, telecommunications, IoT, and
social media for tasks like real-time analytics, monitoring, and alerting.
Unit-4
• Illustrate the use of In-memory processing in Spark
In Spark, data is stored in memory (RAM) across the cluster, allowing for fast data
processing by reducing disk I/O. This enables iterative and interactive analytics
applications to run much faster compared to traditional disk-based processing
frameworks.
• Illustrate the advantage of Spark over Hadoop
Spark offers several advantages over Hadoop, including faster processing speed due to
in-memory computation, support for multiple types of workloads (batch, interactive,
iterative, and real-time), rich APIs (RDD, DataFrame, Dataset), and compatibility with
various data sources and programming languages.
• Illustrate the execution differences between Logic plan and physical plan
The logical plan represents the abstract operations defined by the user's query, whereas
the physical plan represents the execution strategy chosen by the Spark optimizer to
execute the logical plan efficiently. The logical plan is language-independent and
describes the desired result, while the physical plan is optimized for performance and
specifies how to achieve the result using specific execution strategies and data
partitions.
• Demonstrate the essence of RDD
RDD (Resilient Distributed Dataset) is the fundamental data structure in Spark,
representing a distributed collection of immutable objects that can be operated on in
parallel. RDDs are fault-tolerant, distributed across multiple nodes in a cluster, and
support transformations and actions for processing data in parallel.
• Illustrate the RDD transformations
RDD transformations are operations that produce new RDDs from existing RDDs.
Examples include map, filter, flatMap, reduceByKey, join, and sortByKey.
Transformations are lazy evaluated and create a lineage of transformations that define
the computation graph.
• Illustrate ONE fundamental difference between a Spreadsheet and Spark
DataFrame
A fundamental difference is that a Spreadsheet is a single-machine application for data
analysis, while a Spark DataFrame is a distributed data structure capable of handling
large-scale data processing across a cluster of machines.
• Illustrate a best practice when working with Nulls in Data
A best practice when working with Nulls is to handle them explicitly by either removing
them, replacing them with default values, or imputing them using statistical methods,
depending on the specific requirements of the analysis or modeling task.
• Illustrate how to drop NULLs explicitly
In Spark, you can drop NULL values explicitly using the dropna() method. Example:
[Link]()
• Interpret unique support for working with JSON Data
Spark provides native support for reading and writing JSON data, allowing users to
easily parse JSON files into DataFrames and vice versa. This simplifies data ingestion
and integration tasks when dealing with JSON-formatted data.
• Implicit type casting is an easy way to shoot yourself in the foot, especially while
dealing with "N Values". Justify your answer
Implicit type casting can lead to unexpected behavior and data loss, especially when
dealing with null values. It may result in incorrect computations or unexpected errors,
making it important to handle type conversions explicitly to ensure data integrity and
accuracy.
• Is Broadcast Variable immutable? Justify
Yes, broadcast variables are immutable in Spark. Once broadcasted, their values cannot
be changed or mutated. They are distributed read-only variables that are cached and
shared across all tasks within a Spark job for efficient data distribution.
• Justify the statement “accumulator is mutable”
Accumulators are mutable variables used for aggregating values across tasks in parallel
computations. They can be updated or mutated during the execution of a Spark job,
allowing for the accumulation of values across distributed computations. Therefore,
accumulators are mutable.
• Can you use spark logs to monitor the Spark Jobs? Justify
Yes, Spark logs provide detailed information about the execution of Spark jobs,
including task execution, resource utilization, stage progress, and error messages.
Monitoring Spark logs can help diagnose performance issues, optimize resource usage,
and troubleshoot errors in Spark applications.
• Articulate the command to modify the Spark logs
You can modify Spark logs by configuring the log4j properties file, which is typically
located in the conf directory of the Spark installation. By adjusting the log levels and
output destinations in this file, you can control the verbosity and format of Spark logs.
• Articulate the details of the Jobs tab of Spark UI
The Jobs tab of Spark UI displays information about individual Spark jobs, including
job duration, stages within the job, task execution times, input/output sizes, and shuffle
data. It provides insights into job performance, resource utilization, and task execution
progress.
• Articulate the details of the SQL tab of Spark UI
The SQL tab of Spark UI shows details about SQL queries executed using Spark SQL,
including query execution time, stages involved in query processing, input/output
metrics, and physical and logical plans. It helps users understand how SQL queries are
executed and optimize query performance.
Unit-5
• Demonstrate the challenges of Stream Processing
Challenges of stream processing include handling high-velocity data
streams, ensuring low latency, managing out-of-order data, dealing with
late arriving events, maintaining stateful computations, and ensuring fault
tolerance and scalability.
• Illustrate the aspects that need to be considered for Spark’s
performance
Aspects to consider for Spark's performance include data partitioning,
caching, memory management, shuffle optimization, tuning the number of
executors and resources, reducing data shuffling, minimizing data skew,
and optimizing transformations and actions.
• When is arbitrary Stateful processing useful? Justify
Arbitrary stateful processing is useful when computations require
maintaining arbitrary state across multiple events or time windows. It is
suitable for scenarios like sessionization, fraud detection, anomaly
detection, and real-time aggregations where stateful transformations are
needed.
• When is Event Time processing useful? Justify
Event Time processing is useful when analyzing data based on the time
when events actually occurred rather than when they are processed. It is
crucial for handling out-of-order data, late arriving events, and accurately
calculating time-based aggregations, ensuring correct and meaningful
results.
• Build the code snippet to perform a cube operation using event-time
processing
Code snippet:
bash
Copy code
val cubeDF = df
.withWatermark("eventTime", "10 minutes")
.groupBy(window($"eventTime", "1 hour"), $"category")
.agg(sum($"value").as("sum"))
.groupBy("window")
.pivot("category")
.sum("sum")
• Build the code snippet to create a window
Code snippet:
less
Copy code
val windowedDF = df
.withWatermark("eventTime", "10 minutes")
.groupBy(window($"eventTime", "1 hour"))
.agg(sum($"value").as("sum"))
• When can we use “withWatermark()” in Apache Spark?
We use withWatermark() in Apache Spark when dealing with event-time
processing to specify a threshold for late arriving events. It helps in
defining the maximum allowable lateness of events in the stream and is
crucial for correctly handling out-of-order data.
• When can we use “dropDuplicates()” in Apache Spark
dropDuplicates() is used in Apache Spark when removing duplicate
records from a DataFrame or Dataset based on specified columns. It is
useful when working with streaming data to eliminate redundant records
and ensure data consistency.
• Articulate the use of “triggers” in structured streams
Triggers in structured streams control the execution of streaming queries
by specifying when to trigger the execution of micro-batches. They define
conditions such as processing time or the number of records that should
elapse before starting a new micro-batch, providing flexibility in managing
stream processing.
• Justify the statement-- “when we perform a map job on stream, we
should not use Complete mode of output”
When performing a map job on a stream, using the Complete mode of
output is not recommended because it requires keeping all the state in
memory to compute the complete result. This can lead to memory issues
and performance degradation, especially with large datasets.
• Illustrate the use of “[Link]() "in structured
streams
[Link]() is used in structured streams to wait for
the termination of a streaming query. It blocks the current thread until either
the query is stopped or an exception occurs, allowing for graceful
termination of the stream processing job.
• Interpret the output of “[Link]("gt").count()” in
structured Streams
The output of [Link]("gt").count() in structured streams is a
DataFrame/Dataset containing the count of records grouped by the "gt"
column. It provides the frequency of occurrences for each distinct value in
the "gt" column in real-time as the stream is processed.
• Illustrate the Drawbacks of Structured Streaming Joins
Drawbacks of structured streaming joins include the complexity of
managing stateful operations, the potential for data skew leading to
performance issues, the need for managing watermarks and event-time
considerations, and the challenges of handling late arriving events and out-
of-order data.
• Articulate the functionality of Structured Streaming Joins with its
challenges
Structured streaming joins allow combining data from multiple streams or
between streams and static data sources using join operations. Challenges
include ensuring data consistency across multiple streams, handling
stateful operations efficiently, managing event-time considerations, and
dealing with late arriving events and out-of-order data.
• Articulate the use of memory sink
Memory sink in structured streaming is used to store the output of
streaming queries in memory for debugging, testing, or quick analysis
purposes. It allows developers to inspect the intermediate results of stream
processing within the same Spark application.
• Articulate the use of console sink
Console sink in structured streaming is used to output the results of
streaming queries to the console for monitoring or debugging purposes. It
provides a simple and convenient way to observe the output of stream
processing in real-time during development or testing.