lOMoARcPSD|44202220
MoARcPSD|442022 CCS334 - BIG DATA ANALYTICS
UNIT I UNDERSTANDING BIG DATA
PART A
1. What is big data?
2. Name the four V's of big data.
3. How does unstructured data differ from structured data?
4. Provide an example of unstructured data.
5. Can you list two industries that heavily rely on big data analytics?
6. What is the primary purpose of web analytics?
7. Name a popular framework for distributed data processing in big data.
8. What is Hadoop's role in the big data ecosystem?
9. Give an example of a NoSQL database.
10. How does cloud computing relate to big data?
11. What is mobile business intelligence?
12. How does crowdsourcing analytics work?
13. What is the significance of inter-firewall analytics?
14. What does trans-firewall analytics focus on?
15. What are the key trends that led to the emergence of big data?
16. How is big data different from traditional data analysis?
17. Name a few open-source technologies commonly used in big data.
18. How does big data benefit decision-making in healthcare?
19. What is the purpose of data visualization in big data applications?
20. Why is real-time data processing important in some big data scenarios?
PART B
1. How has the convergence of key trends, such as data growth and technological
advancements, shaped the big data landscape?
2. Can you provide real-world examples of how businesses are leveraging big data to gain a
competitive edge in their industries?
3. What are the primary challenges associated with analyzing unstructured data, and how
can organizations overcome them?
lOMoARcPSD|44202220
4. How does Hadoop address the challenges of storing and processing massive datasets?
What are its core components?
5. In what ways does open-source technology foster innovation and collaboration in the
development of big data solutions?
6. What are the advantages and potential drawbacks of using cloud computing platforms for
big data storage and processing?
7. How does mobile business intelligence empower decision-makers and improve business
agility in today's data-driven world?
8. What ethical considerations should organizations take into account when collecting and
analyzing data obtained through crowdsourcing analytics?
9. How do inter-firewall and trans-firewall analytics contribute to network security and data
protection in an increasingly interconnected world?
10. What are the emerging trends and future developments expected in the field of big data,
and how might they impact various industries and society as a whole?
UNIT II - NO SQL DATA MANAGEMENT
PART A
1. What does NoSQL stand for, and why are NoSQL databases used?
2. What is the primary advantage of aggregate data models in NoSQL databases?
3. Name two common types of NoSQL data models.
4. How do graph databases differ from other NoSQL databases?
5. What does it mean for a database to be schemaless?
6. What are materialized views in the context of NoSQL databases?
7. Explain the concept of horizontal scalability in NoSQL.
8. What is master-slave replication, and how does it work in NoSQL systems?
9. What is eventual consistency in distributed databases?
lOMoARcPSD|44202220
10. Why is Cassandra known for its high availability and fault tolerance?
11. In Cassandra, what is a column-family data model?
12. Provide an example of a use case where Cassandra is well-suited.
13. What is a primary key in a Cassandra data model?
14. How does Cassandra handle data distribution across nodes?
15. What is the CAP theorem, and how does it relate to NoSQL databases?
16. What are some popular NoSQL databases apart from Cassandra?
17. How do NoSQL databases typically handle ACID transactions?
18. Can NoSQL databases be used alongside traditional relational databases?
19. What is the role of indexes in improving query performance in NoSQL databases?
20. How can developers interact with Cassandra through client libraries?
PART B
1. What are the fundamental differences between NoSQL databases and traditional
relational databases, and in what scenarios is each type more suitable?
2. How do key-value stores and document stores differ in terms of data modeling, and what
are some use cases for each type?
3. What challenges and advantages come with managing data in a schemaless NoSQL
database, and how can organizations effectively deal with schema evolution?
4. In what situations would you choose a graph database over other NoSQL databases, and
what unique capabilities do graph databases offer for data analysis?
5. How does horizontal scalability impact the design and operation of NoSQL databases,
and what strategies can be employed to ensure data consistency in distributed systems?
6. What are the key architectural features of Cassandra that make it a preferred choice for
applications requiring high availability and fault tolerance, and what are its limitations?
lOMoARcPSD|44202220
7. Can you provide a detailed comparison of the consistency models used in NoSQL
databases, including strong consistency, eventual consistency, and the trade-offs associated with
each?
8. How do NoSQL databases address security and data privacy concerns, especially in the
context of distributed and highly available systems?
9. What role do indexes play in optimizing query performance in NoSQL databases, and
what best practices should developers follow when designing data models?
10. What trends and innovations are emerging in the NoSQL data management space, and
how might they impact the future of data storage and retrieval?
UNIT III: MAP REDUCE APPLICATIONS
PART A
1. What is the primary purpose of MapReduce in the context of big data processing?
2. What are the key components of a MapReduce workflow?
3. How can MRUnit help in testing MapReduce applications?
4. Why is it important to perform local tests with test data before deploying a MapReduce job?
5. What are the different stages in the anatomy of a MapReduce job run?
6. How does YARN differ from the classic MapReduce framework in Hadoop?
7. What are some common types of failures that can occur in MapReduce and YARN, and how
are they managed?
8. Explain the concept of job scheduling in the context of MapReduce.
9. What is shuffling and sorting in MapReduce, and why is it necessary?
10. What happens during the task execution phase in a MapReduce job?
11. Give an example of a problem type that is well-suited for MapReduce batch processing.
12. What are iterative algorithms, and how can MapReduce be used to implement them?
lOMoARcPSD|44202220
13. How does real-time data analysis differ from batch processing in MapReduce?
14. What is the purpose of input formats in MapReduce, and can you name a commonly used
input format?
15. What is an OutputFormat in the context of MapReduce, and why is it important?
16. How does MapReduce handle parallelism and distributed processing?
17. What is the role of the JobTracker in classic MapReduce, and how does it relate to YARN's
ResourceManager?
18. What is speculative execution in MapReduce, and why is it used?
19. How does data locality optimization enhance the efficiency of MapReduce jobs?
20. Can you explain the concept of data skew in the context of MapReduce, and how can it be
mitigated?
PART B
1. What are the fundamental principles and design patterns that underlie the MapReduce
programming model, and how do they enable the processing of large-scale data?
2. How does MRUnit facilitate the testing of MapReduce applications, and what are some
best practices for writing effective unit tests for MapReduce code?
3. In the context of MapReduce, why is it essential to perform local tests with test data
before deploying a job to a production cluster, and how can developers simulate cluster-like
conditions locally?
4. Can you describe the critical stages in the anatomy of a MapReduce job run, and how
does the order of these stages affect the overall performance of a job?
5. What motivated the transition from classic MapReduce to YARN in Hadoop, and how
has YARN improved resource management and job execution in Hadoop clusters?
6. How are failures managed in MapReduce and YARN, and what mechanisms ensure the
reliability and fault tolerance of MapReduce jobs in the face of node or task failures?
7. What are the key considerations in job scheduling for MapReduce, and how do fair
scheduling and capacity scheduling algorithms work to optimize resource allocation?
lOMoARcPSD|44202220
8. What is the role of the shuffling and sorting phase in MapReduce, and how does efficient
data shuffling impact the overall performance of MapReduce jobs?
9. Can you provide insights into the execution of MapReduce tasks, including how
parallelism is achieved, how tasks communicate, and how task-level failures are handled?
10. How does the MapReduce model adapt to different problem types, and what are the
challenges and benefits of using MapReduce for batch processing, iterative algorithms, and
realtime data analysis?
UNIT IV BASICS OF HADOOP
PART – A
1. What is the default block size in HDFS and why is it important?
2. Define Hadoop Streaming and its primary use case.
3. What are the key differences between Hadoop Pipes and Hadoop Streaming?
4. What is data serialization in Hadoop?
5. Mention any two features of Avro.
6. What is the role of the NameNode in HDFS?
7. How does Hadoop ensure data integrity?
8. What do you mean by "scaling out" in Hadoop architecture?
9. Define compression in the context of Hadoop I/O.
10. What are file-based data structures in Hadoop?
11. What is the purpose of the Java interface in HDFS?
12. List any two advantages of integrating Hadoop with Cassandra.
13. What is the use of Checksum in Hadoop I/O?
14. Define data flow in Hadoop MapReduce.
15. Mention any two key components of the Hadoop Distributed File System (HDFS).
16. What is the function of the Avro schema?
PART- B
1. Explain the design and working of Hadoop Distributed File System (HDFS). What are its
key concepts and advantages?
lOMoARcPSD|44202220
2. Describe the data flow in a Hadoop MapReduce job with suitable diagrams.
3. Discuss the concepts of Hadoop I/O. How are compression and serialization handled in
Hadoop?
4. What is Avro? Explain its architecture, schema evolution, and role in Hadoop data
serialization.
5. Describe Hadoop Streaming and Pipes. How do they enable writing MapReduce
programs in non-Java languages?
6. Explain how Hadoop ensures data integrity and reliability during processing.
7. Discuss the integration of Cassandra with Hadoop. How does it enhance distributed data
processing?
8. Compare and contrast different file-based data structures used in Hadoop.
9. What is meant by “scaling out” in Hadoop? How does it differ from “scaling up”? Justify
your answer with examples.
10. Discuss the role and implementation of the Java interface in HDFS file operations.
Include basic Java code examples.
UNIT V HADOOP RELATED TOOLS
PART- A
1. What is HBase and how is it different from Hadoop HDFS?
2. List two key features of the HBase data model.
3. What are the types of HBase clients?
4. Define a column family in HBase.
5. What is Praxis in the context of HBase?
6. Mention any two commands used in the Grunt shell of Pig.
7. What is the role of Pig Latin in data analysis?
8. Differentiate between a tuple and a bag in Pig data model.
9. Write the syntax to load a dataset in Pig Latin.
10. List any two data types supported by Hive.
11. What is the purpose of HiveQL?
lOMoARcPSD|44202220
12. Mention two file formats commonly used with Hive tables.
13. How is Hive different from traditional RDBMS?
14. What is the role of the LOAD DATA statement in HiveQL?
15. Define a managed table in Hive.
16. Write a simple HiveQL query to retrieve all records from a table named students.
PART -B
1. Explain the data model of HBase with suitable examples. How does it store and retrieve
large-scale structured data?
2. Describe the architecture and implementation of HBase. How do clients interact with
HBase?
3. What is Praxis? Discuss with examples how HBase is used in real-world applications.
4. Explain the Pig data model. Compare its structure with relational models using examples.
5. Write and explain a Pig Latin script to analyze a sales dataset. How is it tested in the
Grunt shell?
6. Discuss the components and execution environment of Apache Pig. How does Pig handle
semi-structured data?
7. Explain Hive’s data types and file formats. Why are file formats important for
performance in Hive?
8. Describe the data definition and data manipulation capabilities of HiveQL with suitable
syntax and examples.
9. Write and explain HiveQL queries for the following:
o a) Creating a table for employee data
o b) Loading data from a local file
o c) Querying employees with salary > 50000
o d) Deleting a specific record
10. Compare Hive, Pig, and HBase in terms of data model, usage, and execution models.
Provide a use-case for each.