0% found this document useful (0 votes)
8 views4 pages

Big Data Analysis Course Syllabus

Uploaded by

harsh958098
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views4 pages

Big Data Analysis Course Syllabus

Uploaded by

harsh958098
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

BIG DATA ANALYSIS

Syllabus and Course Outline

Credits: 03
Programs:
[Link]. (CSE) 7th Semester: Paper – 1 (BCS701)
[Link]. (CS) 3rd Semester: Paper – 5: Elective I (MCS651)

Prerequisites:
Students should have knowledge of one Object Oriented Programming Language (Java or
Python or Scala), Database, SQL, and basic hands-on Linux operating system. However, a prior
course in Data warehousing and mining, Machine Learning, Parallel and Distributed Computing
will help in quick learning.

Course Objectives:
1. To introduce the fundamental concepts in big data analytics
2. To learn the data analysis with R
3. To understand the design principles of Hadoop
4. To learn programming with the MapReduce concepts
5. To learn how to design scalable and distributed applications with Hadoop
6. To introduce the tools of the Hadoop ecosystem
7. To understand the document-oriented database system
8. To study the security and privacy issues of big data

Learning Outcomes:
On completion of this course, the students will be able to

1. Define data science, big data, and associated terminologies


2. Understand the fundamentals of Big Data Analytics techniques and its applications in
various sectors
3. Apply R tool for data analysis
4. Analyze the Hadoop system and various components of its Ecosystem
5. Use HDFS and Apply MapReduce to develop big data applications
6. Understand the NoSQL databases
7. Understand the security issue of big data and its implementation with Hadoop
Course Contents:
Unit 1: Introduction: 08 Lectures

Data Science, Big Data and its importance, Prediction vs. Inference, Statistical learning,
Unsupervised and Supervised learning, Drivers for Big data, Big data analytics, Big data
applications, Basic R concepts, Data transformation and data visualization in R.

Unit 2: Hadoop: 08 Lectures

Introduction to Hadoop and Hadoop Architecture, Apache Hadoop & Hadoop EcoSystem,
Moving Data in and out of Hadoop, Understanding inputs and outputs of MapReduce.

Unit 3: Querying in Big Data: 08 Lectures

HDFS Overview, Hive Architecture, Comparison with Traditional Database, HiveQL Querying
Data, Sorting and Aggregating, Map Reduce Scripts, Joins & Sub queries, HBase concepts,
Advanced Usage, Schema Design, Advance Indexing, PIG, Zookeeper, HBase uses Zookeeper.

Unit 4: Data Base for the Modern Web: 08 Lectures

Introduction to Mongo DB key features, Core Server tools, Mongo DB through the JavaScript’s
Shell, Creating and Querying through Indexes, Document-Oriented, principles of schema design,
Constructing queries on Databases, collections and Documents, MongoDB Query Language.

Unit 5: Big Data Security: 08 Lectures

Big Data Privacy, Ethics and Security, Steps to secure big data, Cloud security, Hadoop Security
Design, Hadoop Kerberos Security Implementation & Configuration, Audit logging in Hadoop
cluster, Data security and event logging.

Recommended Readings:
Text and Reference Books:

1. Boris lublinsky, Kevin t. Smith, Alexey Yakubovich, “Professional Hadoop Solutions”,


John Wiley & Sons, Inc.
2. Chris Eaton, Dirk Derooset. al. , “Understanding Big data ”, McGraw Hill
3. Kyle Banker, Piter Bakkum, Shaun Verch, Douglas Garrett, Tim Hawkins, “MongoDB in
Action”, Manning Publications Co.
4. Tom White, “HADOOP: The definitive Guide”, O Reilly, Media, Inc.
5. Vignesh Prajapati, “Big Data Analytics with R and Hadoop”, Packet Publishing.
6. Bill Chambers & Matei Zaharia, Spark The Definitive Guide, O'Reilly Media, Inc.
7. Luis Torgo, Data Mining with R: Learning with Case Studies, Second Edition (Chapman
& Hall/CRC Data Mining and Knowledge Discovery Series)
8. Mark Kerzner and Sujee Maniyam, Hadoop Illuminated,
[Link]
9. Jimmy Lin and Chris Dyer, Data-Intensive Text Processing with MapReduce,
[Link]
10. Anand Rajaraman, Jeffrey D. Ullman, Mining of Massive Datasets,
[Link]

Lecture Notes/Slides

Will be provided during lectures.

Tutorials

Will be provided during lectures.

Research Articles

1. Undefined By Data: A Survey of Big Data Definitions, [Link]


2. Big Data Analytics: A Survey, [Link]
3. A Survey on Platforms for Big Data Analytics, [Link]
6
4. The Google File System,
[Link]
[Link]
5. MapReduce: Simplified Data Processing on Large Clusters,
[Link]
[Link]
6. Spark: Cluster Computing with Working Sets,
[Link]
7. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing, [Link]

Web Resources:

1. The R Project for Statistical Computing, [Link]


2. Apache Hadoop, [Link]
3. Apache HBase, [Link]
4. Apache HIVE, [Link]
5. Apache Pig, [Link]
6. Apache Zookeeper, [Link]
7. Apache Mahout, [Link]
8. Apache Spark, [Link]
9. MongoDB, [Link]
Pedagogy:
Participative learning, discussion, flowchart, algorithm, program writing, assignment, quiz,
power point presentation

Assessment Scheme:
Quizzes/Test: 10%
Assignments: 10%
Mid Term Examination: 20%
End Term Examination: 60%

Assignments:
Lab Assignments
Presentation Assignments

Common questions

Powered by AI

MapReduce plays a critical role in big data applications by enabling the processing of vast data sets in parallel across a distributed computing environment. However, challenges include the complexity of writing efficient MapReduce programs and the overhead associated with transferring large data sets between distributed systems, which can impact performance and require optimization .

HiveQL is designed for querying and managing large datasets stored in distributed storage using Apache Hadoop, while traditional SQL databases are typically used for structured, transactional data management. HiveQL provides scalability and ease of use on large data sets through optimized queries and supports batch processing, which makes it preferable in big data analytics over traditional SQL, which is limited by the capacity of single-node systems .

The course tackles big data security by focusing on privacy, ethics, and security issues specific to big data environments. It discusses steps such as cloud security, Hadoop security design, Kerberos security implementation, and audit logging in Hadoop clusters. These measures help in securing data storage and processing operations against unauthorized access and breaches .

Hadoop's design principles focus on processing large data sets across distributed clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. This architecture enables it to handle large-scale data processing, ensuring reliability and scalability in distributed applications .

Cloud security solutions for big data involve leveraging cloud-based services to ensure data protection, compliance, and secure access controls. The course explains integration with Hadoop, utilizing cloud infrastructure for scalable storage and processing capabilities, while implementing security protocols such as encrypted data transmission and authenticated access, thereby addressing vulnerabilities in big data environments .

The course integrates the R tool for data analysis by teaching students the fundamentals of data science and big data, followed by data transformation and visualization using R. The R tool is applied to analyze big data, thus equipping students with practical skills for handling data sets .

The course adopts pedagogical strategies such as participative learning, discussions, the use of flowcharts, algorithm creation, program writing, and assignments. These methods encourage active engagement, foster understanding through practical application, and develop problem-solving skills essential for mastering big data analysis concepts .

Key features of MongoDB include its document-oriented data model, dynamic schemas, and scalability across distributed systems. These features support flexibility and efficient data management, which are essential for modern web applications that handle large volumes of data with varying structures. Its powerful querying capabilities through JavaScript's shell also make it an ideal choice for developers .

The course provides an understanding of NoSQL databases by discussing their applications in big data environments, focusing on aspects such as flexibility, schema design, and handling of diverse data types. It emphasizes hands-on experience through practical work with tools like MongoDB and Hadoop's ecosystem, enabling students to manage and analyze large, unstructured datasets effectively .

Both unsupervised and supervised learning techniques are crucial in big data analytics as they provide comprehensive tools for data exploration and decision-making. Supervised learning is used for predictive modeling and involves using labeled datasets to train algorithms, whereas unsupervised learning helps in identifying patterns and structures from data without labels, facilitating deeper insights into complex datasets .

You might also like