0% found this document useful (0 votes)
47 views6 pages

Big Data Course Syllabus Overview

Uploaded by

Dr. Neetu Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views6 pages

Big Data Course Syllabus Overview

Uploaded by

Dr. Neetu Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

BCA-IOP BADA Theory Practical

Name of The Course

Introduction to Big Data

Science L T P C IA MTE ETE PR ETE

Course Code BCABA1101 3 0 2 4 20 15 30 15 20

Prerequisite

Co requisite

Ant requisite

Course Objectives:

The student should be made to:

Course Outcomes

CO1 Describe what Data Science is and the skill sets needed to be a data scientist.

CO2 Explain in basic terms what Statistical Inference means. Identify probability distributions

commonly used as foundations for statistical modeling. Fit a model to data

CO3 Explain the significance of exploratory data analysis (EDA) in data science. Apply basic

tools (plots, graphs, summary statistics) to carry out EDA.

CO4 Describe the Data Science Process and how its components interact. Use APIs and other

tools to scrap the Web and collect data.

CO5 Identify and explain fundamental mathematical and algorithmic ingredients that constitute a

Recommendation Engine (dimensionality reduction, singular value decomposition,

principal component analysis). Build their own recommendation system using existing

components.

CO6 Describe advances and the latest trends in data science.

Text Book (s)


[Link] O‟Neil and Rachel Schutt. Doing Data Science, Straight Talk From The Frontline. O‟Reilly. 2014.

Reference Book (s)

1. Jure Leskovek, Anand Rajaraman and Jeffrey Ullman. Mining of Massive Datasets. v2.1, Cambridge

University Press. 2014. (free online)

2. Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. ISBN 0262018020. 2013.

3. Foster Provost and Tom Fawcett. Data Science for Business: What You Need to Know about Data
Mining

and Data-analytic Thinking. ISBN 1449361323. 2013.

4. Trevor Hastie, Robert Tibshirani and Jerome Friedman. Elements of Statistical Learning, Second
Edition.

ISBN 0387952845. 2009. (free online)

5. Avrim Blum, John Hopcroft and Ravindran Kannan. Foundations of Data Science. (Note: this is a book

currently being written by the three authors. The authors have made the first draft of their notes for the

book available online. The material is intended for a modern theoretical course in computer science.)

6. Mohammed J. Zaki and Wagner Miera Jr. Data Mining and Analysis: Fundamental Concepts and

Algorithms. Cambridge University Press. 2014.

7. Jiawei Han, Micheline Kamber and Jian Pei. Data Mining: Concepts and Techniques, Third Edition. ISBN

0123814790. 2011.

Unit-1 Introduction to BI 8 hours

What is Data Science? - Big Data and Data Science hype – and getting past the hype - Why now? –

Datafication - Current landscape of perspectives - Skill sets needed 2. Statistical Inference -

Populations and samples - Statistical modelling, probability distributions, fitting a model - Intro to

R.

Unit-2 . Exploratory Data Analysis and the Data Science Process 8 hours

Exploratory Data Analysis and the Data Science Process - Basic tools (plots, graphs and summary

statistics) of EDA - Philosophy of EDA - The Data Science Process - Case Study: RealDirect

(online real estate firm) 4. Three Basic Machine Learning Algorithms - Linear Regression - k-
Nearest Neighbors (k-NN) - k-means.

Unit-3 Machine Learning Algorithm and Usage in Applications 8 hours

Motivating application: Filtering Spam - Why Linear Regression and k-NN are poor choices for

Filtering Spam - Naive Bayes and why it works for Filtering Spam - Data Wrangling: APIs and

other tools for scrapping the Web 6. Feature Generation and Feature Selection (Extracting Meaning

From Data) - Motivating application: user (customer) retention - Feature Generation

(brainstorming, role of domain expertise, and place for imagination) - Feature Selection algorithms

– Filters; Wrappers; Decision Trees; Random Forests.

Unit-4 Building a User-Facing Data Product 8 hours

Algorithmic ingredients of a Recommendation Engine - Dimensionality Reduction - Singular Value

Decomposition - Principal Component Analysis - Exercise: build your own recommendation

system 8. Mining Social-Network Graphs - Social networks as graphs - Clustering of graphs -

Direct discovery of communities in graphs - Partitioning of graphs - Neighborhood properties in

graphs.

Unit-5 Data Visualization and Ethical Issues 8 hours

Basic principles, ideas and tools for data visualization , Examples of inspiring (industry) projects -

Exercise: create your own visualization of a complex dataset Discussions on privacy, security,

ethics - A look back at Data Science - Next-generation data scientists.

Unit-6 Research 8 hours

The advances and the latest trends in the course as well as the latest applications of the areas

covered in the course.

The latest research conducted in the areas covered in the course.

Discussion of some latest papers published in IEEE transactions and ACM transactions, Web of

Science and SCOPUS indexed journals as well as high impact factor conferences as well as

symposiums.

Discussion on some of the latest products available in the market based on the areas covered in the

course and patents filed in the areas covered.


BCA-IOP Big Data Theory Practical

Name of The Course

Foundation of Big Data

System L T P C IA MTE ETE PR ETE

Course Code BCABI1101 3 0 2 4 20 15 30 15 20

Prerequisite

Co requisite

Ant requisite

COURSE OBJECTIVES:

Understanding Data Science Process and learning techniques, tools, Statistical Methodologies and

Machine learning algorithms used in the process.

COURSE OUTCOMES:

Course Outcomes

CO1 Students should know about design issues of Hadoop Architecture.

CO2 Students should learn various techniques for big data analytics.

CO3 Students able to identify the real time problems and able to design solution using

various big data analytics techniques.

CO4 Students use prediction of supervised and unsupervised learning.

CO5 Students can use classification of clustering algorithms

CO6 Student can understand current research trends in big data

COURSE CONTENT: Hours

UNIT I INTRODUCTION TO BIG DATA: 9

Introduction – distributed file system – Big Data and its importance, Four V‟s in bigdata, Drivers for Big
data, Big data analytics, Big data applications. Algorithms using map reduce, Matrix-Vector

Multiplication by Map Reduce.

UNIT II INTRODUCTION HADOOP : 9

Big Data – Apache Hadoop & Hadoop EcoSystem – Moving Data in and out of Hadoop –

Understanding inputs and outputs of MapReduce - Data Serialization.

UNIT- III HADOOP ARCHITECTURE: 9

Hadoop Architecture, Hadoop Storage: HDFS, Common Hadoop Shell commands , Anatomy of File

Write and Read., NameNode, Secondary NameNode, and DataNode, Hadoop MapReduce paradigm,

Map and Reduce tasks, Job, Tasktrackers - Cluster Setup – SSH & Hadoop Configuration – HDFS

Administering –Monitoring & Maintenance.

UNIT-IV HADOOP ECOSYSTEM AND YARN : 9 Hadoop

ecosystem components - Schedulers - Fair and Capacity, Hadoop 2.0 New Features- NameNode High

Availability, HDFS Federation, MRv2, YARN, Running MRv1 in YARN.

UNIT-V HIVE AND HIVEQL, HBASE: 9 Hive

Architecture and Installation, Comparison with Traditional Database, HiveQL - Querying Data - Sorting

And Aggregating, Map Reduce Scripts, Joins & Subqueries, HBase concepts- Advanced Usage, Schema

Design, Advance Indexing - PIG, Zookeeper - how it helps in monitoring a cluster, HBase uses

Zookeeper and how to Build Applications with Zookeeper.

Unit VI 5 hours

The advances and the latest trends in the course as well as the latest applications of the areas covered in
the course.

The latest research conducted in the areas covered in the course.

Discussion of some latest papers published in IEEE transactions and ACM transactions, Web of Science
and

SCOPUS indexed journals as well as high impact factor conferences as well as symposiums.

Discussion on some of the latest products available in the market based on the areas covered in the
course and

patents filed in the areas covered.


Reference Books

1. Boris lublinsky, Kevin t. Smith, Alexey Yakubovich, “Professional Hadoop Solutions”,

2. Wiley, ISBN: 9788126551071, 2015.

3. Chris Eaton, Dirk deroos et al. , “Understanding Big data ”, McGraw Hill, 2012.

4. Tom White, “HADOOP: The definitive Guide” , O Reilly 2012.

5. Vignesh Prajapati, “Big Data Analytics with R and Haoop”, Packet Publishing 2013.

6. Tom Plunkett, Brian Macdonald et al, “Oracle Big Data Handbook”, Oracle Press, 2014.

7. Jy Liebowitz, “Big Data and Business analytics”,CRC press, 2013.

Common questions

Powered by AI

The architectural components of Hadoop include the HDFS for data storage, MapReduce for processing, YARN for resource management, and a variety of ecosystem tools like Hive, Pig, and HBase for data querying and management. These components contribute to Hadoop's functionality by enabling efficient storage and processing of large datasets, facilitating parallel processing, and supporting diverse data formats and analytical needs, making it a foundation for big data analytics .

The skill sets necessary for a proficient data scientist include a strong understanding of statistical inference and modeling, proficiency in data exploration and visualization tools, an ability to use APIs and web scraping tools for data collection, and knowledge of recommendation system algorithms such as dimensionality reduction and singular value decomposition. These skills are essential to navigate the current landscape of data science and its applications .

Hadoop's distributed file system (HDFS) offers significant advantages in managing big data operations due to its ability to store massive datasets across multiple nodes, ensuring data redundancy and fault tolerance. Its scalability allows for easy expansion by adding more nodes, while the file system's architecture supports the parallel processing of data, significantly improving data retrieval and processing speeds .

Current research trends in big data and data science, such as advancements in machine learning algorithms, real-time data processing, and privacy-preserving analytics, are significantly influencing industry practices by promoting more efficient, accurate, and ethical data handling. These trends enable businesses to derive deeper insights, optimize operational processes, and provide better customer experiences, highlighting the intersection of academic research and applied industry solutions .

Feature generation and selection significantly impact the effectiveness of data analytics applications by enhancing model accuracy and interpretability. Effective feature generation involves incorporating domain expertise and imaginative strategies to create relevant features, while feature selection algorithms like filters, wrappers, and decision trees reduce dimensionality, concentrating on the most informative features. This improves computational efficiency, reduces overfitting, and enhances model performance .

The Four V's of big data—Volume, Velocity, Variety, and Veracity—highlight its significance by illustrating the vast quantities of data generated rapidly from diverse sources, and the need for accuracy and trustworthiness in data processed. These characteristics present challenges in terms of storage, handling, and analysis, necessitating efficient processing tools and sophisticated analytical capabilities such as those provided by Hadoop and other big data technologies .

Naive Bayes models are considered effective for spam filtering because they inherently assume feature independence, which simplifies computation and can effectively capture the probability of a message being spam based on individual words. In contrast, linear regression and k-nearest neighbors often require larger datasets and more complex computations to model non-linear relationships effectively, making them less suitable for the high-dimensional, sparse data typically seen in spam filtering .

Primary ethical considerations in data science regarding data visualization and privacy include ensuring data integrity and accuracy, avoiding misleading representations, and safeguarding user privacy by preventing unauthorized access to personal data. It is also essential to be transparent about the methodology and maintain objectivity to avoid biases that could lead to incorrect conclusions or harm to individuals or groups .

Dimensionality reduction techniques like Singular Value Decomposition (SVD) are crucial in building recommendation engines as they compress the feature space of data, enhance computational efficiency, and help in dealing with the sparsity of user-item interactions. SVD identifies the underlying structure in the data, thereby helping to recommend items by capturing latent factors that represent user preferences and item characteristics .

The RealDirect case study illustrates the principles of Exploratory Data Analysis (EDA) by demonstrating the application of basic tools such as plots, graphs, and summary statistics to understand data patterns and inform decision-making. It exemplifies the data science process by showing how data can be systematically analyzed to derive actionable insights, critical for problem-solving and strategy formulation in a competitive online real estate market .

You might also like