0% found this document useful (0 votes)
31 views2 pages

INT375 Data Science Toolbox Syllabus

The INT375 course focuses on Python programming for data science, covering fundamentals, data manipulation with NumPy and Pandas, data visualization with Matplotlib and Seaborn, exploratory data analysis, statistical analysis, and the role of machine learning. Students will engage in practical experiments to reinforce their understanding of these concepts. Key textbooks include 'Python for Data Science' and 'Data Science and Machine Learning Using Python'.

Uploaded by

pawankalayan0209
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views2 pages

INT375 Data Science Toolbox Syllabus

The INT375 course focuses on Python programming for data science, covering fundamentals, data manipulation with NumPy and Pandas, data visualization with Matplotlib and Seaborn, exploratory data analysis, statistical analysis, and the role of machine learning. Students will engage in practical experiments to reinforce their understanding of these concepts. Key textbooks include 'Python for Data Science' and 'Data Science and Machine Learning Using Python'.

Uploaded by

pawankalayan0209
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

INT375:DATA SCIENCE TOOLBOX: PYTHON PROGRAMMING

L:2 T:0 P:2 Credits:3

Course Outcomes: Through this course students should be able to

CO1 :: understand and apply Python programming fundamentals

CO2 :: utilize NumPy and Pandas for efficient data manipulation, cleaning, and preparation.

CO3 :: apply clear and effective data visualizations using Matplotlib and Seaborn to analyze and
communicate data insights.
CO4 :: execute exploratory data analysis to uncover data insights using Python

CO5 :: perform statistical analysis and hypothesis testing using Python

CO6 :: associate the role of machine learning in data science

Unit I
Introduction to Python for Data Science : Overview of Data Science, Basic Syntax and Data
Types, Control Structures (if statements, loops), Functions and Modules
Unit II
Data Manipulation with NumPy and Pandas : Introduction to NumPy: Arrays, Operations, Data
Manipulation with Pandas: Series and DataFrames, Data Cleaning and Preparation, Handling Missing
Data
Unit III
Data Visualization with Matplotlib and Seaborn : Principles of Data Visualization, Creating Plots
with Matplotlib, Advanced Visualization with Seaborn, Customizing Visualizations
Unit IV
Exploratory Data Analysis (EDA) : Understanding EDA and its Importance, Summary Statistics,
Correlation and Covariance, Outlier Detection
Unit V
Introduction to Statistical Analysis : Descriptive and Inferential Statistics, Hypothesis Testing: Z-
test, t-test, p-test, chi-squared test, variance-inflation factor(VIF), Shapiro- Wilk test, Probability
Distributions: Uniform Distribution Normal Distribution Binomial Distribution Poisson Distribution,
Introduction to A/B Testing
Unit VI
Exploring the role of machine learning in data science : Introduction to Machine Learning
Concepts, Supervised vs. Unsupervised Learning, Understand CRISP-DM framework using Linear
Regression model, Introduction to Classification
Recent Trends : Generative AI and Its Applications: GPT-4 DALL-E, Synthetic Data Generation

List of Practicals / Experiments:

List of Practical's / Experiments:


• Exploring and understanding Basics of Python Language

• Exploring and understanding the basic concepts of Data Science and components of Python

• Exploring different Control Structures and function in Python

• Practical on NumPy Package

• Practical to demonstrate working with Data in Python

• Practical to demonstrate working with NumPy Arrays

• Practical on Pandas Package

• Practical on Visualization with MatPlotLib

• Practical demonstration on EDA, Summary Statistics

• Practical demonstration on Correlation and Covariance, Outlier Detection

• Practical demonstration on Outlier Detection

Session 2024-25 Page:1/2


• Practical Demonstration on Descriptive and Inferential Statistics , Hypothesis testing

• Practical Demonstration on Hypothesis testing, Probability Distributions

• Practical Demonstration on CRISP-DM framework using Linear Regression model

Text Books:
1. PYTHON FOR DATA SCIENCE by MOHD. ABDUL HAMEED, WILEY

2. DATA SCIENCE AND MACHINE LEARNING USING PYTHON by REEMA THAREJA, MC GRAW
HILL
References:
1. FOUNDATIONAL PYTHON FOR DATA SCIENCE, 1ST EDITION by KENNEDY BEHRMAN,
PEARSON
2. DATA SCIENCE FROM SCRATCH by JOEL GRUS, O'REILLY

Session 2024-25 Page:2/2

Common questions

Powered by AI

NumPy and Pandas significantly enhance data cleaning and preparation by providing robust structures, such as arrays and DataFrames, that allow for efficient data storage and manipulation. NumPy's array operations enable swift mathematical computations, while Pandas offers functionalities for handling missing data, filtering, and grouping data efficiently, which are crucial for data cleaning. Additionally, Pandas' intuitive Series and DataFrame objects allow for seamless integration of data cleaning workflows, making it easier to apply transformations and prepare data for further analysis or visualization .

Recent trends like Generative AI and synthetic data generation profoundly influence data science by expanding capabilities in data augmentation, privacy, and scalability. Generative AI models, such as GPT-4, enhance natural language processing and creative tasks, while DALL-E revolutionizes automated image creation. Synthetic data generation offers a solution when real data is scarce or sensitive, providing robust, privacy-preserving data alternatives. These innovations promote advanced research, enable the formulation of new applications, and facilitate broader accessibility to data science solutions for diverse fields .

Supervised and unsupervised learning are crucial in data science as they provide frameworks for pattern recognition and predictive modeling. Supervised learning involves training models on labeled data to make predictions or classifications, useful in applications like fraud detection or customer churn prediction. Unsupervised learning does not use labeled responses, making it invaluable for data exploration and discovering hidden patterns or groupings, such as customer segmentation. Together, these learning paradigms empower data scientists to extract meaningful insights and facilitate automated decision-making across various domains .

Mastering Python programming fundamentals is crucial because it establishes a foundation for effectively utilizing tools like NumPy and Pandas for data manipulation. A solid understanding of basic syntax, data types, and control structures (e.g., if statements, loops) enables consistent and efficient data handling and processing, which is vital for any data science task. Python's functions and modules further allow encapsulation and reusability of code, reducing redundancy and improving readability. This foundational knowledge also facilitates the smooth integration of advanced data manipulation operations using libraries specialized for data science tasks .

Exploratory data analysis techniques like outlier detection and correlation analysis have significant practical implications by enhancing data quality and insights in real-world applications. Outlier detection helps identify and correct anomalous data points that could skew results or highlight new phenomena, such as fraud detection or sensor failures. Correlation analysis reveals relationships between variables, guiding feature selection and model design. These techniques improve decision-making processes, reduce risks associated with incorrect data interpretation, and ensure robust analytical outcomes .

Key steps in exploratory data analysis include summarizing the main characteristics of data using summary statistics, identifying patterns through correlation and covariance, and detecting outliers. EDA facilitates data understanding by uncovering the structure, relationships, and peculiarities within the data set, potentially revealing new insights or guiding further data transformation. This process also aids in hypothesis formulation and selection of appropriate statistical methodologies for deeper analysis, thus providing a comprehensive overview necessary for informed decision-making .

Hypothesis tests like the t-test and chi-squared test are fundamental for statistical analysis in data science as they enable researchers to infer population characteristics from sample data. The t-test evaluates whether the means of two groups are statistically different, aiding comparisons in experimental data. The chi-squared test assesses the independence of categorical variables, useful in survey data to evaluate observed distributions against expected ones. These tests validate findings and help confirm or refute assumptions, thus forming the backbone of evidence-based conclusions in research .

Understanding different probability distributions is critical because they underpin many statistical techniques, providing the foundation for hypothesis testing, estimation, and prediction. The normal distribution is paramount for its role in the central limit theorem, influencing many statistical tests. The binomial distribution models scenarios of binary outcomes, such as success/failure, while the Poisson distribution is suitable for modeling rare events. Mastery of these distributions allows statisticians to correctly apply analytical methods and make informed decisions based on data characteristics .

The CRISP-DM (Cross Industry Standard Process for Data Mining) framework structures data analysis by providing a comprehensive roadmap that emphasizes understanding business objectives, data preparation, modeling, evaluation, and deployment. Linear regression models are integral to the modeling phase where relationships between variables are quantified and predictive insights are generated. This structured approach ensures systematic analysis, minimizes errors, and enhances reproducibility, making it pivotal for effective implementation of data-driven strategies across industries .

Effective data visualization is crucial because it transforms complex data sets into comprehensible insights, helping to convey trends, patterns, and anomalies clearly and concisely. Matplotlib serves as a versatile foundation for creating static, animated, and interactive visualizations in Python, whereas Seaborn provides a high-level interface for drawing attractive and informative statistical graphics. These tools offer extensive customization options, allowing data scientists to tailor graphics to specific audiences and objectives, thus enhancing communication of data-driven insights .

You might also like