0% found this document useful (0 votes)
12 views2 pages

Data Science Concepts and Techniques

The document outlines a series of questions related to data science, including topics such as the differences between data structures, machine learning techniques, data collection strategies, and the data science lifecycle. It also includes practical coding tasks involving Python libraries like Numpy and DataFrame operations using Pandas. Additionally, it covers the roles of data science professionals and various applications of data science across different fields.

Uploaded by

prem prasad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views2 pages

Data Science Concepts and Techniques

The document outlines a series of questions related to data science, including topics such as the differences between data structures, machine learning techniques, data collection strategies, and the data science lifecycle. It also includes practical coding tasks involving Python libraries like Numpy and DataFrame operations using Pandas. Additionally, it covers the roles of data science professionals and various applications of data science across different fields.

Uploaded by

prem prasad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Q1 Attempt any four parts (4x5=20)

a) What is a Series and how is it different from a 1-D array, a list, and a dictionary.
b) Differentiate Supervised and Unsupervised learning techniques
c) Specify any four python Libraries and their applications.
d) Describe any five data collection strategies
e) Give 4 ways of creating Numpy arrays.
f) Explain four major tasks in data pre-processing

Q.2 (a) Explain the roles and responsibilities of any five Data Science professionals. (5)
(b) What are the various types of Data in Statistics? Explain with example (5)

Q3. (a) What is Data Science Lifecycle? Explain all stages with diagram. (5)
(b) What are missing values? What are the strategies to handle them? Explain
four methods of Imputation by giving example of each. (5)

Q4 (a) Explain four methods of creating Dataframe by using (5)


i. Multiple List of different length
ii. Multiple Series Object
iii. Nested Dictionary
iv. Numpy Array
(b) Explain five applications/use in different fields of Data Science. (5)

Q5 Give 4 ways of creating Numpy arrays (10)


Give the code or syntax to Perform the following operation on two 2D numpy array array1 and array2 and
1D array array3.
a. Add array1 and array2
b. Find sum of array1 elements over a given axis.
c. Find product of array2 elements over a given axis.
d. Change the dimension of an array3 to 2D.
e. Transpose the array created in part d.
f. Display 2 rows and third column of 2D array array1.
g. Join two 2D array along row.
h. Convert array2 to 1D array.
i. Split an array 1 into multiple subarrays

Q6 Give 4 ways of creating series by using List, arrays, dictionary, scalar value. (10)
a) Write python code to create the following series
101 Harsh
102 Arun
103 Ankur
104 Harpal
105 Divya
106 Jeet
b) Show details of 1st 3 employees using head function
c) Show details of last 3 employees using tail function
d) Show details of 1st 3 employees without using head function
e) Show details of last 3 employees without using tail function
f) Show value of index no 102.
g) Show 2nd to 4th records.
h) Show values of index no=101,103,105.
i) Show details of “Arun”

Q7. Create a dataframe for the below given data (10)


Write a code to perform following operations on above dataframe:
i. Print the batsman name along with runs scored in Test and T20 using column names and dot
notation.
ii. Display the Batsman name along with runs scored in ODI using loc
iii. Display the batsman details who scored runs more than :

More than 2000 in ODI


Less than 2500 in Test
More than 1500 in T20
iv. Display the columns using column index number like 0, 2, 4.
v. Display the alternated rows.
vi. Reindex the dataframe created above with batsman name and delete data of Hardik Pandya and
Shikhar Dhawan by their index from original dataframe.
vii. Insert 2 rows in the dataframe and delete rows whose index is 1 and 4.
viii. Delete a column Test, add one more column total at last (next to T20 column), make total of ODI
and T20 runs in that column.
ix. Rename column T20 with “T20I Runs”.
x. Print the dataframe without headers.

OR
Q8. Create the following DataFrame Sales containing year-wise sales figures for five salespersons in INR. Use
the years as column labels, and salesperson names as row labels. (10)

2014 2015 2016 2017

Madhu 100.5 12000 2000 50000


Kusum 150.8 18000 5000 60000
Kinshuk 200.9 22000 70000 70000
Ankit 30000 30000 1000 80000
Shruti 40000 45000 1250 90000

a. Display the row labels of Sales.


b. Display the column labels of Sales.
c. Display the dimensions, shape, size and values of Sales.
d. Display the last two rows of Sales.
e. Display the first two columns of Sales.
f. Change the DataFrame Sales such that it becomes its transpose.
g. Add data to Sales for salesman Sumeet where the sales made are [196.2, 37800, 52000, 78438] in the
years [2014, 2015, 2016, 2017] respectively.
h. Delete the data for the year 2014 from the DataFrame Sales.
i. Update the sale made by Shruti in 2017 to 100000.
j. Write the values of DataFrame Sales to a comma-separated file [Link] on the disk. Do not
write the row labels and column labels.
k. Change the name of the salesperson Ankit to Vivaan and Kinshuk to Shailesh.
l. Delete the data for salesman Madhu from the DataFrame Sales.

Common questions

Powered by AI

Common Python libraries used in data analysis include: 1) NumPy, which supports large, multi-dimensional arrays and matrices, along with mathematical functions to operate on these arrays; 2) Pandas, which provides easy-to-use data structures and data analysis tools, especially for data manipulation and analysis; 3) Matplotlib, for creating static, interactive, and animated visualizations; and 4) Scikit-learn, which is used for implementing simple and efficient tools for data mining and data analysis, specifically machine learning models such as classification and regression.

Handling missing data can be approached with several strategies: 1) Deletion - removing the records or features with missing values; 2) Mean/Median Imputation - replacing missing values with the mean or median of the column; 3) Mode Imputation - using the most frequent value in the column to fill in missing entries; and 4) Prediction Model - using a predictive model to estimate and replace missing values based on other features. Each method has its own merit and depends on the data and context; deletion is straightforward but can lead to substantial data loss, while predictive modeling may give the most accurate estimates but is computationally intensive.

Creating a DataFrame from multiple lists of different lengths in pandas may result in NaN values for positions where no data is available from shorter lists. This is because pandas aligns data across the lists based on index positions, filling in NaN where data is missing to ensure alignment. This can be advantageous for consistency in data handling but might require additional data imputation to deal with the resulting missing values.

Effective data collection strategies include: 1) Surveys - structured questionnaires which, if designed well, can gather wide-ranging data; 2) Observations - collecting data by monitoring subjects, often used in behavioral studies; 3) Interviews - obtaining detailed data through interactive conversation; 4) Experiments - collecting data under controlled conditions for causal inference; and 5) Transactions - automatic logging of events in systems, ideal for large and high-velocity data. The quality of data collected by these methods depends on design, execution, and the minimization of bias and errors. Good data collection practices result in high-quality, reliable data which is crucial for accurate analysis.

The Data Science Lifecycle consists of a series of iterative stages: 1) Problem Definition - understanding and defining the problem to solve; 2) Data Collection - gathering data relevant to the problem; 3) Data Cleaning and Preparation - processing raw data for analysis; 4) Exploratory Data Analysis - summarizing main characteristics using visual and quantitative methods; 5) Modeling - selecting and applying machine learning algorithms; 6) Evaluation - assessing the model's performance; 7) Deployment - integrating the model into the decision-making process; and 8) Monitoring and Maintenance - ensuring that the model remains relevant and accurate.

A data scientist's role involves extracting insights from data through the application of statistical, analytical, and machine learning techniques; this includes building models, testing hypotheses, and interpreting data. In contrast, a data engineer focuses on the design, construction, and maintenance of systems to collect, store, and analyze data. They ensure that the infrastructure for data generation and processing is robust and efficient. While data scientists create models and derive insights, data engineers build the pipelines that support that work.

Data imputation is the process of replacing missing data with substituted values to maintain dataset integrity. This is crucial in pre-processing as missing data can result in biased estimates and affect data analysis outcomes. Imputation techniques like mean, median, or mode filling, using predictive models, or neighbor-based imputations, help maintain consistency and comprehensiveness of datasets without discarding useful data. Proper imputation aids in preserving statistical power and ensures more accurate and robust analysis results.

Supervised learning techniques involve training a model on a labeled dataset, meaning each training example is paired with an output label. This allows the model to learn the mapping from inputs to outputs, aiding tasks such as classification and regression. In contrast, unsupervised learning methods work with unlabeled data, and the system tries to learn patterns and structures from the data itself, commonly used in clustering and association tasks.

A Series in Python is a one-dimensional labeled array capable of holding any data type, similar to a column in a table. Unlike a 1-D array, a Series can hold mixed data types and has labeled indices. Compared to a list, a Series provides additional functionality linked to data analytics, like statistical operations. A dictionary, on the other hand, pairs keys with values and does not maintain the order of insertion unless using an OrderedDict, while a Series maintains order and can be indexed numerically or with custom labels.

Converting a 2D numpy array to a 1D array involves flattening the array using methods such as `flatten()` or `ravel()`. This process merges all the nested elements into a single continuous array. The benefits of this conversion include simplified data structures for operations that require linear inputs, reduced complexity, and sometimes improved computational efficiency, especially in operations better suited for one-dimensional data.

You might also like