0% found this document useful (0 votes)
38 views11 pages

Core Concepts in Data Science Explained

Open elective syllabus for ECE

Uploaded by

bandiketharun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views11 pages

Core Concepts in Data Science Explained

Open elective syllabus for ECE

Uploaded by

bandiketharun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT-I: Introduction to Core Concepts and Technologies Introduction- Terminology, data

science process, data science toolkit, Types of data, Example applications


TERMINOLOGY
Data science terminology refers to the specific vocabulary and concepts used within the field
of data science to describe its techniques, tools, and processes. These terms are crucial for
anyone involved in data science as they provide a common language that facilitates clearer
communication, more effective collaboration, and a deeper understanding of complex
concepts.
Algorithm
An algorithm is a set of instructions or rules to follow in order to complete a specific task.
Algorithms can be particularly useful when you’re working with big data or machine learning.
Data analysts may use algorithms to organize or analyze data, while data scientists may use
algorithms to make predictions or build models.
Artificial intelligence (AI)
Artificial intelligence (AI) uses computer science and data to enable problem solving in
machines. In this case, the intelligence is “artificial” because it’s a computer programmed to
perform tasks commonly associated with human intelligence.
Big data
Big data is a large collection of data characterized by the three V’s: volume, velocity, and
variety. Volume refers to the amount of data—big data deals with high volumes of data; velocity
refers to the rate at which data is collected—big data is collected at a high velocity and often
streams directly into memory; and variety refers to the range of data formats—big data tends
to have a high variety of structured, semi-structured, and unstructured data, as well as a variety
of formats such as numbers, text strings, images, and audio.
Business intelligence (BI)
Business intelligence (BI) is data analytics used to empower organizations to make data-driven
business decisions. Business intelligence analysts analyze business data like revenue, sales, or
customer data, and offer recommendations based on their analysis.
Classification
Classification is a machine learning problem that organizes data into categories. You may use
this to create email spam filters, for example. Some examples of algorithms commonly used to
create classification models are logistic regression, decision trees, K-nearest neighbor (KNN),
and random forest.
Dashboard
A dashboard is a tool for monitoring and displaying live data. It is typically connected to a
database and features visualizations that automatically update to reflect the most current data
in the database.
Data analytics
Data analytics is the collection, transformation, and organization of data in order to draw
conclusions, make predictions, and drive informed decision making. Data analytics
encompasses data analysis (the process of deriving information from data), data science (using
data to theorize and forecast), and data engineering (building data systems). Data analysts, data
scientists, and data engineers are all data analytics professionals.
There are four key types of data analytics, including:
 Descriptive analytics tells us what happened.
 Diagnostic analytics tells us why something happened.
 Predictive analytics tells us what will likely happen in the future.
 Prescriptive analytics tells us how to act.

Data architecture

Data architecture, also called data design, is the plan for an organization’s data management
system. This can include all touchpoints in the data lifecycle, including how the data is
gathered, organized, utilized, and discarded. Data architects design the blueprints that
organizations use for their data management systems.
Data cleaning
Data cleaning, cleansing, or scrubbing is the process of preparing raw data for analysis. When
cleaning your data, you verify that your data is accurate, complete, consistent, and unbiased.
It’s important to make sure you have clean data prior to analysis because unclean or dirty data
can lead to inaccurate conclusions and misguided business decisions.
Data enrichment
Data enrichment is the process of adding data to an existing dataset. Typically, a data scientist
would enrich data during the data transformation process as they prepare to begin their analysis
if they realize additional data is needed to answer the business question.
Data governance
Data governance is the formal plan for how an organization manages company data. Data
governance encompasses rules for the way data is accessed and used and can include
accountability and compliance rules.
Data lake
A data lake is a data storage repository designed to capture and store a large amount of
structured, semi-structured, and unstructured raw data. Data scientists use the data in data lakes
for machine learning or AI algorithms and models, or they can process the data and transfer it
to a data warehouse.
Data mart
A data mart is a subset of a data warehouse that houses all processed data relevant to a specific
department. While a data warehouse may contain data pertaining to the finance, marketing,
sales, and human resources teams, a data mart may isolate the finance team data.
Data mining
Data mining is the process of closely examining data to identify patterns and glean insights.
Data mining is a central aspect of data analytics; the insights you find during the mining process
will inform your business recommendations.
Data visualization
Data visualization is the representation of information and data using charts, graphs, maps, and
other visual tools. With strong data visualizations, you can foster storytelling, make your data
accessible to a wider audience, identify patterns and relationships, and explore your data
further.
Data warehouse
A data warehouse is a centralized data repository that stores processed, organized data from
multiple sources. Data warehouses may contain a combination of current and historical data
that has been extracted, transformed, and loaded from internal and external databases.
Data wrangling
Data wrangling, also called data munging or data remediation, is the process of converting raw
data into a usable form. There are four stages of the munging process: discovery, data
transformation, data validation, and publishing. The data transformation stage can be broken
down further into tasks like data structuring, data normalization or denormalization, data
cleaning, and data enrichment.
Deep learning
Deep learning is a machine learning technique that layers algorithms and computing units—or
neurons—into what is called an artificial neural network (ANN). Unlike machine learning,
deep learning algorithms can improve incorrect outcomes through repetition without human
intervention. These deep neural networks take inspiration from the structure of the human
brain.
Machine learning
Machine learning is a subset of AI in which algorithms mimic human learning while processing
data. With machine learning, algorithms can improve over time, becoming increasingly
accurate when making predictions or classifications. Machine learning engineers build, design,
and maintain AI and machine learning systems.
Regression
Regression is a machine learning problem that uses data to predict future outcomes. Some
examples of algorithms commonly used to create regression models are linear regression and
ridge regression.
DATA SCIENCE PROCESS
Data science is a multidisciplinary field that uses statistical and computational methods to
extract insights and knowledge from data. It involves a combination of skills and knowledge
from various fields such as statistics, computer science, mathematics, and domain expertise.
The data science process typically involves the following steps:

Setting the research goal


Data science is mostly applied in the context of an organization. When the business asks you
to perform a data science project, you’ll first prepare a project charter. This charter contains
information such as what you’re going to research, how the company benefits from that, what
data and resources you need, a timetable, and deliverables.
Retrieving data
The second step is to collect data. You’ve stated in the project charter which data you need and
where you can find it. In this step you ensure that you can use the data in your program, which
means checking the existence of, quality, and access to the data. Data can also be delivered by
third-party companies and takes many forms ranging from Excel spreadsheets to different types
of databases.
Data preparation
Data collection is an error-prone process; in this phase you enhance the quality of the data and
prepare it for use in subsequent steps. This phase consists of three subphases: data cleansing
removes false values from a data source and inconsistencies across data sources, data
integration enriches data sources by combining information from multiple data sources, and
data transformation ensures that the data is in a suitable format for use in your models.
Data exploration
Data exploration is concerned with building a deeper understanding of your data. You try to
understand how variables interact with each other, the distribution of the data, and whether
there are outliers. To achieve this you mainly use descriptive statistics, visual techniques, and
simple modeling. This step often goes by the abbreviation EDA, for Exploratory Data Analysis.
Data modeling or model building
In this phase you use models, domain knowledge, and insights about the data you found in the
previous steps to answer the research question. You select a technique from the fields of
statistics, machine learning, operations research, and so on. Building a model is an iterative
process that involves selecting the variables for the model, executing the model, and model
diagnostics.
Presentation and automation
Finally, you present the results to your business. These results can take many forms, ranging
from presentations to research reports. Sometimes you’ll need to automate the execution of the
process because the business will want to use the insights you gained in another project or
enable an operational process to use the outcome from your model.

DATA SCIENCE TOOLKIT


A data science toolkit is the collection of software, programming languages, and platforms that
a data professional uses to perform the full range of tasks in a data science project—from data
collection and cleaning to analysis, modeling, and visualization.
The tools chosen can vary based on the specific project, the scale of the data, and the data
scientist's personal or team preferences. However, a common set of tools forms the foundation
of most modern data science workflows.
Programming Languages
These are the core languages for writing and executing code for data analysis.
 Python: The most popular language for data science, Python is known for its simple,
readable syntax and a vast ecosystem of libraries. It's used for nearly every stage of the
data science lifecycle.
 R: A language developed specifically for statistical computing and graphics. It is a
favorite among statisticians and researchers due to its powerful statistical packages and
robust data visualization capabilities.
 SQL (Structured Query Language): This is the essential language for interacting with
relational databases. It's used for data retrieval, manipulation, and management, making
it a fundamental skill for any data professional.
 Julia: A newer language designed for high-performance numerical and scientific
computing. It is gaining traction in fields that require intensive computations, such as
machine learning and scientific research.
Key Libraries and Frameworks
These libraries extend the functionality of the core programming languages, providing
specialized tools for various data science tasks.
 Data Manipulation and Analysis:
o Pandas (Python): The cornerstone of data analysis in Python, it provides a
powerful data structure called the DataFrame for working with tabular data
efficiently.
o NumPy (Python): A fundamental library for numerical computing, it provides
support for multi-dimensional arrays and a wide array of mathematical
functions.
 Machine Learning:
o Scikit-learn (Python): A comprehensive and user-friendly library that offers a
wide range of algorithms for classification, regression, clustering, and more,
along with tools for model evaluation and selection.
o TensorFlow (Python) and PyTorch (Python): These are the leading deep
learning frameworks used for building and training complex neural networks.
They are essential for tasks like computer vision and natural language
processing.
 Data Visualization:
o Matplotlib (Python) and Seaborn (Python): The most widely used libraries
for creating static and statistical visualizations in Python.
o ggplot2 (R): A popular library in R known for its elegant syntax and ability to
create high-quality, professional-looking plots.
o Tableau and Power BI: These are powerful business intelligence (BI) tools that
provide interactive dashboards and visualizations for a less technical audience.
Integrated Development Environments (IDEs) and Notebooks
These platforms provide the environment for writing, running, and managing data science code.
 Jupyter Notebook: A web-based interactive environment that allows data scientists to
create and share documents containing live code, equations, visualizations, and
narrative text. It's a popular choice for exploratory data analysis.
 RStudio: A powerful IDE specifically designed for R, providing a console, editor, and
tools for plotting, history, and debugging.
 Visual Studio Code and PyCharm: General-purpose IDEs that are popular among data
scientists due to their robust features, including support for extensions, version control,
and debugging.
Cloud Computing Platforms
For large-scale data science projects, cloud platforms provide the necessary computational
power and specialized services.
 Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft
Azure: These platforms offer scalable computing resources and a suite of services for
machine learning, data storage, and data processing. Popular services include Amazon
SageMaker, Google Cloud AI Platform, and Azure Machine Learning.
 Databricks: A unified analytics platform built on Apache Spark that is designed for
collaborative data engineering, machine learning, and business analytics.
Version Control
 Git and GitHub: Essential tools for version control, allowing data scientists to track
changes in their code, collaborate with others, and ensure the reproducibility of their
projects.

TYPES OF DATA
A Very large amount of data will generate in big data and data science. These data is various
types and main categories of data are as follows:
 Structured
 Natural language
 Graph-based
 Streaming
 Unstructured
 Machine-generated
 Audio, video and images
EXAMPLE APPLICATIONS
Data science is a versatile field with a wide range of real-world applications across various
industries. By leveraging data, organizations can make better decisions, automate processes,
and gain a competitive edge.
Finance and Banking
 Fraud Detection: Banks and financial institutions analyze transaction data in real time
to identify and flag fraudulent activities. Machine learning models are trained on
historical data to recognize patterns of legitimate versus fraudulent transactions,
allowing them to detect anomalies and prevent financial losses.
 Algorithmic Trading: High-frequency trading firms use data science to build complex
algorithms that execute trades automatically at lightning-fast speeds. These algorithms
analyze real-time market data, including stock prices, news headlines, and social media
sentiment, to make rapid buy-or-sell decisions.
 Credit Risk Assessment: Instead of relying solely on traditional credit scores, data
science models can analyze a wider range of data points—including spending habits
and repayment history—to provide a more accurate assessment of a person's
creditworthiness. This helps banks make more informed lending decisions and
personalize loan offers.
Healthcare
 Medical Imaging Analysis: Data science, particularly deep learning, is used to analyze
medical images like X-rays, CT scans, and MRIs. Algorithms can be trained to detect
subtle signs of diseases like cancer, making diagnoses more accurate and faster than is
possible with a human eye alone.
 Drug Discovery: The drug development process is extremely long and expensive. Data
science accelerates this by analyzing vast datasets of genetic information, molecular
structures, and clinical trial results to identify potential drug candidates and predict their
efficacy, reducing the time and cost of bringing new treatments to market.
 Personalized Medicine: By analyzing a patient's unique genetic information, lifestyle,
and medical history, data science can help tailor treatment plans to individual needs.
This allows doctors to prescribe the most effective medications and therapies while
minimizing adverse side effects.
Retail and E-commerce
 Recommendation Systems: Perhaps the most common application of data science.
Companies like Amazon and Netflix use data from a user's past purchases, browsing
history, and ratings to recommend products or content they are likely to enjoy. This
significantly enhances the user experience and boosts sales.
 Inventory Management: Data science helps retailers forecast demand for products by
analyzing historical sales data, seasonal trends, and external factors like weather. This
allows them to optimize inventory levels, prevent stockouts, and reduce waste.
 Price Optimization: E-commerce platforms use dynamic pricing models that adjust a
product's price in real time based on demand, competitor prices, and customer behavior
to maximize revenue and profitability.
Other Industries
 Transportation: Logistics companies like UPS use data science to optimize delivery
routes, saving time and fuel. In public transit, data analysis can help model traffic
patterns to improve city planning and reduce congestion.
 Social Media: Platforms like Facebook use data science for content recommendation
(what to show in your feed), sentiment analysis (understanding public opinion about a
brand), and user behavior analysis to improve engagement.
 Telecommunications: Telco companies analyze customer data to predict churn—the
likelihood that a customer will switch to a competitor. This allows them to proactively
offer personalized promotions and improve customer retention.

Common questions

Powered by AI

Data enrichment involves adding data to an existing dataset to provide more information for analysis, often during the data transformation process . In contrast, data integration combines multiple data sources to enrich data sources as well, but focuses on merging information from different data origins to create a coherent dataset . The key difference lies in data enrichment adding new data entities, while data integration focuses on merging existing information from various sources.

Exploratory Data Analysis (EDA) is crucial in the data modeling phase because it helps build a deeper understanding of data, identifying relationships, distributions, and potential outliers . Through EDA, data scientists can discern which variables might be influential, select appropriate features for modeling, and refine initial hypotheses. Without EDA, significant patterns might be overlooked, leading to less informed model selection and development .

Data marts and data warehouses both store structured data, but differ primarily in scope. Data warehouses are centralized repositories storing processed, organized data from multiple sources and are used for comprehensive analysis across an organization . Data marts, on the other hand, are subsets of data warehouses tailored to specific departments or functions, enabling specialized analysis for particular user groups . Applications of data marts include finance team analyses, while data warehouses support enterprise-wide decision-making.

Artificial neural networks (ANNs) are significant in deep learning as they are designed to model complex patterns and improve outcomes through layers of algorithms and computing units . They are inspired by the human brain, where neurons interconnect and process information collectively, allowing ANNs to process vast amounts of data efficiently and improve performance iteratively without human intervention . This architecture enables deep learning models to excel in tasks like image and speech recognition.

Regression algorithms play a central role in predictive analytics by providing a methodological framework to predict future outcomes based on historical data . These algorithms, such as linear regression and ridge regression, model the relationship between dependent and independent variables, allowing analysts to forecast future trends and behaviors based on past data patterns. They help in decision-making processes where future predictions are crucial, such as sales forecasting and risk assessment .

Big data's characteristics—volume, velocity, and variety—necessitate that data lakes capture and store large volumes of data in its raw form, accommodating the high velocity and variety of structured, semi-structured, and unstructured data . In contrast, data warehouses store processed and organized data, optimized for retrieval and specific queries, dealing primarily with volume and organized variety. Hence, data architecture must ensure that the data lake handles raw data influx efficiently, while the warehouse supports efficient data processing and querying .

Classification algorithms like logistic regression and decision trees crucially impact the accuracy and reliability of predictions in machine learning by determining how data is categorized . Logistic regression assigns probabilities to different classes, offering predictions suitable for binary or multi-class problems. Decision trees split data into branches based on feature values, facilitating straightforward interpretation and implementation. The choice of algorithm affects model complexity, capacity to handle data variations, and the accuracy of the resultant predictions .

Business intelligence (BI) focuses on data analytics to empower organizations with data-driven decision-making capabilities . It primarily deals with analyzing business data like revenue and sales to offer actionable insights and recommendations . Data analytics encompasses a broader scope, including data collection, transformation, and organization to draw conclusions, make predictions, and drive decision-making . While BI targets business-specific analysis, data analytics includes processes across descriptive, diagnostic, predictive, and prescriptive analytics .

Machine learning algorithms improve over time by processing more data and learning from mistakes, enhancing their accuracy in predictions or classifications . Deep learning, a subset of machine learning, elevates this process using artificial neural networks that mimic the human brain, enabling the system to refine outcomes through each repetition without human intervention . This self-improving mechanism allows deep learning algorithms to handle complex tasks such as image recognition by continuously enhancing their processing capabilities.

Cloud computing platforms such as AWS, GCP, and Azure have significantly transformed data science workflows by providing scalable computing resources and specialized services that support large-scale data processing and storage . They enable real-time data analysis, removing traditional barriers of computational limits, thus facilitating scalability. Moreover, platforms like Databricks enhance collaboration by offering shared environments where data professionals can collaborate efficiently, centralize their data operations and deploy models quickly, impacting how data science tasks are managed and executed collaboratively .

You might also like