UNIT-I: Introduction to Core Concepts and Technologies Introduction- Terminology, data
science process, data science toolkit, Types of data, Example applications
TERMINOLOGY
Data science terminology refers to the specific vocabulary and concepts used within the field
of data science to describe its techniques, tools, and processes. These terms are crucial for
anyone involved in data science as they provide a common language that facilitates clearer
communication, more effective collaboration, and a deeper understanding of complex
concepts.
Algorithm
An algorithm is a set of instructions or rules to follow in order to complete a specific task.
Algorithms can be particularly useful when you’re working with big data or machine learning.
Data analysts may use algorithms to organize or analyze data, while data scientists may use
algorithms to make predictions or build models.
Artificial intelligence (AI)
Artificial intelligence (AI) uses computer science and data to enable problem solving in
machines. In this case, the intelligence is “artificial” because it’s a computer programmed to
perform tasks commonly associated with human intelligence.
Big data
Big data is a large collection of data characterized by the three V’s: volume, velocity, and
variety. Volume refers to the amount of data—big data deals with high volumes of data; velocity
refers to the rate at which data is collected—big data is collected at a high velocity and often
streams directly into memory; and variety refers to the range of data formats—big data tends
to have a high variety of structured, semi-structured, and unstructured data, as well as a variety
of formats such as numbers, text strings, images, and audio.
Business intelligence (BI)
Business intelligence (BI) is data analytics used to empower organizations to make data-driven
business decisions. Business intelligence analysts analyze business data like revenue, sales, or
customer data, and offer recommendations based on their analysis.
Classification
Classification is a machine learning problem that organizes data into categories. You may use
this to create email spam filters, for example. Some examples of algorithms commonly used to
create classification models are logistic regression, decision trees, K-nearest neighbor (KNN),
and random forest.
Dashboard
A dashboard is a tool for monitoring and displaying live data. It is typically connected to a
database and features visualizations that automatically update to reflect the most current data
in the database.
Data analytics
Data analytics is the collection, transformation, and organization of data in order to draw
conclusions, make predictions, and drive informed decision making. Data analytics
encompasses data analysis (the process of deriving information from data), data science (using
data to theorize and forecast), and data engineering (building data systems). Data analysts, data
scientists, and data engineers are all data analytics professionals.
There are four key types of data analytics, including:
Descriptive analytics tells us what happened.
Diagnostic analytics tells us why something happened.
Predictive analytics tells us what will likely happen in the future.
Prescriptive analytics tells us how to act.
Data architecture
Data architecture, also called data design, is the plan for an organization’s data management
system. This can include all touchpoints in the data lifecycle, including how the data is
gathered, organized, utilized, and discarded. Data architects design the blueprints that
organizations use for their data management systems.
Data cleaning
Data cleaning, cleansing, or scrubbing is the process of preparing raw data for analysis. When
cleaning your data, you verify that your data is accurate, complete, consistent, and unbiased.
It’s important to make sure you have clean data prior to analysis because unclean or dirty data
can lead to inaccurate conclusions and misguided business decisions.
Data enrichment
Data enrichment is the process of adding data to an existing dataset. Typically, a data scientist
would enrich data during the data transformation process as they prepare to begin their analysis
if they realize additional data is needed to answer the business question.
Data governance
Data governance is the formal plan for how an organization manages company data. Data
governance encompasses rules for the way data is accessed and used and can include
accountability and compliance rules.
Data lake
A data lake is a data storage repository designed to capture and store a large amount of
structured, semi-structured, and unstructured raw data. Data scientists use the data in data lakes
for machine learning or AI algorithms and models, or they can process the data and transfer it
to a data warehouse.
Data mart
A data mart is a subset of a data warehouse that houses all processed data relevant to a specific
department. While a data warehouse may contain data pertaining to the finance, marketing,
sales, and human resources teams, a data mart may isolate the finance team data.
Data mining
Data mining is the process of closely examining data to identify patterns and glean insights.
Data mining is a central aspect of data analytics; the insights you find during the mining process
will inform your business recommendations.
Data visualization
Data visualization is the representation of information and data using charts, graphs, maps, and
other visual tools. With strong data visualizations, you can foster storytelling, make your data
accessible to a wider audience, identify patterns and relationships, and explore your data
further.
Data warehouse
A data warehouse is a centralized data repository that stores processed, organized data from
multiple sources. Data warehouses may contain a combination of current and historical data
that has been extracted, transformed, and loaded from internal and external databases.
Data wrangling
Data wrangling, also called data munging or data remediation, is the process of converting raw
data into a usable form. There are four stages of the munging process: discovery, data
transformation, data validation, and publishing. The data transformation stage can be broken
down further into tasks like data structuring, data normalization or denormalization, data
cleaning, and data enrichment.
Deep learning
Deep learning is a machine learning technique that layers algorithms and computing units—or
neurons—into what is called an artificial neural network (ANN). Unlike machine learning,
deep learning algorithms can improve incorrect outcomes through repetition without human
intervention. These deep neural networks take inspiration from the structure of the human
brain.
Machine learning
Machine learning is a subset of AI in which algorithms mimic human learning while processing
data. With machine learning, algorithms can improve over time, becoming increasingly
accurate when making predictions or classifications. Machine learning engineers build, design,
and maintain AI and machine learning systems.
Regression
Regression is a machine learning problem that uses data to predict future outcomes. Some
examples of algorithms commonly used to create regression models are linear regression and
ridge regression.
DATA SCIENCE PROCESS
Data science is a multidisciplinary field that uses statistical and computational methods to
extract insights and knowledge from data. It involves a combination of skills and knowledge
from various fields such as statistics, computer science, mathematics, and domain expertise.
The data science process typically involves the following steps:
Setting the research goal
Data science is mostly applied in the context of an organization. When the business asks you
to perform a data science project, you’ll first prepare a project charter. This charter contains
information such as what you’re going to research, how the company benefits from that, what
data and resources you need, a timetable, and deliverables.
Retrieving data
The second step is to collect data. You’ve stated in the project charter which data you need and
where you can find it. In this step you ensure that you can use the data in your program, which
means checking the existence of, quality, and access to the data. Data can also be delivered by
third-party companies and takes many forms ranging from Excel spreadsheets to different types
of databases.
Data preparation
Data collection is an error-prone process; in this phase you enhance the quality of the data and
prepare it for use in subsequent steps. This phase consists of three subphases: data cleansing
removes false values from a data source and inconsistencies across data sources, data
integration enriches data sources by combining information from multiple data sources, and
data transformation ensures that the data is in a suitable format for use in your models.
Data exploration
Data exploration is concerned with building a deeper understanding of your data. You try to
understand how variables interact with each other, the distribution of the data, and whether
there are outliers. To achieve this you mainly use descriptive statistics, visual techniques, and
simple modeling. This step often goes by the abbreviation EDA, for Exploratory Data Analysis.
Data modeling or model building
In this phase you use models, domain knowledge, and insights about the data you found in the
previous steps to answer the research question. You select a technique from the fields of
statistics, machine learning, operations research, and so on. Building a model is an iterative
process that involves selecting the variables for the model, executing the model, and model
diagnostics.
Presentation and automation
Finally, you present the results to your business. These results can take many forms, ranging
from presentations to research reports. Sometimes you’ll need to automate the execution of the
process because the business will want to use the insights you gained in another project or
enable an operational process to use the outcome from your model.
DATA SCIENCE TOOLKIT
A data science toolkit is the collection of software, programming languages, and platforms that
a data professional uses to perform the full range of tasks in a data science project—from data
collection and cleaning to analysis, modeling, and visualization.
The tools chosen can vary based on the specific project, the scale of the data, and the data
scientist's personal or team preferences. However, a common set of tools forms the foundation
of most modern data science workflows.
Programming Languages
These are the core languages for writing and executing code for data analysis.
Python: The most popular language for data science, Python is known for its simple,
readable syntax and a vast ecosystem of libraries. It's used for nearly every stage of the
data science lifecycle.
R: A language developed specifically for statistical computing and graphics. It is a
favorite among statisticians and researchers due to its powerful statistical packages and
robust data visualization capabilities.
SQL (Structured Query Language): This is the essential language for interacting with
relational databases. It's used for data retrieval, manipulation, and management, making
it a fundamental skill for any data professional.
Julia: A newer language designed for high-performance numerical and scientific
computing. It is gaining traction in fields that require intensive computations, such as
machine learning and scientific research.
Key Libraries and Frameworks
These libraries extend the functionality of the core programming languages, providing
specialized tools for various data science tasks.
Data Manipulation and Analysis:
o Pandas (Python): The cornerstone of data analysis in Python, it provides a
powerful data structure called the DataFrame for working with tabular data
efficiently.
o NumPy (Python): A fundamental library for numerical computing, it provides
support for multi-dimensional arrays and a wide array of mathematical
functions.
Machine Learning:
o Scikit-learn (Python): A comprehensive and user-friendly library that offers a
wide range of algorithms for classification, regression, clustering, and more,
along with tools for model evaluation and selection.
o TensorFlow (Python) and PyTorch (Python): These are the leading deep
learning frameworks used for building and training complex neural networks.
They are essential for tasks like computer vision and natural language
processing.
Data Visualization:
o Matplotlib (Python) and Seaborn (Python): The most widely used libraries
for creating static and statistical visualizations in Python.
o ggplot2 (R): A popular library in R known for its elegant syntax and ability to
create high-quality, professional-looking plots.
o Tableau and Power BI: These are powerful business intelligence (BI) tools that
provide interactive dashboards and visualizations for a less technical audience.
Integrated Development Environments (IDEs) and Notebooks
These platforms provide the environment for writing, running, and managing data science code.
Jupyter Notebook: A web-based interactive environment that allows data scientists to
create and share documents containing live code, equations, visualizations, and
narrative text. It's a popular choice for exploratory data analysis.
RStudio: A powerful IDE specifically designed for R, providing a console, editor, and
tools for plotting, history, and debugging.
Visual Studio Code and PyCharm: General-purpose IDEs that are popular among data
scientists due to their robust features, including support for extensions, version control,
and debugging.
Cloud Computing Platforms
For large-scale data science projects, cloud platforms provide the necessary computational
power and specialized services.
Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft
Azure: These platforms offer scalable computing resources and a suite of services for
machine learning, data storage, and data processing. Popular services include Amazon
SageMaker, Google Cloud AI Platform, and Azure Machine Learning.
Databricks: A unified analytics platform built on Apache Spark that is designed for
collaborative data engineering, machine learning, and business analytics.
Version Control
Git and GitHub: Essential tools for version control, allowing data scientists to track
changes in their code, collaborate with others, and ensure the reproducibility of their
projects.
TYPES OF DATA
A Very large amount of data will generate in big data and data science. These data is various
types and main categories of data are as follows:
Structured
Natural language
Graph-based
Streaming
Unstructured
Machine-generated
Audio, video and images
EXAMPLE APPLICATIONS
Data science is a versatile field with a wide range of real-world applications across various
industries. By leveraging data, organizations can make better decisions, automate processes,
and gain a competitive edge.
Finance and Banking
Fraud Detection: Banks and financial institutions analyze transaction data in real time
to identify and flag fraudulent activities. Machine learning models are trained on
historical data to recognize patterns of legitimate versus fraudulent transactions,
allowing them to detect anomalies and prevent financial losses.
Algorithmic Trading: High-frequency trading firms use data science to build complex
algorithms that execute trades automatically at lightning-fast speeds. These algorithms
analyze real-time market data, including stock prices, news headlines, and social media
sentiment, to make rapid buy-or-sell decisions.
Credit Risk Assessment: Instead of relying solely on traditional credit scores, data
science models can analyze a wider range of data points—including spending habits
and repayment history—to provide a more accurate assessment of a person's
creditworthiness. This helps banks make more informed lending decisions and
personalize loan offers.
Healthcare
Medical Imaging Analysis: Data science, particularly deep learning, is used to analyze
medical images like X-rays, CT scans, and MRIs. Algorithms can be trained to detect
subtle signs of diseases like cancer, making diagnoses more accurate and faster than is
possible with a human eye alone.
Drug Discovery: The drug development process is extremely long and expensive. Data
science accelerates this by analyzing vast datasets of genetic information, molecular
structures, and clinical trial results to identify potential drug candidates and predict their
efficacy, reducing the time and cost of bringing new treatments to market.
Personalized Medicine: By analyzing a patient's unique genetic information, lifestyle,
and medical history, data science can help tailor treatment plans to individual needs.
This allows doctors to prescribe the most effective medications and therapies while
minimizing adverse side effects.
Retail and E-commerce
Recommendation Systems: Perhaps the most common application of data science.
Companies like Amazon and Netflix use data from a user's past purchases, browsing
history, and ratings to recommend products or content they are likely to enjoy. This
significantly enhances the user experience and boosts sales.
Inventory Management: Data science helps retailers forecast demand for products by
analyzing historical sales data, seasonal trends, and external factors like weather. This
allows them to optimize inventory levels, prevent stockouts, and reduce waste.
Price Optimization: E-commerce platforms use dynamic pricing models that adjust a
product's price in real time based on demand, competitor prices, and customer behavior
to maximize revenue and profitability.
Other Industries
Transportation: Logistics companies like UPS use data science to optimize delivery
routes, saving time and fuel. In public transit, data analysis can help model traffic
patterns to improve city planning and reduce congestion.
Social Media: Platforms like Facebook use data science for content recommendation
(what to show in your feed), sentiment analysis (understanding public opinion about a
brand), and user behavior analysis to improve engagement.
Telecommunications: Telco companies analyze customer data to predict churn—the
likelihood that a customer will switch to a competitor. This allows them to proactively
offer personalized promotions and improve customer retention.