0% found this document useful (0 votes)
4 views11 pages

Understanding Data Science and Roles

The document provides an overview of data science, including definitions of key roles such as data scientist, data analyst, data engineer, and data architect. It outlines various facets of data, including structured and unstructured data, and discusses the importance of data mining and data warehousing in extracting insights from large datasets. Additionally, it highlights the benefits and applications of data science across different sectors, emphasizing improved decision-making, efficiency, and innovation.

Uploaded by

prithivipt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views11 pages

Understanding Data Science and Roles

The document provides an overview of data science, including definitions of key roles such as data scientist, data analyst, data engineer, and data architect. It outlines various facets of data, including structured and unstructured data, and discusses the importance of data mining and data warehousing in extracting insights from large datasets. Additionally, it highlights the benefits and applications of data science across different sectors, emphasizing improved decision-making, efficiency, and innovation.

Uploaded by

prithivipt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

[Link] is Data Science?

• Data science is the domain of study that deals with vast volumes of data using modern tools
and techniques to find unseen patterns, derive meaningful information, and make business
decisions.
• Data science uses complex machine learning algorithms to build predictive models.
• The data used for analysis can come from many different sources and presented in various
formats.
2. What is Data Scientist
A data scientist is someone who uses their skills to mine the data, understand it and extract
insights from it. They usually work with a team of engineers and analysts to create models that
can be used for various purposes
3. Define Data Analyst
A data analyst works on getting information from various sources such as offline or online
databases, spreadsheets, surveys and so on. They also use analytical tools like
Excel/PowerPoint/Tableau etc., but mostly rely on statistical techniques to present their
findings in a readable format.
4. Define Data Engineer
A data engineer builds applications that collect and process data using technologies like
Hadoop, Spark etc. while ensuring its quality so that it can be used by other teams such as
analysts or scientists without any issues later down the line.
5. Define Data Architect
Architect’s role includes designing databases according to specific requirements so that they
function efficiently within an organization’s infrastructure
[Link] the different Facets of data
• In data science and big data you’ll come across many different types of data, and each of
them tends to require different tools and techniques. The main categories of data are these:
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Audio, video, and images
■ Streaming
[Link] Data Mining.
Extracting of interesting patterns or knowledge from huge amount of data.
Extracting previously unknown data from large database & using it to make organisational
decisions
It is concerned with discovery of hidden knowledge.
It is useful in making critical organizational decisions partially those of strategic nature.
Examples for data- Relational database, data warehouse, Transactional database, Advanced
db, spatial & temporal db, time series data, stream data, multimedia data.
4. Draw KDD process
8. List the properties of Data Warehouse.
Subject-oriented, Integrated, Non-volatile, and Time-variant. These characteristics define
how data is organized and used within the warehouse for analysis and decision-making

1. Explain the Benefits and uses of Data Science.


Data science offers numerous benefits and applications across various sectors. It enables better
decision-making, improved customer experiences, increased efficiency, and new opportunities
for innovation. By analyzing data, businesses can identify trends, personalize services,
automate tasks, and ultimately drive growth.
Benefits:

Improved Decision-Making:

Data science provides insights and evidence-based information to support better


business decisions, moving away from guesswork and towards informed strategies.
 Enhanced Customer Experience:
By understanding customer behavior and preferences, businesses can personalize services,
tailor marketing efforts, and improve overall customer satisfaction.
 Increased Efficiency:
Data science can automate repetitive tasks, optimize processes, and streamline operations,
leading to increased efficiency and reduced costs.
 Innovation and Growth:
Analyzing data can uncover new opportunities, facilitate product development, and drive
innovation across various industries.
 Risk Management and Fraud Detection:
Data science can identify patterns and anomalies in data to detect fraud, manage risks, and
prevent potential losses.
Uses:
Healthcare:

Data science helps in disease prediction, personalized treatment plans, and optimizing
hospital operations.
 Finance:
It's used for fraud detection, risk management, and providing personalized financial advice.
 E-commerce:
Data science powers recommendation systems, optimizes supply chains, and personalizes
online shopping experiences.
 Transportation:
Data science helps optimize routes, manage traffic, and improve predictive maintenance for
vehicles.
 Marketing:
Data science enables targeted advertising, customer segmentation, and personalized marketing
campaigns.
 Education:
It helps in designing personalized learning experiences, tracking student performance, and
improving administrative efficiency.
 Manufacturing:
Data science optimizes production processes, predicts equipment failures, and improves overall
efficiency.
In essence, data science transforms raw data into actionable insights, enabling organizations to
make better decisions, improve efficiency, and drive innovation across various sectors

2. Explain about Facets of data

Very large amount of data will generate in big data and data science. These data is various
types and main categories of data are as follows:

a) Structured

b) Natural language

c) Graph-based

d) Streaming

e) Unstructured

f) Machine-generated

g) Audio, video and images

Structured Data

• Structured data is arranged in rows and column format. It helps for application to retrieve and
process data easily. Database management system is used for storing structured data.
• The term structured data refers to data that is identifiable because it is organized in a structure.
The most common form of structured data or records is a database where specific information
is stored based on a methodology of columns and rows.

• Structured data is also searchable by data type within content. Structured data is understood
by computers and is also efficiently organized for human readers.

• An Excel table is an example of structured data.

Unstructured Data

• Unstructured data is data that does not follow a specified format. Row and columns are not
used for unstructured data. Therefore it is difficult to retrieve required information.
Unstructured data has no identifiable structure.

• The unstructured data can be in the form of Text: (Documents, email messages, customer
feedbacks), audio, video, images. Email is an example of unstructured data.

• Even today in most of the organizations more than 80 % of the data are in unstructured form.
This carries lots of information. But extracting information from these various sources is a very
big challenge.

• Characteristics of unstructured data:

1. There is no structural restriction or binding for the data.

2. Data can be of any type.

3. Unstructured data does not follow any structural rules.

4. There are no predefined formats, restriction or sequence for unstructured data.

5. Since there is no structural binding for unstructured data, it is unpredictable in nature.

Natural Language

• Natural language is a special type of unstructured data.


• Natural language processing enables machines to recognize characters, words and sentences,
then apply meaning and understanding to that information. This helps machines to understand
language as humans do.

• Natural language processing is the driving force behind machine intelligence in many modern
real-world applications. The natural language processing community has had success in entity
recognition, topic recognition, summarization, text completion and sentiment analysis.

•For natural language processing to help machines understand human language, it must go
through speech recognition, natural language understanding and machine translation. It is an
iterative process comprised of several layers of text analysis.

Machine - Generated Data

• Machine-generated data is an information that is created without human interaction as a result


of a computer process or application activity. This means that data entered manually by an end-
user is not recognized to be machine-generated.

• Machine data contains a definitive record of all activity and behavior of our customers, users,
transactions, applications, servers, networks, factory machinery and so on.

• It's configuration data, data from APIs and message queues, change events, the output of
diagnostic commands and call detail records, sensor data from remote equipment and more.

• Examples of machine data are web server logs, call detail records, network event logs and
telemetry.

• Both Machine-to-Machine (M2M) and Human-to-Machine (H2M) interactions generate


machine data. Machine data is generated continuously by every processor-based system, as
well as many consumer-oriented systems.

• It can be either structured or unstructured. In recent years, the increase of machine data has
surged. The expansion of mobile devices, virtual servers and desktops, as well as cloud- based
services and RFID technologies, is making IT infrastructures more complex.

Graph-based or Network Data


•Graphs are data structures to describe relationships and interactions between entities in
complex systems. In general, a graph contains a collection of entities called nodes and another
collection of interactions between a pair of nodes called edges.

• Nodes represent entities, which can be of any object type that is relevant to our problem
domain. By connecting nodes with edges, we will end up with a graph (network) of nodes.

• A graph database stores nodes and relationships instead of tables or documents. Data is stored
just like we might sketch ideas on a whiteboard. Our data is stored without restricting it to a
predefined model, allowing a very flexible way of thinking about and using it.

• Graph databases are used to store graph-based data and are queried with specialized query
languages such as SPARQL.

• Graph databases are capable of sophisticated fraud prevention. With graph databases, we
can use relationships to process financial and purchase transactions in near-real time. With fast
graph queries, we are able to detect that, for example, a potential purchaser is using the same
email address and credit card as included in a known fraud case.

• Graph databases can also help user easily detect relationship patterns such as multiple people
associated with a personal email address or multiple people sharing the same IP address but
residing in different physical addresses.

• Graph databases are a good choice for recommendation applications. With graph databases,
we can store in a graph relationships between information categories such as customer interests,
friends and purchase history. We can use a highly available graph database to make product
recommendations to a user based on which products are purchased by others who follow the
same sport and have similar purchase history.

• Graph theory is probably the main method in social network analysis in the early history of
the social network concept. The approach is applied to social network analysis in order to
determine important features of the network such as the nodes and links (for example
influencers and the followers).
• Influencers on social network have been identified as users that have impact on the activities
or opinion of other users by way of followership or influence on decision made by other users
on the network as shown in Fig. 1.2.1.

• Graph theory has proved to be very effective on large-scale datasets such as social network
data. This is because it is capable of by-passing the building of an actual visual representation
of the data to run directly on data matrices.

Audio, Image and Video

• Audio, image and video are data types that pose specific challenges to a data scientist. Tasks
that are trivial for humans, such as recognizing objects in pictures, turn out to be challenging
for computers.

•The terms audio and video commonly refers to the time-based media storage format for
sound/music and moving pictures information. Audio and video digital recording, also referred
as audio and video codecs, can be uncompressed, lossless compressed or lossy compressed
depending on the desired quality and use cases.
• It is important to remark that multimedia data is one of the most important sources of
information and knowledge; the integration, transformation and indexing of multimedia data
bring significant challenges in data management and analysis. Many challenges have to be
addressed including big data, multidisciplinary nature of Data Science and heterogeneity.

Streaming Data

Streaming data is data that is generated continuously by thousands of data sources, which
typically send in the data records simultaneously and in small sizes (order of Kilobytes).

• Streaming data includes a wide variety of data such as log files generated by customers using
your mobile or web applications, ecommerce purchases, in-game player activity, information
from social networks, financial trading floors or geospatial services and telemetry from
connected devices or instrumentation in data centers.
3. Explain in detail about Data modelling phase in Data Science process.
Understanding the Business Problem and Data Requirements:

 The first step is to clearly define the business problem that the data science project aims
to solve.
 This involves understanding the specific questions the data needs to answer and the
goals the project aims to achieve.
 This stage also involves identifying the data sources and the type of data needed to
address the problem.
2. Conceptual Data Modeling:

 This stage involves creating a high-level, abstract representation of the data, focusing
on the core entities and their relationships.
 It's independent of any specific technology or database.
 For example, in a customer relationship management (CRM) system, entities might
include "Customer," "Order," and "Product," with relationships like "Customer places
Order" and "Order contains Product".
3. Logical Data Modeling:

 This stage refines the conceptual model by adding more detail, including specific data
types, attributes, and constraints.
 It defines how data will be organized within a specific database or data management
system.
 For example, it might specify that the "Customer" entity has attributes like
"CustomerID" (integer), "Name" (string), and "Address" (string).
4. Physical Data Modeling:

 This stage involves translating the logical model into a specific database schema,
including tables, columns, indexes, and relationships.
 It focuses on performance optimization and storage considerations.
 This stage is typically handled by database administrators and developers.
5. Validation and Refinement:

 Once the data model is created, it's crucial to validate it against the business
requirements and data quality standards.
 This involves ensuring that the model accurately represents the data and supports the
intended analysis and decision-making processes.
 The model might be refined based on feedback from stakeholders or during the data
exploration and analysis phase.
Key Benefits of Data Modeling:

 Data Integrity and Consistency: Ensures data accuracy, reliability, and uniformity
across the system.
 Efficient Querying and Analysis: Facilitates faster and more efficient data retrieval
and analysis.
 Improved Communication: Provides a common language for stakeholders to
understand and discuss data-related concepts.
 Better Decision-Making: Enables informed decision-making based on accurate and
reliable data.
 Compliance and Security: Helps in adhering to data governance policies and security
regulations.
4. Explain in detail about Data Mining and Data warehousing?
Data warehousing is the process of collecting, storing, and managing large volumes of data from
various sources in a central repository, while data mining is the process of analyzing that data to
discover patterns, trends, and insights that can be used for decision-making
Data Warehousing:

 Purpose:

Data warehousing aims to consolidate data from multiple sources into a single,
consistent, and reliable repository. This allows for efficient querying, reporting, and
analysis of historical data.
 Characteristics:
Data warehouses are typically subject-oriented (organized around specific business areas),
integrated (combining data from different sources), time-variant (containing historical data),
and non-volatile (data is not frequently updated).
 Key Processes:
The core process in data warehousing is ETL (Extract, Transform, Load), which involves
extracting data from various sources, transforming it into a suitable format, and loading it into
the warehouse.
 Benefits:
Data warehousing improves data quality, provides a comprehensive view of business
operations, supports informed decision-making, and enhances system performance by
separating analytical processing from transactional databases.
Data Mining:

 Purpose:
Data mining utilizes computational techniques to uncover hidden patterns, correlations,
and anomalies within large datasets.
 Key Techniques:
Common data mining techniques include association rule mining, classification, clustering, and
regression analysis.
 Applications:
Data mining is used in various industries, such as marketing (customer segmentation, targeted
advertising), finance (fraud detection, risk management), and healthcare (disease prediction,
personalized medicine).
 Benefits:
Data mining provides actionable insights that can be used to improve business strategies,
enhance customer relationships, optimize operations, and predict future trends.

Relationship between Data Warehousing and Data Mining:

 Data Warehousing as a Foundation:

Data warehousing provides the necessary infrastructure and data foundation for
effective data mining. Without a well-structured and organized data warehouse, data
mining efforts would be significantly hampered.
 Data Mining as an Analytical Tool:
Data mining leverages the data stored in the warehouse to extract valuable knowledge and
insights. It's the process of turning raw data into actionable intelligence.
 Complementary Processes:
Data warehousing and data mining work together to enable businesses to make data-driven
decisions and gain a competitive edge

5. Explain the various basic statistical descriptions of data.


Measures of Central Tendency:

 Mean: The average of all data points, calculated by summing all values and dividing
by the number of values. Sensitive to outliers.
 Median: The middle value in a sorted dataset. More robust to outliers than the mean.
 Mode: The most frequently occurring value in the dataset.


Measures of Variability:

 Range: The difference between the highest and lowest values in a dataset.
 Variance: Measures how spread out the data is from the mean, calculated by averaging
the squared differences between each data point and the mean.
 Standard Deviation: The square root of the variance. Provides a measure of spread in
the same units as the original data, making it more interpretable than variance.


Measures of Distribution:

 Skewness:
Describes the asymmetry of the data distribution. A positive skew indicates a long tail
on the right, and a negative skew indicates a long tail on the left.

 Kurtosis:

Describes the "peakedness" of the distribution. High kurtosis indicates a sharp peak and
heavy tails, while low kurtosis indicates a flatter peak and lighter tails

You might also like