Data Science
Data Science
SAKUNTHALA ENGINEERING
COLLEGE
Assistant Professor
COURSE CODE:191AI221
Text Books
1. Blum, Avrim, John Hopcroft, and Ravindran Kannan. Foundations of Data Science.
Cambridge University Press, 2020.
2. Hopcroft, John, and Ravi Kannan. "Foundations of data science." Microsoft (2014).
Reference Books
1. Fan, Jianqing, et al. Statistical Foundations of Data Science. CRC press, 2020.
2. Kubben, Pieter, Michel Dumontier, and Andre Dekker. Fundamentals of clinical data
science. Springer Nature, 2019.
DATA
INFORMATION
DATA MODELS
DATA TYPES
Data are units of information, often numeric, that are collected through observation. In a more
technical sense, data are a set of values of qualitative or quantitative variables about one or
more persons or objects, while a datum (singular of data) is a single value of a single variable.
scientific research,
finance, governance (e.g., crime rates, unemployment rates, literacy rates), and
In virtually every other form of human organizational activity (e.g., censuses of the
number of homeless people by non-profit organizations).
Data are measured, collected and reported, and analysed, and from data visualizations such as
graphs, tables or images are produced.
Data in general referred to some existing information or knowledge is represented or coded in
some form suitable for better usage or processing.
Raw data ("unprocessed data") is a collection of numbers or characters before it has been
"cleaned" and corrected by researchers. Raw data needs to be corrected to remove outliers or
obvious instrument or data entry errors (e.g., a thermometer reading from an outdoor Arctic
location recording a tropical temperature).
Data processing commonly occurs by stages, and the "processed data" from one stage may be
considered the "raw data" of the next stage.
Field data is raw data that is collected in an uncontrolled "in situ" environment. Experimental
data is data that is generated within the context of a scientific investigation by observation and
recording.
Kinds of data documents include:
i. data repository
iv. software
v. data paper
vi. database
In general, data is any set of characters that is gathered and translated for some purpose,
usually analysis. If data is not put into context, it doesn't do anything to a human or computer.
Data is represented with the help of characters such as alphabets (A-Z, a-z), digits (0-9) or
special characters (+, -, /, *, <, >,
= etc.)
There are multiple types of data. Some of the more common types of data include the following:
Single character
Text (string)
Picture
Sound
Video
In a computer's storage, digital data is a series of bits (binary digits) that have the value one or
zero. Data is processed by the CPU, which uses logical operations to produce new data (output)
from source data (input).
Primary Data
• Qualitative Data
• Quantitative Data
Secondary Data
• Internal Data
• External Data
Data Processing Cycle:
Input − In this step, the input data is prepared in some convenient form for processing.
The form will depend on the processing machine. For example, when electronic
computers are used, the input data can be recorded on any one of the several types of
input medium, such as magnetic disks, tapes, and so on.
Processing − In this step, the input data is changed to produce data in a more useful
form. For example, pay-checks can be calculated from the time cards, or a summary of
sales for the month can be calculated from the sales orders.
Output − At this stage, the result of the proceeding processing step is collected. The
particular form of the output data depends on the use of the data. For example, output
data may be pay-checks for employees.
INFORMATION
What is Information?
Information is organized or classified data, which has some meaningful values for the
receiver. Information is the processed data on which decisions and actions are [Link] the
decision to be meaningful, the processed data must qualify for the following characteristics −
2. Information can be encoded into various forms for transmission and interpretation (for
example, information may be encoded into a sequence of signs, or transmitted via
a signal). It can also be encrypted for safe storage and communication.
B. From data to information and from information to business intelligence, every business
relies on the data generated. Businesses are taking advantage of this process to create a
difference in their market approach.
C. Business Information like its other segments in the information industry has several
forms i.e., News, Credit & Financial Information, Market Research, IT Research, and
Industry Analysis. They can further be categorized into directories, periodicals, stats,
government information, guides, handbooks, almanacs, and directories.
D. The Internet has made it relatively easier for publishers to deliver business information,
especially with subscription models that deliver content to their user base.
E. Market research doesn’t just stem from a linear source of data, it is rather an exhaustive
process where analysts separate the good data – which is the cornerstone for any
business strategy.
Now, you will have business information systems that are designed to help organizations make
important decisions via objective attainment. This system uses the resources provided in most
IT Infrastructure to satiate the needs of variant entities existing inside a business enterprise.
Mc-Creadie and Rice Concept:
Information as part of the communication process: Timing and social factors play a
significant role in the processing and interpretation of information.
Summary as follows:
What is Data?
Data is a raw and unorganized fact that required to be processed to make it meaningful. Data
can be simple at the same time unorganized unless it is organized. Generally, data comprises
facts, observations, perceptions numbers, characters, symbols, image, etc.
Data is always interpreted, by a human or machine, to derive meaning. So, data is meaningless.
Data contains numbers, statements, and characters in a raw form.
What is Information?
Information is a set of data which is processed in a meaningful way according to the given
requirement. Information is processed, structured, or presented in a given context to make it
meaningful and useful.
It is processed data which includes data that possess context, relevance, and purpose. It also
involves manipulation of raw data.
Information assigns meaning and improves the reliability of the data. It helps to ensure
undesirability and reduces uncertainty. So, when the data is transformed into information, it
never has any useless details.
Data is a raw and unorganized fact that is required to be processed to make it meaningful
whereas Information is a set of data that is processed in a meaningful way according to
the given requirement.
Data does not have any specific purpose whereas Information carries a meaning that
has been assigned by interpreting data.
Data measured in bits and bytes, on the other hand, Information is measured in
meaningful units like time, quantity, etc.
Data can be structured, tabular data, graph, data tree whereas Information is language,
ideas, and thoughts based on the given data.
DATA MODELS
What Is a Data Model?
A data model is a visual representation of data elements and the relationships between them.
Data models help business and technical resources collaborate in the design of information
systems and the databases that power them. They show what data is required and how it needs
to be structured to support various business processes.
A data model is an abstract model that organizes elements of data and standardizes how they
relate to one another and to the properties of real-world entities. For instance, a data model may
specify that the data element representing a car be composed of a number of other elements
which, in turn, represent the color and size of the car and define its owner.
Data modelling is a critical component of metadata management, data governance and data
intelligence. It provides an integrated view of conceptual, logical and physical data models to
help business and IT stakeholders understand data structures and their meaning.
1) Ensures that all data objects required by the database are accurately represented.
Omission of data will lead to creation of faulty reports and produce incorrect results.
2) A data model helps design the database at the conceptual, physical and logical levels.
3) Data Model structure helps to define the relational tables, primary and foreign keys and
stored procedures.
4) It provides a clear picture of the base data and can be used by database developers to
create a physical database.
6) Though the initial creation of data model is labour and time consuming, in the long run,
it makes your IT infrastructure upgrade and maintenance cheaper and faster.
There are mainly three different types of data models: conceptual data models, logical data
models, and physical data models, and each one has a specific purpose. The data models are
used to represent the data and how it is stored in the database and to set the relationship between
data items.
A conceptual data model is a rough draft, containing the relevant concepts or entities and the
relationships between them.
A logical data model, also referred to as information modelling, is the second stage of data
modelling. It is a graphical representation of the information requirements for a given business
area.
A physical data model provides the database-specific context, elaborating on the conceptual
and logical models produced prior. Accordingly, physical data models are often treated as the
blueprint for a proposed database.
A Conceptual Data Model is an organized view of database concepts and their relationships.
The purpose of creating a conceptual data model is to establish entities, their attributes, and
relationships. In this data modelling level, there is hardly any detail available on the actual
database structure. Business stakeholders and data architects typically create a conceptual data
model.
Customer and Product are two entities. Customer number and name are attributes of the
Customer entity
ii. This type of Data Models is designed and developed for a business audience.
Conceptual data models known as Domain models create a common vocabulary for all
stakeholders by establishing basic concepts and scope.
The Logical Data Model is used to define the structure of data elements and to set
relationships between them. The logical data model adds further information to the conceptual
data model elements. The advantage of using a Logical data model is to provide a foundation
to form the base for the Physical model. However, the modelling structure remains generic.
i. Describes data needs for a single project but could integrate with other logical data
models based on the scope of the project.
iii. Data attributes will have datatypes with exact precisions and length.
i. The physical data model describes data need for a single project or application though
it may be integrated with other physical data models based on project scope.
ii. Data Model contains relationships between tables that which addresses cardinality and
nullability of the relationships.
Flat model
This may not strictly qualify as a data model. The flat (or table) model consists of a single, two-
dimensional array of data elements, where all members of a given column are assumed to be
similar values, and all members of a row are assumed to be related to one another.
Hierarchical model
In this type of data model, the data is organized into a tree-like structure that has a single root
and the data is linked to the root. In this model, the main hierarchy begins from the root and it
expands like a tree that has child nodes and further expands in the same manner. In this model
the child node has one single parent node but one parent can have multiple child nodes. As the
data is stored like tree structure in this data model when data is retrieved the whole tree is
traversed from the root node. The hierarchical data model contains a one-to-many relationship
between various types of data. The data is stored in the form of a record and is connected
through links.
Network model
This model organizes data using two fundamental constructs, called records and sets. Records
contain fields, and sets define one-to-many relationships between records: one owner, many
members. The network data model is an abstraction of the design concept used in the
implementation of databases. The network model is a type of database model which is designed
based on a flexible approach for representing objects and the relationship exist among objects.
The schema is very important in the network data model which can be represented in the form
of a graph where a relationship is represented using edges and the nodes are used to represent
objects.
Relational model
In this data model, the data tables are used to collect a group of elements into the relations. In
this model, the relationships and data are represented using interrelated tables. And in the table,
there are multiple rows and multiple columns in which column represents the attribute of the
entity and the rows are used to represent records.
Object-relational model
Similar to a relational database model, but objects, classes and inheritance are directly
supported in database schemas and in the query language.
Object-role modelling
A method of data modelling that has been defined as "attribute free", and "fact-based". The
result is a verifiably correct system, from which other common artifacts, such as ERD, UML,
and semantic models may be derived. Associations between data objects are described during
the database design procedure, such that normalization is an inevitable result of the process.
Star schema
The simplest style of data warehouse schema. The star schema consists of a few "fact tables"
(possibly only one, justifying the name) referencing any number of "dimension tables". The
star schema is considered an important special case of the snowflake schema.
A data structure diagram (DSD) is a diagram and data model used to describe conceptual data
models by providing graphical notations which document entities and their relationships, and
the constraints that bind them. The basic graphic elements of DSDs are boxes, representing
entities, and arrows, representing relationships. Data structure diagrams are most useful for
documenting complex data entities. Data structure diagrams are an extension of the entity-
relationship model (ER model).
What is Data Modelling?
Data modelling is the process of producing a diagram (i.e., ERD) of relationships between
various types of information that are to be stored in a database that helps us to think
systematically about the key data points to be stored and retrieved, and how they should be
grouped and related, is what the
A data model describes information in a systematic way that allows it to be stored and retrieved
efficiently in a Relational Database System which can be thought of as a way of translating the
logic of accurately describing things in the real world and the relationships between them into
rules that can be followed and enforced by computer code. One of the goals of data modelling
is to create the most efficient method of storing information while still providing for complete
access and reporting.
Entity-Relationship (ER) Model is based on the notion of real-world entities and relationships
among them. While formulating real-world scenario into the database model, the ER Model
creates entity set, relationship set, general attributes and constraints.
ER Model is based on −
ER modelling is an important technique for any database designer to master and forms the basis
of the methodology.
Entity type: It is a group of objects with the same properties that are identified by the enterprise
as having an independent existence. The basic concept of the ER model is the entity type that
is used to represent a group of ‘objects’ in the ‘real world’ with the same properties. An entity
type has an independent existence within a database. Entity − An entity in an ER Model is a
real-world entity having properties called attributes. Every attribute is defined by its set of
values called domain. For example, in a school database, a student is considered as an entity.
Student has various attributes like name, age, class, etc.
Attributes are the properties of entities that are represented using ellipse-shaped figures. Every
elliptical figure represents one attribute and is directly connected to its entity (which is
represented as a rectangle).
A relationship type is a set of associations between one or more participating entity types.
Each relationship type is given a name that describes its function. There are four types of
relationships. These are:
One-to-one: When only a single instance of an entity is associated with the relationship,
it is termed as ‘1:1’.
One-to-many: When more than one instance of an entity is related and linked with a
relationship, it is termed as ‘1:N’.
Many-to-one: When more than one instance of an entity is linked with the relationship,
it is termed as ‘N:1’.
Many-to-many: When more than one instance of an entity on the left and more than one
instance of an entity on the right can be linked with the relationship, then it is termed
as N: N relationship.
Data modelling is the first step to ensuring mission-critical information is used, understood and
trusted across the enterprise. It has many benefits. Following are the top six benefits of data
modelling organizations can realize:
3. Support regulatory compliance now and into the future by governing data modelling
teams, processes, portfolios and lifecycles.
(4)A good model can adapt to changes in requirements, but not at the expense .
The main goal of a designing data model is to make certain that data objects offered by
the functional team are represented accurately.
The data model should be detailed enough to be used for building the physical database.
The information in the data model can be used for defining the relationship between
tables, primary and foreign keys, and stored procedures.
Data Model helps business to communicate the within and across organizations.
To develop Data model, one should know physical data stored characteristics.
Even smaller change made in structure require modification in the entire application.
DATA TYPES
A data type refers to the format of data storage that can hold a distinct type or range of
values. When computer programs store data in variables, each variable must be designated a
distinct data type. Some common data types are as follows: integers, characters, strings,
floating point numbers and arrays. More specific data types are as follows: varchar (variable
character) formats, Boolean values, dates and timestamps.
A data type is a type of data. Of course, that is rather circular definition, and also not very
helpful. Therefore, a better definition of a data type is a data storage format that can contain a
specific type or range of values.
Database applications use data types. Database Fields require distinct type of data to be
entered. For example, school record for a student may use a string data type for student’s first
and last name. The student’s date of birth would be stored in a date format and the student’s
GPA can be stored as decimal. By ensuring that the data types are consistent across multiple
records, database applications can easily perform calculations, comparisons, searching and
sorting of fields in different records.
Data types are also used by database applications. The fields within a database often require a
specific type of data to be input. For example, a company's record for an employee may use a
string data type for the employee's first and last name. The employee's date of hire would be
stored in a date format, while his or her salary may be stored as an integer. By keeping the data
types uniform across multiple records, database applications can easily search, sort, and
compare fields in different records.
Float (floating point) Number with a decimal point 3.15, 9.06, 00.13
Integer – is a whole number that can have a positive, negative or zero value. It cannot
be a fraction nor can have decimal places. It is commonly used in programming
especially for increasing values. Addition, subtraction and multiplication of two
integers results to an integer. But division of two integers may result to an integer or a
decimal. The resulting decimal can be rounded off or truncated to produce an integer.
Character – refers to any number, letter, space or symbol that can be entered in a
computer. Each character occupies one byte of space.
String – is used to represent text. It is composed of a set of characters that can have
spaces and numbers. Strings are enclosed in quotation marks to identify the data as
string and not a variable name nor a number.
Floating Point Number – is a number that contains decimals. Numbers that contain
fractions are also considered as floating point numbers.
Array – contains a group of elements which can be of the same data type like an integer
or string. It is used to organise data for easier sorting and searching of related set of
values.
Varchar – as the name implies is variable character as the memory storage has variable
length. Each character occupies one byte of space plus 2 bytes for length information.
Note: Use Character for data entries with fixed length, like phone number. Use
Varchar for data entries with variable length, like address.
Boolean – is used for creating true or false statements. To compare values the following
operators are being used: AND, OR, XOR, and NOT.
Date, Time and Timestamp – these data types are used to work with data containing
dates and times.
If either x or y, or both x
x OR y True
and y are True
x OR y False If both x and y are False
A data type constrains the values that an expression, such as a variable or a function, might
take. This data type defines the operations that can be done on the data, the meaning of the data,
and the way values of that type can be stored.
Data types are the building blocks of databases. A field's data type not only influences other
important characteristics of that field, such as field size, but also how the field is used
throughout the database, such as in objects, calculations, expressions, and so forth. Using the
right data type is a key to success.
A file system is a technique of arranging the files in a storage medium like a hard disk,
pen drive, DVD, etc. It helps you to organizes the data and allows easy retrieval of files
when they are required. It mostly consists of different types of files like mp3, mp4, txt,
doc, etc. that are grouped into directories.
A file system enables you to handle the way of reading and writing data to the storage
medium. It is directly installed into the computer with the Operating systems such as
Windows and Linux.
What is DBMS?
Database Management System (DBMS) is a software for storing and retrieving user's data
while considering appropriate security measures. It consists of a group of programs that
manipulate the database. The DBMS accepts the request for data from an application and
instructs the DBMS engine to provide the specific data. In large systems, a DBMS helps users
and other third-party software to store and retrieve data.
KEY DIFFERENCES:
1. A file system is a software that manages and organizes the files in a storage medium,
whereas DBMS is a software application that is used for accessing, creating, and
managing databases.
2. The file system doesn't have a crash recovery mechanism on the other hand, DBMS
provides a crash recovery mechanism.
3. Data inconsistency is higher in the file system. On the contrary Data inconsistency is
low in a database management system.
4. File system does not provide support for complicated transactions, while in the DBMS
system, it is easy to implement complicated transactions using SQL.
5. File system does not offer concurrency, whereas DBMS provides a concurrency
facility.
Features of DBMS
Authorization services
Avoid inconsistency across file maintenance to get the integrity of data independence.
A DBMS uses various powerful functions to store and retrieve data efficiently.
The DBMS implies integrity constraints to get a high level of protection against
prohibited access to data.
Reduction of redundancy.
Data independence.
Managing directories.
Application of the DBMS system
DBMS system also used by universities to keep call records, monthly bills, maintaining
balances, etc.
Finance for storing information about stock, sales, and purchases of financial
instruments like stocks and bonds.
Each application has its data file so, the same data may have to be recorded and stored
many times.
Data dependence in the file processing system are data-dependent, but, the problem is
incompatible with file format.
Time-consuming.
It allows you to maintain the record of the big firm having a large number of items.
Cost of Hardware and Software of a DBMS is quite high, which increases the budget
of your organization.
Most database management systems are often complex systems, so the training for users
to use the DBMS is required.
The use of the same program at a time by many users sometimes lead to the loss of
some data.
Data-sets begins to grow large as it provides a more predictable query response time.
The database can fail because or power failure or the whole system stops.