0% found this document useful (0 votes)
35 views246 pages

Data Governance: Importance and Benefits

The document outlines key concepts of data governance and data integrity, emphasizing their importance in ensuring accurate, consistent, and secure data management within organizations. It describes the roles and responsibilities involved in data governance, including the chief data officer and data stewards, and highlights the benefits of effective data governance, such as improved decision-making and compliance with regulations. Additionally, it explains the types of data integrity, the risks associated with it, and strategies to mitigate those risks.

Uploaded by

mertmk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views246 pages

Data Governance: Importance and Benefits

The document outlines key concepts of data governance and data integrity, emphasizing their importance in ensuring accurate, consistent, and secure data management within organizations. It describes the roles and responsibilities involved in data governance, including the chief data officer and data stewards, and highlights the benefits of effective data governance, such as improved decision-making and compliance with regulations. Additionally, it explains the types of data integrity, the risks associated with it, and strategies to mitigate those risks.

Uploaded by

mertmk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

INFORMATION

TECHNOLOGY
A-LEVEL
TABLE OF CONTENTS
Written by: Evren Özbir
Designed by: Nurettin Karayeğen

01 Data Governance 1-8 12 Using Big Data 99-125

02 Data Integrity 9-16 13 Enabling Technologies 126-153

03 Data Validation 17-23 14 Distributed Systems 154-181

04 Data Dictionaries 24-25 15 Human Computer Interaction 182-193

05 Basic SQL 26-31 16 Data Stotage 194-203

06 Data Redundancy 32-37


17 Role of IS in an Organization 204-212

07 Data Cleaning 38-40


18 Transaction & Processing Systems 213-236

08 Big Data 41-49 19 Software Development Methodologies 237-243

09 Understanding Big Data: Infrastructure 50-73

10 Big Data storage.


Understand the impact of storing Big Data
74-78

11 Manipulating Big Data 79-98


DATA
GOVERNANCE
Why organizations need to
govern data?
To avoid inconsistent data silos in different departments and
business units.
To agree on common data definitons for a shared
understanding of data.
To improve data quality through efforts to identify and fix
errors in data sets.
To increase analytics accuracy and give decision-makers
reliable information.
To implement and enforce policies that help prevent data
errors and misuse.
To help ensure compliance with data privacy laws and other
regulations.

1
What is data governance?
Data governance is a collection of processes, roles, policies,
standards, and metrics that ensure the effective and efficient
use of information in enabling an organization to achieve its
goals. ... Data governance defines who can take what action,
upon what data, in what situations, using what methods.

Data governance (DG) is the process of managing the


availability, usability, integrity and security of the data in
enterprise systems, based on internal data standards and
policies that also control data usage. Effective data
governance ensures that data is consistent and trustworthy
and doesn't get misused. It's increasingly critical as
organizations face new data privacy regulations and rely
more and more on data analytics to help optimize operations
and drive business decision-making.

Note: data governance is a concept rather than a set


of rules. There is no single, authoritative definition for
the term as different groups and organisations will
have different ideas about what is needed.

“A system of decision rights and accountabilities for information-related


processes, executed according to agreed-upon models which describe who can
take what actions with what information, and when, under what circumstances,
using what methods” – Data Governance Institute (2014)

“The execution and enforcement of authority over the management of data and
data-related assets” - R. Seiner (2014)

2
Important characteristics of
data governance definitions
Data Governance Is: Data Governance Is Not:
More about people and IT’s responsibility
behavior than data Solved by technology
A system that requires Equally applied across all
and promotes shared data assets
agreement
Formal (i.e. written down)
Adds value by supporting
institutional
mission/goals

Why data governance matters


Without effective data governance, data inconsistencies in
different systems across an organization might not get resolved. For
example, customer names may be listed differently in sales,
logistics and customer service systems. That could complicate data
integration efforts and create data integrity issues that affect the
accuracy of business intelligence (BI), enterprise reporting and
analytics applications. In addition, data errors might not be
identified and fixed, further affecting BI and analytics accuracy.

3
Data governance goals and
benefits
A key goal of data governance is to break down data silos (A
data silo is a repository of fixed data that remains under the
control of one department and is isolated from the rest of the
organization), much like grain in a farm silo is closed off from
outside elements. Data silos can have technical or cultural
[Link] an organization. Such silos commonly build up when
individual business units deploy separate transaction
processing systems without centralized coordination or an
enterprise data architecture(Enterprise data architecture
(EDA) refers to a collection of master blueprints designed to
align IT programs and information assets with business
strategy. EDA is used to guide integration, quality
enhancement and successful data delivery.). Data
governance aims to harmonize the data in those systems
through a collaborative process, with stakeholders from the
various business units participating.

Another data governance goal is to ensure that data is used


properly, both to avoid introducing data errors into systems
and to block potential misuse of personal data about
customers and other sensitive information. That can be
accomplished by creating uniform policies on the use of data,
along with procedures to monitor usage and enforce the
policies on an ongoing basis

4
Benefits of Data Governance
Tools
Data changes frequently as it moves through an enterprise.
Information can be replicated and fragmented, which stalls a
business in terms of responsiveness and decision-making. Data
governance strategies help protect the integrity of data assets,
and optimizes master data management and product
information management. This key functionality benefits
enterprise organizations in numerous ways:
Better Decision Making: Accessible, well-governed data is
readily available to provide accurate insights for big
decisions within the organization.

Operational Efficiency: Data provides valuable information


on all parts of an organization from inventory levels to
customer satisfaction. Leveraging and applying insights
gained from this data can improve operational efficiency in
terms of production speed, product quality, and resource
usage.

Improved Data Understanding and Lineage: Data


governance provides a comprehensive view of where data is
stored and how it is being used, including permissions that
manage how and when data can be accessed.

Data Quality: While data quality and data governance are


two distinct capabilities, they work together to improve the
end result. Although data cleansing may be required to
ensure data quality, data governance that is in line with
internal data consistency will guarantee accuracy and
usefulness in data-driven decision-making.

5
Compliance and Regulation: Every business has regulations
unique to its industry and often significant fines are imposed
in cases of non-compliance. Data governance assures those
regulations are met while also safely managing security and
privacy.

Increased Revenue: With regulations in check, data sorted


and analyzed, and customer loyalty assured, data
governance results in an increase in the bottom line by way
of an efficient and proactive system.

When leveraging a data governance tool to automate this


process, organizations benefit from expedited returns and results
achieved with successful data governance.

Who's responsible for data


governance?
In most organizations, various people are involved in the data
governance process. That includes business executives, data
management professionals and IT staffers, as well as end users
who are familiar with relevant data domains in an organization's
systems. These are the key participants and their primary
governance responsibilities.

6
Chief data officer
The chief data officer (CDO), if there is one, often is the senior
executive who oversees a data governance program and has
high-level responsibility for its success or failure. The CDO's role
includes securing approval, funding and staffing for the program,
playing a lead role in setting it up, monitoring its progress and
acting as an advocate for it internally. If an organization doesn't
have a CDO, another C-suite executive usually will serve as an
executive sponsor and handle the same functions.
Data governance manager and team
In some cases, the CDO or an equivalent executive -- a director
of enterprise data management, for example -- may also be the
hands-on data governance program manager. In others,
organizations appoint a data governance manager or lead
specifically to run the program. Either way, the program
manager typically heads a data governance team that works on
the program full time. Sometimes more formally known as the
data governance office, it coordinates the process, leads
meetings and training sessions, tracks metrics, manages internal
communications and carries out other management tasks.

Data governance committee


The governance team usually doesn't make policy or standards
decisions, though. That's the responsibility of the data governance
committee or council, which is primarily made up of business
executives and other data owners. The committee approves the
foundational data governance policy (A data governance policy is
a documented set of guidelines for ensuring that an organization's
data and information assets are managed consistently and used
properly)and associated policies and rules on things like data
access and usage, plus the procedures for implementing them. It
also resolves disputes, such as disagreements between different
business units over data definitions and formats.

7
Data stewards
The responsibilities of data stewards include overseeing data
sets to keep them in order. They're also in charge of ensuring
that the policies and rules approved by the data governance
committee are implemented and that end users comply with
them. Workers with knowledge of particular data assets and
domains are generally appointed to handle the data
stewardship (Specifically, data stewards are responsible for
defining and implementing policies and procedures for the day-
to-day operational and administrative management of systems
and data — including the intake, storage, processing, and
transmission of data to internal and external systems.)role. That's
a full-time job in some companies and a part-time position in
others; there can also be a mix of IT and business data stewards.

Data architects, data modelers and data quality analysts and


engineers are also part of the governance process. In addition,
business users and analytics teams must be trained on data
governance policies and data standards so they can avoid using
data in erroneous or improper ways.

8
DATA
INTEGRITY
What is data integrity and why
is it important
Data integrity is the overall accuracy, completeness, and
consistency of data. Data integrity also refers to the safety of
data in regard to regulatory compliance — such as GDPR
compliance (The General Data Protection Regulation)— and
security. It is maintained by a collection of processes, rules, and
standards implemented during the design phase. When the
integrity of data is secure, the information stored in a database
will remain complete, accurate, and reliable no matter how long
it’s stored or how often it’s accessed.

The importance of data integrity in protecting yourself from data


loss or a data leak cannot be overstated: in order to keep your
data safe from outside forces with malicious intent, you must first
ensure that internal users are handling data correctly. By
implementing the appropriate data validation and error
checking, you can ensure that sensitive data is never
miscategorized or stored incorrectly, thus exposing you to
potential risk.

9
Types of data integrity
Maintaining data integrity requires an understanding of the two
types of data integrity: physical integrity and logical integrity.
Both are collections of processes and methods that enforce data
integrity in both hierarchical and relational databases.

Physical integrity:
Physical integrity is the protection of the wholeness and
accuracy of that data as it’s stored and retrieved. When natural
disasters strike, power goes out, or hackers disrupt database
functions, physical integrity is compromised. Human error,
storage erosion, and a host of other issues can also make it
impossible for data processing managers, system programmers,
applications programmers, and internal auditors to obtain
accurate data.

Hardware design flaws


Corrosion and material failure
Electrical problems
Extreme high or low temperatures
Radiation
G-forces
Natural disasters such as fire and flood
Man-made disasters such as war or terrorism
Extreme high or low pressure.

10
Logical integrity:
Logical integrity keeps data unchanged as it’s used in different
ways in a relational database. Logical integrity protects data
from human error and hackers as well, but in a much different
way than physical integrity does. There are four types of logical
integrity:
1. Entity integrity relies on the creation of primary keys — the
unique values that identify pieces of data — to ensure that
data isn’t listed more than once and that no field in a table
is null. It’s a feature of relational systems which store data in
tables that can be linked and used in a variety of ways.
2. Referential integrity refers to the series of processes that
make sure data is stored and used uniformly. Rules
embedded into the database’s structure about how foreign
keys are used ensure that only appropriate changes,
additions, or deletions of data occur. Rules may include
constraints that eliminate the entry of duplicate data,
guarantee that data entry is accurate, and/or disallow the
entry of data that doesn’t apply.
3. Domain integrity is the collection of processes that ensure
the accuracy of each piece of data in a domain. In this
context, a domain is a set of acceptable values that a
column is allowed to contain. It can include constraints and
other measures that limit the format, type, and amount of
data entered.
4. User-defined integrity involves the rules and constraints
created by the user to fit their particular needs. Sometimes
entity, referential, and domain integrity aren’t enough to
safeguard data. Often, specific business rules must be taken
into account and incorporated into data integrity measures.

11
What data integrity isn’t
With so much talk about data integrity, it’s easy for its true
meaning to be muddled. Often data security and data
quality are incorrectly substituted for data integrity, but each
term has a distinct meaning.
Data integrity is not data security

Data security is the collection of measures taken to keep data


from getting corrupted. It incorporates the use of systems,
processes, and procedures that restrict unauthorized access and
keep data inaccessible to others who may use it in harmful or
unintended ways. Breaches in data security may be small and
easy to contain or large and capable of causing significant
damage.

While data integrity is concerned with keeping information


intact and accurate for the entirety of its existence, the goal of
data security is to protect information from outside attacks. Data
security is but one of the many facets of data integrity. Data
security is not broad enough to include the many processes
necessary for keeping data unchanged over time.

12
Data integrity is not data
quality
Does the data in your database meet company-defined
standards and the needs of your business? Data quality (Data
quality is the measure of how well suited a data set is to serve its
specific purpose) answers these questions with an assortment of
processes that measure your data’s age, relevance, accuracy,
completeness, and reliability.

Much like data security, data quality is only a part of data


integrity, but a crucial one. Data integrity encompasses every
aspect of data quality and goes further by implementing an
assortment of rules and processes that govern how data is
entered, stored, transferred, and much more.

Data integrity risks


An assortment of factors can affect the integrity of the data
stored in a database. A few examples include the following:

Human error: When individuals enter information incorrectly,


duplicate or delete data, don’t follow the appropriate
protocols, or make mistakes during the implementation of
procedures meant to safeguard information, data integrity is
put in jeopardy.

Transfer errors: When data can’t successfully transfer from


one location in a database to another, a transfer error has
occurred. Transfer errors happen when a piece of data is
present in the destination table, but not in the source table in
a relational database.

13
Bugs and viruses: Spyware, malware, and viruses are pieces
of software that can invade a computer and alter, delete, or
steal data.

Compromised hardware: Sudden computer or server crashes,


and problems with how a computer or other device functions
are examples of significant failures and may be indications
that your hardware is compromised. Compromised hardware
may render data incorrectly or incompletely, limit or
eliminate access to data, or make information hard to use.

Risks to data integrity can


easily be minimized or
eliminated by doing the
following:
Limiting access to data and changing permissions to restrict
changes to information by unauthorized parties

Validating data to make sure it’s correct both when it’s


gathered and when it’s used

Backing up data

Using logs to keep track of when data is added, modified, or


deleted

Conducting regular internal audits

Using error detection software

14
8 Ways to Reduce Data
Integrity Risk
Promote a Culture of Integrity reduces data integrity risk in
several ways. It helps to keep employees honest about their
own work as well as the efforts of others. Workers in a culture
based on data integrity are also more likely to report
instances in which others take shortcuts or don’t fulfill their
responsibilities regarding the many different aspects of data
integrity.

Implement Quality Control Measures include specific people


and processes put in place to verify employees are working
with data in accordance to security and data governance
policies. For instance, data stewards can monitor the data
lineage of data sources. IT personnel can monitor security
systems for data integrity.

Create an Audit Trail, An audit trail is a particularly effective


mechanism for minimizing data integrity risk. Audit trails are
key for learning what happened to data throughout the
different stages of its lifecycle, including where it came from
and how it has been transformed or used. Understanding
these specifics can ensure regulatory compliance.

Develop Process Maps for All Critical Data, Developing


process maps for critical data is a crucial aspect of
governing how data is used, by whom, and where. By
mapping these processes — ideally before data is put to use
—organizations have greater control over their data assets.
These maps are fundamental for implementing proper
measures for security and regulatory compliance, as well.

15
Eliminate Known Security Vulnerabilities, It’s mandatory to
eliminate security vulnerabilities to help minimize data
integrity risks related to protecting data assets. This method of
reducing risk requires subject matter expertise for determining
known security vulnerabilities and implementing measures to
eliminate them. It also requires technology like security
patches to actually carry out this work.

Follow a Software Development Lifecycle, Following a


software development lifecycle is a fundamental way of
governing data in its journey throughout the enterprise. These
development lifecycles are important for understanding the
various governance protocols necessary to manage data
according to regulatory and security requirements. This
method is an integral step in understanding where data is and
how it’s deployed, and then using this knowledge as a
foundation to create sustainable practices.

Validate Your Computer Systems, Planning, mapping, and


dictating what’s supposed to happen with data is useless
without regularly testing, validating, and revalidating whether
IT systems and employees are functioning according to these
procedures. For instance, IT teams may be tasked with
mapping source fields to target systems according to the
metadata of the mapping constructs used previously. The only
way to know for certain whether this process is performed is to
test and validate the computer systems involved in these
procedures to see if the information supports employee action.

Implement Error Detection Software, Error detection software


and anomaly detection services can help monitor and isolate
outliers, identify why errors occurred, and illustrate how to
avoid them in the future. This entire process is critical for
keeping data integrity risk at a manageable level.

16
DATA
VALIDATION
What is data validation?
Data validation refers to the process of ensuring the accuracy
and quality of data. It is implemented by building several checks
into a system or report to ensure the logical consistency of input
and stored data.
Types of data validation
There are many types of data validation. Most data validation
procedures will perform one or more of these checks to ensure
that the data is correct before storing it in the database.
Common types of data validation checks include:
Data Type Check: A data type check confirms that the data
entered has the correct data type. For example, a field might
only accept numeric data. If this is the case, then any data
containing other characters such as letters or special symbols
should be rejected by the system.

Code Check: A code check ensures that a field is selected


from a valid list of values or follows certain formatting rules.
For example, it is easier to verify that a postal code is valid
by checking it against a list of valid codes. The same concept
can be applied to other items such as country codes and
NAICS industry codes.

17
Range Check: A range check will verify whether input data
falls within a predefined range. For example, latitude and
longitude are commonly used in geographic data. A latitude
value should be between -90 and 90, while a longitude value
must be between -180 and 180. Any values out of this range
are invalid.

Format Check: Many data types follow a certain predefined


format. A common use case is date columns that are stored in
a fixed format like “YYYY-MM-DD” or “DD-MM-YYYY.” A data
validation procedure that ensures dates are in the proper
format helps maintain consistency across data and through
time.

Consistency Check: A consistency check is a type of logical


check that confirms the data’s been entered in a logically
consistent way. An example is checking if the delivery date is
after the shipping date for a parcel.

Uniqueness Check: Some data like IDs or e-mail addresses


are unique by nature. A database should likely have unique
entries on these fields. A uniqueness check ensures that an
item is not entered multiple times into a database.

18
Data Validation in Excel
The following example is an introduction to
data validation in Excel. The data validation
button under the data tab provides the user
with different types of data validation checks
based on the data type in the cell. It also
allows the user to define custom validation
checks using Excel formulas. The data
validation can be found in the Data Tools
section of the Data tab in the ribbon of Excel:

Data Entry Task


The example below illustrates a case of
data entry, where the province must be
entered for every store location. Since
stores are only located in certain
provinces, any incorrect entry should be
caught.

It is accomplished in Excel using a two-fold


data validation. First, the relevant
provinces are incorporated into a drop-
down menu that allows the user to select
from a list of valid provinces.

Second, if the user inputs a wrong


province by mistake, such as “NY” instead
of “NS,” the system warns the user of the
incorrect input.

19
Further, if the user ignores the warning, an
analysis can be conducted using the data
validation feature in Excel that identifies
incorrect inputs.

There are various methods that


you can use to check your data
in database.
Type: The use of field types forms a basic type of validation.
If you make a particular field numeric (i.e. a number), for
example, then it won't let you enter any letters or other non-
numeric characters. Be careful when using the numeric types,
however - if you use them for things like phone numbers, for
example, you won't be able to enter spaces or any other sorts
of formatting.

Presence: This type of validation might go by different names,


depending on your database program - sometimes it's called
something like Allow Blank or mandatory for example. This
type of validation forces the user to enter the data in that
field. If you had an address book, for example, you might
know the person's address and not their phone number, or
vice-versa, so it wouldn't make sense to make those fields
mandatory. On the other hand, it doesn't make sense to have
an address book entry with no name, so you should check for
the presence of the name.

20
Uniqueness: Some database programs allow you to check
whether the contents of a particular field are unique. This
might be useful to prevent users entering the same information
twice. For example, if you were creating a car database, you
should make the registration number field unique as no two
cars should have the same one.

Range: If you're using a number field, then you might want to


limit the range of inputs. For example, you might want to limit
prices in a stock database so that they are all positive, or limit
the range of a percentage field so that the values entered are
between 0 and 100.

Format:You might have a field in your database that requires


an entry in a particular format. A simple example might be a
date, or piece of text of a certain length. More complex
examples might include things like postcodes, or National
Insurance or driving licence numbers. If you're using Access,
you can define your own formats using an input mask, which
defines the valid characters.

Multiple Choice: A good way to validate fields is to use


multiple choice responses. These might take the form of a
listbox, combo box, or radio button. For example, you could
create a field that would only allow the user to select from Yes
or No, or Male or Female. This can be an especially useful
techniques in database applications such as Access, which
allow you to dynamically generate the choices. For example, if
you created a database system to manage bookings, rather
than checking the dates and times after they have been
entered, to check that there are no double-bookings, you
could use a query and a combo-box to only show the
available times. That would stop you making double-bookings
in the first place, and make any subsequent validation much
simpler.
21
Referential Integrity: Finally, if you're using a relational
database, then you can enforce referential integrity to
validate inputs. This means you can check entries in certain
fields against values in other tables. For example, in the merits
database, when a new merit is entered, you could check the
names of the students and teachers against the student and
staff tables, to prevent either spelling errors, or the entry of
merits for students that don't exist.

Importance of Data Validation


Data validation provides accuracy, details, and clarity because
it is necessary to eliminate issues from any project. Risks occur in
the decision making if you don’t validate your data by
appropriate process. In datasets, structures and content decide
the results of the process and validation technique cleanse and
eliminate the unnecessary files from it and provide an
appropriate structure to the dataset for best results. Data
validation is used in data warehousing as well as it is also used
for the ETL (Extraction Translation Load) process. It provides
convenience to an analyst for getting insight inside the scope of
data conflicts. Data validation can also be performed on any
data, including the data in a single application like MS excel or
mixing simple data in a single data store. We have used a term
ETL, so it is highly time-consuming to validate the data via
scripting or manually. Still, a modern ETL tool can be beneficial
for you to expedite the process of validating your data. You can
easily integrate, transform, and clean the data if it is moved to
your data warehouse. As a part of your assessment of your data,
you can determine which errors can be fixed at the source, and
which errors an ETL tool can repair while the data is in the
pipeline.

22
Types of validation
There are a number of validation types that can be used to
check the data that is being entered.

Verification

There are two main methods of verification:

1. Double entry: entering the data twice and comparing the


two copies. This effectively doubles the workload, and as
most people are paid by the hour, it costs more too.
2. Proofreading data: this method involves someone checking
the data entered against the original document. This is also
time-consuming and costly.

23
DATA
DICTIONARY
What is a data dictionary?
A Data Dictionary is a collection of names, definitions, and
attributes about data elements that are being used or captured
in a database, information system, or part of a research project.
It describes the meanings and purposes of data elements within
the context of a project, and provides guidance on
interpretation, accepted meanings and representation. A Data
Dictionary also provides metadata about data elements. The
metadata included in a Data Dictionary can assist in defining
the scope and characteristics of data elements, as well the rules
for their usage and application.
Why do we use data dictionaries?
Assist in avoiding data inconsistencies across a project

Help define conventions that are to be used across a project

Provide consistency in the collection and use of data across


multiple members of a research team

Make data easier to analyze

Enforce the use of Data Standards

24
What Are Data Standards and
Why Should I Use Them?
•Data Standards are rules that govern the way data are
collected, recorded, and represented. Standards provide a
commonly understood reference for the interpretation and use of
data sets.

•By using standards, researchers in the same disciplines will know


that the way their data are being collected and described will
be the same across different projects. Using Data Standards as
part of a well-crafted Data Dictionary can help increase the
usability of your research data, and will ensure that data will be
recognizable and usable beyond the immediate research team.

25
BASIC SQL COMMANDS

Starting MySQL
On the course server enter the command
mysql

You should then see the MySQL prompt


mysql>

To end your MySQL session use the quit command


mysql>quit;

Creating the database


CREATE DATABASE <database name>;

CREATE DATABASE username;

On the course server you have only been granted


permission to create a database whose name is your
username.

26
Using a database
USE <database name>;

USE username;

DROP <database name>;

DROP username;

Deleting a database
DROP DATABASE [IF EXISTS] <databasename>;

DROP DATABASE username;

This deletes the database and all tables and


contents. Use with caution.

Basic MySQL Data Types

27
Create Table
Example
CREATE TABLE Parts Note: If you are using
( Putty you can copy &
PartID INT NOT NULL,
PartName VARCHAR(40) NOT NULL, paste the SQL commands
CatID INT NOT NULL, from the PowerPoint slides
PRIMARY KEY (PartID) into MySQL.
);

Table Parts
Inserting elements
INSERT INTO Parts
(PartID, PartName, CatID)
VALUES
(1001,'Guy wire assembly',503),
(1002,'Magnet',504);

INSERT INTO Parts


VALUES
(1003,'Regulator',505);

28
SELECT Examples
SELECT * FROM Parts;

SELECT PartID, PartName FROM Parts;

SELECT PartID, PartName FROM Parts


WHERE
CatiID = 504;

Create Books Table


CREATE TABLE Books
(
BookID SMALLINT NOT NULL PRIMARY KEY,
BookTitle VARCHAR(60) NOT NULL,
Copyright YEAR NOT NULL
);
Create Example Tables
Books
Authors
AuthorBook
Insert data into Books
INSERT INTO Books
VALUES (12786, 'Letters to a Young Poet', 1934),
(13331, 'Winesburg, Ohio', 1919),
(14356, 'Hell\'s Angels', 1966),
(15729, 'Black Elk Speaks', 1932),
(16284, 'Noncomformity', 1996),
(17695, 'A Confederacy of Dunces', 1980),
(19264, 'Postcards', 1992),
(19354, 'The Shipping News', 1993);

29
Create Authors Table
CREATE TABLE Authors
(
AuthID SMALLINT NOT NULL PRIMARY KEY,
AuthFN VARCHAR(20),
AuthMN VARCHAR(20),
AuthLN VARCHAR(20)
);
Insert data into Books
INSERT INTO Authors
VALUES (1006, 'Hunter', 'S.', 'Thompson'),
(1007, 'Joyce', 'Carol', 'Oates'),
(1008, 'Black', NULL, 'Elk'),
(1009, 'Rainer', 'Maria', 'Rilke'),
(1010, 'John', 'Kennedy', 'Toole'),
(1011, 'John', 'G.', 'Neihardt'),
(1012, 'Annie', NULL, 'Proulx'),
(1013, 'Alan', NULL, 'Watts'),
(1014, 'Nelson', NULL, 'Algren');
Create AuthorBook Table
CREATE TABLE AuthorBook
(
AuthID SMALLINT NOT NULL,
BookID SMALLINT NOT NULL,
PRIMARY KEY (AuthID, BookID),
FOREIGN KEY (AuthID) REFERENCES Authors (AuthID),
FOREIGN KEY (BookID) REFERENCES Books (BookID)
);
Insert Data into AuthorBook
INSERT INTO AuthorBook
VALUES (1006, 14356), (1008, 15729),
(1009, 12786), (1010, 17695),
(1011, 15729), (1012, 19264),
(1012, 19354), (1014, 16284);
30
Basic Join
SELECT BookTitle, Copyright, [Link]
FROM Books, AuthorBook, Authors
WHERE
[Link]=[Link]
AND
[Link]=[Link]
ORDER BY [Link];

SELECT BookTitle, Copyright, [Link]


FROM Books, AuthorBook, Authors
ORDER BY BookTitle;

What happens when we leave off the WHERE clause?

SELECT BookTitle, Copyright, AuthID


FROM Books AS b, AuthorBook AS ab
WHERE [Link]=[Link]
ORDER BY BookTitle;

31
DATA REDUNDANCY

What is Data Redundancy?


Data redundancy occurs when the same piece of data is stored
in two or more separate places and is a common occurrence in
many businesses. As more companies are moving away from
siloed data to using a central repository to store information,
they are finding that their database is filled with inconsistent
duplicates of the same entry. Although it can be challenging to
reconcile — or even benefit from — duplicate data entries,
understanding how to reduce and track data redundancy
efficiently can help mitigate long-term inconsistency issues for
your business.

How does data redundancy occur?


Sometimes data redundancy happens by accident while other times
it is intentional. Accidental data redundancy can be the result of a
complex process or inefficient coding while intentional data
redundancy can be used to protect data and ensure consistency —
simply by leveraging the multiple occurrences of data for disaster
recovery and quality checks.

If data redundancy is intentional, it’s important to have a central


field or space for the data. This allows you to easily update all
records of redundant data when necessary. When data redundancy
isn’t purposeful, it can lead to a variety of issues which we’ll discuss
below.
32
Understanding database versus
file-based data redundancy
Data redundancy can be found in a database, which is an
organized collection of structured data that’s stored by a
computer system or the cloud. A retailer may have a database to
track the products they stock. If the same product gets entered
twice by mistake, data redundancy takes place.

The same retailer may keep customer files in a file storage


system. If a customer purchases from the company more than
once, their name may be entered multiple times. Duplicate
entries of the customer name is considered redundant data.

Regardless of whether data redundancy occurs in a database or


in a file storage system, it can be problematic. Fortunately, data
replication can help prevent data redundancy by storing the
same data in multiple locations. With data replication,
companies can ensure consistency and receive the information
they need at any time.

33
Top 4 advantages of data redundancy
Although data redundancy sounds like a negative event, there
are many organizations that can benefit from this process when
it’s intentionally built into daily operations.

1. Alternative data backup method


Backing up data involves creating compressed and encrypted
versions of data and storing it in a computer system or the cloud.
Data redundancy offers an extra layer of protection and
reinforces the backup by replicating data to an additional
system. It’s often an advantage when companies incorporate
data redundancy into their disaster recovery plans.

2. Better data security


Data security relates to protecting data, in a database or a file
storage system, from unwanted activities such as cyberattacks or
data breaches. Having the same data stored in two or more
separate places can protect an organization in the event of a
cyberattack or breach — an event which can result in lost time
and money, as well as a damaged reputation.

3. Faster data access and updates


When data is redundant, employees enjoy fast access and quick
updates because the necessary information is available on
multiple systems. This is particularly important for customer
service-based organizations whose customers expect
promptness and efficiency.

4. Improved data reliability


Data that is reliable is complete and accurate. Organizations
can use data redundancy to double check data and confirm it’s
correct and completed in full — a necessity when interacting
with customers, vendors, internal staff, and others.
34
Watch out for data redundancy
disadvantages
Although there are noteworthy advantages of intentional data
redundancy, there are also several significant drawbacks when
organizations are unaware of its presence.

Possible data inconsistency


Data redundancy occurs when the same piece of data exists in
multiple places, whereas data inconsistency is when the same data
exists in different formats in multiple tables. Unfortunately, data
redundancy can cause data inconsistency, which can provide a
company with unreliable and/or meaningless information.

Increase in data corruption


Data corruption is when data becomes damaged as a result of
errors in writing, reading, storage, or processing. When the same
data fields are repeated in a database or file storage system, data
corruption arises. If a file gets corrupted, for example, and an
employee tries to open it, they may get an error message and not
be able to complete their task.

Increase in database size


Data redundancy may increase the size and complexity of a
database — making it more of a challenge to maintain. A larger
database can also lead to longer load times and a great deal of
headaches and frustrations for employees as they’ll need to spend
more time completing daily tasks.

Increase in cost
When more data is created due to data redundancy, storage costs
suddenly increase. This can be a serious issue for organizations who
are trying to keep costs low in order to increase profits and meet
their goals. In addition, implementing a database system can
become more expensive.

35
How to reduce data redundancy
Fortunately, it is possible to reduce unintentional cases of data
redundancy that often lead to operational and financial
problems.

Master data
Master data is a single source of common business data that is
shared across several applications or systems. Although master
data does not reduce the occurrences of data redundancy, it
allows companies to work around and accept a certain level of
data redundancy. This is because the use of master data ensures
that in the event a data piece changes, an organization only
needs to update one piece of data. In this case, redundant data
is consistently updated and provides the same information.

Database normalization
Database normalization is the process of efficiently organizing data in a
database so that redundant data is eliminated. This process can ensure
that all of a company's data looks and reads similarly across all records.
By implementing data normalization, an organization standardizes data
fields such as customer names, addresses, and phone numbers.

Normalizing data involves organizing the columns and tables of a


database to make sure their dependencies are enforced correctly. The
“normal form” refers to the set of rules or normalizing data, and a
database is known as “normalized” if it’s free of delete, update, and
insert anomalies.

When it comes to normalizing data, each company has their own unique
set of criteria. Therefore, what one organization believes to be “normal,”
may not be “normal” for another organization. For instance, one
company may want to normalize the state or province field with two
digits, while another may prefer the full name. Regardless, database
normalization can be the key to reducing data redundancy across any
company.

36
Efficient data redundancy use cases
Efficient data redundancy is possible. Many organizations like home
improvement companies, real estate agencies, and companies
focused on customer interactions have customer relationship
management (CRM) systems.

When a CRM system is integrated with another business software like


an accounting software that combines customer and financial data,
redundant manual data is eliminated, leading to more insightful
reports and improved customer service.

Database management systems are also used in a variety of


organizations. They receive direction from a database administrator
(DBA) and allow the system to load, retrieve, or change existing data
from the systems. Database management systems adhere to the rules
of normalization, which reduces data redundancy.

Hospitals, nursing homes, and other healthcare entities use database


management systems to generate reports that provide useful
information for physicians and other employees. When data
redundancy is efficient and does not lead to data inconsistency,
these systems can alert healthcare providers of rises in denial claim
rates, how successful a certain medication is, and other important
pieces of information.
Resources
Some examples are explained at:
[Link]

There is an example of the latter case here:


[Link]
normalization-explained-in-simple-english//chapter-10-er-modelling/

There are worked examples of normalisation at:


[Link]
english/
and
[Link]

There is a walk through for ERDs, that also explains the relationship of ERDs to logical data models, here:
[Link]
data-models-logical

37
DATA CLEANING

The power of clean data


A decision is only as good as the data that informs it. And with
massive amounts of data streaming in from multiple sources, a
data cleansing tool is more important than ever for ensuring
accuracy of information, process efficiency, and driving your
company’s competitive edge. Some of the primary benefits of
data scrubbing include:

Improved Decision Making — Data quality is critical because it


directly affects your company’s ability to make sound decisions
and calculate effective strategies. No company can afford
wasting time and energy correcting errors brought about by
dirty data.

Consider a business that relies on customer-generated data to


develop each new generation of its online and mobile ordering
systems, such as AnyWare from Domino’s Pizza. Without a data
cleansing program, changes and revisions to the app may not be
based on precise or accurate information. As a result, the new
version of the app may miss its target and fail to meet customer
needs or expectations.

38
Boosted Efficiency — Utilizing clean data isn’t just beneficial for
your company’s external needs — it can also improve in-house
efficiency and productivity. When information is cleaned
properly, it reveals valuable insights into internal needs and
processes. For example, a company may use data to track
employee productivity or job satisfaction in an effort to predict
and reduce turnover. Cleansing data from performance reviews,
employee feedback, and other related HR documents may help
quickly identify employees who are at a higher risk of attrition.
Competitive Edge — The better a company meets its customers
needs, the faster it will rise above its competitors. A data
cleansing tool helps provide reliable, complete insights so that
you can identify evolving customer needs and stay on top of
emerging trends. Data cleansing can produce faster response
rates, generate quality leads, and improve the customer
experience.

Data cleansing: step-by-step


A data cleansing tool can automate most aspects of a
company’s overall data cleansing program, but a tool is only one
part of an ongoing, long-term solution to data cleaning. Here’s
an overview of the steps you’ll need to take to make sure your
data is clean and usable:

Step 1 — Identify the Critical Data Fields

Companies have access to more data now than ever before, but
not all of it is equally useful. The first step in data cleansing is to
determine which types of data or data fields are critical for a
given project or process.

39
Step 2 — Collect the Data

After the relevant data fields are identified, the data they
contain is collected, sorted, and organized.

Step 3 — Discard Duplicate Values

After the data has been collected, the process of resolving


inaccuracies begins. Duplicate values are identified and
removed.

Step 4 — Resolve Empty Values

Data cleansing tools search each field for missing values, and
can then fill in those values to create a complete data set and
avoid gaps in information.

Step 5 — Standardize the Cleansing Process

For a data cleansing process to be effective, it should be


standardized so that it can be easily replicated for consistency.
In order to do so, it’s important to determine which data is used
most often, when it will be needed, and who will be responsible
for maintaining the process. Finally, you’ll need to determine how
often you’ll need to scrub your data. Daily? Weekly? Monthly?

Step 6 — Review, Adapt, Repeat

Set time aside each week or month to review the data cleansing
process. What has been working well? Where is there room for
improvement? Are there any obvious glitches or bugs that seem
to be occurring? Include members of different teams who are
affected by data cleansing in the conversation for a well-
rounded account of your company’s process.

40
BIG DATA

What is Big Data?


The definition of big data is data that contains greater variety,
arriving in increasing volumes and with more velocity. This is also
known as the three Vs.

Put simply, big data is larger, more complex data sets, especially
from new data sources. These data sets are so voluminous that
traditional data processing software just can’t manage them. But
these massive volumes of data can be used to address business
problems you wouldn’t have been able to tackle before.

Big data benefits:


Big data makes it possible for you to gain more complete
answers because you have more information. More complete
answers mean more confidence in the data—which means a
completely different approach to tackling problems.

41
Five Vs in Big Data
[Link]
The amount of data matters. With big data, you’ll have to process
high volumes of low-density, unstructured data. This can be data of
unknown value, such as Twitter data feeds, clickstreams on a web
page or a mobile app, or sensor-enabled equipment. For some
organizations, this might be tens of terabytes of data. For others, it
may be hundreds of petabytes.

[Link]
Velocity is the fast rate at which data is received and (perhaps)
acted on. Normally, the highest velocity of data streams directly
into memory versus being written to disk. Some internet-enabled
smart products operate in real time or near real time and will
require real-time evaluation and action.

[Link]
Variety refers to the many types of data that are available.
Traditional data types were structured and fit neatly in a relational
database. With the rise of big data, data comes in new
unstructured data types. Unstructured and semistructured data
types, such as text, audio, and video, require additional
preprocessing to derive meaning and support metadata.

[Link]
The fourth V is veracity, which in this context is equivalent to
quality. We have all the data, but could we be missing something?
Are the data “clean” and accurate? Do they really have something
to offer?

[Link]
Finally, the V for value sits at the top of the big data pyramid. This
refers to the ability to transform a tsunami of data into business.

42
The history of big data
Although the concept of big data itself is relatively new, the
origins of large data sets go back to the 1960s and ‘70s when
the world of data was just getting started with the first data
centers and the development of the relational database.

Around 2005, people began to realize just how much data users
generated through Facebook, YouTube, and other online
services. Hadoop (an open-source framework created
specifically to store and analyze big data sets) was developed
that same year. NoSQL also began to gain popularity during this
time.

The development of open-source frameworks, such as Hadoop


(and more recently, Spark) was essential for the growth of big
data because they make big data easier to work with and
cheaper to store. In the years since then, the volume of big data
has skyrocketed. Users are still generating huge amounts of data
—but it’s not just humans who are doing it.

With the advent of the Internet of Things (IoT), more objects and
devices are connected to the internet, gathering data on
customer usage patterns and product performance. The
emergence of machine learning has produced still more data.

While big data has come far, its usefulness is only just beginning.
Cloud computing has expanded big data possibilities even
further. The cloud offers truly elastic scalability, where
developers can simply spin up ad hoc clusters to test a subset of
data. And graph databases are becoming increasingly
important as well, with their ability to display massive amounts of
data in a way that makes analytics fast and comprehensive.

43
Big data use cases
Big data can help you address a range of business activities,
from customer experience to analytics. Here are just a few.

Product development
Companies like Netflix and Procter & Gamble use big data to
anticipate customer demand. They build predictive models for
new products and services by classifying key attributes of past
and current products or services and modeling the relationship
between those attributes and the commercial success of the
offerings. In addition, P&G uses data and analytics from focus
groups, social media, test markets, and early store rollouts to
plan, produce, and launch new products.

Predictive maintenance
Factors that can predict mechanical failures may be deeply
buried in structured data, such as the year, make, and model of
equipment, as well as in unstructured data that covers millions of
log entries, sensor data, error messages, and engine
temperature. By analyzing these indications of potential issues
before the problems happen, organizations can deploy
maintenance more cost effectively and maximize parts and
equipment uptime.

Customer experience
The race for customers is on. A clearer view of customer
experience is more possible now than ever before. Big data
enables you to gather data from social media, web visits, call
logs, and other sources to improve the interaction experience
and maximize the value delivered. Start delivering personalized
offers, reduce customer churn, and handle issues proactively.

44
Fraud and compliance
When it comes to security, it’s not just a few rogue hackers—
you’re up against entire expert teams. Security landscapes and
compliance requirements are constantly evolving. Big data helps
you identify patterns in data that indicate fraud and aggregate
large volumes of information to make regulatory reporting much
faster.

Machine learning
Machine learning is a hot topic right now. And data—specifically
big data—is one of the reasons why. We are now able to teach
machines instead of program them. The availability of big data
to train machine learning models makes that possible.

Operational efficiency
Operational efficiency may not always make the news, but it’s an
area in which big data is having the most impact. With big data,
you can analyze and assess production, customer feedback and
returns, and other factors to reduce outages and anticipate
future demands. Big data can also be used to improve decision-
making in line with current market demand.

Drive innovation
Big data can help you innovate by studying interdependencies
among humans, institutions, entities, and process and then
determining new ways to use those insights. Use data insights to
improve decisions about financial and planning considerations.
Examine trends and what customers want to deliver new
products and services. Implement dynamic pricing. There are
endless possibilities.

45
How big data works
Big data gives you new insights that open up new opportunities
and business models. Getting started involves three key actions:

1. Integrate
Big data brings together data from many disparate sources and
applications. Traditional data integration mechanisms, such as
extract, transform, and load (ETL) generally aren’t up to the task.
It requires new strategies and technologies to analyze big data
sets at terabyte, or even petabyte, scale.

During integration, you need to bring in the data, process it, and
make sure it’s formatted and available in a form that your
business analysts can get started with.

2. Manage
Big data requires storage. Your storage solution can be in the
cloud, on premises, or both. You can store your data in any form
you want and bring your desired processing requirements and
necessary process engines to those data sets on an on-demand
basis. Many people choose their storage solution according to
where their data is currently residing. The cloud is gradually
gaining popularity because it supports your current compute
requirements and enables you to spin up resources as needed.

3. Analyze
Your investment in big data pays off when you analyze and act
on your data. Get new clarity with a visual analysis of your varied
data sets. Explore the data further to make new discoveries.
Share your findings with others. Build data models with machine
learning and artificial intelligence. Put your data to work.

46
Big data best practices
To help you on your big data journey, we’ve put together some
key best practices for you to keep in mind. Here are our
guidelines for building a successful big data foundation.

[Link] big data with specific business goals


More extensive data sets enable you to make new discoveries.
To that end, it is important to base new investments in skills,
organization, or infrastructure with a strong business-driven
context to guarantee ongoing project investments and funding.
To determine if you are on the right track, ask how big data
supports and enables your top business and IT priorities.
Examples include understanding how to filter web logs to
understand ecommerce behavior, deriving sentiment from social
media and customer support interactions, and understanding
statistical correlation methods and their relevance for customer,
product, manufacturing, and engineering data.

[Link] skills shortage with standards and governance


One of the biggest obstacles to benefiting from your investment
in big data is a skills shortage. You can mitigate this risk by
ensuring that big data technologies, considerations, and
decisions are added to your IT governance program.
Standardizing your approach will allow you to manage costs and
leverage resources. Organizations implementing big data
solutions and strategies should assess their skill requirements
early and often and should proactively identify any potential skill
gaps. These can be addressed by training/cross-training existing
resources, hiring new resources, and leveraging consulting firms.

47
[Link] knowledge transfer with a center of excellence
Use a center of excellence approach to share knowledge,
control oversight, and manage project communications. Whether
big data is a new or expanding investment, the soft and hard
costs can be shared across the enterprise. Leveraging this
approach can help increase big data capabilities and overall
information architecture maturity in a more structured and
systematic way.

[Link] payoff is aligning unstructured with structured data


It is certainly valuable to analyze big data on its own. But you
can bring even greater business insights by connecting and
integrating low density big data with the structured data you are
already using [Link] you are capturing customer,
product, equipment, or environmental big data, the goal is to
add more relevant data points to your core master and
analytical summaries, leading to better conclusions. For example,
there is a difference in distinguishing all customer sentiment from
that of only your best customers. Which is why many see big data
as an integral extension of their existing business intelligence
capabilities, data warehousing platform, and information
[Link] in mind that the big data analytical processes
and models can be both human- and machine-based. Big data
analytical capabilities include statistics, spatial analysis,
semantics, interactive discovery, and visualization. Using
analytical models, you can correlate different types and sources
of data to make associations and meaningful discoveries.

48
[Link] your discovery lab for performance
Discovering meaning in your data is not always straightforward.
Sometimes we don’t even know what we’re looking for. That’s
expected. Management and IT needs to support this “lack of
direction” or “lack of clear requirement.”At the same time, it’s
important for analysts and data scientists to work closely with
the business to understand key business knowledge gaps and
requirements. To accommodate the interactive exploration of
data and the experimentation of statistical algorithms, you need
high-performance work areas. Be sure that sandbox
environments have the support they need—and are properly
governed.

[Link] with the cloud operating model


Big data processes and users require access to a broad array of
resources for both iterative experimentation and running
production jobs. A big data solution includes all data realms
including transactions, master data, reference data, and
summarized data. Analytical sandboxes should be created on
demand. Resource management is critical to ensure control of
the entire data flow including pre- and post-processing,
integration, in-database summarization, and analytical
modeling. A well-planned private and public cloud provisioning
and security strategy plays an integral role in supporting these
changing requirements.

There are some interesting examples of real life uses of Big Data here:
[Link]

49
UNDERSTANDING BIG DATA:
INFRASTRUCTURE
Basics of Big Data Infrastructure
Big data is all about high velocity, large volumes, and wide data
variety, so the physical infrastructure will literally “make or break”
the implementation. Most big data implementations need to be
highly available, so the networks, servers, and physical storage
must be resilient and redundant.

Resiliency and redundancy are interrelated. An infrastructure, or


a system, is resilient to failure or changes when sufficient
redundant resources are in place ready to jump into action.
Resiliency helps to eliminate single points of failure in your
infrastructure. For example, if only one network connection exists
between your business and the Internet, you have no network
redundancy, and the infrastructure is not resilient with respect to
a network outage.

In large data centers with business continuity requirements, most


of the redundancy is in place and can be leveraged to create a
big data environment. In new implementations, the designers
have the responsibility to map the deployment to the needs of
the business based on costs and performance.

50
Infrastructure is the cornerstone
of Big Data architecture
We’ll be closely examining infrastructural approaches- what they
are, how they work and what each approach is best used for.

[Link]

To recap, Hadoop is essentially an open-source framework for


processing, storing and analysing data. The fundamental principle
behind Hadoop is rather than tackling one monolithic block of data
all in one go, it’s more efficient to break up & distribute data into
many parts, allowing processing and analysing of different parts
concurrently.

When hearing Hadoop discussed, it’s easy to think of Hadoop as one


vast entity; this is a myth. In reality, Hadoop is a whole ecosystem of
different products, largely presided over by the Apache Software
Foundation. Some key components include:
HDFS- The default storage layer

MapReduce- Executes a wide range of analytic functions by


analysing datasets in parallel before ‘reducing’ the results. The “Map”
job distributes a query to different nodes, and the “Reduce” gathers
the results and resolves them into a single value.

YARN- Responsible for cluster management and scheduling user


applications

Spark- Used on top of HDFS, and promises speeds up to 100 times


faster than the two-step MapReduce function in certain applications.
Allows data to loaded in-memory and queried repeatedly, making it
particularly apt for machine learning algorithms

More information about Apache Hadoop add-on components, can


be found here.
51
The main advantages of Hadoop are its cost- and time-effectiveness.
Cost, because as it’s open source, it’s free and available for anyone
to use, and can run off cheap commodity hardware. Time, because it
processes multiple ‘parts’ of the data set concurrently, making it a
comparatively fast tool for retrospective, in-depth analysis. However,
open source has its drawbacks. The Apache Software Foundation are
constantly updating and developing the Hadoop ecosystem; but if
you hit a snag with open-source technology, there’s no one go-to
source for troubleshooting.

This is where Hadoop-on-Premium packages enter the picture.


Hadoop-on-Premium services such as Cloudera, Hortonworks and
Splice offer the Hadoop framework with greater security and
support, with added system & data management tools and enterprise
capabilities.

[Link]

NoSQL, which stands for Not Only SQL, is a term used to cover a
range of different database technologies. As mentioned in the
previous article, unlike their relational predecessors, NoSQL
databases are adept at processing dynamic, semi-structured data
with low latency, making them better tailored to a Big Data
environment.

The different strengths and uses of Hadoop and NoSQL are often
described as “operational” and “analytical”. NoSQL is better suited
for “operational” tasks; interactive workloads based on selective
criteria where data can be processed in near real-time. Hadoop is
better suited to high-throughput, in-depth analysis in retrospect,
where the majority or all of the data is harnessed. Since they serve
different purposes, Hadoop and NoSQL products are sometimes
marketed concurrently. Some NoSQL databases, such as HBase, were
primarily designed to work on top of Hadoop.

52
Some big names in NoSQL field include Apache Cassandra,
MongoDB, and Oracle NoSQL. Many of the most widely used NoSQL
technologies are open source, meaning security and troubleshooting
may be an issue. It also places less focus on atomicity and
consistency than on performance and scalability. Premium packages
of NoSQL databases (such as Datastax for Cassandra) work to
address these issues.

3. MASSIVELY PARALLEL PROCESSING (MPP)

As the name might suggest, MPP technologies process massive


amounts of data in parallel. Hundreds (or potentially even thousands)
of processors, each with their own operating system and memory,
work on different parts of the same programme.
As mentioned in the previous article, MPP usually runs on expensive
data warehouse appliances, whereas Hadoop is most often run on
cheap commodity hardware (allowing for inexpensive horizontal
scale out). MPP uses SQL, and Hadoop uses Java as default (although
the Apache Foundation developed Hive, a language used in Hadoop
similar to SQL, to make using Hadoop slightly easier and less
specialist). As with all technologies in this article, MPP has crossovers
with the other technologies; Teradata, an MPP technology, has an
ongoing partnership with Hortonworks (a Hadoop-on-Premium
service).
Many of the major players in the MPP market have been acquired by
technology vendor behemoths; Netezza, for instance, is owned by
IBM, Vertica is owned by HP and Greenplum is owned by EMC.

53
CLOUD
Cloud computing refers to a broad set of products that are sold
as a service and delivered over a network. In other infrastructural
approaches, when setting up your big architecture you need to
buy hardware and software for each person involved with the
processing and analysing of your data. In cloud computing, your
analysts only require access to 1 application- a web-based
service where all of the necessary resources and programmes are
hosted. In cloud computing, up-front costs are minimal as you
typically only pay for what you use, and scale out from there-
Amazon Redshift, for instance, allows you to get started for as
little as 25 cents an hour. As well as cost, Cloud computing also
has an advantage in terms of delivering faster insights.

Of course, having your data hosted by third party can raise


questions about security; many choose to host their confidential
information in-house, and use the cloud for less private data.

Alot of big names in IT offer cloud computing solutions; Google


has a whole host of Cloud computing products, including
BigQuery, specifically designed for the processing and
management of Big Data; Amazon Web Services also has a wide
range, included EMR for Hadoop, RDS for MySQL and DynamoDB
for NoSQL. There are also vendors such as Infochimps and Mortar
specifically dedicated to offering cloud computing solutions.

As you can see, these different technologies are by no means


direct competitors; each has its own particular uses and
capabilities, and complex architectures will make use of
combinations of all of these approaches, and more. In the next
“Understanding Big Data”, we will be moving beyond processing
data and into the realm of advanced analytics; programmes
specifically designed to help you harness your data and glean
insights from it.
54
Data Lake vs Data Warehouse
Data lakes and data warehouses are both widely used for storing
big data, but they are not interchangeable terms. A data lake is a
vast pool of raw data, the purpose for which is not yet defined. A
data warehouse is a repository for structured, filtered data that
has already been processed for a specific purpose.
The two types of data storage are often confused, but are much
more different than they are alike. In fact, the only real similarity
between them is their high-level purpose of storing data.
The distinction is important because they serve different purposes
and require different sets of eyes to be properly optimized. While
a data lake works for one company, a data warehouse will be a
better fit for another.

Four key differences between a data


lake and a data warehouse
There are several differences between a data lake and a data
warehouse. Data structure, ideal users, processing methods, and
the overall purpose of the data are the key differentiators.

55
Data lake vs data warehouse:
which is right for me?
Organizations often need both. Data lakes were born out of the
need to harness big data and benefit from the raw, granular
structured and unstructured data for machine learning, but there
is still a need to create data warehouses for analytics use by
business users.

Healthcare: data lakes store


unstructured information
Data warehouses have been used for many years in the
healthcare industry, but it has never been hugely successful.
Because of the unstructured nature of much of the data in
healthcare (physicians notes, clinical data, etc.) and the need for
real-time insights, data warehouses are generally not an ideal
model.

Data lakes allow for a combination of structured and


unstructured data, which tends to be a better fit for healthcare
companies.

Data lake:[Link]
Both:[Link]

56
Education: data lakes offer
flexible solutions
In recent years, the value of big data in education reform has
become enormously apparent. Data about student grades,
attendance, and more can not only help failing students get back
on track, but can actually help predict potential issues before they
occur. Flexible big data solutions have also helped educational
institutions streamline billing, improve fundraising, and more.

Much of this data is vast and very raw, so many times, institutions in
the education sphere benefit best from the flexibility of data lakes.

Finance: data warehouses appeal to


the masses
In finance, as well as other business settings, a data warehouse is
often the best storage model because it can be structured for
access by the entire company rather than a data scientist.

Big data has helped the financial services industry make big
strides, and data warehouses have been a big player in those
strides. The only reason a financial services company may be
swayed away from such a model is because it is more cost-
effective, but not as effective for other purposes.
Transportation: data lakes help make
predictions
Much of the benefit of data lake insight lies in the ability to make
predictions.

In the transportation industry, especially in supply chain management,


the prediction capability that comes from flexible data in a data lake
can have huge benefits, namely cost cutting benefits realized by
examining data from forms within the transport pipeline.

57
The importance of choosing a
data lake or data warehouse
The “data lake vs data warehouse” conversation has likely just
begun, but the key differences in structure, process, users, and
overall agility make each model unique. Depending on your
company’s needs, developing the right data lake or data
warehouse will be instrumental in growth.
Cloud data lakes or on-premises?
Data lakes are traditionally implemented on-premises, with
storage on HDFS and processing (YARN) on Hadoop clusters.
Hadoop is scalable, low-cost, and offers good performance with
its inherent advantage of data locality (data and compute reside
together).

However, there are challenges to creating an on-premises


infrastructure:

Space — Bulky servers occupy real-estate that translates to higher


costs.

Setup — Procuring hardware and setting up data centers isn’t


straightforward and can take weeks or months to take off.
Scalability — If there is a need to scale up the storage capacity, it
takes time and effort, due to increased space requirement and
cost approvals from senior execs.

Estimating requirements — Since scalability isn’t easier on-


premises, it becomes important to estimate the hardware
requirements correctly at the beginning of the project. As data
grows unsystematically every day, this is a tough feat to achieve.

Cost — Cost estimations have proven to be higher on-premises


than the cloud alternatives.
58
Cloud data lakes, on the other hand, help overcome these
challenges. A data lake in the cloud is:

Easier and quicker to get started. Rather than a big bang


approach, the cloud allows users to get started incrementally.
Cost-effective with a pay-as-you-use model.

Easier to scale up as needs grow, which eliminates the stress of


estimating requirements and getting approvals.
Cloud data lake challenges
There are challenges to using a cloud data lake, of course. Some
organizations prefer not to store confidential and sensitive
information in the cloud due to security risks. While most cloud-
based data lake vendors vouch for security and have increased
their protection layers over the years, the looming uncertainty
over data theft remains.
Another practical challenge is that some organizations already
have an established data warehousing system in place to store
their structured data. They may choose to migrate all that data
to cloud, or explore a hybrid solution with a common compute
engine accessing structured data from the warehouse and
unstructured data from the cloud.
Data governance is another concern. A data lake should not
become a data swamp that is difficult to wade through.
Data lake architecture: Hadoop, AWS, and Azure
It’s important to remember that there are two components to a
data lake: storage and compute. Both storage and compute can
be located either on-premises or in the cloud. This results in multiple
possible combinations when designing a data lake architecture.
Organizations can choose to stay completely on-premises, move
the whole architecture to the cloud, consider multiple clouds, or
even a hybrid of these options.
There is no single recipe here. Depending on the needs of an
organization, there are several good options.
59
Data lakes on Hadoop
Many people associate Hadoop with data lakes.

A Hadoop cluster of distributed servers solves the concern of big


data storage. At the core of Hadoop is its storage layer, HDFS
(Hadoop Distributed File System), which stores and replicates
data across multiple servers. YARN (Yet Another Resource
Negotiator) is the resource manager that decides how to
schedule resources on each node. MapReduce is the
programming model used by Hadoop to split data into smaller
subsets and process them in its cluster of servers.

Other than these three core components, the Hadoop ecosystem


comprises several supplementary tools such as Hive, Pig, Flume,
Sqoop, and Kafka that help with data ingestion, preparation,
and extraction. Hadoop data lakes can be set up on-premises
as well as in the cloud using enterprise platforms such as
Cloudera and HortonWorks. Other cloud data lakes such as
Azure wrap functionalities around the Hadoop architecture.

Strengths:
More familiarity among technologists
Less expensive because it is open-source
Many ETL tools available for integration with Hadoop
Easy to scale
Data locality makes computation faster

60
Data lakes on AWS
AWS has an exhaustive suite of product offerings for its data lake
solution.
Amazon Simple Storage Service (Amazon S3) is at the center of
the solution providing storage function. Kinesis Streams, Kinesis
Firehose, Snowball, and Direct Connect are data ingestion tools
that allow users to transfer massive amounts of data into S3.
There is also a database migration service that helps migrate
existing on-premises data to the cloud.

In addition to S3, there is DynamoDB, a low-latency No-SQL


database, and Elastic Search, a service that provides a
simplified mechanism to query the data lake. Cognito User Pools
define user authentication and access to the data lake. Services
such as Security Token Service, Key Management Service,
CloudWatch, and CloudTrail ensure data security. For processing
and analytics, there are tools such as RedShift, QuickSight, EMR,
and Machine Learning.

The huge list of products offerings available from AWS come with
a steep initial learning curve. However, the solution’s
comprehensive functionalities find extensive use in business
intelligence applications.

Strengths:
Exhaustive and feature-rich product suite
Flexibility to pick and choose products based unique
requirements
Low costs
Strong security and compliance standards
Separation of compute and storage to scale each one as
needed
Collaboration with APN (AWS Partner Network) firms such as
Talend ensures seamless AWS onboarding

61
Data lakes on Azure
Azure is a data lake offered by Microsoft. It has a storage and
an analytics layer; the storage layer is called as Azure Data Lake
Store (ADLS) and the analytics layer consists of two components:
Azure Data Lake Analytics and HDInsight.

ADLS is built on the HDFS standard and has unlimited storage


capacity. It can store trillions of files with a single file larger than
one petabyte in size. ADLS allows data to be stored in any
format and is secure and scalable. It supports any application
that uses the HDFS standard. This makes migration of existing
data easier, and also facilitates plug-and-play with other
compute engines.

HDInsight is a cloud-based data lake analytics service. Built on


top of Hadoop YARN, it allows data to be accessed using tools
such as Spark, Hive, Kafka, and Storm. It supports enterprise-
grade security due to integration with Azure Active Directory.
Azure Data Lake Analytics is also an analytics service, but its
approach is different. Rather than using tools such as Hive, it uses
a language called U-SQL, a combination of SQL and C#, to
access data. It is ideal for big data batch processing as it
provides faster speed at lower costs (pay only for the jobs used).

Strengths:
Both storage and compute in the cloud makes it simple to
manage.
Strong analytical services with powerful functionalities
Easy to migrate from an existing Hadoop cluster
Many big data experts are familiar with Hadoop and its tools,
so it is easy to find skilled manpower.
Integration with Active Directory ensures no separate effort to
manage security

62
Data Transmission: What Is It?
Data transmission is the transfer of data from one digital
device to another. This transfer occurs via point-to-point
data streams or channels. These channels may previously
have been in the form of copper wires but are now much
more likely to be part of a wireless network.
As we know, data transmission methods can refer to both
analog and digital data but in this guide, we will be focusing
on digital modulation. This modulation technique focuses on
the encoding and decoding of digital signals via two main
methods parallel and serial transmission.
The effectiveness of data transmission relies heavily on the
amplitude and transmission speed of the carrier channel. The
amount of data transferred within a given time period is the
data transfer rate, which specifies whether or not a network
can be used for tasks that require complex, data-intensive
applications.
Network congestion, latency, server health, and insufficient
infrastructure can bring data transmission rates to a sub-par
level, affecting overall business performance. High-speed
data transfer rates are essential to processing complex tasks
like online streaming and large file transfers.

Importance of Content Delivery


Networks in Data Transmission
High-quality delivery of websites and applications to as many
locations around the world as possible requires the infrastructure
and expertise to achieve delivery with low latency, high
performance reliability, and high-speed data transmission.
Professional content delivery networks offer a variety of benefits,
including seamless and secure distribution of content to end users,
no matter their location.
A higher data rate conversion improves user experience and
increases reliability
63
Faster Data Transfer
FTP and HTTP are common methods of file transfer. FTP can
be used to transfer files or access online software archives,
for example. HTTP is the protocol used to indicate how
messages are not only defined, but also transmitted. It also
determines what actions web browsers and servers take to
respond to a variety of commands.

HTTP requests are identified as a stateless protocol, meaning


they have no information regarding previous requests. ISPs
offer finite levels of bandwidth for both sending and
receiving data communications, which can cause excessive
slowdowns a business just cannot afford.
Transfer Rates
High data transfer rates are essential for any business. To
determine how fast data is transferred from one network location
to another, the data are measured using the transfer rate in bits
per second (bps). Bandwidth refers to the maximum amount of
data that can be transferred within a given amount of time. One
of the most promising innovations implemented by content
network services is Tbps (Terabits Per Second), which was
unimaginable up until the early part of the decade, and can lead
to almost real-time communication between devices.

Transmission Control Protocol (TCP)


The Transmission Control Protocol (TCP) is a transport protocol
that is used on top of IP to ensure reliable transmission of packets.

TCP includes mechanisms to solve many of the problems that arise


from packet-based messaging, such as lost packets, out of order
packets, duplicate packets, and corrupted packets.

Since TCP is the protocol used most commonly on top of IP, the
Internet protocol stack is sometimes referred to as TCP/IP.
64
Packet format
When sending packets using TCP/IP, the data portion of
each IP packet is formatted as a TCP segment.

Each TCP segment contains a header and data. The TCP header
contains many more fields than the UDP header and can range in
size from 202020 to 606060 bytes, depending on the size of the
options field.

The TCP header shares some fields with the UDP header: source
port number, destination port number, and checksum.

User Datagram Protocol (UDP)


The User Datagram Protocol (UDP) is a lightweight data
transport protocol that works on top of IP.

UDP provides a mechanism to detect corrupt data in packets, but


it does not attempt to solve other problems that arise with
packets, such as lost or out of order packets. That's why UDP is
sometimes known as the Unreliable Data Protocol.

UDP is simple but fast, at least in comparison to other protocols


that work over IP. It's often used for time-sensitive applications
(such as real-time video streaming) where speed is more
important than accuracy.

65
FROM START TO FINISH
Let's step through the process of transmitting a packet with
TCP/IP.
Step 1: Establish connection
When two computers want to send data to each other over TCP,
they first need to establish a connection using a three-way
handshake.

The first computer sends a packet with the SYN bit set to 111 (SYN
= "synchronize?"). The second computer sends back a packet
with the ACK bit set to 111 (ACK = "acknowledge!") plus the SYN
bit set to 111. The first computer replies back with an ACK.

The SYN and ACK bits are both part


of the TCP header:
In fact, the three packets involved in the three-way handshake do
not typically include any data. Once the computers are done with
the handshake, they're ready to receive packets containing actual
data.

66
Step 2: Send packets of data
When a packet of data is sent over TCP, the recipient must
always acknowledge what they received.

The first computer sends a packet with data and a sequence


number. The second computer acknowledges it by setting the
ACK bit and increasing the acknowledgement number by the
length of the received data.

The sequence and acknowledgement


numbers are part of the TCP header:

The 32-bit sequence and acknowledgement numbers are


highlighted.

Those two numbers help the computers to keep track of which data
was successfully received, which data was lost, and which data was
accidentally sent twice.

67
Step 3: Close the connection
Either computer can close the connection when they no longer
want to send or receive data.

A computer initiates closing the connection by sending a packet


with the FIN bit set to 1 (FIN = finish). The other computer replies
with an ACK and another FIN. After one more ACK from the
initiating computer, the connection is closed.

Detecting lost packets


TCP connections can detect lost packets using a timeout.

After sending off a packet, the sender starts a timer and puts the
packet in a retransmission queue. If the timer runs out and the sender
has not yet received an ACK from the recipient, it sends the packet
again.

The retransmission may lead to the recipient receiving duplicate


packets, if a packet was not actually lost but just very slow to arrive
or be acknowledged. If so, the recipient can simply discard
duplicate packets. It's better to have the data twice than not at all!

68
Handling out of order packets
TCP connections can detect out of order packets by using the
sequence and acknowledgement numbers.

When the recipient sees a higher sequence number than what


they have acknowledged so far, they know that they are missing
at least one packet in between. In the situation pictured above,
the recipient sees a sequence number of #73 but expected a
sequence number of #37. The recipient lets the sender know
there's something amiss by sending a packet with an
acknowledgement number set to the expected sequence
number.

Detecting lost packets


Sometimes the missing packet is simply taking a slower route through
the Internet and it arrives soon after.
Other times, the missing packet may actually be a lost packet and
the sender must retransmit the packet.

In both situations, the recipient has to deal with out of order packets.
Fortunately, the recipient can use the sequence numbers to
reassemble the packet data in the correct order.

69
The advantages of TCP/IP protocol
suite are
It is an industry–standard model that can be effectively
deployed in practical networking problems.
It is interoperable, i.e., it allows cross-platform
communications among heterogeneous networks.
It is an open protocol suite. It is not owned by any particular
institute and so can be used by any individual or
organization.
It is a scalable, client-server architecture. This allows
networks to be added without disrupting the current services.
It assigns an IP address to each computer on the network,
thus making each device to be identifiable over the network.
It assigns each site a domain name. It provides name and
address resolution services.
The disadvantages of the TCP/IP
model are
It is not generic in nature. So, it fails to represent any protocol stack other
than the TCP/IP suite. For example, it cannot describe the Bluetooth
connection.
It does not clearly separate the concepts of services, interfaces, and
protocols. So, it is not suitable to describe new technologies in new networks.
It does not distinguish between the data link and the physical layers, which
has very different functionalities. The data link layer should concern with the
transmission of frames. On the other hand, the physical layer should lay down
the physical characteristics of transmission. A proper model should segregate
the two layers.
It was originally designed and implemented for wide area networks. It is not
optimized for small networks like LAN (local area network) and PAN (personal
area network).
Among its suite of protocols, TCP and IP were carefully designed and well
implemented. Some of the other protocols were developed ad hoc and so
proved to be unsuitable in long run. However, due to the popularity of the
model, these protocols are being used even 30–40 years after their
introduction.

70
Reliable Multi-Destination Transport
(RMDT)
RMDT* (Reliable Multi-Destination Transport Protocol) is a
solution for fast data transmission over high-capacity WAN
networks. It easily handles high packet delays and jitter as well
as significant packet losses on a channel. The protocol is built
upon the standard UDP, so it does not require special hardware
or additional proprietary drivers – it just works in legacy IP
infrastructures. The key novelty of the protocol is that it can
handle data delivery to many destinations in one session. So
delivery of big data sets to several destinations consumes much
less system resources and much less time as any another solution.

Let’s watch the video below to understand better:

[Link]
UDP-Based Protocol (UDT)
UDP-Based Protocol (UDT) is a high-performance data transfer protocol.
It is designed specifically for the high-volume transfer of large datasets
over high-speed wide area networks (WAN). The UDP-Based Protocol is
a much more efficient alternative to the TCP protocol, and can transfer
data at a much higher speed.
UDT is an application built on top of User Datagram Protocol (UDP), and
is a connection-oriented, duplex protocol that can support reliable data
streaming and partial reliable messaging.
UDT is the major technology behind most commercial WAN acceleration
products, and can support global data transfers or terabyte-sized
datasets. It also offers a highly configurable framework that can
accommodate a variety of algorithms that control network congestion
and increase reliability of delivery.
The project to develop UDT began in 2001. At this time optical networks
were less expensive and increasing in popularity. This development led
to a wider awareness of TCP efficiency problems over high-speed wide
area networks. The initial version of UDT was developed to support bulk
data transfer of scientific data over private networks.
71
Quick UDP Internet Connections
(QUIC)
An experimental protocol developed by Google to speed up
latency-sensitive applications such as Web search. A prime goal
is that connections can be established more quickly than with
TCP ("zero RTT connection establishment"). The protocol is
layered on top of UDP for deployability. According to a
Chromium Blog article from April 2015, "roughly half of all
requests from Chrome to Google servers are served over QUIC".

[Link]
Now comparison of TCP ,UDP and QUICK
[Link]
Wireless Transmission of Big Data:
The growing popularity of big data and Internet of Things (IoT)
applications bring new challenges to the wireless communication
community. Wireless transmission systems should more efficiently
support the large amount of data traffics from diverse types of
information sources.

The supporting of big data transmission presents several


technical challenges to wireless system design, including
spectrum efficiency enhancement of radio access network (RAN),
capacity provision of fronthaul/backhaul links, and network
architecture improvement for traffic scalability. To effectively
support various big data and IoT applications, future wireless
systems need to optimize their transmission strategies for a large
amount of data from diverse sources.

[Link]

72
Note: you should understand the limitations of TCP and be aware
of efforts to replace it. you do not need to know technical details
of any proposed replacements but should be able to explain in
general terms why they are an improvement on TCP.

·Collection systems and sources; e.g. purchasing and financial


data from credit/debit cards, purchasing and lifestyle data from
supermarkets and other large stores, lifestyle data from email
inboxes and social media accounts.

Note: collection, storage and transmission of data is an area


where changes may happen due to both domestic and
international politics and new trade agreements. You should know
about types of laws/regulations rather than specific detail that
may be contained in them.

Resources

There is a good, but possibly over-technical, discussion in a pdf at:

[Link]

There is an article about how businesses can collect personal data


at:

[Link]
can-collect-your-data/

There is a pdf that describes in detail data sources used by the UK


Department of Transport at:

[Link]
m1-2-data-sources-and-surveys

Many of the datasets are available as a free download.


73
BIG DATA STORAGE.
UNDERSTAND THE IMPACT OF STORING BIG DATA
INDIVIDUAL OR TEAM WORK
Guided research into issues associated with storing Big Data. You should work
individually or in small groups.
You could look at:

Access; e.g. where to find data stores, who sees an organisations data, types
of query language used, tools for seeing and understanding the data

Transmission time; e.g. effects of different media such as (copper cable, fibre
optic, wireless), bandwidth, requirements for building/upgrading networks
and systems to cope with increasing data movement, international aspects,
latency affecting processing at different locations especially affecting real
time processing, synchronisation problems

Security; e.g. theft of data from online/third party storage, ransomware,


DDoS attacks, insider attacks, data misuse by authorised users, insertion of
fake data, changes to data at source level such as changes to data or
metadata formats at source can cause data to go to the wrong place, access
control, security audits, physical security.

Processing time; e.g. balance between security - encrypting all data - and
processing time, problems with having to decrypt-process-encrypt, process
types (e.g. batch, real time, stream), processing architecture (lambda and
kappa), software, problems with data noise and corruption

You should also consider how legal issues affect Big Data storage, especially
in the areas of access and security.

Groups report to whole class so that all students build up a more complete
picture.

74
Resources
There is a discussion of tools at:
[Link]
how-to-use-them/

There are discussions of Big Data processing software at;


[Link]
[Link]
and
[Link]
There is a discussion of architecture at:
[Link]
data-processing-architectures-lambda-and-kappa-for-big-data-
4f35c28005bb

Note: You should understand the uses and limitations of different


software, techniques, and architectures. You do not need to know
technical details but should be able to explain them general
terms.

75
Version 1 by Hasan Eser
What Are The Impacts?
There can be many positive and negative sides to handling and
storing big data. If erected efficiently the benefits big data
brings to the table for the corporation is much larger than the
drawbacks it takes to get there.
The Main Impacts Of Big Data
Storage

ACCESS

76
TRANSMISSION TIME

SECURITY

PROCESSING TIME

77
LEGAL ISSUES

78
MANIPULATING BIG DATA

Data Manipulation?
Data manipulation is a process of changing data so that it can
be analyzed, aggregated, and visualized.

Nowadays all companies struggle with ever growing stores of


data. Due to the increase in data storage technology, data
sources have become larger and old fashioned data tools won't
cut it. Big Data is hard to manage, move, report, and analyze.

A user-friendly interface allows for data manipulation on any


level, easy or advanced. The user can also incorporate
visualizations along with visual analysis and published reporting.

DATA MINING
Data mining is the process of looking at large banks of information
to generate new information. Intuitively, you might think that data
“mining” refers to the extraction of new data, but this isn’t the
case; instead, data mining is about extrapolating patterns and
new knowledge from the data you’ve already collected.
Relying on techniques and technologies from the intersection of
database management, statistics, and machine learning,
specialists in data mining have dedicated their careers to better
understanding how to process and draw conclusions from vast
amounts of information. But what are the techniques they use to
make this happen?
79
Data Mining Techniques
Data mining is highly effective, so long as it draws upon one or
more of these techniques:

1. Tracking patterns. One of the most basic techniques in data


mining is learning to recognize patterns in your data sets. This is
usually a recognition of some aberration in your data happening
at regular intervals, or an ebb and flow of a certain variable
over time. For example, you might see that your sales of a certain
product seem to spike just before the holidays, or notice that
warmer weather drives more people to your website.

2. Classification. Classification is a more complex data mining


technique that forces you to collect various attributes together
into discernable categories, which you can then use to draw
further conclusions, or serve some function. For example, if you’re
evaluating data on individual customers’ financial backgrounds
and purchase histories, you might be able to classify them as
“low,” “medium,” or “high” credit risks. You could then use these
classifications to learn even more about those customers.

3. Association. Association is related to tracking patterns, but is


more specific to dependently linked variables. In this case, you’ll
look for specific events or attributes that are highly correlated
with another event or attribute; for example, you might notice
that when your customers buy a specific item, they also often buy
a second, related item. This is usually what’s used to populate
“people also bought” sections of online stores.

80
4. Outlier detection. In many cases, simply recognizing the
overarching pattern can’t give you a clear understanding of your
data set. You also need to be able to identify anomalies, or
outliers in your data. For example, if your purchasers are almost
exclusively male, but during one strange week in July, there’s a
huge spike in female purchasers, you’ll want to investigate the
spike and see what drove it, so you can either replicate it or
better understand your audience in the process.

5. Clustering. Clustering is very similar to classification, but


involves grouping chunks of data together based on their
similarities. For example, you might choose to cluster different
demographics of your audience into different packets based on
how much disposable income they have, or how often they tend
to shop at your store.

6. Regression. Regression, used primarily as a form of planning


and modeling, is used to identify the likelihood of a certain
variable, given the presence of other variables. For example, you
could use it to project a certain price, based on other factors
like availability, consumer demand, and competition. More
specifically, regression’s main focus is to help you uncover the
exact relationship between two (or more) variables in a given
data set.

7. Prediction. Prediction is one of the most valuable data mining


techniques, since it’s used to project the types of data you’ll see
in the future. In many cases, just recognizing and understanding
historical trends is enough to chart a somewhat accurate
prediction of what will happen in the future. For example, you
might review consumers’ credit histories and past purchases to
predict whether they’ll be a credit risk in the future.

81
Data Mining Tools
So do you need the latest and greatest machine learning
technology to be able to apply these techniques? Not
necessarily. In fact, you can probably accomplish some cutting-
edge data mining with relatively modest database systems, and
simple tools that almost any company will have. And if you don’t
have the right tools for the job, you can always create your own.
However you approach it, data mining is the best collection of
techniques you have for making the most out of the data you’ve
already gathered. As long as you apply the correct logic, and
ask the right questions, you can walk away with conclusions that
have the potential to revolutionize your enterprise.

What Is Data Warehousing?


Data warehousing is the secure electronic storage of information
by a business or other organization. The goal of data
warehousing is to create a trove of historical data that can be
retrieved and analyzed to provide useful insight into the
organization's operations.

Data warehousing is a vital component of business intelligence.


That wider term encompasses the information infrastructure that
modern businesses use to track their past successes and failures
and inform their decisions for the future.

A data warehouse is a relational or multidimensional database


that is designed for query and analysis. Data warehouses are not
optimized for transaction processing, which is the domain of
OLTP systems. Data warehouses usually consolidate historical and
analytic data derived from multiple sources. Data warehouses
separate analysis workload from transaction workload and
enable an organization to consolidate data from several
sources.

82
A data warehouse usually stores many months or years of data to
support historical analysis. The data in a data warehouse is
typically loaded through an extraction, transformation, and
loading (ETL) process from one or more data sources such as
OLTP applications, mainframe applications, or external data
providers.

Users of the data warehouse perform data analyses that are


often time-related. Examples include consolidation of last year's
sales figures, inventory analysis, and profit by product and by
customer. More sophisticated analyses include trend analyses
and data mining, which use existing data to forecast trends or
predict futures. The data warehouse typically provides the
foundation for a business intelligence environment.

Data warehousing is the storage of information over time by


a business or other organization.

New data is periodically added by people in various key


departments such as marketing and sales.

The warehouse becomes a library of historical data that can


be retrieved and analyzed in order to inform decision-
making in the business.

The key factors in building an effective data warehouse


include defining the information that is critical to the
organization and identifying the sources of the information.

A database is designed to supply real-time information. A


data warehouse is designed as an archive of historical
information.

83
How Data Warehousing Works
The need to warehouse data evolved as businesses began
relying on computer systems to create, file, and retrieve
important business documents. The concept of data warehousing
was introduced in 1988 by IBM researchers Barry Devlin and Paul
Murphy.

Data warehousing is designed to enable the analysis of historical


data. Comparing data consolidated from multiple
heterogeneous sources can provide insight into the performance
of a company. A data warehouse is designed to allow its users to
run queries and analyses on historical data derived from
transactional sources.

Data added to the warehouse do not change and cannot be


altered. The warehouse is the source that is used to run analytics
on past events, with a focus on changes over time. Warehoused
data must be stored in a manner that is secure, reliable, easy to
retrieve, and easy to manage.
The Key Characteristics of a Data
Warehouse
The key characteristics of a data warehouse are as follows:
Some data is denormalized for simplification and to improve
performance

Large amounts of historical data are used


Queries often retrieve large amounts of data
Both planned and ad hoc queries are common
The data load is controlled
In general, fast query performance with high data throughput
is the key to a successful data warehouse.

84
Characteristics and Functions of
Data warehouse
Data warehouse can be controlled when the user has a shared
way of explaining the trends that are introduced as specific
subject. Below are major characteristics of data warehouse:

Subject-oriented
A data warehouse is always a subject oriented as it delivers
information about a theme instead of organization’s current
operations. It can be achieved on specific theme. That means
the data warehousing process is proposed to handle with a
specific theme which is more defined. These themes can be sales,
distributions, marketing etc.

A data warehouse never put emphasis only current operations.


Instead, it focuses on demonstrating and analysis of data to
make various decision. It also delivers an easy and precise
demonstration around particular theme by eliminating data
which is not required to make the decisions.

85
Integrated
It is somewhere same as subject orientation which is made in a
reliable format. Integration means founding a shared entity to
scale the all similar data from the different databases. The data
also required to be resided into various data warehouse in
shared and generally granted manner.

A data warehouse is built by integrating data from various


sources of data such that a mainframe and a relational
database. In addition, it must have reliable naming conventions,
format and codes. Integration of data warehouse benefits in
effective analysis of data. Reliability in naming conventions,
column scaling, encoding structure etc. should be confirmed.
Integration of data warehouse handles various subject related
warehouse.

Time-Variant
In this data is maintained via different intervals of time such as
weekly, monthly, or annually etc. It founds various time limit which
are structured between the large datasets and are held in online
transaction process (OLTP). The time limits for data warehouse is
wide-ranged than that of operational systems. The data resided
in data warehouse is predictable with a specific interval of time
and delivers information from the historical perspective. It
comprises elements of time explicitly or implicitly. Another
feature of time-variance is that once data is stored in the data
warehouse then it cannot be modified, alter, or updated.

86
Non-Volatile
As the name defines the data resided in data warehouse is
permanent. It also means that data is not erased or deleted
when new data is inserted. It includes the mammoth quantity of
data that is inserted into modification between the selected
quantity on logical business. It evaluates the analysis within the
technologies of warehouse.

In this, data is read-only and refreshed at particular intervals.


This is beneficial in analysing historical data and in
comprehension the functionality. It does not need transaction
process, recapture and concurrency control mechanism.
Functionalities such as delete, update, and insert that are done in
an operational application are lost in data warehouse
environment. Two types of data operations done in the data
warehouse are:

Data Loading
Data Access

Functions of Data warehouse:


It works as a collection of data and here is organized by various
communities that endures the features to recover the data
functions. It has stocked facts about the tables which have high
transaction levels which are observed so as to define the data
warehousing techniques and major functions which are involved
in this are mentioned below:

Data consolidation
Data Cleaning
Data Integration

87
Data consolidation
Data consolidation is the corralling, combining, and storing of
varied data in a single place. It lets users manipulate different
types of data from one point of access and helps turn raw data
into insights that drive better, faster decision-making. The term
sometimes is used interchangeably with data integration.
Data consolidation enables businesses to streamline their data
resources, discover patterns, and look for insights in multiple
types of data.

Data consolidation best practices


Organizations should plan and execute data consolidation projects
carefully. These best practices promote effective data consolidation:

Check to see whether data types in your source and target are
compatible: If they’re not, you’ll have to transform data to address
differences among data types.

Maintain copies of your data: Data lineage allows an


organization to understand exactly what was done to the data —
and how — during the consolidation process. You may need
information to demonstrate regulatory compliance, or for
retracing steps to understand the results of analytics and any
business decisions based on them.

Standardize character set conversions: If you work with an


application that allows you to store single-byte characters — such
as Western languages — and double-byte characters — such as
some Asian languages — in a database, the application can
convert between these character types. However, when you move
the data, the tools processing the data may be unaware that the
data is stored in a different format. By standardizing character set
conversions, you increase the likelihood of consolidating data for
a reliable outcome.

88
Data consolidation challenges
There are challenges in the data consolidation process. The most
common ones include:

Limited resources: Hand-coding consolidation techniques


require data engineers who must write code and manage the
process, and write more code every time a new data source
comes online. The more sources and data types involved, the
longer the process becomes.

Security issues (real and perceived): Security concerns


include guarding data from breaches before and after
consolidation, and developing backup and disaster recovery
capabilities if data is compromised, corrupted, or deleted.
Companies also must guard against “inside jobs” like
exfiltration — the unauthorized copying, transfer, or retrieval
of data from a computer or server.

Data spread across multiple locations: Today’s


decentralized data landscape can make data integration
challenging. Having data in different locations — including in
the cloud, on premises, and at remote locations — adds to
the complexity of data consolidation. For instance, data
stored in legacy systems may be missing times and dates for
activities, which more modern systems commonly include; and
data from external systems may not contain the same level of
detail as internal sources.

89
DATA CLEANING
Data cleaning is the process of fixing or removing incorrect,
corrupted, incorrectly formatted, duplicate, or incomplete data
within a dataset. When combining multiple data sources, there
are many opportunities for data to be duplicated or mislabeled.
If data is incorrect, outcomes and algorithms are unreliable, even
though they may look correct. There is no one absolute way to
prescribe the exact steps in the data cleaning process because
the processes will vary from dataset to dataset. But it is crucial to
establish a template for your data cleaning process so you know
you are doing it the right way every time.
What are the benefits of data
cleaning?
There are many benefits to having clean data:

1. It removes major errors and inconsistencies that are


inevitable when multiple sources of data are being pulled
into one dataset.
2. Using tools to clean up data will make everyone on your team
more efficient as you’ll be able to quickly get what you need
from the data available to you.
3. Fewer errors means happier customers and fewer frustrated
employees.
4. It allows you to map different data functions, and better
understand what your data is intended to do, and learn
where it is coming from.

90
DATA INTEGRATION
Data integration is the practice of consolidating data from
disparate sources into a single dataset with the ultimate goal of
providing users with consistent access and delivery of data
across the spectrum of subjects and structure types, and to meet
the information needs of all applications and business processes.
The data integration process is one of the main components in
the overall data management process, employed with increasing
frequency as big data integration and the need to share existing
data continues to grow.
Data integration architects develop data integration software
programs and data integration platforms that facilitate an
automated data integration process for connecting and routing
data from source systems to target systems. This can be achieved
through a variety of data integration techniques, including:

Extract, Transform and Load: copies of datasets from


disparate sources are gathered together, harmonized, and
loaded into a data warehouse or database
Extract, Load and Transform: data is loaded as is into a big
data system and transformed at a later time for particular
analytics uses
Change Data Capture: identifies data changes in databases
in real-time and applies them to a data warehouse or other
repositories
Data Replication: data in one database is replicated to
other databases to keep the information synchronized to
operational uses and for backup
Data Virtualization: data from different systems are virtually
combined to create a unified view rather than loading data
into a new repository
Streaming Data Integration: a real time data integration
method in which different streams of data are continuously
integrated and fed into analytics systems and data stores
91
Data Warehousing vs. Databases
A data warehouse is not the same as a database:
A database is a transactional system that monitors and updates
real-time data in order to have only the most recent data
available.
A data warehouse is programmed to aggregate structured data
over time.
For example, a database might only have the most recent
address of a customer, while a data warehouse might have all
the addresses for the customer for the past 10 years.
Data mining relies on the data warehouse. The data in the
warehouse are sifted for insights into the business over time.

Top 10 Benefits of a Data Warehouse

92
1. Enables Historical Insight
No business can survive without a large and accurate storehouse of
historical data, from sales and inventory data to personnel and
intellectual property records. If a business executive suddenly needs
to know the sales of a key product 24 months ago, the rich historical
data provided by a data warehouse make this possible.

Also important, a data warehouse can add context to this historical


data by listing all the key performance trends that surround this
retrospective research. This kind of efficiency cannot be matched by
a legacy database.

2. Enhances Conformity And Quality Of Data


Your business generates data in myriad different forms, including
structured and unstructured data, data from social media, and data
from sales campaigns. A data warehouse converts this data into the
consistent formats required by your analytics platforms. Moreover, by
ensure this conformity, a data warehouse ensures that the data
produced by different business divisions is at the same quality and
standard – allowing a more efficient feed for analytics.

3. Boosts Efficiency
It’s very time consuming for a business user or a data scientist to have
to gather data from multiple sources. It’s far more advantageous for
this data to be gathered in one place, hence the benefit of a data
warehouse.

Additionally, if for instance your data scientist needs data to run a


fast report, they don’t need to get the assistance from tech support to
perform this task. A data warehouse makes this data readily available
– in the correct format – improving efficiency of the entire process.

93
4. Increase The Power And Speed Of Data Analytics
Business intelligence and data analytics are the opposite of instinct
and intuition. BI and analytics require high quality, standardized data
– on time and available for rapid data mining. A data warehouse
enables this power and speed, allowing competitive advantage in
key business sectors, ranging from CRM to HR to sales success to
quarterly reporting.

5. Drives Revenue
A tech pundit opined that “data is the new oil,” referring to the high
dollar value of data in today’s world. Creating more standardized
and better quality data is the key strength of a data warehouse, and
this key strength translates clearly to significant revenue gains. The
data warehouse formula works like this: Better business intelligence
helps with better decisions, and in turn better decisions create a
higher return on investment across any sector of your business.

Most important, these revenue gains build on themselves over time, as


better decisions strengthen the business.

In short, a high quality, fully scalable data warehouse can be seen as


less of a cost and more of an investment – one that adds exponential
value like few other investments that businesses make.

6. Scalability
The top key word in the cloud era is “scalable” and a data warehouse
is a critical component in driving this scale. A topflight data
warehouse is itself scalable, and also enables greater scalability in
the business overall.

That is, today’s sophisticated data warehouse are built to scale,


handling ever more queries as the business grows (though this will
require more supporting hardware). Additionally, the efficiency in
data flow enabled by a data warehouse greatly boosts a business’s
growth – this growth is the core of business scalability.

94
7. Interoperates With On-Premise And Cloud
Unlike the legacy databases of yesteryear, today’s data warehouses
are built with multicloud and hybrid cloud in mind. Many data
warehouses are now fully cloud-based, and even those that are built
for on-premise typically will interoperate well with the cloud-based
portion of a company’s infrastructure. As an additional important side
point: this cloud-based focus also means that mobile users are better
able to access the data warehouse – this is beneficial for sales reps
in particular.

8. Data Security
A number of key advances in data warehouse have enhanced their
security, which enhances the overall security of company data. Among
these advances are techniques like a “slave read only” set up, which
blocks malicious SQL code, and encrypted columns, which protects
confidential data.

Some businesses set up custom user groups on their data warehouses,


which can include or exclude various data pools, and even give
permission on a row by row basis.

9. Much Higher Query Performance And Insight


The constant business intelligence queries that are part of today’s
business can put a major strain on an analytics infrastructure, from the
legacy databases to the data marts. Having a data warehouse to
more effectively handle queries removes some of the pressure on the
system.

Furthermore, since a data warehouse is specifically geared to handle


massive levels of date and myriad complex queries, it’s the high
functioning core of any business’s data analytics practice.

95
10. Provides Major Competitive Advantage
This is absolutely the bottom line benefit of a data warehouse: it
allows a business to more effectively strategize and execute against
other vendors in its sector.

With the quality, speed and historical context provided by a data


warehouse, the greater insight in data mining can drive decisions that
create more sales, more targeted products, and faster response
times.

In short, a data warehouse improves business decision making, which


in turn gives any business a key competitive advantage
What Is Data Analytics?
The term data analytics refers to the process of examining datasets to
draw conclusions about the information they contain. Data analytic
techniques enable you to take raw data and uncover patterns to extract
valuable insights from it.

Today, many data analytics techniques use specialized systems and


software that integrate machine learning algorithms, automation and
other capabilities.

Data Scientists and Analysts use data analytics techniques in their


research, and businesses also use it to inform their decisions. Data analysis
can help companies better understand their customers, evaluate their ad
campaigns, personalize content, create content strategies and develop
products. Ultimately, businesses can use data analytics to boost business
performance and improve their bottom line.

For businesses, the data they use may include historical data or new
information they collect for a particular initiative. They may also collect it
first-hand from their customers and site visitors or purchase it from other
organizations. Data a company collects about its own customers is called
first-party data, data a company obtains from a known organization that
collected it is called second-party data, and aggregated data a
company buys from a marketplace is called third-party data. The data a
company uses may include information about an audience’s demographics,
their interests, behaviors and more.
96
4 Ways to Use Data Analytics
1. Improved Decision Making
Companies can use the insights they gain from data analytics to inform
their decisions, leading to better outcomes.
Data analytics eliminates much of the guesswork from planning
marketing campaigns, choosing what content to create, developing
products and more. It gives you a 360-degree view of your customers,
which means you understand them more fully, enabling you to better
meet their needs. Plus, with modern data analytics technology, you can
continuously collect and analyze new data to update your
understanding as conditions change.

2. More Effective Marketing


When you understand your audience better, you can market to them
more effectively. Data analytics also gives you useful insights into how
your campaigns are performing so that you can fine-tune them for
optimal outcomes.

3. Better Customer Service


Data analytics provide you with more insights into your customers,
allowing you to tailor customer service to their needs, provide more
personalization and build stronger relationships with them.
Your data can reveal information about your customers’
communications preferences, their interests, their concerns and more.
Having a central location for this data also ensures that your whole
customer service team, as well as your sales and marketing teams, are
on the same page.

4. More Efficient Operations


Data analytics can help you streamline your processes, save money and
boost your bottom line. When you have an improved understanding of
what your audience wants, you waste less time on creating ads and
content that don’t match your audience’s interests.
This means less money wasted as well as improved results from your
campaigns and content strategies. In addition to reducing your costs,
analytics can also boost your revenue through increased conversions,
ad revenue or subscriptions.

97
Data Analytics Technology
Some of the technologies that make modern data analytics so powerful
are:

Machine learning: Artificial intelligence (AI) is the field of developing


and using computer systems that can simulate human intelligence to
complete tasks. Machine learning (ML) is a subset of AI that is
significant for data analytics and involves algorithms that can learn on
their own. ML enables applications to take in data and analyze it to
predict outcomes without someone explicitly programming the system
to reach that conclusion. You can train a machine learning algorithm
on a small sample of data, and the system will continue to learn as it
gathers more data, becoming more accurate as time goes on.

Data management: Before you can analyze data, you need to have
procedures in place for managing the flow of data in and out of your
systems and keeping your data organized. You also need to ensure that
your data is high-quality and that you collect it in a central data
management platform (DMP) where it’s available for use when needed.
Establishing a data management program can help ensure that your
organization is on the same page regarding how to organize and
handle data.

Data mining: The term data mining refers to the process of sorting
through large amounts of data to identify patterns and discover
relationships between data points. It enables you to sift through large
datasets and figure out what’s relevant. You can then use this
information to conduct analyses and inform your decisions. Today’s
data mining technologies allow you to complete these tasks
exceptionally quickly.

Predictive analytics: Predictive analytics technology helps you analyze


historical data to predict future outcomes and the likelihood of various
outcomes occurring. These technologies typically use statistical
algorithms and machine learning. More accurate predictions means
businesses can make better decisions moving forward and position
themselves to succeed. It allows them to anticipate their customers’
needs and concerns, predict future trends and stay ahead of the
competition.

98
USING BIG DATA

Eight Ways Big Data Affects


Your Personal Life
What is big data and what are some examples of big data?

Before you can understand the ways big data is used in your
everyday life, you must have a basic understanding of what big
data is and how it is gathered. Research indicates that 2.5
quintillion bytes of data are created each day as our many
internet-connected devices track, produce, and store
information (source). That number is expected to continue to
increase as internet access and use improves and expands
around the world.

The world has never seen this amount of information collected so


rapidly. In short, data is everywhere. Large groups of information
are assimilated and then analyzed for insights into human
behavior, past, present, and future. Experts of all kinds are
working to apply the knowledge gained from big data in an
ever-growing number of ways. Big data is changing the way
people live their lives as it is applied to fields such as:

99
Music, Shows, and Movies
Healthcare and Medical
Services
Shopping and Marketing
Travel and Transportation
Public Policy and Safety
News and Information
Education and Employment
Artificial Intelligence

How Big Data is Changing the Way


People Live Their Lives
The changes in how big data is collected have occurred so
rapidly that big data is more prevalent in daily life than you
might think. Companies and organizations are collecting
information about their targeted audiences. They know what
you’re watching, what you’re reading, and what you’re buying.

This access to key, personalized data then affects your daily


experience in some of the most important and common areas of
life. Consider these ways big data is used in your everyday life:

Music, Shows, and Movies


One of the most apparent and personal ways big data affects
your personal life is through the entertainment and media you
consume. This includes music streaming services as well as
television and film platforms.

100
Streaming has revolutionized the music industry, and most people
use one or more of the most popular music streaming services.
Companies like Spotify and Pandora rely heavily on big data,
tracking what music you choose and like to redirect your
experience in real-time. Pandora personalizes user experience
through its music genome project, which it describes as “the most
comprehensive analysis of music ever undertaken” (source).
Spotify offers users a weekly, personalized playlist and even
marks it with the user’s photo (source).

Other forms of entertainment are working from the same


playbook, and that includes most movie and television streaming
services. Since many of these companies are now also creating
their own content, they also use the data they collect from you to
determine what kind of content to produce. In 2016, big data led
Netflix to create more original content like its hit Stranger
Things, contributing to a major change in the company’s
direction and in your experience with Netflix (source).

Healthcare and medical services


Healthcare is another area where it’s easy to trace the impacts
of big data in your personal life. The collection and application
of mass information has changed many areas of the healthcare
industry, including (source):

Tracking and maintaining personal records and health


patterns,
Prediction of disease transmission and epidemics,
Treatment protocols and potential cures,
Tracking and improving quality of life patterns, and
Privacy and Security

101
You’ve probably noticed that more and more of your medical
records are digitized. This use of electronic data can affect your
life in a couple of important ways. First, it enables doctors,
hospitals, and clinics to more efficiently track your history and
provide the treatment you need. However, your data is also
being analyzed along with the health histories of many others to
enable medical professionals to track diseases, determine the
effectiveness of treatments, and much more (source).

You also may be contributing to one of the examples of big data


in healthcare in a more immediate way. Do you track your
exercise or other aspects of your personal health through a
wearable device like a Fitbit or through apps on your phone? If
so, your information helps professionals track important health
trends that affect research and medical progress.

Shopping and Marketing


Most major retailers now rely heavily on big data to shape not
only their front end business but also to direct their marketing
efforts. If you shop online regularly, the impacts of big data in
your personal life definitely include both a change in the ads you
see and in your actual shopping experience.

Online retailers now collect information from your activity on


your computers, smartphones, and other devices that connect to
the world wide web. Businesses then analyze your data and use it
to evaluate your interests and preferences and make projections
about what you might buy in the future. This affects both their
advertising tactics and what you see when you shop.

102
Most of us would like to believe we are unaffected by these
techniques, but consider: Have you ever purchased an item for
which you didn’t search but that appeared on a site where you
were shopping? If so, this is another reminder of how the use of
big data can affect your life.

When you make a purchase on a site like Amazon, that


information is then compared to a pool of data about other
consumers who purchased the same item (source). Those
comparisons allow the retailer to make predictions about other
items that might interest you. These items then will appear as
suggestions in various ways as you browse the retailer’s sites.

Travel and Transportation


The ways we move around our cities and the world have
changed dramatically in recent years. Many of those changes
are driven by examples of big data creating more efficient
solutions to our travel and transportation needs, including:

The development of GPS and intelligent map programs,


Better sequencing of traffic signals,
Advancements in how air travel is managed and sold,
Traffic prediction and planning,
More efficient operation of mass transit systems, and
On-board automobile data collection

From the moment you turn the key in your ignition, you probably
are experiencing several ways big data affects your personal
life. Most automobiles produced in the last decade or so have
smart technology designed to monitor the condition of the car,
track mileage and fuel consumption, and improve your driving
experience in other ways. That data is also collected to help the
auto industry continue to make more efficient and reliable
vehicles.

103
You likely haven’t opened a physical map in the car in years, if
ever. What was once a normal part of a road trip experience
has now been moved to either built-in GPS systems or smart
maps programs on phones. You no longer have to guess about
how long your trip will take, when you will arrive, or what traffic
conditions you might experience. And, of course, your own travel
data is being collected to help improve the accuracy of these
systems for everyone.

Big data also has revolutionized the airline industry at virtually


all levels. From the moment you begin to search for a ticket, you
begin a journey through multiple examples of big data in use.
Fares are set by automated data collection and analysis, and
schedules are created based on predictions made from the
collection of big data. And, of course, airlines are keeping track
of how frequently you fly, what you prefer to drink, and other
information to customize your experience.

Public Policy and Safety


Public agencies are also utilizing these trends in data collection,
providing yet another reminder that big data is more prevalent
in daily life than you might think. Police and fire departments and
all levels of government turn to big data to help develop and
implement new policies and procedures.

Police departments and law enforcement agencies around the


world increasingly work to be proactive rather than simply
responding to crime after the fact. Networks of computers,
cameras, and mobile devices track incidents in real-time so
police can be dispatched more efficiently and effectively.
Larger law enforcement efforts like campaigns to stop terrorism
also rely on global collection and examination of relevant data
(source).

104
Fire services and other emergency responders are taking a
similar course. Stockpiles of data gathered from government
sources, surveillance systems, emergency vehicle GPS tracking,
and fire and smoke detectors help fire departments prepare to
respond more quickly and effectively to fire emergencies
(source). They also have become more proactive as they are
able to identify higher risk areas, conduct inspections and safety
checks, and recommend better preventative measures.

These are just two examples of big data use that demonstrate
ways public service agencies and policy-makers are attempting
to improve efficiency and accuracy.

News and Information


No matter what sources you rely on for news coverage, your
experience is impacted by multiple examples of big data at
work. From the earliest reporting and news gathering through
news delivery and on to comments you might make on social
media, big data is everywhere in the news cycle.

More and more, reporters utilize social media in gathering


information that shapes their news reports. Applications such as
Twitter are so common that major news is often reported on
them by everyday users within seconds. Reporters preparing
stories can use search features on these platforms to isolate
posts made in given time periods and geographic areas
(source).

Of course, one of the other relevant impacts of big data in your


personal life is in the way you receive news. Trends in big data
populate your various media timelines with stories determined to
be of higher importance or of particular interest to you.
Automated data processes determine the most talked about
stories and often push them to the front of both news
aggregators and social media sites (source).
105
Education and Employment
Big data also impacts two of the primary areas that will shape
your future: education and employment processes. College
admissions and hiring practices both take cues from the
collection and use of big data.

Many colleges and universities now rely on statistical programs


designed to identify and attract students who enable the
institutions to meet internal goals. For example, schools benefit
from higher rates of enrollment among admitted students and
from improved graduation rates. Admissions decisions often
include work with data designed to predict success for both
students and educational institutions (source).

Employers are also working to leverage data to improve hiring


processes. They often rely on services that aggregate large
volumes of data to identify candidates most likely to fit and
excel in particular jobs. Some examples of big data in these
processes include (source):

Education
Job History
Language
Public Work Samples
Behavior on Social Media

106
Artificial Intelligence
You may read the term “artificial intelligence” and tend to
envision a scene from a science fiction movie. However, artificial
intelligence (AI) is no longer just an imagined force in the future.
You likely interact with some form of AI on a regular basis.

Among the most obvious examples of artificial intelligence in


your daily life is the chatbot. Chatbots draw on multiple
examples of big data and automate your experience of locating
very specific information. Because big data and AI merge in
chatbot functioning, these bots are able to “learn” and
constantly improve their ability to customize your experience
(source).

Overall, big data and artificial intelligence have a symbiotic


relationship. Big data is fueling improvements in AI, and in turn, AI
improves the insights we glean from big data in several ways
(source):

AI is creating new methods for analyzing data,


AI helps make data analytics less labor-intensive,
AI requires human guidance, an important reminder in an age
of big data,
AI can be used to alleviate common data problems, and
AI helps data analytics become more predictive and
prescriptive

These impacts of artificial intelligence matter. As long as there


are so many ways big data affects your personal life, you benefit
from the ability of AI to create better uses of that data.

107
How Big Data is Reshaping Society
It’s impossible for the human brain to fathom the quantity of data
generated today. In 2003, the world had created a total of 1.8
zettabytes of data. In 2011, that same amount was created every
two [Link], in 2018, “over 2.5 quintillion bytes of data are
created every day, and by 2020, it’s estimated that 1.7
megabytes of data will be created every second for every
person on earth”. With each click, share, like, and swipe, society
is creating big data. Every day, billions of people interacting
with their devices via the internet contribute to a world of
valuable information.

As Meglena Kuneva, European Consumer Commissioner said,


“Personal data is the new oil of the internet and the new
currency of the digital world”.

The Value of Big Data


The focus of big data in society should not only be on the
extraordinary volume of information, but rather the value that
organisations can extract from it. Using machine learning
technology, a field of data science known as predictive
analytics shows the value in large amounts of data. In a nutshell,
predictive analytics learns from data to predict the way
individuals will behave in the future. Machine learning detects
patterns in data sets to consider the probability of certain
outcomes. For example, the predictive model uses everything
known about an individual to determine the likelihood of them
buying a specific product, contracting a certain disease, being
influenced by an economic trend, or any desired outcome. Based
on that insight, organisations can make more informed decisions.

108
Sandy Pentland, Director of MIT’s Human Dynamics and Media
Lab aptly states, “It’s not that it’s big or that it’s fast. The part
that’s important is that it’s about people…. By understanding
these things, we begin to understand society and social
interaction in ways we never could before”.

While some may be sceptical about the implications of big data


on privacy and security, big data stands to transform society for
the better. Discover its potential and application in the following
industries:
Business
Analyzing large scale social media and browsing behavior,
businesses can create a more complete profile of customers and
stream them into narrow segments of preferences, likes, and
dislikes. With this level of specificity and insight, businesses can
make more informed marketing decisions to promote their
product or service to those more likely to convert.

Netflix is an example of a company using big data to understand


their customers and target them with personalized suggestions
based on their viewing history. As much as 80% of your Netflix
stream is influenced by its recommendation system powered by
algorithms. This behavioral data is used to create a better
experience for the customer, one they are almost guaranteed to
enjoy, and gain better brand loyalty as a result.

Personal data is the new oil of the internet and the new currency
of the digital world. -Meglena Kuneva. European Consumer
Commissioner

109
How big data helps understand
customers
By gathering information through customers’ buying habits, large-
scale surveys, and case studies, organisations can begin to
create new innovative products based on what customers are
looking for. Companies can predict what needs customers are
wanting to fulfil, and create products that best meet those
needs. It’s the first opportunity for businesses to create consumer-
responsive products based on data prediction instead of relying
on the lengthy process of customer feedback.

AmazonFresh and Whole Foods Market capitalize on this by


collecting data to better understand how suppliers interact with
grocers and how customers buy groceries. This allows Amazon to
know whenever there is a need for change and improvement in
the process or product.
How big data predictions help
reduce cost
For businesses stocking inventory, products, or produce, knowing
when and how much stock is needed at any given time can save
money and prevent waste. Big data analysis makes it possible to
predict when sales will occur, helping organisations to order the
precise amount of stock needed to adhere to demand without
wasting produce, keeping capital tied up in inventory, or
incurring unnecessary carrying costs.

PepsiCo relies on big data to manage its supply chain effectively


to reduce costs and minimise waste. With reports from suppliers
and warehouse inventory, PepsiCo is able to forecast which
retailers need which products, the volumes they need, and the
time they need them, saving them money and effort.

110
How big data helps understand
customers
By gathering information through customers’ buying habits, large-
scale surveys, and case studies, organisations can begin to
create new innovative products based on what customers are
looking for. Companies can predict what needs customers are
wanting to fulfil, and create products that best meet those
needs. It’s the first opportunity for businesses to create consumer-
responsive products based on data prediction instead of relying
on the lengthy process of customer feedback.

AmazonFresh and Whole Foods Market capitalize on this by


collecting data to better understand how suppliers interact with
grocers and how customers buy groceries. This allows Amazon to
know whenever there is a need for change and improvement in
the process or product.
How big data predictions help
reduce cost
For businesses stocking inventory, products, or produce, knowing
when and how much stock is needed at any given time can save
money and prevent waste. Big data analysis makes it possible to
predict when sales will occur, helping organisations to order the
precise amount of stock needed to adhere to demand without
wasting produce, keeping capital tied up in inventory, or
incurring unnecessary carrying costs.

PepsiCo relies on big data to manage its supply chain effectively


to reduce costs and minimise waste. With reports from suppliers
and warehouse inventory, PepsiCo is able to forecast which
retailers need which products, the volumes they need, and the
time they need them, saving them money and effort.

110
Healthcare
By 2020, IBM predicts a 20% decrease in patient mortality.14 As
more data is collected and analysed, it will be possible to save
more people’s lives. With artificial intelligence (AI) able to
understand questions, read through 200 million pages of data,
and provide an answer in seconds, doctors can consider all
available resources before making a decision regarding a
patient’s condition. This can change the nature of emergency
rooms in the future.

In a consultation context, medical professionals will be able to


analyze past trials, trends, and current data. With this
information and access to data concerning a patient’s lifestyle,
history, and genetics, a holistic picture can aid doctors in
providing the most effective care.

How big data personalises medical


care
In much the same way businesses reach customers through
targeted marketing, the goal in healthcare is to have enough
personal data on each person to provide evidence-based
personalised treatment.17 People would be empowered to play
an active role in their health and wellbeing with information
tailored to their specific needs. To push the envelope even
further, scientists foresee the use of innovative smart devices at
home. We are already surrounded by devices that process large
amounts of data, why not imagine a medical future where your
toothbrush, toilet, or scale is capable of reporting instantly on
the condition of your health.

111
Genome sequencing is another exciting possibility in the medical
field. It took 10 years to decode the first human genome, today it
takes a week. As big data technology continues to expand and
larger amounts can be processed, everyday genome sequencing
for regular people may be possible.19 The unique set of
variations in your DNA sequence affect your appearance, your
behaviour, and from a medical perspective, your susceptibility to
disease. Today, there are only certain parts of the genome that
are well understood and that influence our health care decisions.
However, as more people are sequenced, scientists and doctors
will have access to a larger set of data to learn about genes
they previously couldn’t understand, including certain genomes
relationships to diseases.

How big data reduces medical costs


With US healthcare expenses in excess of $600 billion, fiscal
concerns are among the greatest demand for big data
application.21

Analytics can help organisations forecast demand for


medication. The Clinton Health Access Initiative (CHAI) uses
analytics to determine the need for HIV/AIDS, malaria, and
tuberculosis medication, which has led to negotiations for lower
prices and wider availability of medical care in countries
needing it most. CHAI has shared information with the United
Nations and the World Health Organisation, aiming to identify
how best to spend limited resources.

112
Fifty hospitals across Queensland, Australia are using a tool
which predicts the number of patients who will arrive and the
injuries they will have days, weeks, months, or even years in
advance. It’s able to identify patterns in historical admission and
discharge data, which allows staff to know when to expect
patients with specific injuries when preparing and planning for
surgery. This has prevented long waits and cancellations by up to
20%. It can also be used to predict arrivals by the hour to alert
staff of the number of beds needed at all times. Queensland
hospitals are saving as much as $2.5 million USD every year, and
the value of improved patient outcomes in the state could be up
to $77.5 million.

Government
The application of big data analytics in Western democratic
governments is subject to much debate. While there are
foreseeable benefits, democracy’s prioritisation of privacy has
the potential to hinder progress. Today, the big data revolution
seems to play more into the hands of China’s communist
leadership. With uninterrupted access to data through “close
collaboration with a few state-licensed commercial data
conglomerates Alibaba, Tencent or Baidu,” the Chinese
government could become a data powerhouse.

However, while only time will tell how China’s authoritarian


approach to data will accelerate its technological advance, the
rest of the world is by no means ignoring the potential
application of big data in government sectors, despite the often
strict privacy regulations.

113
How big data transforms cities
Barcelona developed plans to incorporate data into city
planning over 10 years ago. The smart city it’s hoping to achieve
spans across various departments including water, energy,
communication, housing, and mobility, all with the intention of
improving people’s quality of life through technological
innovation. The developments are underpinned with intelligent
data systems collecting information from smart assets and public
organisations with the intention of creating a more open,
transparent government to engage citizens and provide greater
public independence. According to government officials,
Barcelona “is eager to connect global cities through big data
analytics to address shared issues in the future”.

How big data collaborations assist in


safety
Collaboration across countries and between public and private
sectors may become more common as big data gains
prominence. The Haiti earthquake in 2010 is an example of
collaboration between businesses, emergency services, the
government, and people in a disaster. To help the government,
people, and emergency services cope, InStedd, a technology
company specialising in emergency services, set up systems to
decode data based on time, geolocation, and route of
transmission. The quality of data is paramount in prioritising
information and minimising incorrect usage of limited rescue
resources. Together with various crowdsourcing agencies, they
were able to “accurately geotag information to provide
coordinates to the search and rescue teams working on the
ground,” managing to save and support many more people.

114
Similar to disaster management, around the US and 30 other
countries across the world, Risk-Terrain modelling is being used
to predict crime. Buildings are linked with types of crimes, and
potentially dangerous areas are recorded on a digital map
overlaid with past crime locations. Buildings are then ranked
according to the number and type of crimes that have occurred
close to their location and are flagged with police to alert them
where to focus their efforts and attention. As a result, gun
violence has dropped by 35% in Newark, New Jersey; 33% of car
robberies stopped in Colorado Springs; and in Glendale,
Arizona, overall crime was reduced by over 40%.

How big data governance empowers


you
Smarter government systems mean an easier government
experience. In Estonia, Europe, citizens are never asked to fill in
the same information twice. The country’s data exchange
network, X-Road, saves over 240 hours of work every three
minutes by automatically syncing information across government
agencies. X-Road works by allowing approved databases to
request and share data automatically. Whenever an official
needs information, it can be retrieved without a manual request.
Finland is also adopting the technology due to its success.

In Taiwan, people are encouraged to participate in government


processes, creating greater citizen engagement. Using a system
called vTaiwan, citizens post on a discussion forum about new
legislation. Through this method of crowdsourcing, people are
invited to post principles that should underpin legislation.
Suggestions are rated by other citizens and then clustered to
show statements that hold true across the nation. After further
deliberation among experts, the government decides whether to
implement legislation based on the people’s response, or
explains why it won’t be doing so.
115
It’s not that it’s big or that it’s fast. The part that’s important is
that it’s about people…. By understanding these things, we begin
to understand society and social interaction in ways we never
could before.

-Sandy Pentland. Director of MIT’s Human Dynamics and Media


Lab

How an efficient data system saves


on cost
Efficiency is everything when it comes to reducing costs in
government. With more efficient processes in place and inter-
organisational compatibility, analysts estimate potential savings
could be as much as $500 billion USD in the United States alone.

116
Understand how big Data is used in the context of:

Healthcare
Infrastructure Planning
Transportation
Fraud Detection.

Electronic health records


It is important to note that the National Institutes of Health (NIH)
recently announced the “All of Us” initiative
([Link] that aims to collect one million or more
patients’ data such as EHR, including medical imaging, socio-
behavioral, and environmental data over the next few years.
EHRs have introduced many advantages for handling modern
healthcare related data. Below, we describe some of the
characteristic advantages of using EHRs. The first advantage of
EHRs is that healthcare professionals have an improved access
to the entire medical history of a patient. The information
includes medical diagnoses, prescriptions, data related to known
allergies, demographics, clinical narratives, and the results
obtained from various laboratory tests. The recognition and
treatment of medical conditions thus is time efficient due to a
reduction in the lag time of previous test results. With time we
have observed a significant decrease in the redundant and
additional examinations, lost orders and ambiguities caused by
illegible handwriting, and an improved care coordination
between multiple healthcare providers.

117
Big data in biomedical research
A biological system, such as a human cell, exhibits molecular and
physical events of complex interplay. In order to understand
interdependencies of various components and events of such a
complex system, a biomedical or biological experiment usually
gathers data on a smaller and/or simpler component.
Consequently, it requires multiple simplified experiments to
generate a wide map of a given biological phenomenon of
interest. This indicates that more the data we have, the better
we understand the biological processes. With this idea, modern
techniques have evolved at a great pace. For instance, one can
imagine the amount of data generated since the integration of
efficient technologies like next-generation sequencing (NGS)
and Genome wide association studies (GWAS) to decode human
genetics. NGS-based data provides information at depths that
were previously inaccessible and takes the experimental
scenario to a completely new dimension.

118
Big data and how it can transform
infrastructure planning
When passenger volumes bounce back after the pandemic, airport
operators will be looking at many ways to rebuild their shattered
balance sheets. Increasing productivity from the airport infrastructure
will be front and center in many airport’s recovery strategies. Simply
put, this means delivering higher passenger volumes from the same
space, reducing opex per passenger, re-prioritizing capex plans, and
encouraging greater commercial spend, while having a relentless
focus on improving service and experience levels.

To achieve this nirvana requires a collaborative effort across the


many business partners and service providers that deliver airport
operations. However, if there is one key enabler it will be access to
high-quality operational data. The aircraft and bag journey through
the airport is often recorded with many time stamps, but there is a
sporadic understanding of the passenger’s journey from home to
aircraft door (and vice-versa).

Sourcing accurate operational data is the starting point to enable


high-quality planning decisions. Sampling surveys have been the
traditional approach where staff with stopwatches, clipboards and
questionnaires laboriously collect observations. More recently,
beacon technology has been used to measure passenger flow as well
as providing other wayfinding and commercial information to
passengers. Today, we are seeing the roll-out of sensors and cameras
that measure occupancy and queue lengths with real-time monitoring
and response by teams collocated in operational centers. This
progress is encouraging but these will not help achieve a deep
understanding of passenger behavior in an airport or across the
passengers’ whole end-to-end journey.

119
Imagine if for every flight we were able to map everyone’s journey
that day: where they started, what transportation mode they took to
the airport, how long it took to get them to the front door of the
terminal, their speed through the terminal, how they used the space,
their dwell time and location, their age, gender and so on. And over
time, we could understand how the behaviors change by time of day,
across the week, months and seasons. This is the opportunity that cell
phone data can unlock for airport operators.

O2 Motion provides a powerful alternative to traditional data


capture methods. It uses all the mobile events that O2 UK captures 24
hours a day, 365 days a year. Each device has a unique identifier
although it is aggregated and anonymized for reporting. This
information is then extrapolated to represent the national population.
Recent investment to measure activity levels during the pandemic has
improved the latency significantly and is very close to near-time
reporting. Smart cells complement this data by providing additional
granularity, sharpening the focus from 100m all the way down to 5m.
Using this data along with demographic data, O2 can show which
types of people are where and when. The information can be broken
down by age, gender, home locations and affluence bracket, for
example.

Combined with other automated airport data bases such as boarding


card scans and aircraft gating plans, the potential to understand the
detailed experience for passengers on the day and how they use
airport infrastructure is transformational. And with these deeper
insights, value can be unlocked. For instance, improving the accuracy
of the passenger reporting profile by say 5% into labor intensive
operations such as check-in or security could provide significant
resource management benefits and therefore opex savings. Or
identify missed opportunities to convert landside forecourt activity
into car parking growth. Or identify where wasted time can be taken
out of the airport ecosystem to improve the overall efficiency and
experience for users.

120
The potential of cell phone data has always been known. The
pandemic accelerated the development of this capability and now
airport managers can take advantage in creating the next
generation of truly smart, efficient and resilient airports.

How is big data used in


transportation?
Big data analytics help the public transportation sector to
predict passenger volumes as precisely as possible. In this
context, for example, certain events such as bad weather,
holidays, malfunctions and customer feedback from running
transportation operations can be analyzed and processed in
real time.

Potential uses of big data in public


transportation
ex:LUFTHANSA
Optimizing operations, cutting costs and increasing revenues:
These are the challenges facing almost every company in the
public transportation sector. Companies that want to meet these
challenges and ensure future success should utilize the potential
of new technologies such as big data.

Between cost pressure and customer orientation: There is less and less
financial leeway in the public transportation sector due to the
strained budgetary situations of many municipalities and subsidy cuts
enacted by the German government. However, transportation
companies cannot afford to lose sight of customer needs and must
continue to provide them with high-quality services. For this to
happen, there needs to be innovation. The digital transformation and
the accompanying utilization of big data technologies provides
transportation companies with the opportunity to cut costs while
strengthening their customer bases by analyzing historic data.
121
How big data analytics help public
transportation
1. Optimizing operating procedures and cutting costs:
How many passengers use which routes when? This is one of the most
important questions in the public transportation sector. The more
precisely transportation companies know how their routes are being
used, the more cost efficiently they can deploy staff and trains. Big
data analytics help the public transportation sector to predict
passenger volumes as precisely as possible.

In this context, for example, certain events such as bad weather,


holidays, malfunctions and customer feedback from running
transportation operations can be analyzed and processed in real
time. This knowledge can be used to plan operating procedures more
efficiently and affordably. Moreover, transportation companies can
increase customer retention by doing away with short trains during
times of peak passenger volume and increasing the frequency of
trains in a service-oriented way, for example.

2. Targeted increases in transit passes:


A further challenge facing transportation companies is having to
increase revenues – especially in light of growing cost pressures. With
the help of big data analytics, companies can create system-based
sales forecasts. Using historic sales figures – for example figures from
the same months of the previous year – and big data applications,
transportation companies can analyze customer behavior even more
accurately. This knowledge helps public transportation companies to
develop sales strategies. They can use it to further optimize their
range of products, increase revenues and improve customer
satisfaction. This puts companies in a position where they can start
campaigns to win back customers at the right point in time, for
instance, in order to increase sales of season tickets. This information
also allows transportation companies to improve timetables and thus
to act in a more customer-focused way using the season ticket
figures.

122
3. Data Insight Lab data analyses for public transportation:
If it’s a matter of big analytics, transportation and logistics experts
from Lufthansa Industry Solutions work together with the in-house Data
Insight Lab. The data scientists and data architects from this
competence center compile, structure and analyze existing data on
the basis of use cases. With the help of this data, they then evaluate
its potential for a company. In addition to this, our experts then assist
a company on its way to becoming a data-driven company 4.0
during the subsequent implementation of big data projects.

Using Big Data for Financial Fraud


Prevention
Although technology has made banking more convenient for
customers, it has also opened up new avenues for fraud.
Financial fraud statistics show that account fraud, credit card
fraud, insurance fraud, scams, and other fraudulent acts cause
millions of dollars in damages to institutions and consumers every
year.

Financial fraud detection is essential for minimizing risk for


institutions. Scammers can easily drain individual accounts or run
up tens of thousands of dollars on credit cards. Worse yet,
organized crime rings can execute elaborate schemes and steal
millions of dollars.

Big data fraud detection is a cutting-edge way to use consumer


trends to detect and prevent suspicious activity. Even subtle
differences in a consumer’s purchases or credit activity can be
automatically analyzed and flagged as potential fraud. Using
data analytics to detect fraud requires expert knowledge and
computer resources, but is easier than ever, due to improvements
in programming languages and server technology.

123
How Does Data Mining Work?
Data mining is the science of automatically detecting patterns in
a given set of data. It requires significant amounts of computing
power and careful data management using advanced
technology like data lakes and cloud computing. Any data
analysis program requires complex programming languages, but
data mining that is robust enough to support subsequent machine
learning must be carefully coded to prevent errors in pattern
detection.

Data mining for fraud prevention relies on pattern analysis to


find outliers or suspicious trends. In financial services and many
other industries, one of the best sources of data is big data. This
data contains information like customer zip codes, travel
patterns, income levels, age, and other demographic factors that
influence customers’ financial decisions and purchases.

All of these datasets and accompanying machine learning


processes are subject to continuous review, testing, and
feedback from humans. When a system flags a false positive, the
person who investigates that false positive can then teach the
system why it was incorrect. Experts in financial fraud detection
apply this new knowledge and understanding to future fraud
data analysis as well.

124
Accuracy of Fraud Prevention
Suspicious large transactions or blatant fraud can be detected
without data mining. For example, if a customer uses their credit
card at a store, and then an hour later appears to use their
credit card at a store on the other side of the country, the
provider can freeze that account.

However, data mining makes it possible to detect other, more


subtle signs of fraud with high levels of accuracy. Customer
information can be analyzed to predict general trends and spot
fraudulent transactions before a customer even knows that their
card or account has been compromised.

Improvements in technology and machine learning processes


have resulted in fewer transactions being flagged as fraudulent.

[Link]
prevention

Resources
•Wikipedia has examples of how data mining is used at:
•[Link]
•There is a pdf on ways in which data mining has been used in
different sectors here:
•[Link]

125
ENABLING TECHNOLOGIES

VIRTUALISATION
How does Hardware Virtualization Work?

Hardware virtualization has gained popularity in server


platforms. The basic idea to enable hardware virtualization is to
integrate numerous small physical servers into a single large
physical server for the processor to be used effectively. The
Operating System that runs on the physical server is converted
into an OS that runs inside the virtual machine.

Hence, hardware virtualization means embedding a virtual


machine software into a server’s hardware component. This
software is given different names, with virtual machine monitor
and hypervisor as the most common ones. Hypervisor controls the
memory, processor and other components and allows different
OS to run on the machine without needing a source code.

126
Advantages of Hardware
Virtualization
It has many advantages to it. The main advantage is that it is
much easier to control a virtual machine than a physical server.
Operating systems running on the machine appear to have their
own memory and processor. Hardware virtualization can
increase the scalability of your business while also reducing
expenses at the same time.

It can reduce downtime costs that are otherwise incurred in


terms of money losses and recovery time in times of disaster
affecting a physical server. A virtual machine can be easily
cloned, thus making the environment more resilient. It also
increases your team’s productivity by spending lesser time on
physical hardware monitoring and maintenance.

Types of Hardware Virtualization


Hardware assisted virtualization has three kinds. These include:

Full Virtualization: The hardware architecture is fully


simulated. No modification is required by guest software for
running applications

Para-Virtualization: The hardware is not simulated, rather the


guest software runs the isolated system

Emulation Virtualization: In emulation virtualization, the virtual


machine is independent. It simulates the hardware and there
is no modification required by the guest operating system.

127
Enabling Hardware Virtualization
Coming to the main point of the article, let’s look at how you can
enable hardware virtualization on your computer system’s BIOS.
Every PC manufacturer requires different steps for entering the
BIOS and making this change. The following are the steps that
you need to take when enabling hardware assisted virtualization.

1. Check if your system supports hardware virtualization:


By Task Manager

Open your task manager by using Ctrl+Shift+Esc keys. If your


processor supports hardware virtualization, you will see
virtualization as Enabled along with the other details, or
otherwise disabled. If it does not support virtualization, you will
not see Hyper-V or virtualization mentioned in the task manager.

Open your command prompt by first using Windows Key + R to


open the run box. Type cmd in it and hit Enter. In your command
prompt, type the command “systeminfo” and press Enter. This
command displays all required details for your system, including
support for hardware virtualization.

If your processor supports Hardware Virtualization technology,


you will be able to see a section of Hyper-V requirements along
with the status. If virtualization is turned off, you will see “No” in
front of the option “Virtualization Enabled in Firmware”. This
means that your system does not support hardware virtualization.

128
2. Reboot your Computer and Press the BIOS Key:

If your system supports hardware virtualization, it’s time for you to


reboot it and open its BIOS. The key for BIOS can vary according
to the manufacturer of the BIOS. Most often, it is one of these
keys i.e. Esc, Del, F1, F2 or F4.

When your screen goes black while rebooting, tap the BIOS key
by quickly tapping it at least twice every second. This is
important so that you don’t miss the BIOS while rebooting. If the
key you used did not work, try rebooting your computer and do
the same process with another key.

3. Locate the section for CPU configuration:

Once you have entered the BIOS, you need to find the section for
configuring your CPU. Depending upon your system, you will need
to look for a menu called CPU configuration, processor,
Northbridge or Chipset. You may arrive at this menu by clicking
on a like such as “Advanced” or “Advanced Mode”.

4. Find the Settings for Virtualization:

After finding the CPU configuration section, you need to find the
menu or option where it allows you to enable hardware
virtualization. Hardware virtualization is enabled in the
acceleration section. Depending upon your PC, look for any of
these or similar names such as Hyber-V, Vanderpool, SVM, AMD-
V, Intel Virtualization Technology or VT-X.

129
5. Select the Option for Enabling Virtualization:

When you reach the hardware virtualization enabling menu, it


might ask you to choose the enabling option from a checklist or
a drop-down menu. In either case, select the “Enabled” option. If
you see the options of AMD IOMMU or Intel VT-d, enable them as
well.

6. Save the Changes You Have Made:

After selecting the enabling virtualization option, look for the


option that allows you to save these changes. Before saving it,
you may have to first exit the menu and then click the save
changes option. Now you have successfully enabled hardware
virtualization on your computer.

7. Exit Your BIOS and Reboot Your Computer:

Once you have saved the settings of virtualization enablement,


you have to exit the BIOS. The computer will now get restarted
but with hardware virtualization enabled in it.

Virtual Machine
There are numerous YouTube type videos explaining how to set
up and use VMs;

[Link]

[Link]

130
Reasons for Using Virtualisation
Resorce Optimization

Virtualization is becoming a core business strategy from desktop


to data center. It can provide an isolated execution environment
to applications, support share and reuse of hardware resources.
The cost reduction and simplicity of management are the
prominent advantages. It can help to deliver high priority
business services more quickly .Enhance energy efficiency.
Optimizing the various resources with virtualization helps to
improve organization efficiency. Here, the various optimization
methods like desktop, server, network and storage are included.
Optimization is the process of making something more efficient
or optimal. Optimization is about virtualization of storage,
servers and move towards on demand computing.

Desktop optimization enables organizations to centrally defi ne


virtual machines (VMs) and assign these VMs to authenticated
users to run on their PCs. It offers temporary access to a
corporate desktop, and provides higher security for corporate
applications and data

Desktop optimization enables organizations to centrally defi ne


virtual machines (VMs) and assign these VMs to authenticated
users to run on their PCs. It offers temporary access to a
corporate desktop, and provides higher security for corporate
applications and data

131
Virtualization helps you to optimize your hardware utilization,
and the advantages it provides;
Consolidation:
Consolidating is the initial consideration for virtualizing your
environment. How much consolidation will depend on the needs of
your business, your applications, and the type of servers you choose.
The extent of consolidation depends on several factors. Machines
hosting virtual servers can cover a range of capabilities, depending
on how many physical cores and the processing power they have.

Within a virtual environment, one or more virtual central processing


units (vCPU) are assigned to every Virtual Machine (VM).Each vCPU is
seen as a single physical CPU core by the VM’s operating system. The
number of virtual processors available is determined by the number of
cores available on the hardware. Four to eight vCPUs can usually be
allocated to each physical core to accommodate varying workloads.

Server Quantity Reduction


Reducing the number of servers in your organization while
accomplishing the same processing tasks results in a more efficient
utilization of resources. This is accomplished by:

Reducing idle time in machines - Servers are optimized temporally,


utilizing their resources more fully. For example, if one application is
not running, the same host processor can be used to run another
application.

More space available on a server - each server is configured to


optimize its space usage, reducing extra unused server areas. Since
virtualized applications run on their own separate virtual operating
systems, they can work independently of each other. The limiting
factor becomes memory and processing power.

132
Saving Physical Space
The benefits of reducing IT hardware are apparent with regard to
office space-saving considerations. Fewer servers translates into
increased floor space, which can be re-purposed for functions such
as production space, meeting rooms, or workspaces. If enough
footprint is reduced, downsizing the facility could also be an option.
How much space you open up will depend on the extent of
virtualization and the capacity of the equipment you use.

Re-Allocation or Reduction of IT
Resources
Maintaining IT hardware requires qualified staff. Issues such as
overheating, part replacement and upgrades, firmware and OS
updates, and network security require constant attention to keep the
system reliably up and running. With fewer servers to maintain, less
time is needed for server maintenance from IT staff, freeing them to
attend to other tasks, or perhaps requiring less people on staff.

Maximizes Uptime
Agility is all about being able to respond to changing requirements as
quickly and flexibly as possible. Virtualization brings new
opportunities to data center administration, allowing users to enjoy:
Guaranteed uptime of servers and applications; speedy disaster
recovery if large scale failures do occur.

Instant deployment of new virtual machines or even aggregated


pools of virtual machines via template images.

Elasticity, that is, resource provisioning when and where required


instead of keeping the entire data center in an always-on state.

Reconfiguration of running computing environments without


impacting the users.
133
What is redundancy in virtualization?
Redundancy, in the sense of running the same application on multiple
servers, is a safety measure: if for any reason a server fails, another
server running the same application takes over, thereby minimizing
the interruption in service.

More info;
Redundancy in a Virtual Environment White Paper

Virtualization: Flexibility and


Scalability
The growing trend of virtualization has provided many benefits
to a wide variety of enterprises and organizations. A primary
advantage of virtualization is that it allows better utilization of
resources.

Sharing servers and other resources is one way that virtualization


optimizes IT resources. It provides flexibility that allows closer
alignment with an organization’s needs for computing, storage
and/or database systems.

Changing needs can be accommodated easily by allocating


resources to applications with heavier or lighter loads. For
example, if one application grows quickly while another is
underutilized, virtualization allows fast and easy scaling to meet
the requirements of the growing application while allocating less
resources to the diminishing one. The ability to reallocate
resources enables business to grow and meet computing needs
without a large investment in new equipment, licenses, and IT
manpower.

134
Two Types of Scalability
1. Horizontal Scalability
Horizontal Scalability, also called scaling out, involves the amount of
hardware and software required to accommodate the workload in
your network. Horizontal Scalability is necessary if you are adding
new applications or increasing data volume in your environment. In
this case, more hardware and storage space would be needed.

2. Vertical Scalability
Within a virtual environment, one or more virtual central processing
units (vCPU) are assigned to Vertical Scalability, also called scaling
up, involves growing and re-allocating features such as memory,
bandwidth, and CPU cores in your network. In this case, you can scale
up vertically by increasing these and other resources for the existing
applications when they require it. In some cases, vertical scaling will
require adding RAM or other hardware or firmware, and in other
cases a simple reallocation of the existing configuration will suffice.
The addition of resources to a Virtual Machine can be set up through
the hypervisor's management system.

135
How does technology help
investment?
Data systems allow investors and developers to have a clear
understanding of the market in minute detail. This advance in
technology will help investors with illiquid assets make the most
of their investment, by providing both clarity and speed to the
decision-making process.
Top 9 enabling technologies that support
emerging business trends
1. Internet of Things (IoT), sensors and
biometrics
The IoT has had an enormous impact on every industry, creating
a “design platform” that enables the development of a variety of
applications.

Key business trends enabled by IoT, sensors and biometrics


include : automation and “Smart Everything”, empowered
consumers, on-demand logistics and services, traceability and
sustainability.

2. Artificial intelligence (A.I.)


A.I. describes advanced, smart computing techniques used to
analyse complex problems and data. It helps define patterns in
data and provide predictive analytics.

A.I. enables the automation/Smart Everything trend, powering


new applications for autonomous robotics, creating new ways of
engaging with today’s empowered consumers, and solving ever
new challenges in real-time.

136
3. Open, structured and linked data
Almost any useful business-to-business or business- to-consumer
application needs data from multiple sources. Integrating all this
information is extremely difficult, especially if it’s unstructured.

The more data can be open, structured and linked, the stronger
impact they will have in enabling greater interoperability,
especially in business trends such as empowered consumers,
traceability, and automation/Smart Everything.

4. Autonomous logistics
In the same way that self-driving cars are
disrupting personal transportation, there is a
surge of applications that are taking
advantage of autonomous systems for logistics.

Robotics and A.I. are other technologies that


are contributing to the advancement of
autonomous logistics, which is a key enabler for
business trends such as on-demand logistics,
and automation/ Smart Everything.

5. Blockchain and distributed data


The interest in blockchain has expanded across industry as a
way to share data and information across a large number of
participants, while offering the possibility of greater data and
transactional security.

Blockchain offers new capabilities and is helping to re-ignite


interest in other approaches to managing distributed data, such
as edge computing and distributed data warehouses.

This technology is being rapidly evaluated today as a potential


enabler for traceability, especially in food safety applications.
137
6. Computer vision
While early advances in computer vision focused exclusively on
image recognition, the field has expanded. Vision systems can
now observe environments and make decisions and conclusions
to support a variety of applications, especially to aid in product
quality control in the warehouse.

This technology is enabling many business trends, most notably


automation/Smart Everything—and is creating efficiency and
speed in support of on-demand logistics and services.

7. Voice recognition
Voice recognition and natural language processing have
advanced—and are driving the adoption of personal assistant
devices.

This new “conversational commerce” is emerging as a hot new


trend that is impacting brands, companies, and marketplaces.
These players are increasingly connecting with consumers
through apps and voice to improve product research, answer
questions and simplify purchases.

This technology enabler will have the biggest impact on the


trends: empowered consumer and automation/Smart Everything.

138
8. Robotics
Robotic systems take on many forms, whether carrying out
actions autonomously or semi-autonomously or acting in concert
with other robots or people for more complex tasks.

A new trend in robotics is “collaborative robots” (also referred to


as cobots or co-robots) in which robots are interacting with
people in warehouses and manufacturing settings.

Robotics is a key enabler in the automation/Smart Everything


trend. It is also assisting in the scaling of mass customisation.

9. Augmented, virtual (AR/VR) and mixed


reality
The ability to superimpose digital images and information into
the real world using mobile phones, displays and wearable
headsets is helping to improve accuracy and efficiency in
industrial and commercial settings.

These systems will have a big impact in driving new advances in


the automation/Smart Everything and empowered consumer
business trends.

139
What is a legacy system?
The definition of a legacy system is an obsolete computer system,
programming language, software application, process, or
technology that is no longer can be maintained, replaced, or
easily updated.

It does not mean that the legacy system is unusable. Many


organizations or companies still find these systems essential to
their daily work. It depends upon you to either upgrade or
replaces it.

Here are five common issues to consider


[Link] not run on modern hardware
Some legacy tools can’t work well with more modern computers,
so you are stuck managing the old system that sometimes way
underperformed. Having to maintain old hardware and
operating system can be pricey. Yes, virtualization and emulation
solutions do exist, but it comes with a cost.

2. Lack of sufficient skillsets to maintain


As time passes, people with knowledge about how to manage or
enhance the application moves on. Getting someone that can
work with an old system can be challenging. You need to re-train
your staff on how the legacy system works, which can be
increasing your company operating costs. A lot of time and effort
is required just to keep the system operational, let alone
enhance it.

[Link] and security vulnerability


Obsolete technology often not plays well with others, and it is
hard to integrate or stay compatible with modern computers.
Security breaches are also the main concern with the older
system. A lack of security in your legacy hardware and other
systems poses a significant vulnerability that difficult to solve.
140
[Link] of data and users limitation
The amount of data and users continuously increases as your
company grows each day. Your legacy application may not have
the capacity to store the amount of information you have
captured over time. Furthermore, it also may not be efficient
enough to handle new users that need access.

[Link] user interface


How we interact with software has changed drastically. Your
legacy system may not be able to meet the expectation of the
user experience and usability of the software. Poor user
experience can reduce productivity, although the core tools
behind the interface are great and working correctly.

If any of these problems exist with company’s legacy software,


it’s probably time to finding a way to upgrade or replace the
system.

Here are five reasons why the legacy


system still used in some companies
[Link] well enough
Many legacy systems are doing their job just fine. The proverb “if
it ain’t broke, don’t fix it” applied even when you need to run your
legacy application on a mainframe instead of cloud computing.

[Link] existing legacy data to a new system isn’t easy


The way data is stored and retrieved is changing. An older
system may have a different algorithm, and some are no longer
common. To convert all of the data into an entirely different form
to be stored in a modern computing system are quite
challenging on the internal IT team.

141
[Link] and revalidating requires considerable effort
and cost
In some industries, software that can impact safety must be
validated and certified for compliance. This software is critical
to the company or customer. Recertifying and validating the new
system may take a long time and a high cost.

[Link] the tool is too disruptive to the organization


The downtime, while transitioning to new software, is considered
an escalating cost to pay because of a decrease in productivity
when the new software is deployed and learned.

[Link] cost of replacing is higher than the benefits of a new


system
There is monetary value in replacing something old with
something new. The reduction in cost and increase in business
should be significant than the total cost of switching.

Economics
The contribution of new technology to economic growth can only
be realized when and if the new technology is widely diffused
and used. Diffusion itself results from a series of individual
decisions to begin using the new technology, decisions which are
often the result of a comparison of the uncertain benefits of the
new invention with the uncertain costs of adopting it. An
understanding of the factors affecting this choice is essential
both for economists studying the determinants of growth and for
the creators and producers of such technologies.

142
Why is technology adoption
important?
Adopting new technologies allows businesses to offer what no
one is offering, boost revenue streams while providing value to
customers. It also presents them as innovators and risk-takers in
front of their customers and investors, opening new doors to a
wider market and more investment.

143
How does the adoption of new
technology affect prices?
Technological advances that improve production efficiency will
shift a supply curve to the right. The cost of production goes
down, and consumers will demand more of the product at lower
prices.

Also saving on hardware costs, allowing trials before adoption of


new systems.

Scalability and Automation


A significant component in the business-design is the topic
“scaling and automation.”

Scaling means that you can change the size of a photo while
keeping its proportions. It still looks good, even if the picture gets
bigger. Your pic is not distorted.
Everyone will talk about scaling, without really having understood
the actual meaning behind it. In every incubator, the founders
say this sentence about 1000 times a day. “Scaling, scaling,
scaling.”

But the smart founders don’t talk about image processing in


Photoshop. They instead mean the following: Does your business
model still work when you suddenly sell 3000 socks a day instead
of 3? It is just one example.

144
Some practical examples about
scalability and automation:
1. Publishing Amazon ebooks and paperbacks
through Kindle Direct Publishing
Why is this infinitely scalable? Well, yes. If you write an ebook and sell
it in self-publishing via Amazon, it makes no difference to you if you
sell only one book a day or 1000 books. The entire sales process is
scalable and automated (we’ll talk about that in a moment). You
don’t have to worry about anything except writing new books and
marketing. Even when you buy a paperback, you have no work at all.
Amazon prints the book on demand and sends it directly to the
customer. You don’t take any financial risk and don’t have to pay in
advance. This process is infinitely scalable.

2. Fulfillment by Amazon (FBA)


It means that you are selling a product through Amazon. You buy your
product from a wholesaler and have it shipped directly to Amazon.
Amazon stores the products in its warehouse and sends them directly
to the customer as soon as he places an order. This value chain is also
infinitely scalable. Of course, your wholesaler must be able to
produce enough products.

3. Affiliate Marketing
With Affiliate Marketing you act as an advertising partner for other
companies and products. For example, you build a blog about a
specific topic. Take for instance the niche “washing machines.” On
your blog, you regularly post product tests about washing machines
and link to the products that are available in various shops. If the
customer now clicks on your advertising banner or a text link, you can
receive a commission if the customer then buys a product via your
link. Ultimately, this business model also scales infinitely. It doesn’t
matter whether ten people per click on your link or 1000. The business
model works.

145
WAYS OF ACHIEVING
VIRTUALISATION
In order to do that as we mentioned in previous slides, three basic
virtualization techniques for embedded systems are considered: full
virtualization, paravirtualization (as instances of hardware-level
virtualization), and containers (as an instance of operating-system-
level virtualization).

How do you achieve virtualization?


Server virtualization works by abstracting or isolating a computer's
hardware from all the software that might run on that hardware. This
abstraction is accomplished by a hypervisor, a specialized software
product. There are numerous hypervisors in the enterprise space,
including Microsoft Hyper-V and VMware vSphere.

Virtual machines in practical use


In hardware virtualization, physical system resources can be
distributed across multiple virtual systems. Each guest system
(including all programs running in it) is separated from the underlying
hardware.

In practice, virtual machines are mostly used to isolate certain


processes and applications for security reasons. Compared to other
virtualization concepts, VMs offer a strong encapsulation, functioning
as a basis for hosting products in which several customer servers are
operated on a common hardware platform. The provision of virtual
machines is the basis of shared hosting and VPS (virtual private
server) setups. Since each guest system runs in an isolated runtime
environment, processes encapsulated in a VM do not affect the host
system or other systems on the same physical machine.

146
In a business context, virtual machines are used to reduce costs for
operating and maintaining IT infrastructures. Companies often run an
extensive IT infrastructure that is idle most of the day. Virtual
machines can significantly reduce this wastage. Instead of providing
each application area of a business IT department with its own
physical machine, more and more companies are moving to running
mail, database, file, or application servers in isolated virtual
environments on the same powerful hardware platform. This concept
is implemented in the context of server consolidation, as it is usually
cheaper to maintain a large computing platform for different virtual
systems than to operate several small computers. Processors, in
particular, are still expensive to buy. In other words: unused processor
time is an unnecessary cost factor that can be avoided by switching
to virtual systems.

Another field of application for virtual environments is software


development. Programmers who develop applications for different
system architectures often use virtual machines for software testing.
Numerous hypervisor products allow the parallel operation of
different operating systems or system versions. Virtual machines can
be created, cloned, and removed from the physical hard disk space
at the touch of a button without leaving any data behind. In addition,
faulty processes within a virtual machine have no effect on the
underlying system due to encapsulation.

Home users typically use hypervisors with emulation capabilities to run


applications originally written for a different system architecture.
However, it should be noted that hardware virtualization, as well as
emulation, always goes hand in hand with performance losses. For
example, if a user wants to run a Linux program in a VM on their
Windows machine, additional resources must be spent on both the
hypervisor and the guest system. An encapsulated Linux application
like this no longer has the same performance of the underlying
hardware at its disposal. This is referred to as an overhead.

147
Note: In information technology, IT resources like computing time,
memory, or bandwidth, which are used or lost during the execution of
a process without directly contributing to the result of the process,
are referred to as overhead.

Hardware virtualization is reaching its limits, especially for resource-


intensive workloads. If multiple virtual machines are running on the
same host system, the resource requirements of one machine during
performance peaks can also affect the performance of the other
machines on the same host. This can be counteracted by
guaranteeing each virtual machine a fixed contingent of hardware
resources. Make sure that the total of all virtual resources used
simultaneously never exceeds the maximum available power of the
physical machine.

148
An overview of the advantages and
disadvantages of virtual machines

149
Containerization
Containerization is the packaging of software code with just the
operating system (OS) libraries and dependencies required to run the
code to create a single lightweight executable—called a container
—that runs consistently on any infrastructure. More portable and
resource-efficient than virtual machines (VMs), containers have
become the de facto compute units of modern cloud-native
applications.

Containerization allows developers to create and deploy


applications faster and more securely. With traditional methods,
code is developed in a specific computing environment which, when
transferred to a new location, often results in bugs and errors. For
example, when a developer transfers code from a desktop computer
to a virtual machine (VM) or from a Linux to a Windows operating
system. Containerization eliminates this problem by bundling the
application code together with the related configuration files,
libraries, and dependencies required for it to run. This single package
of software or “container” is abstracted away from the host
operating system, and hence, it stands alone and becomes portable
—able to run across any platform or cloud, free of issues.

Application containerization
Containers encapsulate an application as a single executable
package of software that bundles application code together with all
of the related configuration files, libraries, and dependencies
required for it to run. Containerized applications are “isolated” in
that they do not bundle in a copy of the operating system. Instead, an
open source runtime engine (such as the Docker runtime engine) is
installed on the host’s operating system and becomes the conduit for
containers to share an operating system with other containers on the
same computing system.

150
Other container layers, like common bins and libraries, can also be
shared among multiple containers. This eliminates the overhead of
running an operating system within each application and makes
containers smaller in capacity and faster to start up, driving higher
server efficiencies. The isolation of applications as containers also
reduces the chance that malicious code present in one container will
impact other containers or invade the host system.

The abstraction from the host operating system makes containerized


applications portable and able to run uniformly and consistently
across any platform or cloud. Containers can be easily transported
from a desktop computer to a virtual machine (VM) or from a Linux to
a Windows operating system, and they will run consistently on
virtualized infrastructures or on traditional “bare metal” servers, either
on-premise or in the cloud. This ensures that software developers can
continue using the tools and processes they are most comfortable
with.

One can see why enterprises are rapidly adopting containerization


as a superior approach to application development and
management. Containerization allows developers to create and
deploy applications faster and more securely, whether the
application is a traditional monolith (a single-tiered software
application) or a modular microservice (a collection of loosely
coupled services). New cloud-based applications can be built from
the ground up as containerized microservices, breaking a complex
application into a series of smaller specialized and manageable
services. Existing applications can be repackaged into containers (or
containerized microservices) that use compute resources more
efficiently.

151
Benefits in detail
Containerization offers significant benefits to developers and
development teams. Among these are the following:

Portability: A container creates an executable package of software


that is abstracted away from (not tied to or dependent upon) the
host operating system, and hence, is portable and able to run
uniformly and consistently across any platform or cloud.

Agility: The open source Docker Engine for running containers started
the industry standard for containers with simple developer tools and
a universal packaging approach that works on both Linux and
Windows operating systems. The container ecosystem has shifted to
engines managed by the Open Container Initiative (OCI). Software
developers can continue using agile or DevOps tools and processes
for rapid application development and enhancement.

Speed: Containers are often referred to as “lightweight,” meaning


they share the machine’s operating system (OS) kernel and are not
bogged down with this extra overhead. Not only does this drive
higher server efficiencies, it also reduces server and licensing costs
while speeding up start-times as there is no operating system to boot.

Fault isolation: Each containerized application is isolated and


operates independently of others. The failure of one container does
not affect the continued operation of any other containers.
Development teams can identify and correct any technical issues
within one container without any downtime in other containers. Also,
the container engine can leverage any OS security isolation
techniques—such as SELinux access control—to isolate faults within
containers.

152
Efficiency: Software running in containerized environments shares the
machine’s OS kernel, and application layers within a container can
be shared across containers. Thus, containers are inherently smaller in
capacity than a VM and require less start-up time, allowing far more
containers to run on the same compute capacity as a single VM. This
drives higher server efficiencies, reducing server and licensing costs.

Ease of management: A container orchestration platform automates


the installation, scaling, and management of containerized workloads
and services. Container orchestration platforms can ease
management tasks such as scaling containerized apps, rolling out
new versions of apps, and providing monitoring, logging and
debugging, among other functions. Kubernetes, perhaps the most
popular container orchestration system available, is an open source
technology (originally open-sourced by Google, based on their
internal project called Borg) that automates Linux container functions
originally. Kubernetes works with many container engines, such as
Docker, but it also works with any container system that conforms to
the Open Container Initiative (OCI) standards for container image
formats and runtimes.

Security: The isolation of applications as containers inherently


prevents the invasion of malicious code from affecting other
containers or the host system. Additionally, security permissions can
be defined to automatically block unwanted components from
entering containers or limit communications with unnecessary
resources.

Resources
There are numerous YouTube type videos explaining how to set up and use containers. e.g.
[Link]

There is scope here for some practical work if suitable hardware/software is available. (e.g.
using Docker, Linux Containers or similar applications).

Resources
Docker, blog, video and download.
[Link]

Linux containers.
[Link]

153
DISTRIBUTED SYSTEMS

What Are Distributed Systems?


A distributed system is a computing environment in which various
components are spread across multiple computers (or other
computing devices) on a network. These devices split up the
work, coordinating their efforts to complete the job more
efficiently than if a single device had been responsible for the
task.

Distributed systems are an important development for IT and


computer science as an increasing number of related jobs are so
massive and complex that it would be impossible for a single
computer to handle them alone. But distributed computing also
offers additional advantages over traditional computing
environments. Distributed systems reduce the risks involved with
having a single point of failure, bolstering reliability and fault
tolerance. Modern distributed systems are generally designed to
be scalable in near real-time; also, you can spin up additional
computing resources on the fly, increasing performance and
further reducing time to completion.

154
What Are Distributed Systems?
How does a distributed system work?

Distributed systems have evolved over time, but today’s most common
implementations are largely designed to operate via the internet
and, more specifically, the cloud. A distributed system begins with a
task, such as rendering a video to create a finished product ready for
release. The web application, or distributed applications, managing
this task — like a video editor on a client computer — splits the job
into pieces. In this simple example, the algorithm that gives one frame
of the video to each of a dozen different computers (or nodes) to
complete the rendering. Once the frame is complete, the managing
application gives the node a new frame to work on. This process
continues until the video is finished and all the pieces are put back
together. A system like this doesn’t have to stop at just 12 nodes — the
job may be distributed among hundreds or even thousands of nodes,
turning a task that might have taken days for a single computer to
complete into one that is finished in a matter of minutes.

There are many models and architectures of distributed systems in use


today. Client-server systems, the most traditional and simple type of
distributed system, involve a multitude of networked computers that
interact with a central server for data storage, processing or other
common goal. Cell phone networks are an advanced type of
distributed system that share workloads among handsets, switching
systems and internet-based devices. Peer-to-peer networks, in which
workloads are distributed among hundreds or thousands of computers
all running the same software, are another example of a distributed
system architecture. The most common forms of distributed systems in
the enterprise today are those that operate over the web, handing
off workloads to dozens of cloud-based virtual server instances that
are created as needed, then terminated when the task is complete.

155
What are key characteristics of
a distributed system?
Distributed systems are commonly defined by the following key
characteristics and features:

Scalability: The ability to grow as the size of the workload


increases is an essential feature of distributed systems,
accomplished by adding additional processing units or nodes to
the network as needed.

Concurrency: Distributed system components run simultaneously.


They’re also characterized by the lack of a “global clock,” when
tasks occur out of sequence and at different rates.

Availability/fault tolerance: If one node fails, the remaining


nodes can continue to operate without disrupting the overall
computation.

Transparency: An external programmer or end user sees a


distributed system as a single computational unit rather than as its
underlying parts, allowing users to interact with a single logical
device rather than being concerned with the system’s
architecture.

Heterogeneity: In most distributed systems, the nodes and


components are often asynchronous, with different hardware,
middleware, software and operating systems. This allows the
distributed systems to be extended with the addition of new
components.

Replication: Distributed systems enable shared information and


messaging, ensuring consistency between redundant resources,
such as software or hardware components, improving fault
tolerance, reliability and accessibility.

156
Benefits, Challenges an Risks of
Distributed Systems
Distributed systems offer a number of advantages over monolithic, or
single, systems, including:

Greater flexibility: It is easier to add computing power as the


need for services grows. In most cases today, you can add servers
to a distributed system on the fly.

Reliability: A well-designed distributed system can withstand


failures in one or more of its nodes without severely impacting
performance. In a monolithic system, the entire application goes
down if the server goes down.

Enhanced speed: Heavy traffic can bog down single servers when
traffic gets heavy, impacting performance for everyone. The
scalability of distributed databases and other distributed systems
makes them easier to maintain and also sustain high-performance
levels.

Geo-distribution: Distributed content delivery is both intuitive for


any internet user, and vital for global organizations.

157
What are some challenges of
distributed systems?
Distributed systems are considerably more complex than monolithic
computing environments, and raise a number of challenges around
design, operations and maintenance. These include:

Increased opportunities for failure: The more systems added to a


computing environment, the more opportunity there is for failure. If
a system is not carefully designed and a single node crashes, the
entire system can go down. While distributed systems are designed
to be fault tolerant, that fault tolerance isn’t automatic or foolproof.

Synchronization process challenges: Distributed systems work


without a global clock, requiring careful programming to ensure
that processes are properly synchronized to avoid transmission
delays that result in errors and data corruption. In a complex system
— such as a multiplayer video game — synchronization can be
challenging, especially on a public network that carries data
traffic.

Imperfect scalability: Doubling the number of nodes in a


distributed system doesn’t necessarily double performance.
Architecting an effective distributed system that maximizes
scalability is a complex undertaking that needs to take into
account load balancing, bandwidth management and other issues.

More complex security: Managing a large number of nodes in a


heterogeneous or globally distributed environment creates
numerous security challenges. A single weak link in a file system or
larger distributed system network can expose the entire system to
attack.

Increased complexity: Distributed systems are more complex to


design, manage and understand than traditional computing
environments.

158
What are the risks of distributed
systems?
The challenges of distributed systems as outlined above create a
number of correlating risks. These include:

Security: Distributed systems are as vulnerable to attack as any


other system, but their distributed nature creates a much larger
attack surface that exposes organizations to threats.

Risk of network failure: Distributed systems are beholden to


public networks in order to transmit and receive data. If one
segment of the internet becomes unavailable or overloaded,
distributed system performance may decline.

Governance and control issues: Distributed systems lack the


governability of monolithic, single-server-based systems, creating
auditing and adherence issues around global privacy laws such as
GDPR. Globally distributed environments can impose barriers to
providing certain levels of assurance and impair visibility into
where data resides.

Cost control: Unlike centralized systems, the scalability of


distributed systems allows administrators to easily add additional
capacity as needed, which can also increase costs. Pricing for
cloud-based distributed computing systems are based on usage
(such as the number of memory resources and CPU power
consumed over time). If demand suddenly spikes, organizations
can face a massive bill.

159
How are distributed systems
used?
Some of the most common examples of distributed systems:

Telecommunications networks (including cellular networks and the


fabric of the internet)

Graphical and video-rendering systems

Scientific computing, such as protein folding and genetic research

Airline and hotel reservation systems

Multiuser video conferencing systems

Cryptocurrency processing systems (e.g. Bitcoin)

Peer-to-peer file-sharing systems (e.g. BitTorrent)

Distributed community compute systems (e.g. Folding@Home)

Multiplayer video games

Global, distributed retailers and supply chain management (e.g.


Amazon)

160
Key Components of a
Distributed System
The three basic components of a distributed system include
primary system controller, system data store, and database. In a
non-clustered environment, optional components consist of user
interfaces and secondary controllers.

161
Main Components of a
Distributed System
1. Primary system controller
The primary system controller is the only controller in a distributed
system and keeps track of everything. It’s also responsible for
controlling the dispatch and management of server requests
throughout the system. The executive and mailbox services are installed
automatically on the primary system controller. In a non-clustered
environment, optional components consist of a user interface and
secondary controllers.

2. Secondary controller
The secondary controller is a process controller or a communications
controller. It’s responsible for regulating the flow of server processing
requests and managing the system’s translation load. It also governs
communication between the system and VANs or trading partners.

3. User-interface client
The user interface client is an additional element in the system that
provides users with important system information. This is not a part of
the clustered environment, and it does not operate on the same
machines as the controller. It provides functions that are necessary to
monitor and control the system.

4. System datastore
Each system has only one data store for all shared data. The data
store is usually on the disk vault, whether clustered or not. For non-
clustered systems, this can be on one machine or distributed across
several devices, but all of these computers must have access to this
datastore.

5. Database
In a distributed system, a relational database stores all data. Once the
data store locates the data, it shares it among multiple users.
Relational databases can be found in all data systems and allow
multiple users to use the same information simultaneously.
162
Distributed Database Options –
Contrast and Compare

163
Distrubuted DATABASE System
For us to distribute this database system, we’d need to have this
database run on multiple machines at the same time. The user must be
able to talk to whichever machine he chooses and should not be able
to tell that he is not talking to a single machine — if he inserts a record
into node#1, node #3 must be able to return that record.

164
Why distribute a system?
Systems are always distributed by necessity. The truth of the
matter is — managing distributed systems is a complex topic
chock-full of pitfalls and landmines. It is a headache to deploy,
maintain and debug distributed systems, so why go there at all?

What a distributed system enables you to do is scale horizontally.


Going back to our previous example of the single database
server, the only way to handle more traffic would be to upgrade
the hardware the database is running on. This is called scaling
vertically.

Scaling vertically is all well and good while you can, but after a
certain point you will see that even the best hardware is not
sufficient for enough traffic, not to mention impractical to host.

Scaling horizontally simply means adding more computers rather


than upgrading the hardware of a single one.

Horizontal scaling
becomes much cheaper
after a certain threshold

165
Why distribute a system?
Vertical scaling can only bump your performance up to the latest
hardware’s capabilities. These capabilities prove to be
insufficient for technological companies with moderate to big
workloads.

The best thing about horizontal scaling is that you have no cap
on how much you can scale — whenever performance degrades
you simply add another machine, up to infinity potentially.

Easy scaling is not the only benefit you get from distributed
systems. Fault tolerance and low latency are also equally as
important.

Fault Tolerance — a cluster of ten machines across two data


centers is inherently more fault-tolerant than a single machine.
Even if one data center catches on fire, your application would
still work.

Low Latency — The time for a network packet to travel the world
is physically bounded by the speed of light. For example, the
shortest possible time for a request‘s round-trip time (that is, go
back and forth) in a fiber-optic cable between New York to
Sydney is 160ms. Distributed systems allow you to have a node in
both cities, allowing traffic to hit the node that is closest to it.

For a distributed system to work, though, you need the software


running on those machines to be specifically designed for running
on multiple computers at the same time and handling the
problems that come along with it. This turns out to be no easy
feat.

166
Scaling our database
Imagine that our web application got insanely popular. Imagine
also that our database started getting twice as much queries per
second as it can handle. Your application would immediately
start to decline in performance and this would get noticed by
your users.

Let’s work together and make our database scale to meet our
high demands.

In a typical web application you normally read information much


more frequently than you insert new information or modify old
one.

There is a way to increase read performance and that is by the


so-called Primary-Replica Replication strategy. Here, you
create two new database servers which sync up with the main
one. The catch is that you can only read from these new
instances.

Whenever you insert or modify information — you talk to the


primary database. It, in turn, asynchronously informs the replicas
of the change and they save it as well.

Congratulations, you can now execute 3x as much read queries!


Isn’t this great?

167
Pitfall
Gotcha! We immediately lost the C in our relational database’s
ACID (In computer science, ACID (atomicity, consistency,
isolation, durability) is a set of properties of database
transactions intended to guarantee data validity despite errors,
power failures, and other mishaps.)guarantees, which stands for
Consistency.

You see, there now exists a possibility in which we insert a new


record into the database, immediately afterwards issue a read
query for it and get nothing back, as if it didn’t exist!

Propagating the new information from the primary to the replica


does not happen instantaneously. There actually exists a time
window in which you can fetch stale information. If this were not
the case, your write performance would suffer, as it would have
to synchronously wait for the data to be propagated.

Distributed systems come with a handful of trade-offs. This


particular issue is one you will have to live with if you want to
adequately scale.

Continuing to Scale
Using the replica database approach, we can horizontally scale
our read traffic up to some extent. That’s great but we’ve hit a
wall in regards to our write traffic — it’s still all in one server!

We’re not left with much options here. We simply need to split our
write traffic into multiple servers as one is not able to handle it.

168
One way is to go with a multi-primary replication strategy. There,
instead of replicas that you can only read from, you have multiple
primary nodes which support reads and writes. Unfortunately, this
gets complicated real quick as you now have the ability to
create conflicts (e.g insert two records with same ID).

Let’s go with another technique called sharding (Sharding is a


method of splitting and storing a single logical dataset in
multiple databases.)(also called partitioning).

With sharding you split your server into multiple smaller servers,
called shards. These shards all hold different records — you
create a rule as to what kind of records go into which shard. It is
very important to create the rule such that the data gets spread
in an uniform way.

A possible approach to this is to define ranges according to


some information about a record (e.g users with name A-D).

This sharding key should be chosen very carefully, as the load


is not always equal based on arbitrary columns.
(e.g more people have a name starting with C rather than Z).
A single shard that receives more requests than others is called
a hot spot and must be avoided

169
Distributed System Categories
Distributed Data Stores
Distributed Data Stores are most widely used and recognized as
Distributed Databases. Most distributed databases are NoSQL
non-relational databases, limited to key-value semantics. They
provide incredible performance and scalability at the cost of
consistency or availability.

Known Scale — Apple is known to use 75,000 Apache Cassandra


nodes storing over 10 petabytes of data, back in 2015

We cannot go into discussions of distributed data stores without


first introducing the CAP Theorem.

CAP Theorem
Proven way back in 2002, the CAP theorem states that a
distributed data store cannot simultaneously be consistent,
available and partition tolerant.

170
Some quick definitions:
Consistency — What you read and write sequentially is what is
expected (remember the gotcha with the database replication
a few paragraphs ago?)
Availability — the whole system does not die — every non-failing
node always returns a response.
Partition Tolerant — The system continues to function and uphold
its consistency/availability guarantees in spite of network
partitions

In reality, partition tolerance must be a given for any distributed


data store. As mentioned in many places, one of which this great
article, you cannot have consistency and availability without
partition tolerance.

Think about it: if you have two nodes which accept information
and their connection dies — how are they both going to be
available and simultaneously provide you with consistency? They
have no way of knowing what the other node is doing and as
such have can either become offline (unavailable) or work with
stale information (inconsistent).

WHAT DO WE DO?

171
In the end you’re left to choose if you want your system to be
strongly consistent or highly available under a network partition.

Practice shows that most applications value availability more.


You do not necessarily always need strong consistency. Even
then, that trade-off is not necessarily made because you need
the 100% availability guarantee, but rather because network
latency can be an issue when having to synchronize machines to
achieve strong consistency. These and more factors make
applications typically opt for solutions which offer high
availability.

Such databases settle with the weakest consistency model —


eventual consistency (strong vs eventual consistency
explanation). This model guarantees that if no new updates are
made to a given item, eventually all accesses to that item will
return the latest updated value.

Those systems provide BASE properties (as opposed to


traditional databases’ ACID)

Basically Available — The system always returns a response

Soft state — The system could change over time, even during
times of no input (due to eventual consistency)

Eventual consistency — In the absence of input, the data will


spread to every node sooner or later — thus becoming consistent

172
Examples of such available distributed
databases — Cassandra, Riak, Voldemort
Cassandra
Cassandra, as mentioned above, is a distributed No-SQL
database which prefers the AP properties out of the CAP, settling
with eventual consistency. I must admit this may be a bit
misleading, as Cassandra is highly configurable — you can make
it provide strong consistency at the expense of availability as
well, but that is not its common use case.

Cassandra uses consistent hashing to determine which nodes out


of your cluster must manage the data you are passing in. You set
a replication factor, which basically states to how many nodes
you want to replicate your data.

Regardless, in the distributed systems trade-off which enables


horizontal scaling and incredibly high throughput, Cassandra
does not provide some fundamental features of ACID databases
— namely, transactions.

Distributed Computing
Distributed computing is the key to the influx of Big Data
processing we’ve seen in recent years. It is the technique of
splitting an enormous task (e.g aggregate 100 billion records), of
which no single computer is capable of practically executing on
its own, into many smaller tasks, each of which can fit into a
single commodity machine. You split your huge task into many
smaller ones, have them execute on many machines in parallel,
aggregate the data appropriately and you have solved your
initial problem. This approach again enables you to scale
horizontally — when you have a bigger task, simply include more
nodes in the calculation.

173
Known Scale — Folding@Home had 160k active machines in 2012

An early innovator in this space was Google, which by necessity


of their large amounts of data had to invent a new paradigm for
distributed computation — MapReduce. They published a paper
on it in 2004 and the open source community later created
Apache Hadoop based on it.

Distributed File Systems


Distributed file systems can be thought of as distributed data
stores. They’re the same thing as a concept — storing and
accessing a large amount of data across a cluster of machines
all appearing as one. They typically go hand in hand with
Distributed Computing.

Known Scale — Yahoo is known for running HDFS on over 42,000


nodes for storage of 600 Petabytes of data, way back in 2011

Wikipedia defines the difference being that distributed file


systems allow files to be accessed using the same interfaces and
semantics as local files, not through a custom API like the
Cassandra Query Language (CQL).

174
Hadoop Distributed File System
(HDFS)
Hadoop Distributed File System (HDFS) is the distributed file
system used for distributed computing via the Hadoop
framework. Boasting widespread adoption, it is used to store and
replicate large files (GB or TB in size) across many machines.

Its architecture consists mainly of NameNodes and DataNodes.


NameNodes are responsible for keeping metadata about the
cluster, like which node contains which file blocks. They act as
coordinators for the network by figuring out where best to store
and replicate files, tracking the system’s health. DataNodes
simply store files and execute commands like replicating a file,
writing a new one and others.

Unsurprisingly, HDFS is best used with Hadoop for computation as


it provides data awareness to the computation jobs. Said jobs
then get ran on the nodes storing the data. This leverages data
locality — optimizes computations and reduces the amount of
traffic over the network.

175
Interplanetary File System (IPFS)
Interplanetary File System (IPFS) is an exciting new peer-to-peer
protocol/network for a distributed file system. Leveraging
Blockchain technology, it boasts a completely decentralized
architecture with no single owner nor point of failure.

IPFS offers a naming system (similar to DNS) called IPNS and lets
users easily access information. It stores file via historic
versioning, similar to how Git does. This allows for accessing all
of a file’s previous states.

It is still undergoing heavy development (v0.4 as of time of


writing) but has already seen projects interested in building over
it (FileCoin).
Distributed Messaging
Messaging systems provide
a central place for storage
and propagation of
messages/events inside your
overall system. They allow
you to decouple your
application logic from
directly talking with your
other systems.

Known Scale — LinkedIn’s


Kafka cluster processed 1
trillion messages a day with
peaks of 4.5 millions
messages a second.

176
Simply put, a messaging platform works in the following way:

A message is broadcast from the application which potentially


create it (called a producer), goes into the platform and is read
by potentially multiple applications which are interested in it
(called consumers).

If you need to save a certain event to a few places (e.g user


creation to database, warehouse, email sending service and
whatever else you can come up with) a messaging platform is the
cleanest way to spread that message.

Consumers can either pull information out of the brokers (pull


model) or have the brokers push information directly into the
consumers (push model).

There are a couple of popular top-


notch messaging platforms:
RabbitMQ — Message broker which allows you finer-grained
control of message trajectories via routing rules and other easily
configurable settings. Can be called a smart broker, as it has a
lot of logic in it and tightly keeps track of messages that pass
through it. Provides settings for both AP and CP from CAP. Uses a
push model for notifying the consumers.

Kafka — Message broker (and all out platform) which is a bit


lower level, as in it does not keep track of which messages have
been read and does not allow for complex routing logic. This
helps it achieve amazing performance. In my opinion, this is the
biggest prospect in this space with active development from the
open-source community and support from the Confluent team.
Kafka arguably has the most widespread use from top tech
companies. I wrote a thorough introduction to this, where I go
into detail about all of its goodness.
177
Apache ActiveMQ — The oldest of the bunch, dating from 2004.
Uses the JMS API, meaning it is geared towards Java EE
applications. It got rewritten as ActiveMQ Artemis, which
provides outstanding performance on par with Kafka.

Amazon SQS — A messaging service provided by AWS. Lets you


quickly integrate it with existing applications and eliminates the
need to handle your own infrastructure, which might be a big
benefit, as systems like Kafka are notoriously tricky to set up.
Amazon also offers two similar services — SNS and MQ, the latter
of which is basically ActiveMQ but managed by Amazon.

Data Replication in Distributed


System
Three types of replication
1. Synchronous (aka eager) replication
2. Asynchronous (aka lazy) replication
3. Two-tier replication

Synchronous Replication
Also called eager replication
All updates are applied to all replicas (or to a majority) as part
of a single transaction (need two phase commit)
Main goal: as if there was only one copy
Maintain consistency
Maintain one-copy serializability
I.e., execution of transactions has same effect as an execution
on a non-replicated db
Transactions must acquire global locks

178
Synchronous Master Replication
One master for each object holds primary copy
The “Master” is also called “Primary”
To update object, transaction must acquire a lock at the
master
Lock at the master is global lock
Master propagates updates to replicas synchronously
Updates propagate as part of the same distributed
transaction
Need to run 2PC at the end
For example, using triggers

Crash Failures
What happens when a secondary crashes?
Nothing happens
When secondary recovers, it catches up

What happens when the master/primary fails?


Blocking would hurt availability
Must chose a new primary: run election
Network Failures
Network failures can cause trouble...
Secondaries think that primary failed
Secondaries elect a new primary
But primary can still be running
Now have two primaries!

179
Majority Consensus
To avoid problem, only majority partition cancontinue processing
at any time
In general,
Whenever a replica fails or recovers...
A set of communicating replicas must determine...
Whether they have a majority before they cancontinue
Synchronous Replication Properties
Favours consistency over availability
Only majority partition can process requests
There appears to be a single copy of the db
High runtime overhead
Must lock and update at least majority of replicas
Two-phase commit
Runs at pace of slowest replica in quorum
So overall system is now slower
Higher deadlock rate (transactions take longer)

Asynchronous Replication
Also called lazy replication
Also called optimistic replication
Main goals: availability and performance
Approach:
One replica updated by original transaction
Updates propagate asynchronously to other replicas

180
Asynchronous Master Replication
One master holds primary copy
Transactions update primary copy
Master asynchronously propagates updates to replicas, which
process them in same order (e.g. through log shipping)
Ensures single-copy serializability
What happens when master/primary fails?
Can lose most recent transactions when primary fails!
After electing a new primary, secondaries must agree who is
most up-to-date

Two-Tier Replication
Benefits of lazy master and lazy group:
Each object has a master with primary copy
When disconnected from master
Secondary can only run tentative transactions
When reconnects to master
Master reprocesses all tentative transactions
Checks an acceptance criterion
If passes, we now have final commit order
Secondary undoes tentative and redoes committed
Conclusion
Replication is a very important problem
Fault-tolerance (various forms of replication)
Caching (lazy master)
Warehousing (lazy master)
Mobility (two-tier techniques)
Replication is complex, but basic techniques and trade-offs are
very well known
Synchronous or asynchronous replication
Master or quorum

181
HUMAN COMPUTER INTERACTION

What is Human-Computer
Interaction (HCI)?
Human-computer interaction (HCI) is a multidisciplinary field of
study focusing on the design of computer technology and, in
particular, the interaction between humans (the users) and
computers. While initially concerned with computers, HCI has
since expanded to cover almost all forms of information
technology design. Human-computer interaction (HCI) is an area
of research and practice that emerged in the early 1980s,
initially as a specialty area in computer science embracing
cognitive science and human factors engineering.

HCI has expanded rapidly and steadily for three decades,


attracting professionals from many other disciplines and
incorporating diverse concepts and approaches. To a
considerable extent, HCI now aggregates a collection of semi-
autonomous fields of research and practice in human-centered
informatics. However, the continuing synthesis of disparate
conceptions and approaches to science and practice in HCI has
produced a dramatic example of how different epistemologies
and paradigms can be reconciled and integrated in a vibrant
and productive intellectual project.

182
Where HCI came from
Until the late 1970s, the only humans
who interacted with computers were
information technology professionals
and dedicated hobbyists. This changed
disruptively with the emergence of
personal computing in the later 1970s.
Personal computing, including both
personal software (productivity applications, such as text
editors and spreadsheets, and interactive computer games)
and personal computer platforms (operating systems,
programming languages, and hardware), made everyone in
the world a potential computer user, and vividly highlighted
the deficiencies of computers with respect to usability for
those who wanted to use computers as tools.
The Meteoric Rise of HCI
HCI surfaced in the 1980s with the
advent of personal computing, just as
machines such as the Apple Macintosh,
IBM PC 5150 and Commodore 64
started turning up in homes and offices
in society-changing numbers. For the
first time, sophisticated electronic
systems were available to general consumers for uses such as
word processors, games units and accounting aids.
Consequently, as computers were no longer room-sized,
expensive tools exclusively built for experts in specialized
environments, the need to create human-computer interaction
that was also easy and efficient for less experienced users
became increasingly vital. From its origins, HCI would expand
to incorporate multiple disciplines, such as computer science,
cognitive science and human-factors engineering.

183
The UX Value of HCI and Its Related
Realms
HCI is a broad field which
overlaps with areas such as user-
centered design (UCD), user
interface (UI) design and user
experience (UX) design. In many
ways, HCI was the forerunner to
UX design.

What is User Centered Design?


User-centered design (UCD) is an iterative design process in
which designers focus on the users and their needs in each phase
of the design process. In UCD, design teams involve users
throughout the design process via a variety of research and
design techniques, to create highly usable and accessible
products for them.

What is User Interface Design?


User interface (UI) design is the process designers use to build
interfaces in software or computerized devices, focusing on looks
or style. Designers aim to create interfaces which users find easy
to use and pleasurable. UI design refers to graphical user
interfaces and other forms—e.g., voice-controlled interfaces.

184
What is User Experience (UX)
Design?
User experience (UX) design is
the process design teams use to
create products that provide
meaningful and relevant
experiences to users. This involves
the design of the entire process
of acquiring and integrating the product, including aspects of
branding, design, usability and function.

Despite that, some differences remain between HCI and UX


design. Practitioners of HCI tend to be more academically
focused. They're involved in scientific research and developing
empirical understandings of users. Conversely, UX designers are
almost invariably industry-focused and involved in building
products or services—e.g., smartphone apps and websites.
Regardless of this divide, the practical considerations for
products that we as UX professionals concern ourselves with have
direct links to the findings of HCI specialists about users’
mindsets. With the broader span of topics that HCI covers, UX
designers have a wealth of resources to draw from, although
much research remains suited to academic audiences. Those of
us who are designers also lack the luxury of time which HCI
specialists typically enjoy. So, we must stretch beyond our
industry-dictated constraints to access these more academic
findings. When you do that well, you can leverage key insights
into achieving the best designs for your users. By “collaborating”
in this way with the HCI world, designers can drive impactful
changes in the market and society.

185
A Brief History of
Human-Computer
Interaction

186
The most notable industries that rely
on HCI are:
Virtual and Augmented Reality, and others
Ubiquitous and Context-Sensitive Computing
Healthcare technologies
Education-based technologies
Security and cybersecurity
Voice User interfaces and speed recognition technologies

More companies around the world are implementing HCI


research and principles into their development processes, and its
already in use by companies like Google and Nintendo.
Researchers show how technologies like the Smartwatch, 3D
printers, Voice Search Apps, and more, all apply HCI design
principles.

Components of HCI
HCI includes three intersecting components: a human, a
computer, and the interactions between them. Humans interact
with the inferences of computers to perform various tasks. A
computer interface is the medium that enables communication
between any user and a computer. Much of HCI focuses on
interfaces.

In order to build effective interfaces, we need to first understand


the limitations and capabilities of both components. Humans and
computers have different input-output channels.

187
Humans: Computers:
Long-term memory Text input devices
Short-term memory Speech recognition
Sensory memory Mouse / touchpad /
Visual perception keyboard
Auditory perception Eye-tracking
Tactile perception Display screens
Speech and voice Auditory displays
Printing abilities

Visual Interaction
In the field of Human-Computer Interaction (HCI), Visual
Interaction refers to the adoption of user interfaces for
interactive systems, which make use of visual elements and visual
interaction strategies with the aim of supporting perceptual
inferences instead of arduous cognitive comparisons and
computations. The design of Visual Interaction focuses on the
definition of interaction mechanisms through which (i) the users
can perform actions on the interactive system by means of visual
elements, and (ii) the system can provide feedback to the user,
by visually representing the results of the computations triggered
by the user actions. The flow of user actions and system feedback
over time then has to be coordinated.

In the field of Databases, Visual Interaction refers to the


adoption of visual interfaces that provide access to the
collection of data stored in databases, by means of visual
formalisms and mechanisms supporting...

188
Examples of visual interaction:
Flat screens, 3d/virtual reality screens, eye tracking,
head/hand/body tracking, virtual retinal display/retinal
projector, head-up displays

What is auditory interaction?


Auditory Interfaces are bidirectional, communicative connections
between two systems—typically a human user and a technical
product. The side toward the machine involves machine listening,
speech recognition, and dialogue systems. The side towards the
human uses auditory displays.

Audio interaction; e.g. microphones, speakers, voice commands,


natural language, language translation, audio implants

189
Haptic Interaction
Haptics, what does it mean? The term “haptics” is used to
designate any form of interaction involving touch (for example,
haptic perception means recognizing objects through touch). It
also includes communicating through touch and technologies
that bring the sense of touch to users.

Simple haptic devices are common in the form of game


controllers, joysticks, and steering wheels. Haptic technology
facilitates investigation of how the human sense of touch works
by allowing the creation of controlled haptic virtual objects.

Ergonomic Aspects of Human-


Computer Interaction
The development of effective interfaces to computer systems is
the fundamental objective of research on human-computer
interactions.

An interface can be defined as the sum of the hardware and


software components through which a system is operated and
users informed of its status. The hardware components include
data entry and pointing devices (e.g., keyboards, mice),
information-presentation devices (e.g., screens, loudspeakers),
and user manuals and documentation. The software components
include menu commands, icons, windows, information feedback,
navigation systems and messages and so on. An interface’s
hardware and software components may be so closely linked as
to be inseparable (e.g., function keys on keyboards). The
interface includes everything the user perceives, understands and
manipulates while interacting with the computer (Moran 1981). It
is therefore a crucial determinant of the human-machine
relation.

190
Research on interfaces aims at improving interface utility,
accessibility, performance and safety, and usability. For these
purposes, utility is defined with reference to the task to be
performed. A useful system contains the necessary functions for
the completion of tasks users are asked to perform (e.g., writing,
drawing, calculations, programming). Accessibility is a measure
of an interface’s ability to allow several categories of users—
particularly individuals with handicaps, and those working in
geographically isolated areas, in constant movement or having
both hands occupied—to use the system to perform their
activities. Performance, considered here from a human rather
than a technical viewpoint, is a measure of the degree to which
a system improves the efficiency with which users perform their
work. This includes the effect of macros, menu short-cuts and
intelligent software agents. The safety of a system is defined by
the extent to which an interface allows users to perform their
work free from the risk of human, equipment, data, or
environmental accidents or losses. Finally, usability is defined as
the ease with which a system is learned and used. By extension, it
also includes system utility and performance, defined above.

191
HCI - Interface design: Menus , Icons,
Accessibility, Windows, Pointers

192
193
DATA STORAGE
UNDERSTAND HOW DATA IS STORED IN THE CLOUD

How Cloud Storage Works


For some computer owners, finding
enough storage space to hold all the
data they've acquired is a real
challenge. Some people invest in larger
hard drives. Others prefer external
storage devices like thumb drives or
compact discs. Desperate computer owners might delete entire
folders worth of old files in order to make space for new
information. But some are choosing to rely on a growing trend:
cloud storage.

Cloud Storage Basics


You're probably familiar with several providers of cloud storage
services, though you might not think of them in that way. Here
are a few well-known companies that offer some form of cloud
storage:

Google Docs allows users to upload documents, spreadsheets


and presentations to Google's data servers. Users can edit files
using a Google application. Users can also publish documents
so that other people can read them or even make edits, which
means Google Docs is also an example of cloud computing.
194
Web e-mail providers like Gmail, Hotmail and Yahoo! Mail store
e-mail messages on their own servers. Users can access their e-
mail from computers and other devices connected to the
Internet.

Sites like Flickr and Picasa host millions of digital photographs.


Their users create online photo albums by uploading pictures
directly to the services' servers.

YouTube hosts millions of user-uploaded video files.

Web site hosting companies like StartLogic, Hostmonster and


GoDaddy store the files and data for client Web sites.

Social networking sites like Facebook and MySpace allow


members to post pictures and other content. All of that content
is stored on the respective site's servers.

Services like Xdrive, MediaMax and Strongspace offer storage


space for any kind of digital data.

Concerns about Cloud Storage


The two biggest concerns about cloud storage are reliability
and security. Clients aren't likely to entrust their data to another
company without a guarantee that they'll be able to access
their information whenever they want and no one else will be
able to get at it.

To secure data, most systems use a combination of techniques,


including:

195
Encryption, which means they use a complex algorithm to
encode information. To decode the encrypted files, a user
needs the encryption key. While it's possible to crack
encrypted information, most hackers don't have access to
the amount of computer processing power they would need
to decrypt information.

Authentication processes, which require to create a user


name and password.

Authorization practices -- the client lists the people who are


authorized to access information stored on the cloud system.
Many corporations have multiple levels of authorization. For
example, a front-line employee might have very limited
access to data stored on a cloud system, while the head of
human resources might have extensive access to files.

Even with these protective measures in place, many people


worry that data saved on a remote storage system is vulnerable.
There's always the possibility that a hacker will find an
electronic back door and access data. Hackers could also
attempt to steal the physical machines on which data are
stored. A disgruntled employee could alter or destroy data using
his or her authenticated user name and password. Cloud
storage companies invest a lot of money in security measures in
order to limit the possibility of data theft or corruption.

The other big concern, reliability, is just as important as security.


An unstable cloud storage system is a liability. No one wants to
save data to a failure-prone system, nor do they want to trust a
company that isn't financially stable. While most cloud storage
systems try to address this concern through redundancy
techniques, there's still the possibility that an entire system could
crash and leave clients with no way to access their saved data.

196
Cloud storage companies live and die by their reputations. It's in
each company's best interests to provide the most secure and
reliable service possible. If a company can't meet these basic
client expectations, it doesn't have much of a chance -- there
are too many other options available on the market.

Hardware requirements for


Cloud services(IBM)
The hardware requirements for Cloud services are:Any standard
x86 64-bit servers or Power® Linux nodes that run supported
Linux distributions.

The minimum size that is required for the /var/MCStore folder is


12 GB.

Note: For better performance, it is recommended to have a


minimum of 2 CPU socket server of the latest Intel variety with at
least 128 GB of memory.

A high CPU count promotes better cloud tiering throughput


because although object storage can be slow in I/O operations
per thread, object storage can support many threads. Use
sixteen or more CPUs when you select your hardware.

Cloud tiering services demand a large amount of memory, which


is why the minimum recommended memory size is 128 GB.
Memory size requirements increase if the number of files
increases, as you add files on the cloud means you must increase
the memory that is on your system. For larger deployments, it is
recommended that you use 10 - 20 times as much memory that is
required so the Cloud services can cache its directory database
data.

197
Types of cloud computing
services
Cloud computing is divided into three main service categories:
SaaS, PaaS, and IaaS. Some providers combine these services –
and others offer them independent of each other.

What is SaaS?
With SaaS (software-as-a-service), software is hosted on a
remote server and customers can access it anytime, anywhere,
from a Web browser or a standard web integration. The SaaS
provider takes care of backups, maintenance, and updates.
SaaS solutions include enterprise resource planning (ERP),
customer relationship management (CRM), project
management, and more.

What is PaaS?
Platform-as-a-service (PaaS) is a cloud-based, application
development environment that provides developers with
everything they need to build and deploy apps. With PaaS,
developers can choose the features and cloud services they
want on a subscription or pay-per-use basis.

What is IaaS?
Infrastructure-as-a-service (IaaS) lets companies “rent”
computing resources, such as servers, networks, storage, and
operating systems, on a pay-per-use basis. The infrastructure
scales – and customers don’t have to invest in the hardware.

198
Cloud security
Is the cloud actually secure? The degree of security in the cloud
depends on how it was deployed and the cloud service
provider’s capabilities. But it has been shown that in most cases,
the cloud provides more security than on-premise installations.
There are several reasons for this:

Location of data: An on-premise deployment will mean your


data is in your facility. It is worth noting that the first step in
someone stealing your data is knowing where it is located. The
large cloud providers have many servers in various locations, so
it is difficult for anyone to identify where data is located.

Security: With an on-premise solution, your staff maintains all


security procedures and software updates. Just recently, a large,
well-known insurance company had a security breach, and it
was found that the IT department had not installed security
updates for many months. With a reputable cloud provider,
companies have full-time professional security experts to keep
the data safe.

Backup: In a traditional legacy application implementation in


your facility, you are responsible for backing up your valuable
information on a regular basis. If your company does this, it is
still necessary to have current copies stored off-site.

199
What is encryption?
In its most basic form, encryption is the process of encoding
data, making it unintelligible and scrambled. In a lot of cases,
encrypted data is also paired with an encryption key, and only
those that possess the key will be able to open it.

An encryption key is a collection of algorithms designed to be


totally unique. These are able to scramble and unscramble data,
essentially unlocking the information and turning it back to
readable data.

Usually, the person that is encrypting the data will possess the
key that locks the data and will make 'copies' and pass them on
to relevant people that require access. This process is called
public-key cryptography.

Computer or at least machine cryptography, which encryption is


a form of, became significant during the second world war with
military forces across Europe tasked with breaking Germany's
Enigma code.

How does encryption work?


In practice, when you send a message using an encrypted
messaging service (WhatsApp for example), the service wraps
the message in code, scrambling it and creating an encryption
key. It can then only be unlocked by the recipient of the
message.

Digital encryption is extremely complicated and that's why it is


considered difficult to crack. To bolster that protection, a new
set of encryption algorithms is created each time two
smartphones begin communicating with one another.

200
You might have heard of end-to-end encryption, perhaps you've
received a notification on WhatsApp saying that they now
support this type of encryption.

End-to-end encryption refers to the process of encoding and


scrambling some information so only the sender and receiver
can see it.

With end-to-end encryption, however, only the sender and


recipient are able to unlock and read the information. With
WhatsApp, the messages are passed through a server, but it is
not able to read the messages.

What about other methods?


There are two main methods of encryption that can be done:
symmetric and asymmetric. Although, it is worth noting that
within these two ways, there are various of encryption algorithms
that are used to keep messages private.

Symmetric encryption is the process of using the same key (two


keys which are identical) for both encrypting and decrypting
data.

This will mean two or more parties will have access to the same
key, which for some is a big drawback, even though the
mathematical algorithm to protect the data is pretty much
impossible to crack. People's concerns often land with the
behaviours of those with access to the shared key.

Conversely, asymmetric encryption refers to the method of using


a pair of keys: one for encrypting the data and the other for
decrypting it.

201
This process is depicted in the above diagram. The first key is
called the public key and the second is called the private key.
The public key is shared with the servers so the message can be
sent, while the private key, which is owned by the possessor of
the public key, is kept a secret, totally private.

Only the person with the private key matching the public one
will be able to access the data and decrypt it, making it
impenetrable to intruders.

Other methods
There are numerous common encryption algorithms and methods
designed to keep information private. You may already be aware
of some of them including RSA, Triple DES and Blowfish.

RSA(Rivest-Shamir-Adleman) is an Asymmetric encryption


technique that uses two different keys as public and private keys
to perform the encryption and decryption. With RSA, you can
encrypt sensitive information with a public key and a matching
private key is used to decrypt the encrypted message.

In cryptography, Triple DES (3DES or TDES), officially the Triple


Data Encryption Algorithm (TDEA or Triple DEA), is a symmetric-key
block cipher, which applies the DES cipher algorithm three times
to each data block. The Data Encryption Standard's (DES) 56-bit
key is no longer considered adequate in the face of modern
cryptanalytic techniques and supercomputing power.

Blowfish is a symmetric-key block cipher, designed in 1993 by


Bruce Schneier and included in many cipher suites and encryption
products. Blowfish provides a good encryption rate in software,
and no effective cryptanalysis of it has been found to date.
However, the Advanced Encryption Standard (AES) now receives
more attention, and Schneier recommends Twofish for modern
applications.
202
Password protection; how passwords are stored and what that
means for their security (e.g. plaintext, hash, salted hash,
reversibly encrypted), logon/authentication passwords v
passwords on data/ files/ compressed files.
[Link]
Difference beetwen hashing and encryption
[Link]

203
ROLE OF INFORMATION SYSTEMS IN
AN ORGANIZATION
What Is an Information System?
At the most basic level, an information system (IS) is a set of
components that work together to manage data processing and
storage. Its role is to support the key aspects of running an
organization, such as communication, record-keeping, decision
making, data analysis and more. Companies use this information
to improve their business operations, make strategic decisions
and gain a competitive edge.

Information systems typically include a combination of software,


hardware and telecommunication networks. For example, an
organization may use customer relationship management systems
to gain a better understanding of its target audience, acquire
new customers and retain existing clients. This technology allows
companies to gather and analyze sales activity data, define the
exact target group of a marketing campaign and measure
customer satisfaction.

204
The Benefits of Information
Systems
Modern technology can significantly boost your company's
performance and productivity. Information systems are no
exception. Organizations worldwide rely on them to
research and develop new ways to generate revenue,
engage customers and streamline time-consuming tasks.

With an information system, businesses can save time and


money while making smarter decisions. A company's internal
departments, such as marketing and sales, can communicate
better and share information more easily.

Since this technology is automated and uses complex


algorithms, it reduces human error. Furthermore, employees
can focus on the core aspects of a business rather than
spending hours collecting data, filling out paperwork and
doing manual analysis.

Thanks to modern information systems, team members can


access massive amounts of data from one platform. For
example, they can gather and process information from
different sources, such as vendors, customers, warehouses
and sales agents, with a few mouse clicks.

205
Role of Information Technology
in an Organization
The role of information technology in various sectors has evolved
quickly since the last decade of the 20th century. Modern
organizations use information technology throughout most, if not
all, departments and across most functions. The obvious example
is email. Email has become ubiquitous in connecting employees
to each other, between departments and between locations or
markets. This is true whether a business is entirely local with a
single point of presence or maintains offices in multiple
locations in multiple countries.

But IT goes far beyond mundane operations. The right IT systems


give companies a competitive edge, enabling them to enter
larger markets and expand products or service lines more
efficiently, as well as keep tabs on competitors. IT has now
become such a pervasive aspect of business operations that
many employees and managers no longer see it as a separate
function. Rather, IT has become an indispensable element of
every corporate department and function, driving innovation
and fostering growth throughout the entire organization.

Information technology in business helps a corporation maintain


a watchful eye on expenses and profits, enabling management
to act more nimbly to trim costs or to change the sales team’s
focus when necessary. A strong IT system also helps all facets of
a company work more productively. By enabling automation
and digital tools, tasks that once took hours can now be
performed in a matter of minutes. Today’s agile businesses
weave IT into everything they do, enabling them to accomplish
more in a shorter amount of time.

206
Operations support system
Operations support systems (OSS), operational support systems
in British usage, or Operation System (OpS) in NTT[ are
computer systems used by telecommunications service providers
to manage their networks (e.g., telephone networks). They
support management functions such as network inventory,
service provisioning, network configuration and fault
management.

Together with business support systems (BSS), they are used to


support various end-to-end telecommunication services. BSS
and OSS have their own data and service responsibilities. The
two systems together are often abbreviated OSS/BSS, BSS/OSS
or simply B/OSS.

The acronym OSS is also used in a singular form to refer to all


the Operations Support Systems viewed as a whole system.

Different subdivisions of OSS have been proposed by the TM


Forum, industrial research labs, or OSS vendors. In general, an
OSS covers at least the following five functions:

Network management systems


Service delivery
Service fulfillment, including the network inventory, activation
and provisioning
Service assurance
Customer care

207
A lot of the work on OSS has been centered on defining its
architecture. Put simply, there are four key elements of OSS:

Processes:
the sequence of events
Data:
the information that is acted upon
Applications:
the components that implement processes to manage
data
Technology:
how we implement the applications

Collaboration
In an IT context, collaboration is a situation in which multiple
parties converge toward a common goal. The term can be
applied to a technology that allows individuals or groups to
work together. This covers a broad spectrum of technologies,
including social and interactive media and other social
platforms.

In IT, the term may be used in ways that fit the definition of using
technology to achieve a specified goal. Some refer to
collaboration as a recursive process, where multiple steps
produce incremental progress. Others refer to specific features
of most collaboration software, like chat or instant messaging
(IM) features/presentation and file sharing.

In general, the term collaboration is used in IT to talk about the


evolution of group work resources over various types of network
structures. These tools help drive more efficient business
practices and enhance global communications in the modern
world.

208
Knowledge Management
Knowledge management includes the collection, analysis,
dissemination, and general management of all information that
is possessed by an organization. A Knowledge Management
System carries out these functions and follows best practices to
deliver optimal results for the organization using it in an efficient
and effective manner.

By definition, a Knowledge Management System (KMS) is a


system for applying and using knowledge management
principles to typically enable employees and customers to
create, share and find relevant information quickly. A
Knowledge Management System is a valuable tool for any
business operating in our data-driven digital world, particularly
those that sell products and/or provide services.

Functionally, an IT backed knowledge management system will


collect, store, and retrieve knowledge, find sources of
knowledge, monitor and mine repositories for hidden
information. It helps automate the knowledge management
process and creates efficiencies by providing key players with
more time to spend learning from and applying data insights,
information, and knowledge.

A successful Knowledge Management System implementation


will:
Manage and capture knowledge
Search for and retrieve existing
knowledge
Disseminate knowledge, data, and
information to those who need it,
and
Facilitate collaboration within and
across teams
209
Product Development
The three stages of product development

Altitude recommends approaching product development using a


three-stage process. First is the customer needs and discovery
stage. The design strategy team should conduct extensive
research into the deep insights of consumers to distill the right
opportunities.

Second is the design and solution stage, which requires the


design team and client team to collaborate to design and
develop a meaningful solution.

Finally, the build, test and go-to-market stage. This is when the
engineering team steps in to realize the solution into something
manufacturable, reliable and of target cost. The team can then
deliver the solution to contract manufacturers who will build the
product themselves. Now let’s take a look at the biggest IX.0
product development trends of today.

210
Service Delivery
The simplest and clearest definition of Enterprise Service
Management (ESM), is the use of IT Service Management (ITSM)
principles and capabilities in business functions to improve their
performance, service, and outcomes.

ESM improves visibility and access to enterprise services of all


forms, accelerates service delivery and of course supports core
ITSM processes, such as incident, problem, change, request, and
service asset and configuration management.

What does this mean?


Modern technologies and software that deliver instant access
and answers to all aspects of consumer life have become
pervasive. Employees expect a similar experience in their day-
to-day business life, whether they engage with IT or any of the
many other service providers in a company. This includes Human
Resources (HR), Legal, Facilities, Education, Security, Sales,
Marketing, R&D, and Finance departments.

As a consequence, businesses need to rethink their approach to


providing employee workplace services. These back office
services, for example onboarding a new employee, have been
made – and often still are – only available through manual
processes including phone calls, emails, or filling in
spreadsheets. In the digital world, employees expect easy and
instant access to these services through a common service
catalog, along with automatically fulfilling their request
immediately. Another characteristic of non-IT services is that
enterprise services span multiple business functions and typically
also include IT services.

211
RESOURCES
The Open University has a short course on IT systems. It is more a
tutor resource than one suitable for students.

[Link]
php?printable=1&id=2846

There is a YouTube talk on IT systems from a business perspective


here:

[Link]

212
TRANSACTION & PROCESSING
SYSTEMS
Overview of Topic
This Chapter examines the characteristics of transaction
processing systems. It investigates specific examples of real time
transaction processing and batch transaction processing, The
information processes of storing and retrieving, collecting, and
analysing for a transaction processing system are examined. This
chapter concludes by outlining the social and ethical issues that
relate to transaction processing systems

The information provided on this website is a summarization of


the important points required of the syllabus provided by the
BOSTES

Youtube channel providing explainations on each focuspoint


[Link]
OjgyLLaQClBHmIA

213
1. Characteristics of transaction
processing systems
Transaction processing systems(TPS) collect, store, modify
and retrieve the transactions
Transaction is an event that generates or modifies data to
be stored in an information system
Examples: Point of Sale, credit card payments,
Designed in conjunction with the organisation's procedures
Main processes are collecting and storing

ACID (Atomicity, Consistency, Isolation, Durability) is a set of


properties of database transactions. In the context of
databases, a single logical operation on the data is called a
transaction.

The four important characteristics include


Rapid response
Fast performance is critical
Turnaround time from transaction input to the production
output must be a few seconds or less

Reliability
Breakdowns disturb operations
Failure rates must be low
If failure occurs, recovery must be quick and accurate

214
Inflexibility
Every transaction must be processed in the same way
Flexibility results in too many opportunities for non
standard operations, resulting in problems due to
different transaction data

Controlled processing
Must support an organisation's operations
If roles and responsibilities are allocated, the TPS should
maintain these requirements
TPS systems reduce costs by reducing number of times
data must be handled
Two types, Batch and real time

Batch transaction processing

Collects the transaction data as a group and processes it


later

Has a time delay(hours, days)

low processing costs per transaction

Used for pay cheques and when a time delay does not
decrease the usefulness of the results

Disadvantages
Processing must wait until a set time
Errors cannot be corrected during processing
Sorting the data is expensive and time consuming

215
Real-time transaction processing
Immediate processing of data

Instant confirmation of a transaction but does require access


to an online database

Uses a terminal or workstation to enter data and display the


results of the TPS

A network links terminals to mainframe computer

Large number of users simultaneously perform


transactions(requests are also simultaneous)

Concerns:
Concurrency: Two users cannot change the same data at
the same time
E.g if an airline agent has reserved the last seat, another
agent cannot tell another passenger that seat is
available
Atomicity: all steps involved in a transaction are
completed successfully

If any steps fail no other should be completed

E.g transferring money between accounts(the withdrawal


must succeed for the transfer to succeed)

The response time delay must be acceptable for the


application to be considered real time

Main disadvantage=expense, due to hardware and


software

216
Transaction processing monitor

Software that allows the transaction processing application


programs to run efficiently

Provides standard interface between input devices,


programs and DBMS

Ensures transactions do not get lost or corrupted

Differences between real time and batch

Both processed in the same manner

Real time requires more availability of master file for


updating and referencing

Real time has fewer errors, as in batch the data is organised


and stored before updating

Infrequent, yet tolerable, errors can occur in batch

More computers are required in real time processing

More difficult to maintain a real time system

217
Data validation

Used to check the entry of transaction data to ensure the


transactions are correct and have been accurately stored in the
database

Transaction initiation: Used to acknowledge that the TP


monitor is ready to receive the transaction data. Used in
Real time to eliminate possible errors

Field checking: When transaction data is entered into a


database. Validation is carried out by checking the
fields(range check, list check, type check, check digit)

These don't eliminate human error(Eg typing 45 and not 54)

Historical significance of transaction processing systems

TPS was the first type of information system(1950s)


First computer used to batch process business transactions

218
Manual Transaction systems

Business systems that operate without the use of machines

E.g Manual POS systems have eleven operational steps


performed by a sales assistant to sell product
These steps allow the system to be easily computerised

Computerisation of a manual TPS provides benefits for


businesses
Increases the rate at which products are sold
Less time taken for a customer to purchase a product
Customers aren’t waiting for the manual procedure to be
completed
Provides information on products in demand
Maximises profits

TPS must be designed to specifically fit the businesses needs

Youtube video for explaination:


[Link]

219
2. Examples of transaction
processing systems
Components of a transaction processing systems

Users
Take the provided data by the TPS and use it in another
information system
E.g, A Point of ser system provides stock inventory used by an
automated manufacturing system. The users of the other
system belong to the same organisation as the TPS. They are
not interacting with the TPS but are using the data provided
by the TPS

Participants
Conduct information processing(People who do the work)
Need to know what to do, how to do it and when to do it
Success or failure is dependent on them

People
From the environment are becoming participants in real time
processing as they directly enter transactions and perform
validation
When you withdraw money from an ATM, you are a
participant of a TPS
Examples of real time transaction processing

Reservation systems
Used in any type of business involved in setting aside a
product or server for a customer(E.g layby, train tickets)
Require an acceptable response time

220
Point of sale terminals

Used by retail stores to sell goods and services


Minimises the cost of batch handling by converting the data
to a form that can be easily transmitted through a
communication system
Correct price of the product is received one the product
number is entered

Library loan system


Used to keep track of borrowed items
Barcodes are scanned on user's card and the item
This is recorded on the database
Similar to reservation systems(involves keeping information
on products, availabilities, usage and maintenance
Items are often stored in a warehouse

Examples of batch transaction processing

Cheque clearance
Written order asking a bank to transfer an amount of money
to an account
People deposit them into their account
Involves checking the person has the correct funds(takes up
to 3 days)
money is withdrawn when cheque has been cleared

Bill generation
Invoice given to a customer for a supplied goods or service
Generated at a scheduled time so the user can effectively
manage their time
Done as a group

221
Credit card sales transactions(Manual)
Impression of customer's card is taken on a credit slip, to be
filled by a sales clerk
Impression is sent to the bank as a group
Not processed immediately
Customers may view credit card transactions as real time,
but the updating is batch

youtube video for explanation:


[Link]

3. Storing and retrieving


Storing And Retrieving

TPS (Transaction Processing Systems) require an efficient


method for storage and retrieval of data.

Databases and files

The information processes in any large organisation are


often unique and complex.
The storage and retrieval of data must occur accurately many
times each day.
Database = Organised collection of data.

222
Types of Database:

Hierarchical Database
Organises data in a series of levels. It uses a top-down structure
consisting of nodes and branches, and each lower-level node
(child) may be linto more than one higher-level node (parent).

Network Database
Organises data as a series of nodes linked by branches. Each
node can have many branches, and each lower-level node
(child) may be linked to more than one higher-level (parent).

Relational Database
Organise data using a series of related tables. Relationships are
built between the tables to provide a flexible way of
manipulating and combining data.

223
Features for Real-Time Transaction Processing

Good Data Placement


A large number of users are simultaneously performing
transactions to change data;. Used to access patterns of data
use and to place frequently accessed data together.

Short transactions
Short transactions make the process quick and easy which
improves concurrency and makes user interaction easy.

High Normalisation
Redundant information is kept to a minimum to increase speed
on updates and the transactions

Archiving of Historical Data


Data that is rarely referenced should be archived into separate
databases or moved out of the heavily updated tables

Good Hardware Configuration


Hardware needs to be able to handle a large number of
concurrent users and to provide quick response times

File: A block of data. In a database, a file is divided into a


set of related records that contain specific information such
as customer details. TPS uses files to store and organise
transaction data. Both Batch and real time require different
methods of storage and retrieval

224
Five basic file types
Master file: Contains information about the organisation's
business situation. Transaction data is stored in the master
file
Transaction file: Collection of transaction records, serving as
audit trails and history
Report file: Data that has been formatted for presentation
Work file: Temporary file used during processing
Program file:Contains instructions for processing of data
using programming language such as C++ and Visual basic.

Data warehousing
A Data Warehouse is a database that collects information from
different data sources, gathered in real time, and provides data
in various formats

Consolidated- Data is organised using consistent naming


conventions, measurements, attributes and semantics, such as
a true or false question could be one/zero or on/off to
name some. Data is stored in one format and allows data to
be effectively used across the organisations
Subject-oriented- As large amounts of data is stored, some is
irrelevant which makes querying for the data difficult.
Warehouses organises key information from operational
sources available for analysis
Historical- Real time transactions shows current values at
any moment but not past values. Data stored in a warehouse
is accurate for any moment as it stores past information as
well as current and cannot change, stored as a series of
snapshots generated over a period of time
Read-only-Data does not change in a warehouse unless it
was incorrect, thus updates, deletes and inserts are not
applicable. The only operations that occur are loading and
querying data.

225
Backup Procedures

As organisations are dependent on their TPs, breakdowns may


stop a business operating. To counter this Backup and recovery
procedures are put in place to minimise disruptions. A Backup is
a copy of data used to rebuild a system if it goes down. The
success of a backup relies on the implementation and
appropriate procedures.

Recovery process

TPS fail due to many reasons such as system failure, human error
or hardware failure. To cope with failures a set of procedures
are put in place for the recovery

Backup: Backups are regularly done at least once a day,


with the copy being stored in a secure location for
protection

Journal: Maintains an audit trail of transactions and


changes. Transaction logs records all essential data for each
transition such as time of transaction. Change logs contains
before and after copies of records

Checkpoint: DBMS periodically suspends all processing to


synchronize all files and journals. All programs are
completed and the journal entries are updated. Once
synchronized Writes a special record called a checkpoint
which contains information necessary to restart the system.
They are taken frequently. When failures occur checkpoints
are reliable as mostly only a few minutes of work is lost.

Recovery manager: A program that restores the database


and restarts the transaction processing.

226
The two types of recovery include:
Backup Recovery: Used to undo wanted changes to a
database. For example a bank transaction involved
transferring $1000 between accounts but an error occurred
and the funds were taken from the wrong account

Forward Recovery: starts with a backup copy of the


database, reprocessing the transactions in the journal that
occurred between the time the backup was made and the
current time. It is more efficient than backup as it does not
require reprocessing each transaction.

Magnetic Tape
Often used as a backup as it can large quantities of data
It Uses sequential access to retrieve data which is slow but is
still a suitable medium for backup

Grandfather-father-son
Uses at least three generations of backup master files
Most recent is son and the oldest is grandfather
Commonly used with magnetic tape for batch
Used when a failure occurs and the master file is recreated
through one of the three(preferably son) and restarting the
system
Keeping many generations ensures data is never lost and
can be recreated

Partial backups
Occur when parts of the master file are backed up
Transactions completed since when last backup are stored
separately as journal files
Master file can be recreated from the journal files and the
magnetic tape

227
Updating in Batch

Updating is used when transactions are recorded onto paper or


stored as magnetic tape

Transactions are collected when it is convenient or economical


to process them

Two stages in batch includes collecting and storage and


Processing of the data. It is difficult to update in batch as it may
include additions or deletions. If one error occurs the entire
batch is rejected

Sequential access: When data is accessed in sequence, and is


the only method of accessing data on tape

The steps involve retrieving transaction data from tape. The


update starts from the beginning of the tape and read the data
as it was stored. It is time consuming to locate a specific
transaction

This technology requires secondary storage to store the data


inexpensively, thus magnetic tape is favorable

228
Updating in real time
Real time involves a large number of users performing
transactions to change data

The steps in real time involve sending the data to an online


database in a master file

Data is accessed via direct access, which occurs when data


is accessed without accessing previous data items

Uses an algorithm to calculate the location of data

If the data is not there it continues to search through


successive locations until it is found

Involves an index, a table containing information about the


location of data

Technology in real time requires secondary storage to store


large quantities of data for quick access, magnetic disk
storage.

The interface for the software is user-friendly as rapid


response time is critical

Youtube video for explaination:


[Link]

229
4. Other information Processes
Collecting

Collecting data in a TPS involves generating the transaction


data

E.g People using an ATM generate transaction data by entering


their pin and typing requests on a keyboard

Hardware
MICR(Magnetic ink character recognition)
Used by banks to read account numbers on cheques
Characters printed using magnetic ink containing
magnetised particles
Quickly and accurately read prerecorded data on cheque
and deposit slips
Example of batch transaction processing

Barcode readers
Used in retail industries to collect product information at
POS(keeping track of stock movements)
Product information is held on a central computer linked to a
terminal
Data about the item is collected quickly and accurately

230
Forms
Forms are Documents used to collected data from a person
Processed in batch or real time
Paperforms: Person completes a paper form(Batch
processing)
On-screen forms: Computerised data entry purposes, user
can view, enter and change data in real time. Well designed
forms provides information explaining the required data and
any particular rules.
Web forms: typically used to purchase items over the
internet. Requests relevant data such as a delivery address
and payment methods. Can be processes in real time or
batch. The requests become data in fields of a database

Analysing Data
The results of processing transactions are stored in a database
and are analysed in many ways to meet the information needs
for users.

Decision support systems


Assist people in making decisions by providing information,
models and analysis tools
E.g A business uses a TPS to process its sales transactions. It
uses a database to summarise all sales by date, region and
product. This summary is stored in a separate database to
be analysed by senior management
Data mining is used to find relationships and patterns in the
stored data
E.g data mining can be used to analyse data in a
supermarket, it can determine whether there was a
relationship between tomato sauce and meat pie sales,
useful for marketing promotions and making more informed
decisions

231
Management information systems

Provide information for the organisation's managers


Presents basic facts about the performance of the
organisation
Examples: sales reports, stock inventory, payroll, orders and
budgets

Report examples

Scheduled: provided on a regular basis used by middle to


low level management uses

Forecasting: used to help make projections about business


trends. They are important in decision making and are
sometimes referred to as planning reports. High level
management uses them.

On Demand: Generated on request and in response to a


specific need.

Exception: used to alert management to unexpected


situations that require special handling

Youtube video for explanation:


[Link]

232
5. Issues related to transaction
processing systems
Both positive and negative impacts arise from the use of a TPS

Nature of work

TPS automate business operations, affecting the people who


perform these operations

Automation of jobs
Use of information technology to perform tasks once
performed by people
Eg, POS has replaced many manual tasks such as memorising
the price of products
Allow organisations to be more efficient and offer new
services
Created changes(new skills required, complete ongoing
training)
Fewer people being required(loss of jobs in one industry can
result in the growth in other industries)

People as participants
People directly add transaction data(Eg Withdrawing from
an ATM)
Internet allows more people to become participants, e,g
buying products online)
Replacing people who’d normally provide the service

233
Non-computer procedures
When the computer system is unavailable due to any reason,
non-computer procedures are needed to deal with
transactions in real time
When the system is available again users need to enter the
transactions completed by the non-computer procedures

Bias
Bias=Data is unfairly skewed or gives to much weight to a
particular result
This process is carefully designed and examined
Data from a TPS can be biased using graphs and charts
Becomes an issue when data is knowingly misrepresented

Importance of data
Organisations rely on a TPS and the data it processes, thus
procedures are important to ensure security, accuracy and
validity

Data Security
Data can be stolen, destroyed or modified
Bigger risk in real time when it is accessible to multiple users
First line of defense: Passwords, personal objects, biometric
devices
Encryption= coding data Decryption=decoding data.
Effective in security during transmission of data
Firewalls=Verify and authenticate all incoming data by
checking the password of people accessing the network.
They are expensive to install

234
Data Accuracy
Extent to which it is free from errors
Errors can be caused by mistakes in gathering, entry or out
of date information
E.g if a price is entered incorrectly the customer is charged
the wrong price

Data Validation
Used to check the entry of data
Should check each transaction for detectable errors(missing
data, suspicious data values, wrong format)
Carried out using range checks, list checks, type checks and
check digits

Data integrity

Describes the reliability of the data(ACID test)

Atomicity: Occurs when all steps involved are completed. If


any step fails no other step should be completed or
attempted

Consistency: Occurs when a transaction successfully


transforms the system and database from one valid state to
another. Includes the correct application programming,
Debiting and crediting the same amount

Isolation: Occurs if a transaction is processed concurrently


with other transactions and still behaves as if it were the only
transaction executing the system

Durability: Occurs if all the changes becomes permanent


when the transaction is committed

235
Control in transaction processing
Starts with collecting and the way TPS manipulates data and
the way errors are corrected

Data preparation and authorisation create the data to be


entered

Issue of creating false data to promote careers

TPS results are not always correct(people should not become


dependents on TPS, thus need to maintain control over their
organisation's operations)

Youtube video for explanation:


[Link]

236
AGILE VS. WATERFALL DIFFERENCES IN
SOFTWARE DEVELOPMENT METHODOLOGIES
At the beginning of any software project, teams and
organizations have to first deal with the question of Agile vs.
Waterfall. Software projects follow a methodology of clearly
defined processes or software development life cycle (SDLC) to
ensure the end product is of high quality. An SDLC identifies
phases and the structured flow from one phase to another phase.
Typically, there are six to seven phases. Agile and waterfall are
two popular, but very different, development processes.

Waterfall project management is a traditional model for


developing engineering systems and is originally based on
manufacturing and construction industry projects. When applied
to software development, specialized tasks completed in one
phase need to be reviewed and verified before moving to the
next phase. It is a linear and sequential approach, where phases
flow downward (waterfalls) to the next.

Agile methodology is a type of incremental approach to


software development based on principles that focuses more on
people, results, collaboration, and flexible responses to change.
Instead of planning for the whole project, it breaks down the
development process in small increments completed in iterations,
or short time frames. Each iteration includes all SDLC phases such
that a working product is delivered at the end. After several
iterations, a new or updated product is released.
237
Differences Between Agile vs. Waterfall
Both methodologies can help developers produce high-quality
project management. Depending on the specific project
requirement, knowing the difference between agile and
waterfall can better equip a development team to choose the
right process and methods in delivering a successful software
project. Some of the distinct differences are:
Agile is an incremental and iterative approach; Waterfall is
a linear and sequential approach.
Agile separates a project into sprints; Waterfall divides a
project into phases.
Agile helps complete many small projects; Waterfall helps
complete one single project.
Agile introduces a product mindset with a focus on customer
satisfaction; Waterfall focuses on successful project delivery.
Requirements are prepared everyday in Agile, while
requirements are prepared once at the start in Waterfall.
Agile allows requirement changes at any time; Waterfall
avoids scope changes once the project starts.
Testing is performed concurrently with development in Agile;
testing phase comes only after the build phase in a Waterfall
project.
Test teams in Agile can take part in requirements change;
test teams in Waterfall do not get involved in requirements
change
Agile enables the project team to operate without a
dedicated project manager; Waterfall requires a project
manager who plays an essential role in every phase.

238
What Is Agile Development?
Agile development is a team-based approach that emphasizes
rapid deployment of a functional application with a focus on
customer satisfaction. It defines a time-boxed phase called a
sprint with a defined duration of two weeks.

At the start of each sprint, a list of deliverables are prioritized


based on customer input. At the end of the sprint, the developers
and the customer review and evaluate the work with notes for
future sprints. As a methodology based on general principles,
more specific methods based on processes such as Scrum and
Kanban are called types of agile methodology.

239
What Are the Benefits of Agile?
Some of the known benefits of an agile project are:
Faster software development life cycle

Predictable schedule in sprints

Customer-focused approach, resulting in increased customer


satisfaction

Flexible in accepting changes

Empowers teams to manage projects

Promotes efficient communications

Ideal for projects with non-fixed funding

What Are the Disadvantages of


Agile?
The following are disadvantages of agile:
Agile requires a high degree of customer involvement, which
not all customers are comfortable with or prefer to give.

Agile assumes every project team member is completely


dedicated, without which weakens the principle of self-
management.

A time-boxed approach may not be enough to


accommodate all deliverables, which will require changes in
priority and additional sprints that can bring up cost.

Agile recommends co-location for efficient communication,


which is not always possible.
240
What Is the Waterfall Development
Process?
Waterfall project management is a sequential approach that
divides the SDLC to distinct phases such as requirements
gathering, analysis and design, coding and unit testing, system
and user acceptance testing, and deployment. The next phase
can only proceed if the previous phase has been completed. In
between phases, a deliverable is expected or a document is
signed off.

All phases are passed through and completed only once, so all
requirements are gathered as much as possible at the start to
provide the information in creating the plans, schedules, budget,
and resources. It is plan-driven, so any changes after the project
has started would offset the original plan and require a restart.

What Are the Benefits of Waterfall?


The following are the benefits of waterfall methodology:
Straightforward planning and designing due to the
agreement on deliverables at the start of the project.

Better design with whole-system approach

Defined scope of work

Easier costing

Clear measurements of progress

Defined team roles

Dedicated resources can work in parallel for their specific


tasks
241
What Are the Disadvantages of
Waterfall?
Newer development methodologies were created because of
known disadvantages of waterfall, including:
Rigid structure to allow necessary changes
No allowance for uncertainty
Limited customer engagement, resulting in poor satisfaction
Sequential approach is not ideal for a large-sized project
where the end result is too far in the future
Testing is done only at the latter phases of the project.
A better way to approach a software development project
is to focus first on your business goals. Then teams can
choose, adapt, and even customize to create the best hybrid
methods fit for their needs.

Leading Agile Project Management


Software Platforms
Project managers looking for the best software that supports
agile development have many options. These are the best agile
PM tools you should consider.

242
243

Common questions

Powered by AI

Integrating unstructured data (e.g., social media, sensor data) with structured data (e.g., databases) is significant because it enriches data analysis, leading to deeper business insights. This integration allows for comprehensive sentiment analysis and correlation of diverse data sources to enhance decision-making and strategic planning .

Managing a distributed database system involves challenges such as ensuring consistency, handling faults in geographically dispersed nodes, and dealing with complex maintenance and debugging requirements. Implementing communication protocols and designing systems to work across multiple computers simultaneously are essential to overcome these challenges while maintaining performance and scalability .

Aligning big data with business goals involves establishing a clear context for investments in skills, organization, and infrastructure tied to business priorities. Understanding how big data supports IT and business objectives ensures ongoing project investment, helps target specific areas like customer sentiment analysis, and informs decision-making for strategic initiatives .

AWS's data lake solution comprises several components: Amazon S3 for storage; Kinesis Streams, Kinesis Firehose, Snowball, and Direct Connect for data ingestion; DynamoDB as a No-SQL database; Elastic Search for querying the data lake; and Cognito User Pools for user authentication. These components enhance data management by providing a flexible, scalable solution with strong security standards and integration capabilities .

Predictive analytics assists businesses in maintaining a competitive edge by analyzing historical data to forecast future outcomes and trends more accurately. This allows businesses to anticipate customer needs and concerns, adjust strategies proactively, and stay ahead of competitors by making informed decisions that align with predicted market dynamics .

Wireless transmission systems face challenges such as spectrum efficiency, fronthaul/backhaul capacity, and scalable network architecture due to the high data volumes from big data and IoT applications. Addressing these challenges requires optimizing spectrum use, increasing network capacity, and developing adaptable architectures to handle diverse data sources more efficiently .

Content Delivery Networks improve data transmission efficiency and reliability by providing low latency, high performance, and ensuring secure distribution of content globally. CDNs optimize data transmission by managing network congestion and enhancing delivery speeds, thus improving user experience and maintaining consistent performance across various locations .

In big data environments, machine learning plays a crucial role in fraud detection and compliance by identifying patterns indicative of fraudulent activities and facilitating large-scale data analysis for faster regulatory reporting. It allows for proactive issue management and enhances security by adapting to evolving threats .

Horizontal scaling involves adding more machines to handle increased load, providing cost efficiency and removing the upper limit on scalability. It enhances fault tolerance by distributing data across multiple nodes and reduces latency by localizing data processing. In contrast, vertical scaling involves upgrading existing hardware, which has inherent capacity limits and higher costs .

Big data contributes to operational efficiency by allowing companies to analyze and assess production processes, customer feedback, and returns, leading to reduced outages and the anticipation of future demands. By utilizing big data, businesses can align decision-making with current market demand, thereby enhancing efficiency .

You might also like