DMBI Module 6
DMBI Module 6
6.1 WHAT IS BI ?
Define Business Intelligence with examples
BI (Business Intelligence) is a set of processes, architectures, and technologies that
convert raw data into meaningful information that drives profitable business actions. It
is a suite of software and services to transform data into actionable intelligence and
knowledge.
BI has a direct impact on organization's strategic, tactical and operational business
decisions. BI supports fact-based decision making using historical data rather than
assumptions and gut feeling. BI tools perform data analysis and create reports,
summaries, dashboards, maps, graphs, and charts to provide users with detailed
intelligence about the nature of the business.
Business intelligence is defined set of mathematical models and analysis
methodologies that exploit the available data to generate information and knowledge
useful for complex decision- making processes. as a
A business intelligence system provides decision makers with information and
knowledge extracted from data.
Why is BI important ?
1. Measurement: creating KPI (Key Performance Indicators) based on historic data.
2. Identify and set benchmarks for varied processes.
3. With BI systems organizations can identify market trends and spot business
problems that need to be addressed.
4. BI helps on data visualization that enhances the data quality and thereby the
quality of decision making.
5. BI systems can be used not just by enterprises but SME (Small and Medium
Enterprises)
Types of BI users
1. The Professional Data Analyst : The data analyst is a statistician who always needs to
drill deep down into data. BI system helps them to get fresh insights to develop unique
business strategies.
2. The IT users : The IT user also plays a dominant role in maintaining the BI infrastructure.
:
PAGE 1
MODULE 6 (BUSINESS INTELLIGENCE)
3. The head of the company : CEO or CXO can increase the profit of their business by
improving operational efficiency in their business.
4. The Business Users : Business intelligence users can be found from across the
organization.
PAGE 2
MODULE 6 (BUSINESS INTELLIGENCE)
BI System Disadvantages
1. Cost : Business intelligence can prove costly for small as well as for medium-sized
enterprises. The use of such type of system may be expensive for routine business
transactions.
2. Complexity : Another drawback of BI is its complexity in implementation of data
warehouse. It can be so complex that it can make business techniques rigid to deal
with.
3. Limited use : Like all improved technologies, BI was first established keeping in
consideration the buying competence of rich firms. Therefore, BI system is yet not
affordable for many small and medium size companies.
4. Time Consuming Implementation : It takes almost one and half year for data
warehousing system to be completely implemented. Therefore, it is a time consuming
process.
PAGE 3
MODULE 6 (BUSINESS INTELLIGENCE)
PAGE 4
MODULE 6 (BUSINESS INTELLIGENCE)
PAGE 5
MODULE 6 (BUSINESS INTELLIGENCE)
Design : The second phase, which is divided into two sub-phases, aims to derive a
tentative plan for the overall architecture, taking into account any future developments
as well as the system's evolution in the mid - term. First and foremost, a review of
existing information infrastructures is required. Furthermore, in order to adequately
evaluate the information requirements, the primary decision-making processes that
will be supported by the business intelligence system should be examined. Later on,
the project plan will be drawn out using traditional project management approaches,
including development phases, priorities, projected execution time frames and costs,
as well as the essential roles and resources. Analysis Identification of business needs
Design Infrastructure recognition Project macro planning Planning Detailed project
requirements Definition of the mathematical models needed Identification of the data
Definition of data warehouses and data marts Development of a prototype
Implementation and control Development of data Warehouses and data marts
PAGE 6
MODULE 6 (BUSINESS INTELLIGENCE)
PAGE 7
MODULE 6 (BUSINESS INTELLIGENCE)
Planning : A sub-phase of the planning stage is dedicated to defining and describing the
functions of the business intelligence system in greater depth. Following that, existing
data, as well as data that could be collected from outside sources, is evaluated. This
enables the business intelligence architecture's information structures to be created,
which include a central data warehouse and potentially some satellite data marts.
Simultaneously with the recognition of available data, the mathematical models to be
used should be defined, ensuring the availability of the data required to feed each
model and ensuring that the efficiency of the algorithms to be used will be adequate
for the magnitude of the problems that will result. Finally, a system prototype should
be built at a low cost and with limited capabilities to discover any discrepancies
between actual needs and project specifications ahead of time.
Implementation and control : There are five major sub-phases in the last phase.
1. The data warehouse and each individual data mart must first be built. The
information infrastructures that will feed the business intelligence system are shown
by these data marts.
2. A metadata archive should be developed to explain the meaning of the data in
the data warehouse and the transformations made to the original data in advance.
3. Furthermore, ETL procedures are designed to extract and transform data from
primary sources before loading it into the data warehouse and data marts.
4. The next step is to create the main business intelligence applications that will
enable for the execution of the planned analyses.
5. Finally, the system should be made available for testing and use.
PAGE 8
MODULE 6 (BUSINESS INTELLIGENCE)
2. Data Integration and cleaning tools : To effectively analyze the data collected for a BI
program, organization integrate and consolidate different data sets to create unified
views of them. The most widely used data integration technology for BI applications
is extract, transform and load (ETL) software, which pulls data from source systems in
batch processes. A variant of ETL is extract, load and transform (ELT), in which data is
extracted and loaded as is and transformed later for specific BI uses. Other methods
include real-time data integration, such as change data capture and streaming
integration to support real-time analytics applications, and data virtualization, which
combines data from different source systems virtually. A BI architecture typically also
includes data profiling and data cleansing tools that are used to identify and fix data
quality issues. They help BI and data management teams provide clean and consistent
data that's suitable for BI uses,
3. Analytics data stores : This encompasses the various repositories where BI data is
stored and managed. The primary one is a data warehouse, which usually stores
structured data in a relational, columnar or Mo multidimensional database and makes it
available for querying and analysis. An enterprise data warehouse can also be tied to
smaller data marts set up for individual departments and business units with data
PAGE 9
MODULE 6 (BUSINESS INTELLIGENCE)
PAGE 10
MODULE 6 (BUSINESS INTELLIGENCE)
and an ODS can be deployed on a single database server or separate systems. A data
lake running on a Hadoop cluster or other big data platform can also be incorporated
into a BI architecture as a repository for raw data of various types. The data can be
analyzed in the data lake itself or filtered and loaded into a data warehouse for
analysis. A well-planned architecture should specify which of the different data stores
is best suited for particular BI uses.
4. BI and data visualization tools : The tools used to analyze data and present
information to business users include a suite of technologies that can be built into a BI
architecture, for example, ad hoc query, data mining and online analytical processing,
or OLAP, software. In addition, the growing adoption of self-service BI tools enables
business analysts and managers to run queries themselves instead of relying on the
members of a BI team to do that for them. BI software also includes data visualization
tools that can be used to create graphical representations of data, in the form of
charts, graphs and other types of visualizations designed to illustrate trends, patterns
and outlier elements in data sets.
5. Dashboards, portals and reports : These information delivery tools give business users
visibility into the results of BI and analytics applications, with built-in data
visualizations and, often, self-service capabilities to do additional data analysis. For
example, BI dashboards and online portals can both be designed to provide real-time
data access with configurable views and the ability to drill down into data. Reports
tend to present data in a more static format.
PAGE 11
MODULE 6 (BUSINESS INTELLIGENCE)
Types of DSS
1. Communication-driven DSS which enables cooperation, supporting more than one
person working on a shared task; examples include integrated tools like Google Docs
or Microsoft Groove.
2. Document-driven DSS which manages, retrieves, and manipulates unstructured
information in a variety of electronic formats.
3. Knowledge-driven DSS provides specialized problem solving expertise stored as facts,
rules, procedures, or in similar structures
4. Model-driven DSS emphasizes access to and manipulation of a statistical, financial,
optimization, or simulation model. Model-driven DSS use data and parameters
provided by users to assist decision makers in analyzing a situation; they are not
necessarily data intensive.
5. Data-driven DSS (or data-oriented DSS) emphasizes access to and manipulation of a
time series of internal company data and, sometimes, external data. A data driven
DSS, which we will focus on, emphasizes access to and manipulation of a time series
of internal company data and sometimes external data. Simple file systems accessed
by query and retrieval tools provide the most elementary level of functionality. Data
warehouse systems that allow the manipulation of data by computerized tools
tailored to a specific task and setting or by more general tools and operators provide
PAGE 12
MODULE 6 (BUSINESS INTELLIGENCE)
PAGE 13
MODULE 6 (BUSINESS INTELLIGENCE)
Types of Decisions
1. Structured : A structured decision is one in which the phases of the decision-making
process (intelligence, design, and choice) have standardized procedures, clear
objectives, and clearly specified input and output. There exists a procedure for
arriving at the best solution.
3. Semi-structured : A semi structured decision has some, but not all, structured phases
where standardized procedures may be used in combination with individual
judgment.
2. Tactical : Tactical decisions affect only parts of an enterprise and are usually
restricted to a single department. The time span is limited to a medium-term horizon,
typically up to a year. Made by middle managers.
3. Strategic : Decisions are strategic when they affect the entire organization or at least
a substantial part of it for a long period of time. They strongly influence the general
objectives and policies of an enterprise. Taken at a higher organizational level,
usually by the company top management.
PAGE 14
MODULE 6 (BUSINESS INTELLIGENCE)
PAGE 15
MODULE 6 (BUSINESS INTELLIGENCE)
(i) Organizational information : You may want to use virtually any information available
in the organization for your Decision Support System. What you use, of course,
depends on what you need and whether it is available. You can design your
Decision Support System to access this information directly from your company's
database and data warehouse.
(ii) External information : Some decisions require input from external sources of
information. Various branches of federal government, and the internet, to mention
just a few, can provide additional information for the use with a Decision Support
System.
(iii) Personal information : You can incorporate your own insights and experience your
personal information into your Decision Support System. You can design your
Decision Support System so that you enter this personal information only as
needed, or you can keep the information in a personal database that is accessible
by the Decision Support System.
The model management component consists of both the Decision Support System
models and the Decision Support System model management system. A model is a
representation of some event, fact, or situation. As it is not always practical, or wise, to
experiment with reality, people build models and use them for experimentation.
Models can take various forms.
PAGE 16
MODULE 6 (BUSINESS INTELLIGENCE)
(i) Businesses use models to represent variables and their relationships. For example, you
would use a statistical model called analysis of variance to determine whether
newspaper, TV, and billboard advertising are equally effective in increasing sales.
PAGE 17
MODULE 6 (BUSINESS INTELLIGENCE)
(ii) Decision Support Systems help in various decision making situations by utilizing
models that allow you to analyze information in many different ways. The models you
use in a Decision Support System depend on the decision you are making and,
consequently, the kind of analysis you require. For example, you would use what-if
analysis to see what effect the change of one or more variables will have on other
variables, or optimization to find the most profitable solution given operating
restrictions and limited resources. Spreadsheet software such as excel can be used as
a Decision Support System for what-if analysis.
(iii) The model management system stores and maintains the Decision Support System's
models. Its function of managing models is similar to that of a database management
system. The model management component cannot select the best model for you to
use for a particular problem that requires your expertise but it can help you create and
manipulate models quickly and easily.
(Phase 3) Design : In this phase, entire architecture of the system is considered. The
various factors like hardware, network structure, software tools, technology, database
and interaction tool are also taken into consideration.
PAGE 19
MODULE 6 (BUSINESS INTELLIGENCE)
(Phase 4) Implementation : This phase includes the actual implementation of a DSS and
its installation. A а DSS is also tested for any errors or bugs. Any changes can be
backtracked using feedback mechanism and project management tools. We can also
use agile methodology to speed up the implementation process.
(1) Integration : The design and development of a DSS necessitates the collaboration of
a large variety of approaches, tools, models, persons, and organizational processes.
(2) Involvement : During the design and development of DSS, it is common to make the
mistake of excluding or feeling isolated from the project team of knowledge workers
who will really utilize the system once it is deployed.
(3) Uncertainty : While the cost of implementation is lower, the cost of making more
effective decisions may be higher.
PAGE 20
MODULE 6 (BUSINESS INTELLIGENCE)
The telecommunications industry has expanded dramatically in the last few years with
the development of affordable mobile phone technology.
There are many different types of telecommunications fraud and these can occur at
various levels. The two most types of fraud are subscription fraud and superimposed
fraud.
In subscription fraud, fraudsters obtain an account without intention to pay the bill. This
is thus at the level of a phone number, all transactions from this number will be
fraudulent. In such cases abnormal usage occurs throughout the active period of the
account. The account is usually used for call selling or intensive self usage. .
In superimposed fraud, fraudsters take over legitimate account. In such cases the
abnormal usage is superimposed upon the normal usage of the legitimate customers.
There are several ways to carry out superimposed fraud, including mobile phone
cloning and obtaining calling card authorization details. Examples of such cases
include cellular cloning, calling card theft and cellular handset theft. Superimposed
fraud will generally occur at the level of individual calls; the fraudulent calls will be
mixed in with the justified ones.
Other types of telecommunications fraud include ghosting (technology that tricks the
network in order to obtain free calls) and insider fraud where telecommunication
company employees sell information to criminals that can be explained for fraudulent
gain.
These method exists in the areas of Knowledge Discovery in Databases (KDD), Data
Mining, Machine Learning and Statistics. They offer applicable and successful solutions
in different areas of fraud crimes.
At a low level, simple rule-based detection systems use rules such as the apparent use
of the same phone in two very distant geographical locations in quick succession, calls
which appear to overlap in time and very high value and very long calls.
PAGE 21
MODULE 6 (BUSINESS INTELLIGENCE)
For example, forensic analytics may be used to review an employees' purchasing card
activity to assess whether any of the purchases were diverted or divertible for personal
use.
Techniques used for fraud detection fall into two primary classes: Statistical techniques
and Artificial intelligence.
5. Clustering and classification to find patterns and association among groups of data.
1. Data mining to classify, cluster, and segment the data and automatically find
associations and rules in the data that may signify interesting patterns, including
those related to fraud.
2. Expert systems to encode expertise for detecting fraud in the form of rules.
PAGE 22
MODULE 6 (BUSINESS INTELLIGENCE)
Recommendation System
This system works in three phases namely preprocessing, modeling and obtaining
intelligence.
First, the users are filtered based on the user's profile and knowledge such as needs
and preferences defined in the form of rules. This poses selection of features and data
reduction from dataset.
Second, these filtered users are then clustered using k-means clustering algorithm as a
modelling phase.
Third, it identifies nearest neighbour for active users and generates recommendations
by finding most frequent items from identified cluster of users. This algorithm can be
experimentally tested with e-commerce application for better decision making by
recommending top n products to the active users.
2. Choose the columns consideration /features : Once the dataset D has been
identified, the next step of the system is to choose the consideration column or
filtering columns/features. That is, from the whole dataset, the columns/subset of
PAGE 23
MODULE 6 (BUSINESS INTELLIGENCE)
features to be considered for our work are chosen. This includes the elimination of the
irrelevant column in the dataset. The irrelevant column/feature may be the one which
provide less information about the dataset.
PAGE 24
MODULE 6 (BUSINESS INTELLIGENCE)
3. Filtering objects by defining rules: From the consideration dataset, the objects can
be grouped under stated conditions that are defined in terms of rules. That is, for each
column that is considered, specify the rule to extract the necessary domain from the
original dataset. This rule is considered to be the threshold value T. The domain can be
chosen by identifying the frequent items from the dataset.
4. Identifying frequent items : The frequent items can be identified by analyzing the
repeated value in the consideration column satisfying the support count and the
confidence threshold. This will create a new dataset D'.
5. Cluster objects using k-means clustering : Upon forming the new dataset D', the
objects in D' are clustered based on similarity of objects using k-means clustering. k-
means clustering is a method of classifying or grouping objects into k clusters (where k
is the number of clusters). The clustering is performed by minimizing the sum of
squared distances between the objects and the corresponding centroid. The result
consists of cluster of objects with their labels/classes.
6. Find nearest neighbour of active user : In order to find the nearest neighbours of the
active user, similarity of the active user between cluster centroids are calculated based
on distance measure. Then, select cluster that have the highest similarity among other
clusters.
Clickstream Mining
Clickstream mining is a record of a user's activity on the internet, including every
website and every page of every website that the users visits, how long the user was
on a page or site, in what order the pages were visited, any newsgroups that the user
participates in and even the email-addresses of mail that the users send and receive.
Both ISPs and individual websites are capable of tracking a user's clickstream.
Clickstream data is becoming increasingly valuable to internet marketers and
advertisers. Be aware of the big amount of data a clickstream generates.
These 'footprints' visitors leave at a site grown wildly - large businesses may gather a
terabyte of it every day. But the ability to analyze such data hasn't kept pace with the
ability to capture it.
The next frontier of web data analysis is better integration of clickstream data with
other customer information such as purchase history and even demographic profiles,
to form what's often called a "360-degree view" of a site visitor. .
PAGE 25
MODULE 6 (BUSINESS INTELLIGENCE)
PAGE 26
MODULE 6 (BUSINESS INTELLIGENCE)
(b) E-business feedback : The e-business analysis cycle is more sophisticated. This
process combines website activity with data from other sources, such as visitor profile
information, sales databases, and campaigns that include links to the website. It
provides higher-level information, more focused answers and information that can be
used to enhance ecommerce activities across the business as well as improving the
website.
Market Segmentation
Market segmentation is a marketing concept which divides the complete market set up
into smaller subsets comprising of consumers with a similar taste, demand and
preference. A market segment is a small unit within a large market comprising of like-
minded individuals. One market segment is totally distinct from the other segment. A
market segment comprises of individuals who think on the same lines and have similar
interests. The individuals from the same segment respond in a similar way to the
fluctuations in the market.
PAGE 27
MODULE 6 (BUSINESS INTELLIGENCE)
Retail Industry
Retail organizations thrive by providing quality products to customers in a convenient,
timely, and cost effective manner. Understanding emerging customer shopping
patterns can assist retailers in organizing their products, inventory, store layout, and
web presence in order to delight their customers, thereby increasing revenue and
profits. Retailers generate a lot of transaction and logistics data that can be used to
solve problems.
Optimize inventory levels at different locations : Retailers must carefully manage their
inventories. Carrying too much inventory incurs carrying costs, whereas carrying too
little inventory can result in stockouts and missed sales opportunities. Dynamic sales
trend prediction can assist retailers in moving inventory to where it is most in demand.
Online retailers can provide their suppliers with real-time information about their items'
sales, allowing the suppliers to deliver their product to the right locations and reduce
stock-outs.
Improve store layout and sales promotions : Using a market basket analysis, you can
create predictive models of which products frequently sell together. This
understanding of product affinities can assist retailers in co-locating those products.
Alternatively, those affinity products could be placed further apart in order to force the
customer to walk the length and breadth of the store, exposing them to other
products. Promotional discounted product bundles can be created to promote a non-
selling item and also a group of products that sell well together.
Optimize logistics for seasonal effects : Seasonal products provide extremely profitable
short- term sales opportunities, but they also pose the risk of unsold inventories at the
end of the season. Understanding which products are in season in which markets can
assist retailers in dynamically managing prices to ensure inventory is sold during the
season. If it is raining in a specific area, inventory of umbrellas and ponchos could be
quickly moved there from non-rainy areas to help increase sales.
Reduce losses due to limited shelf life : Perishable goods present difficulties in disposing
of inventory on time. Tracking sales trends allows perishable products that are at risk
of not selling before their sell-by date to be appropriately discounted and promoted.
Telecommunication Industry
BI in telecom can help with churn management, marketing/customer profiling, network
failure, and fraud detection.
PAGE 28
MODULE 6 (BUSINESS INTELLIGENCE)
Telecom companies must provide a consistent and data-driven method for predicting
the risk of customer switching and then making an operational decision in real time
while the customer call is in progress. A decision-tree or a neural network based
system can
PAGE 29
MODULE 6 (BUSINESS INTELLIGENCE)
be used to guide the customer-service call operator to make the right decisions for the
company, in a consistent manner.
(2) Marketing and product creation : In addition to customer data, telecom companies
also store call detail records (CDRs), which precisely describe the calling behavior of
each customer. This unique data can be Modu used to profile customers and then can
be used for creating new products/services bundles for marketing purposes. An
American telecom company, MCI, created a program called Friends & Family that
allowed calls with one's friends and family on that network to be totally free and thus,
effectively locked many people into their network.
(5) Fraud control : There are numerous types of fraud in consumer transactions. When a
customer opens an account with the intent of never paying for the services, this is
referred to subscription fraud. Superimposition fraud is defined as unauthorised
activity by someone other than the legitimate account holder. Decision rules can be
developed to analyze each CDR in real time to identify chances of fraud and take
effective action.
Banking
Banks make loans and offer credit cards to millions of customers. They are most
concerned with improving loan quality and reducing bad debts. They also want to keep
more of their current customers and sell them more services.
(1) Automate the loan application process : Decision models that predict the likelihood of
a loan's success can be generated from historical data. The can be integrated into
business processes to automate the loan application process.
(2) Detect fraudulent transactions : Every day, billions of financial transactions take place
around the world. Exception-seeking models detect fraudulent transaction patterns.
For example, if money is transferred for the first time to an unrelated account, it could
be a fraudulent transaction. can
(3) Increase customer value (cross-selling, upselling) : Selling more products and services
to existing customers is frequently the simplest way to increase revenue. A checking
account customer in good standing may be offered better terms on home, auto, or
educational loans than other customers, thus, increasing the value generated by that
customer.
(4) Optimize cash reserves through forecasting : Banks must maintain a certain level of
PAGE 30
MODULE 6 (BUSINESS INTELLIGENCE)
liquidity in order to meet the needs of depositors who may wish to withdraw funds.
Banks can forecast how much to keep and invest the rest to earn interest by using
historical data and trend analysis.
PAGE 31
MODULE 6 (BUSINESS INTELLIGENCE)
Finance
Stock brokerages make extensive use of Business Intelligence (BI) systems. Access to
accurate and timely information can mean the difference between making or losing a
fortune.
Predict changes in bond and stock prices : Forecasting the price of stocks and bonds is a
favorite pastime of financial experts as well as lay people. Stock transaction data from
the past, along with other variables, can be used to predict future price patterns. This
can help traders develop long-term trading strategies.
Assess the impact of events on market movements : Decision trees can be used to create
decision models that assess the impact of events on changes in market volume and
prices. Monetary policy changes (such as a change in the Fed Reserve interest rate) or
geopolitical changes (such as a war in a particular region of the world) can be factored
into the predictive model to help take action with greater confidence and less risk.
Identify and prevent fraudulent activities in trading : There have unfortunately been many
cases of insider trading, leading to many prominent financial industry stalwarts going
to jail. Fraud detection models can identify and flag fraudulent activity patterns.
(1) Maximize the return on marketing campaigns : Data-driven analysis of customer pain
points can ensure that marketing messages are fine-tuned to better resonate with
customers.
(2) Improve customer retention (churn analysis) : Winning new customers is more difficult
and expensive than retaining existing customers. Scoring each customer based on
their likelihood to quit can assist businesses in developing effective interventions, such
as discounts or free services, to retain profitable customers in a cost-effective manner.
(3) Maximize customer value (cross-selling, upselling) : Every interaction with the
customer should be viewed as an opportunity to assess their current needs. Offering
new products and solutions to customers based on their presumed needs can help
increase revenue per customer. Even a customer complaint can be viewed as a chance
to impress the customer. Using the knowledge of the customer's history and value, the
business can choose to sell a premium service to the customer.
PAGE 32
MODULE 6 (BUSINESS INTELLIGENCE)
(4) Identify and delight highly valued customers : The best customers can be identified by
segmenting the customers. They can be proactively contacted and delighted with
enhanced attention and service. Loyalty programs can be more effectively managed.
PAGE 33
MODULE 6 (BUSINESS INTELLIGENCE)
(5) Manage brand image : A company can set up a listening post to monitor social media
conversations about itself. It can then perform sentiment analysis on the text in order
to understand the nature of the comments and respond appropriately to prospects and
customers.
The major objective of watching or reading news was to be informed about whatever is
happening around us. There are several social media platforms in the current modern
era, like Facebook, Twitter and so forth where millions of users would rely upon for
knowing day-to-day happenings. Then came the fake news which spread across people
as fast as the real news could. Fake news is a piece of incorporated or falsified
information often aimed at misleading people to a wrong path or damage a person or
an entity's reputation.
(1) Data Collection : The process of gathering information from various and all possible
resources regarding a particular research problem. This information is stored in a file
as the dataset and is subject to various techniques like testing, evaluation, etc.
(2) Data Cleaning : Identification and removal of errors if any in the gathered
information. This process is carried out mainly to improve the dataset's quality, make it
reliable, and provide accurate decision-making processes.
(3) Data Exploration Analysis : Various visualization techniques are carried out here to
understand the dataset in terms of its characteristics namely, size, quantity, etc. This
process is essential to better understand the nature of the dataset and get insights
faster.
(4) Data Modelling : The process of training the dataset using one or more ML algorithms
to tune it according to the business need, predict or validate it accordingly.
PAGE 34
MODULE 6 (BUSINESS INTELLIGENCE)
(5) Data Validation : The method of tuning the hyperparameters before testing the
model. This provides an unbiased evaluation of a model fit done on the training
dataset.
PAGE 35
MODULE 6 (BUSINESS INTELLIGENCE)
Cyberbullying
Social networking sites (SNS) is being rapidly increased in recent years, which provides
platform to connect people all over the world and share their interests. However, Social
Networking Sites is providing opportunities for cyberbullying activities, Cyberbullying is
harassing or insulting a person by sending messages of hurting or threatening nature
using electronic communication, Cyberbullying poses significant threat to physical and
mental health of the victims. Detection of cyberbullying and the provision of
subsequent preventive measures are the main courses of action to combat
cyberbullying. The detection method can identify the presence of cyberbullying terms
and classify cyberbullying activities in social network such Flaming, Harassment,
Racism and Terrorism, using Fuzzy logic and Genetic algorithm.
The input data set contains text, image, audio and video which will be collected from
social networks. The input of the data is sent to data pre-processing which improves
quality of the input. Social network dataset consists of most noisy and unwanted data;
to improve the accuracy of the input data, the preprocessing is applied. This includes
removing stop words and symbols. Stop words are usually like “a”, "as”, “have", "is”,
“the”, “or”, etc. Stop words mainly consume memory space and reduce the processing
time. After completion of the data pre-processing the outcome of the data is sent to
cyberbully detection module for detecting the cyberbully contents. The cyberbully
detection techniques are explained below:
(a) Image Cyberbully detection : Nowadays, the cyberbullying using images is vast and
causes large effects to the society. They seem to be spreading in the social networks
very rapidly. Such anti-social elements are able to create more stress to the world by
spreading communalism through images. The cyberbully image can be detected using
the computer vision algorithm which includes two methods like image similarity and
Optical Character Recognition (OCR).
(b) Video Cyberbully detection : The video cyberbullying also causes more problems in
terms of both emotional and psychological means. The cyberbully video will be
detected using the shot boundary detection algorithm. Here, the video will be broken
into scene, shot and frames. A shot is a sequence of frames captured by a single
camera in a single continuous action. Thereby, the content of the video will be
analysed using the shot boundary detection algorithms such as Pixel based shot
boundary detection, Histogram based shot boundary detection, Block based shot
boundary detection.
(c) Audio Cyberbully detection : The audio is the one of area where many cyberbullying
occurs in a larger part. Here, the audio will be converted into text using CMU Sphinx
tool. In the converted text, cyberbully will be detected using trained dataset.
PAGE 36
MODULE 6 (BUSINESS INTELLIGENCE)
Finally, the cyberbully content is classified into Physical bullying, Social bullying and
Verbal bullying using Naïve Bayesian classifier. The Naive Bayes classifier method is
developed based on Bayesian theorem with assumptions which are independent in
between predictors.
(a) Social bullying : Social bullying involves spreading rumours about a person,
purposely embarrassing a person in public where it intends to hurt his or her feeling.
Another form of bullying that falls into this category involves encouraging others to
avoid a certain person or group. Social bullying affects a person and their ability to
relate to their environment as well as other people in a social setting. Not only does it
have a direct impact on a person's mental and emotional state, it can also adversely
affect their reputation in both personal and professional circles.
(b) Verbal bullying : Verbal bullying is one of the most highly used techniques to
perform bullying mechanism in an efficient way. Criticizing and making fun of others
are all forms of verbal bullying. In verbal bullying the main weapon the bully uses are
their voice. Verbal bullying is defined as a negative aspects based defining declaration
told to the victim or about the target, thereby defining the target to be as non-existent
one. If the abuser proximately does not make an apology and draw back the significant
declaration, the relationship is considered as verbally abusive one to the network. This
will create psychological disorders that plague them into and throughout adulthood
periods of an individual.
(c) Physical bullying : Physical bullying is one in which one's feeling is being hatred or
harms their personal possessions. The various types of physical bullying methods
which are present widely are Stealing, heaving, hitting, pushing, slapping, spattering
and abolishing property. Physical bullying is hardly the primary form of bullying that a
Buller will experience. Frequently bullying will commence in an altered method and
advancement to physical violence. In physical bullying the foremost weapon the bully
uses are their body.
Sentiment Analysis
Preprocessing : We are interested in features of an object. For this, input data are
preprocessed using following steps:
(a) Tokenization : White spaces, special characters, symbols are removed; remaining
words are called as tokens.
(b) Removal of Stop Words : The articles and common words like “a, an, the, this, that am,
PAGE 37
MODULE 6 (BUSINESS INTELLIGENCE)
is”, etc.
(c) Stemming : Reduces the tokens or words to its root form.
(d) Case Normalization : It changes the whole document either in lower case letters or
upper case letters.
PAGE 38
MODULE 6 (BUSINESS INTELLIGENCE)
(a) Feature Types : It deals with finding of types of features used for sentiments
viz, term frequency, term co-occurrence, sentiment word, negation, syntactic
dependency.
(b) Feature Selection : It deals with finding good features for sentiment
classification viz. information gain, odd ratio, document frequency, mutual
information.
(c) Feature Weighting Mechanism : It calculates weight for ranking the features using
term frequency and inverse document frequency.
(d) Feature Reduction : The dimensionality of features is reduced for better
performance.
Opinion posted is classified as positive opinion, negative opinion and neutral opinion.
The 3 levels of sentiments analysis are as follows.
(a) Document level : The whole document is considered for impressing the opinion as
positive, negative or neutral. The opinion about an object may be expressed without
using any opinion word. In this case natural language processing plays a vital role to
mine the correct sentiments. The main challenge is to extract subjective text for
inferring the overall sentiment of the whole document.
(b) Sentence level : The documents in collection are divided into sentences and then
the sentences are classified as per positive, negative or neutral polarity. A document is
a combination of subjective and objective sentences. First the subjective sentences
are determined and then the opinion in those subjective sentences will be calculated.
The sentence level polarity identification can be done in either of the two ways: a
grammatical syntactic approach or a semantic approach. The grammatical syntactic
approach takes grammatical structure of the sentence into account by considering
parts of speech tags.
(c) Word or phrase level : When product feature is considered for sentiment analysis, it
is word or phrase level sentiment analysis. It uses adjective, adverb as features. Word
level sentiment can be attained by ‘Dictionary Based Approach' or 'Corpus Based
Approach'.
(i) Dictionary based approach : Sometimes the opinion is not expressed by a popular
keyword. Some jargons may be used to express the sentiments. Here, WordNet
containing the synonyms and antonyms is considered for finding out the polarity of
a word.
(ii) Corpus based approach: In this method, occurrence of any word with other word
whose polarity is known is taken into account. Adjectives joined by 'and' show the
same impression and if joined by 'but' show opposite impression.
PAGE 39
MODULE 6 (BUSINESS INTELLIGENCE)
Finally, the sentiments are classified using machine learning approaches like SVM,
Naïve Bayes, Decision Tree, Rule Based Classifier and lexicon based approaches like
dictionary based and corpus based approach.
PAGE 40