R PROGRAMMING
Aug 3, 2020 Department of Computer Science and Engineering 1
Introduction to R
What is R?
What is R?
R is a programming language which provides an environment for statistical
computing, data science and graphics.
It was inspired by, and is mostly compatible with, the statistical language S
developed at Bell laboratory (formerly AT & T, now Lucent technologies).
Although there are some very important differences between R and S, much of the
code written for S runs unaltered on R.
R has become so popular that it is used as the single most important tool for
computational statistics, visualization and data science.
Department of Computer Science and Engineering
Why R?
R has opened tremendous scope for statistical computing and data analysis.
It provides techniques for various statistical analyses like classical tests and
classification, time series analysis, clustering, linear and non-linear modeling
and graphical operations.
The techniques supported by R are highly extensible.
S is the pioneer of statistical computing; however, it is a proprietary solution
and is not readily available to developers. In contrast, R is available freely under
the GNU license.
Department of Computer Science and Engineering
Hence, it helps the developer community in research and development.
Another reason behind the popularity and widespread use of R is its superior
support for graphics.
It can provide well-developed and high-quality plots from data analysis.
The plots can contain mathematical formulae and symbols, if necessary, and users
have full control over the selection and use of symbols in the graphics. Hence, other
than robustness, user-experience and user-friendliness are two key aspects of R.
Department of Computer Science and Engineering
Why Learn R?
Department of Computer Science and Engineering
The following points describe why R language should be used
If you need to run statistical calculations in your application, learn and deploy R. It easily
integrates with programming languages such as Java, C++, Python and Ruby.
If you wish to perform a quick analysis for making sense of data.
If you are working on an optimization problem.
If you need to use re-usable libraries to solve a complex problem, leverage the 2000+ free
libraries provided by R.
If you wish to create compelling charts.
If you aspire to be a Data Scientist.
If you want to have fun with statistics.
Department of Computer Science and Engineering
R is free. It is available under the terms of the Free Software Foundation’s GNU General Public License
in source code form.
It is available for Windows, Mac and a wide variety of Unix platforms (including FreeBSD, Linux, etc.).
In addition to enabling statistical operations, it is a general programming language so that you can
automate your analyses and create new functions.
R has excellent tools for creating graphics such as bar charts, scatter plots, multipanel lattice charts, etc.
It has an object oriented and functional programming structure along with support from a robust and
vibrant community.
R has a flexible analysis tool kit, which makes it easy to access data in various formats, manipulate it
(transform, merge, aggregate, etc.), and subject it to traditional and modern statistical models (such as
regression, ANOVA, tree models, etc.)
R can be extended easily via packages. It relates easily to other programming languages. Existing
software as well as emerging software can be integrated with R packages to make them more productive.
R can easily import data from MS Excel, MS Access, MySQL, SQLite, Oracle etc. It can easily connect to
databases using ODBC (Open Database Connectivity Protocol) and ROracle package.
Department of Computer Science and Engineering
Advantages of R Over Other Programming Languages
Python needs third party extensions and support for data visualization and statistical
computing. However, R does not require any such support extensively.
For example, the lm function is present for linear regression analysis and data analysis in
both Python and R. In R, data can be easily passed through the function and the function
will return an object with detailed information about the regression.
The function can also return information about the standard errors, coefficients, residual
values and so on.
When lm function is called in the Python environment,it will duplicate the functionalities
using third party libraries such as SciPy, NumPy and so on. Hence, R can do the same
thing with a single line of code instead of taking support from third party libraries.
Department of Computer Science and Engineering
R Studio
R studio is the most widely used IDE for writing, testing and executing R codes.
This is a user-friendly and open source solution. There are various parts in a typical screen
of an R studio IDE. These are:
Console, where users write a command and see the output
Workspace tab, where users can see active objects from the code written in the console
History tab, which shows a history of commands used in the code
File tab, where folders and files can be seen in the default workspace
Plot tab, which shows graphs
Packages tab, which shows add-ons and packages required for running specific process(s)
Help tab, which contains the information on IDE, commands, etc.
Department of Computer Science and Engineering
Aug 3, 2020 Department of Computer Science and Engineering
Handling Packages In R
A package in R is the fundamental unit of shareable code. It is a collection of the following
elements:
Functions
Data sets
Compiled code
Documentation for the package and for the functions inside
Tests – few tests to check if everything works as it should.
The directory where packages are stored is called a library. R comes with a standard
set of packages.
Department of Computer Science and Engineering
R is an open source language; thus, new packages are being developed and updated
by developers daily. Some of these packages may not work properly or may have bugs.
Hence, it is not a good idea to use every new and updated package on R development
environment. This can affect the stability of the development environment.
A stable environment requires the sandboxing technique (a security mechanism
often used to execute untested or untrusted programs or code from unverified or
untrusted third parties, users, etc., without damaging/maligning the host machine or
operating system or production environment) to test new packages or update a
package before installing it in the development environment
Department of Computer Science and Engineering
Users can change the path to that library to install a package on a different location
other than the default package library.
The command .libPaths() can be used to get or set the path of the package library
Example
> .libPaths()
Output: C:/R/R-3.1.3/library
This is the default package library location. The following command will change it
into another path:
Example
> .libPaths(“~/R/win-library/3.1-mran-2016-07-02”)
Output: C:/Users/User1/Documents/R/win-library/3.1-mran-2016-07-02
R can be extended easily with the help of a rich set of packages. There are more than
10,000 packages available for R. These packages are used for different purposes.
Department of Computer Science and Engineering
Department of Computer Science and Engineering
Department of Computer Science and Engineering
Installing an R Package
R comes with some standard packages that are installed when a user first installs R and
additional packages can be installed separately.
Users need to navigate through the package library and install a package in the desired
location.
Following commands are used for navigating through R package library and installing R
package.
Department of Computer Science and Engineering
Once you have started R, you can install an R package (e.g. the “ggplot2” package) by
choosing “Install package(s)” from the “Packages” menu at the top of the R console.
(ToolsPackages)
This will ask you for the website that you wish to download the package from.
You can choose “India” (or another country, if you prefer). It will also bring up a list of
available packages that you can install, and you can choose the package that you want to
install from that list (e.g. “ggplot2”).
This will install the “ggplot2” package.
The “ggplot2” package is now installed. Whenever you want to use the “ggplot2”
package after this, after having successfully started R, you first have to load the
package by typing into the R console: library(“ggplot2”).
You can get help on a package by typing the following at the R prompt: help(package=
“ggplot2”)
Department of Computer Science and Engineering
Few Commands to Get Started
[Link]()
A user can check for all installed packages on the machine by using the installed.
packages() function.
[Link]() can be used to uninstall a package
Ex:[Link](“ggplot2”);
packageDescription() :“DESCRIPTION” file has the basic information about a
package
To access the description file inside R, use the function,
Ex:packageDescription(“ggplot2”).
The same can also be accessed via the documentation of the package by using
help(package = “packagename”).
Department of Computer Science and Engineering
Department of Computer Science and Engineering
Ex:help(package = "datasets")
The above will provide an overview of all functions and datasets inside the package,
“datasets”. One of the dataset available in “datasets” package is “AirPassengers”.
To access the dataset, “AirPassengers” inside the “datasets” package, use the code
given below:
help(package='datasets')
datasets::AirPassengers
Department of Computer Science and Engineering
If there will be frequent use of this package, it is worthwhile to load it into the memory.
This can be achieved using the library function:
library (datasets)
Note: the package name has to be specified without enclosing it in quotes.
Department of Computer Science and Engineering
[Link]() and [Link]() commands will find and install specific
R
package(s).
Example
To install a single package, the command is:
>[Link](“ggplot2”)
>[Link](“ggplot2”)
Output
The first command will help to find if there is any package named “ggplot2”
installed in the system or not. Then the [Link]() function will install the
package named “ggplot2”
Department of Computer Science and Engineering
Example:
To install more than one package(s) at a time, the [Link]() command will
have the following format:
>[Link](c(“ggplot”, “tidyr”, “dplyr”))
Output
It will install packages ggplot, tidyr and dplyr.
The command to check whether a package is installed or not is the ‘if’ condition
checking. The command for checking whether the package “ggplot2” is installed or not
can be done by using:
require()
>if (!require(‘ggplot2’)){[Link](‘ggplot2’)}
Department of Computer Science and Engineering
[Link]()
Purpose: Locates installed R packages on your system.
require()
Purpose: Loads and attaches a package in your R session (if it is installed).
It Attempts to load a package into your R environment. If the package is not installed, it returns FALSE (but does
not throw an error).
Department of Computer Science and Engineering
vignette()
Vignettes are a very useful source of help with packages. They are provided by the
package authors to demonstrate and highlight few functionalities of their package in
detail.
Use browseVignettes() function to get a list of all vignettes available with your installed
packages.
> browseVignettes()
Department of Computer Science and Engineering
To view all vignettes for a specific package, e.g., “ggplot2”, use the vignette() function.
Vignettes in package ‘ggplot2’:
browseVignettes(‘ggplot2’)
ggplot2-specs Aesthetic specifications (source, html)
extending-ggplot2 Extending ggplot2 (source, html)
Department of Computer Science and Engineering
Getting Started with R
What
Datais R?
exploration in R is an approach to summarize and visualize important
characteristics of a data set.
An exploratory data analysis focuses on understanding the underlying variables and
data structures to see how they can help in data analysis through various formal statistical
methods.
Department of Computer Science and Engineering
Working with directory
Before writing a program or code using R, it is important to find out the directory
being used.
This can be done using the getwd() function.
If the current working directory is not as per preference, it can be changed using
the setwd() function.
The dir() or the [Link]() functions give information about the files and
directories in the current working directory or any other directory.
Department of Computer Science and Engineering
getwd() Command
getwd() command returns the absolute file path of the current working
directory. This function has no arguments.
Example
>getwd()
setwd() Command:resets the current working directory to another location as
per the
user’s preference.
Example
>setwd(“C:/path/to/my_directory”)
Department of Computer Science and Engineering
dir() Function
This is equivalent to [Link]() function. This function returns a character vector of
the names of files or directories in the named directory.
Syntax
dir(path = “.”, pattern = NULL, [Link] = FALSE, [Link] = FALSE, recursive =
FALSE,
[Link] = FALSE, [Link] = FALSE ,no.. = FALSE) or
[Link](path = “.”, pattern = NULL, [Link] = FALSE, [Link] = FALSE,
recursive = FALSE, [Link] = FALSE, [Link] = FALSE, no.. = FALSE)
>dir()
character(0)
>[Link]()
character(0)
The above command implies that there are no files or directories in the current
directory.
Department of Computer Science and Engineering
Example 1:
To display the files and directories in the current directory, use path= “.” as an argument
to dir().
>i
Example 2:
To display the list of all files and directories in a specific path, use the command as
follows:
dir (path="C:/Users/R-program")
Example 3
To display the complete or absolute path of all files and directories in the specified path,
use dir() as follows:
dir (path="C:/Users/R-program“,[Link]=TRUE)
Department of Computer Science and Engineering
Example 4:
To look for a specific pattern, e.g. file/directory names beginning with a “D”, use
the dir() command with a pattern = “^D” argument.
> dir(path="C:/Users/Seema_acharya", pattern="^D")
[1] "Desktop" "Documents" "Downloads"
Example 5:
To display a recursive list of files or directories in the specified path, use the dir()
command as follows:
> dir(path="d:/data")
[1] "db"
> dir(path="d:/data", recursive=TRUE,[Link]=TRUE)
[1] "db" "db/Demo.0" "db/[Link]" "db/local.0" "db/[Link]"
"db/[Link]" "db/MyDB.0" "db/[Link]"
The options or arguments used with dir() can also be used with [Link](). Try
it out and observe the output.
Department of Computer Science and Engineering
Data types in R
The most commonly used data types are listed as follows:
Data types supported by R are:
Logical
Numeric
Integer
Double
Character
Complex
Raw
class() function can be used to reveal the data type
Department of Computer Science and Engineering
Logical: TRUE / T and FALSE / F are logical values.
Numeric
>2
[1] 2
> class (2)
[1] "numeric“
Integer
Integer data type is a sub class of numeric data type. Notice the use of “L“ as a
suffix to
a numeric value in order for it to be considered an “integer”.
> 2L
[1] 2
> class(2L)
[1] "integer"
Department of Computer Science and Engineering
Functions such as [Link](), [Link]() can be used to test the data type.
> [Link](2)
[1] TRUE
> [Link](2L)
[1] TRUE
> [Link](2)
[1] FALSE
> [Link](2L)
[1] TRUE
Note: Integers are numeric but NOT all numbers are integers.
Department of Computer Science and Engineering
Double (for double precision floating point numbers):By default, numbers are of
“double” type unless explicitly mentioned with an L suffixed to the number for it to
be considered an integer.
> typeof (76.25)
[1] "double"
Complex
> 5 + 5i
[1] 5+5i
> class(5 + 5i)
[1] "complex"
Department of Computer Science and Engineering
Character
> "Data Science"
[1] "Data Science"
> class("Data Science")
[1] "character"
[Link]() function can be used to ascertain if a value is a character.
> [Link] ("Data Science")
[1] TRUE
Department of Computer Science and Engineering
Raw
charToRaw("Hi“) o/p: [1] 48
69(hexadecimal value of ascii value here 48 is the hexadecimal value of 72 which
is the ascii value of H)
class (charToRaw ("Hi")) o/p: [1] "raw“
typeof() function can also be used to check the data type (as shown).
> typeof(5 + 5i) o/p: [1] "complex"
> typeof(charToRaw ("Hi“)) o/p: [1] "raw"
> typeof ("DataScience") o/p: [1] "character"
> typeof (2L) o/p: [1] "integer"
typeof (76.25) o/p: [1] "double“
typeof(2) o/p: [1] "double“
Department of Computer Science and Engineering
class(5 + 5i) o/p: [1] "complex"
> class(charToRaw ("Hi“)) o/p: [1] "raw"
> class ("DataScience") o/p: [1] "character"
> class (2L) o/p: [1] "integer"
class (76.25) o/p: [1] "double“
class(2) o/p: [1] “numeric“
Department of Computer Science and Engineering
Coercion
Coercion helps to convert one data type to another, e.g. logical “TRUE” value when
converted to numeric yields “1”. Likewise, logical “FALSE” value yields “0 ”.
> [Link](TRUE)
[1] 1
> [Link](FALSE)
[1] 0
Numeric 5 can be converted to character 5 using [Link]().
Department of Computer Science and Engineering
> [Link](5)
[1] "5"
> [Link](5.5)
[1] 5
On converting characters, “hi” to numeric data type, the [Link]()
returns NA.
> [Link]("hi")
[1] NA Warning message: NAs introduced by coercion
Department of Computer Science and Engineering
Introducing Variables and ls() Function
operator “<-” to assign a value to the variable
Example: RectangleHeight <- 2
Ex:RectangleHeight
[1] 2
Use the ls() function to list all the objects in the working environment.
> ls()
[1] "RectangleArea" "RectangleHeight" "RectangleWidth"
Department of Computer Science and Engineering
ls() is also useful to clean the environment before running a code. Execute the rm()
function as shown to clean up the environment.
> rm(list=ls())
> ls()
character(0)
Department of Computer Science and Engineering
Commands for data exploration
summary(), str(), head(), tail(), view(), edit(), etc., to explore a dataset.
Load Internal Dataset:
There are various inbuilt datasets in R, e.g. AirPassengers, mtcars, BOD, etc. A list of datasets is available at
[Link]
Let us load the mtcars dataset from the datasets package following the steps:
1. Check if the datasets package is already installed.
>[Link]()
2. If already installed and will be used frequently, load the package.
>library(datasets)
3. Display the observations from the mtcars dataset.
Department of Computer Science and Engineering
summary() Command: includes functions like min, max, median, mean, etc., for each
variable present in the given data frame
Example
>summary(mtcars)
Department of Computer Science and Engineering
Department of Computer Science and Engineering
str() Command: Displays the internal structure of a data frame. It can be used as an
alternative to summary function. It is a diagnostic function and roughly displays one
line per basic object.
Example 1:
>str(str)
function(object,…)
>str(mtcars)
Department of Computer Science and Engineering
Department of Computer Science and Engineering
View() Command: displays the given dataset in a spreadsheet-like data frame viewer.
Example:
>View(mtcars)
Output: The output shows a tabular view of the content of the mtcars dataset
Department of Computer Science and Engineering
Department of Computer Science and Engineering
head() Command: displays the first “n” observations from the given data frame.
The default value for n is 6. However, users can specify the value of “n” as per their
requirement as well.
Example:
>head(mtcars)
Department of Computer Science and Engineering
tail() Command: displays the last “n” observations from a given data frame. The
default
value for n is 6. However, users can specify the value of “n” as per their requirement
as well.
Example
>tail(mtcars, n = 5)
Department of Computer Science and Engineering
Department of Computer Science and Engineering
ncol() Command: returns the number of columns in the given dataset.
Example:
>ncol(mtcars)
Output: The output shows the number of columns in the “mtcars” dataset.
>ncol(mtcars)
[1] 11
Department of Computer Science and Engineering
nrow() command: returns the number of rows in the given dataset.
Example
>nrow(mtcars)
Output:The output shows the number of rows in the “mtcars” dataset.
>nrow(mtcars)
[1] 32
To read help on any command in R, the user can type “?” followed by the function
name on the console.
Department of Computer Science and Engineering
edit() Command:
helps with the dynamic editing or data manipulation of a dataset.
When this command is invoked, a dynamic data editor window opens with a tabular
view of the dataset.
Hereafter, the required changes to the dataset can be made.
Example
>edit(mtcars)
The modified dataset should be stored in a new variable.
For example, it is a good practice to call the edit() method as mtcars_new = edit(mtcars).
Department of Computer Science and Engineering
fix() command :saves the changes in the dataset itself, so there is no need to assign any
variable to it.
Example
fix(mtcars)
View(mtcars)
Department of Computer Science and Engineering
data() Function: lists the available datasets.
Syntax
data()
Example:
data(trees)
plot(trees,col=“red”,pch=16,main=“scatter plot b/w variables of tree”)
Department of Computer Science and Engineering
The plot() function in R is a command used for creating various types of plots like scatterplots, line graphs,
and more.
plot(x, y, type, main, xlab, ylab, col, pch, ...)
x, y: Vectors or variables to be plotted.
type: Type of plot ("p" for points, "l" for lines, "b" for both, etc.).
main: Title of the plot.
xlab, ylab: Labels for the x and y axes.
col: Color of points/lines.
pch: Plotting symbol (e.g., pch=19 for filled circles).
Department of Computer Science and Engineering
[Link]() Function:
writes an external representation of R objects to the specified file.
At a later point in time when it is required to read back the objects, one can use the
load or attach function.
Syntax
[Link](file = “.RData”, version = NULL, ascii = FALSE, safe = TRUE)
Department of Computer Science and Engineering
The file is to be given an extension of RData.
ascii = TRUE, will save an ascii representation of the file. The default is ascii = FALSE.
With ascii being set to false, a binary representation of the file is saved.
version = NULL
Uses the default save format compatible with the R version.
You can specify version = 2 or version = 3 for compatibility with older/newer R versions.
safe = TRUE
Ensures safe saving by writing to a temporary file first, then renaming it.
Prevents file corruption in case of an interruption.
Department of Computer Science and Engineering
Loading and Handling Data in R
Challenges of Analytical Data Processing
Expression, Variables and Functions
Missing Values Treatment in R
Using the ‘as’ Operator to Change the Structure of Data,
Vectors
Matrices
Factors
List
Aggregating and Group Processing of a Variable,
Methods for Reading Data
Department of Computer Science and Engineering
Challenges of Analytical Data Processing
Data Formats
DtDadar
Data Quality
Project Scope
Output Result via Stakeholder Expectation Management Formats
Department of Computer Science and Engineering
Data Formats
Selecting a data format is the first challenge in analytical data processing for researchers or
developers.
R is a well-documented programming language that stores data in the form of an object.
It has a very simple syntax that helps in processing any type of data.
R provides many packages and features such as open database connectivity (ODBC), which
process different types of data formats.
For example, ODBC supports data formats such as CSV, MS Excel, SQL, etc.
Department of Computer Science and Engineering
Data Quality
Maintaining data quality is another challenge in analytical data processing.
With the help of R, business analysts can maintain data quality.
Different tools of R help business analysts in removing invalid data, replacing missing values and
removing outliers in data.
Department of Computer Science and Engineering
Project Scope
Projects based on analytical data processing are costly and time consuming.
Hence, before starting a new project, business analysts should analyse the scope of the project.
They should identify the amount of data required from external sources, time of delivery and other
parameters related to the project.
Department of Computer Science and Engineering
Output Result via Stakeholder Expectation Management
In analytical data processing, analysts design projects that generate output with different types of values
like p-value, the degree of freedom, etc.
However, users or stakeholders prefer to see the output.
The stakeholders do not want to see the constraints used in data processing, assumptions, hypothesis,
p-values, chi-square value or any other value.
Hence, an analytical project should try to fulfil all the expectations of the stakeholders
Department of Computer Science and Engineering
Expression, variables and functions
Arithmetic Expressions
Department of Computer Science and Engineering
Department of Computer Science and Engineering
Logical Values
• Logical values are TRUE and FALSE or T and F.
•Comparison operators in R are used to compare values and return a logical
(boolean) result: TRUE or FALSE.
Department of Computer Science and Engineering
Department of Computer Science and Engineering
Part (i) Display ‘TRUE’ for elements whose values are more than 7, else display ‘FALSE’.
> x>7
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
Part (ii) Display ‘TRUE’ for elements whose values are less than 5, else display ‘FALSE’.
> x<5
[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Department of Computer Science and Engineering
Department of Computer Science and Engineering
Date Activity
Dates : The default format of date is YYYY-MM-DD.
Print system’s date. [Link]() [1] “2017-01-13”
Print system’s time. [Link]() [1] “2017-01-13 10:54:37 IST”
Print the time zone [Link]() [1] “Asia/Calcutta”
Print today’s date. today <- [Link]() today [1] “2017-01-13”
Format Date format (today, format = “%B %d %Y”) “January 13 2017”
Store date as a text data type. CustomDate = “2016-01-13” [1] “2016-01-13”
class (CustomDate) [1] “character”
Convert the date stored as text CustDate = [Link](CustomDate) [1] “Date”
data type into a date data type. class(CustDate) [1] “2016-01-13”
CustDate
Department of Computer Science and Engineering
Find a Difference between Two Dates
Department of Computer Science and Engineering
Variables
[Link] Variables Explain
1 Assign a value of 50 to the variable called ‘Var’. Var <-50 or Var=5
2 Print the value in the variable, ‘Var Var
3 Perform arithmetic operations on the variable, ‘Var’. Var + 10
4 Reassign a string value to the variable, ‘Var’. Var
> Var <- “R is a Statistical Programming Language” Print the value in [1] “R is a Statistical Programming
the variable, ‘Var’. Language”
5 Reassign a logical value to the variable, ‘Var’. Var <- TRUE
Var [1] TRUE
Department of Computer Science and Engineering
Functions
sum() function
sum() function returns the sum of all the values in its arguments.
Syntax
sum(..., [Link] = FALSE)
where … implies numeric or complex or logical vectors.
[Link] accepts a logical value. Should missing values (including NaN (Not a Number)) be
removed?
Examples
Sum the values ‘1’, ‘2’ and ‘3’ provided as arguments to sum()
sum(1, 2, 3)
[1] 6
What will be the output if NA is used for one of the arguments to sum()?
sum(1, 5, NA, [Link]=FALSE)
[1] NA
If [Link] is FALSE, an NA or NaN value in any of the argument will cause NA or NaN to be
returned.
Department of Computer Science and Engineering
What will be the output if NaN is used for one of the arguments to
sum()?
sum(1, 5, NaN, [Link]= FALSE)
[1] NaN
What will be the output if NA and NaN are used as arguments to sum()?
sum(1, 5, NA, NaN, [Link]=FALSE)
[1] NA
What will be the output if option, [Link] is set to TRUE?
If [Link] is TRUE, an NA or NaN value in any of the argument will be ignored.
sum(1, 5, NA, [Link]=TRUE)
[1] 6
sum(1, 5, NA, NaN, [Link]=TRUE)
[1] 6
Department of Computer Science and Engineering
>sum(1+2i,2+3i,NaN,NA,[Link]=TRUE)
[1] 3+5i
>sum(TRUE,FALSE,NaN,NA,[Link]=TRUE)
[1] 1
>sum(TRUE,TRUE,NaN,NA,[Link]=TRUE)
[1] 2
> v<-c(1,2,3)
v
[1] 1 2 3
sum(v)
[1] 6
> sum(v,[Link]=FALSE)
[1] 6
Department of Computer Science and Engineering
min() function
min() function returns the minimum of all the values present in their arguments.
Syntax
min(…, [Link]=FALSE)
where … implies numeric or character arguments and [Link] accepts a logical value.
Should missing values (including NaN) be removed?
Example
min(1, 2, 3)
[1] 1
If [Link] is FALSE, an NA or NaN value in any of the argument will cause NA or NaN to be
returned.
Department of Computer Science and Engineering
> min(1, 2, 3, NA, [Link]=FALSE)
[1] NA
> min(1, 2, 3, NaN, [Link]=FALSE)
[1] NaN
> min(1, 2, 3, NA, NaN, [Link]=FALSE)
[1] NA
If [Link] is TRUE, an NA or NaN value in any of the argument will be
ignored.
min(1, 2, 3, NA, NaN, [Link]=TRUE)
[1] 1
Department of Computer Science and Engineering
> v<-c(1,2,3)
v
[1] 1 2 3
min(v)
[1] 1
> min(v,[Link]=FALSE)
[1] 1
min('a','b',[Link]=FALSE)
[1] "a"
Department of Computer Science and Engineering
max() function
max() function returns the maximum of all the values present in their arguments.
Syntax
max(…, [Link]=FALSE)
where … implies numeric or character arguments
[Link] accepts a logical value. Should missing values (including NaN) be removed?
Example
max(44, 78, 66)
[1] 78
If [Link] is FALSE, an NA or NaN value in any of the argument will cause NA or NaN to be
returned.
Department of Computer Science and Engineering
> max(44, 78, 66, NA, [Link]=FALSE)
[1] NA
> max(44, 78, 66, NaN, [Link]=FALSE)
[1] NaN
> max(44, 78, 66, NA, NaN, [Link]=FALSE)
[1] NA
If [Link] is TRUE, an NA or NaN value in any of the argument will be
ignored.
max(44, 78, 66, NA, NaN, [Link]=TRUE)
[1] 78
Department of Computer Science and Engineering
> v<-c(1,2,3)
v
[1] 1 2 3
max(v)
[1] 3
> max(v,[Link]=FALSE)
[1] 3
max('a','b',[Link]=FALSE)
[1] “b"
Department of Computer Science and Engineering
seq() function
seq() function generates a regular sequence.
Syntax
seq(start from, end at, interval, [Link])
where,
Start from: It is the start value of the sequence.
End at: It is the maximal or end value of the sequence. Interval: It is the increment of the
sequence. [Link]: It is the desired length of the sequence.
Example
seq(1, 10, 2)
[1] 1 3 5 7 9
seq(1, 10, [Link]=10)
[1] 1 2 3 4 5 6 7 8 9 10
seq(18)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Department of Computer Science and Engineering
Or
Or
•seq_len(18)
seq_len(18)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
•seq(1, 6, by=3)
seq(1, 6, by=3)
[1]
[1] 1 1
4 4
Department of Computer Science and Engineering
Manipulating Text in Data
Functions Function Arguments Description
substr(a, start stop) a is a character vector. The function returns a part of the string be- ginning from the start argument and
Start and stop arguments contain a numeric value. ending at the stop argument.
strsplit(a, split, …) a is a character vector. The function splits the given text string into substrings.
Split is also a character vector that contains a regular expression
for splitting.
paste(…,sep=‘’,…_ The dots ‘…’ define R objects. The function concatenates string vectors after converting the objects into strings.
sep argument is a character string for separating objects.
grep(pattern, a) Pattern argument contains a matching pattern. The function returns string after searching for a text pattern into a given text
a is a character vector. string.
toupper(a) a is a character vector. The function converts a string into uppercase.
tolower(a) a is a character vector. The function converts a string into lowercase.
Department of Computer Science and Engineering
strsplit(a, split, …)
a is a character vector.
Split is also a character vector that contains a regular expression for splitting.
The function splits the given text string into substrings.
strsplit(c("MYSQL","SQLite"),split=" ")
[[1]]
[1] "MYSQL“
[[2]]
[1] "SQLite“
strsplit(c("MYSQL","SQLite"),split="")
[[1]]
[1] "M" "Y" "S" "Q" "L“
[[2]]
[1] "S" "Q" "L" "i" "t" "e"
Department of Computer Science and Engineering
substr(a, start stop)
a is a character vector.
Start and stop arguments contain a numeric value.
The function returns a part of the string be- ginning from the start argument and ending at
the stop argument.
substr('abc',1,2)
[1] "ab“
> names <- c("Alice", "Bob", "Charlie")
> substr(names,1,2)
[1] "Al" "Bo" "Ch"
Department of Computer Science and Engineering
paste(…,sep= ‘ ‘, …)
The dots ‘…’ define R objects(strings, numbers, etc.) that need to be combined.
sep argument is a character string for separating objects.
The function concatenates string vectors after converting the objects into strings.
>a<-"R language"
>b<-"is open source language"
> paste(a,b)
[1] "R language is open source language"
(default seperator is space)
> paste(a,b,sep=",")
[1] "R language,is open source language"
Department of Computer Science and Engineering
grep(pattern, a)
Pattern argument contains a matching pattern.
a is a character vector.
The function returns string after searching for a text pattern into a given text string.
grep('ness','business')
[1] 1
grep('abc','business')
integer(0)
grep('ness',c('business','businessman','company'))
[1] 1 2
Department of Computer Science and Engineering
Department of Computer Science and Engineering
rep() function
rep() function repeats a given argument for a specified number of times.
Syntax:rep(x, times, each)
In the example below, the string, ‘statistics’ is repeated three times.
Example
>rep(“statistics”, 3)
[1] “statistics” “statistics” “statistics”
>result <- rep(c(1, 2, 3),2)
1] 1 2 3 1 2 3
print(result)
>result <- rep(c(1, 2, 3), each = 2)
print(result)
[1] 1 1 2 2 3 3
Department of Computer Science and Engineering
grep() function
In the example below, the function grep() finds the index position at which the string,
‘statistical’ is present.
Example
grep(“statistical”,c(“R”,“is”,“a”,“statistical”,“language”), fixed=TRUE)
[1] 4
Department of Computer Science and Engineering
toupper() function converts a given character vector into upper case.
Syntax
toupper(x)
x → is a character vector
Example
toupper(“statistics”)
[1] “STATISTICS”
toupper(c("hello","hai"))
[1] "HELLO" "HAI"
Or
casefold (“r programming language”, upper=TRUE)
[1] “R PROGRAMMING LANGUAGE”
Department of Computer Science and Engineering
tolower() function
tolower() function converts the given character vector into
lower case.
Syntax
tolower(x)
x → is a character vector
Example
tolower(“STATISTICS”)
[1] “statistics”
tolower(c("HELLO","HAI"))
[1] "hello" "hai
Department of Computer Science and Engineering
Or
casefold(“R PROGRAMMING LANGUAGE”, upper=FALSE)
[1] “r programming language”
substr() function
substr() function extracts or replaces substrings in a character
vector.
Syntax
substr(x, start, stop)
x → character vector
start → start position of extraction or replacement
stop → stop or end position of extraction or replacement
Example
Extract the string ‘tic’ from ‘statistics’. Begin the extraction at
position 7 and continue the extraction till position 9.
substr(“statistics”, 7, 9)
[1] “tic”
Department of Computer Science and Engineering
MISSING VAlUES TREATMENT IN R
Functions Function Arguments Description
[Link](x) x is an R object to be tested. The function checks the object and returns true
if data is missing.
[Link] (x, x is an R object from which NA needs to be removed. The function returns the object after removing
The dots ‘…’ define the other optional argument. missing values from it.
…)
[Link] x is an R object from which NA needs to be removed. The function returns the object after removing
The dots ‘…’ define the other optional argument. missing values from it.
(x, …)
[Link] (x, The package provides the functions for accessing all APIs. The function will encounter an error if the
object contains any missing values and will
…) return the object if it does not contain any
missing value.
x is an R object from which NA needs to be removed. The dots The function returns the unchanged object.
[Link] (x, ‘…’ define the other optional argument.
…)
Department of Computer Science and Engineering
Department of Computer Science and Engineering
[Link]()
Removes rows with missing values (NA).
Permanently drops the missing values from the dataset.
If you use it in a function like lm() (linear model), the missing values are removed
completely, and residuals will be computed only for the remaining observations.
[Link]()
Similar to [Link]() but keeps track of omitted [Link] used in a function like lm(), it
removes missing values, but when residuals or fitted values are calculated, the missing
values are reintroduced as NA in the correct [Link] when you want to maintain
the structure of the data.
Department of Computer Science and Engineering
USING THE ‘AS’ OPERATOR TO CHANGE THE STRUCTURE OF
DATA
Sometimes analytical data processing requires data conversion from one data format into
another. Generally, analytical data processing stores data in a table format, wherein it requires only some
part of the table or another structure to store the table’s data. In this case, R can convert the structure of
the table into other structures like factor, list, etc.
The operator ‘as’ provides the facility to convert the structure of one dataset into another structure in R.
The syntax of using this operator is
[Link](objectname)
where,
objecttype is the type of object like [Link], matrix, list, etc. and objectname is the name of the object
that needs to be converted into another format.
Department of Computer Science and Engineering
Also, [Link]() and [Link]() functions convert characters and numbers, respectively.
Department of Computer Science and Engineering
Vectors
A vector can have a list of values.
The values can be numbers, strings or logical.
All the values in a vector should be of the same data type.
A few points to remember about vectors in R are:
• Vectors are stored like arrays in C
• Vector indices begin at 1
• All vector elements must have the same mode such as integer, numeric (floating
point number), character (string), logical (Boolean), complex, object, etc.
Department of Computer Science and Engineering
[Link] Vector Example
1 Create a vector of numbers > c(4, 7, 8)
The c function (c is short for combine) creates a new vector consisting of [1] 4 7 8
three values, viz. 4, 7 and 8.
2 Create a vector of string values. > c(“R”, “SAS”, “SPSS”)
[1] “R” “SAS” “SPSS”
3 Create a vector of logical values. > c(TRUE, FALSE)
[1] TRUE FALSE
4 A vector cannot hold values of different data types. Consider the example > c(4, 8, “R”, FALSE)
below on placing integer, string and Boolean values together in a vector. [1] “4” “8” “R” “FALSE”
5 Declare a vector by the name, ‘Project’ of length 3 and store values in it. > Project
Project <- vector(length = 3) [1] “Finance Project” “Retail Project”
> Project [1] <- “Finance Project” “Energy Project”
> Project [2] <- “Retail Project” > length (Project)
> Project [3] <- “Energy Project” [1] 3
Department of Computer Science and Engineering
A sequence vector can be created with a start:end notation.
[Link] Sequence Vector Example
1 Create a sequence of numbers between 1 and 5 (both inclusive). > 1:5
[1] 1 2 3 4 5
> seq(1:5)
[1] 1 2 3 4 5
2 The default increment with seq is 1. However, it also allows the use of > seq (1, 10, 2)
increments [1] 1 3 5 7 9
other than 1. > seq (from=1, to=10, by=2)
[1] 1 3 5 7 9
> seq (1, 10, by=2)
[1] 1 3 5 7 9
3 seq can also generate numbers in the descending order. > 10:1
[1] 10 9 8 7 6 5 4 3 2 1
> seq (10, 1, by=–2)
[1] 10 8 6 4 2
Department of Computer Science and Engineering
To access more than one value from the vector.
[Link] Rep Function Example
> VariableSeq[c(1, 5)]
1 Access the first and the fifth element from the vector, ‘VariableSeq’. [1] “R” “language”c
> VariableSeq[1:4]
2 Access first to the fourth element from the vector, ‘VariableSeq’. [1] “R” “is” “a” “good programming”
> VariableSeq[c(1, 4:5)]
Access the first, fourth and the fifth element from the vector, [1] “R” “good programming” “language”
3
‘VariableSeq’.
> VariableSeq
[1] “R” “is” “a” “good programming”
4 Retrieve all the values from the variable, ‘VariableSeq’ [5] “language”
Department of Computer Science and Engineering
Vector Names : The names() function helps to assign names to the vector elements.
[Link] Function Examples
Creating a name to a vector values > placeholder <- 1:5
placeholder > names(placeholder) <- c(“r”, “is”, “a”,
R is a programming language “programming”, “language”)
1 2 3 4 5
placeholder [3] a
3
Plot a bar graph using the barplot function. The barplot function uses a
vector’s values to plot a bar chart.
> BarVector <- c(4, 7, 8)
> barplot(BarVector)
Department of Computer Science and Engineering
Vector Names : The names() function helps to assign names to the vector elements.
[Link] Function Examples
Let us use the name function to assign names to the vector elements.
These names will be used as labels in the barplot.
> names(BarVector) <- c(“India”, “MiddleEast”, “US”)
barplot(BarVector)
Department of Computer Science and Engineering
[Link] Vector Math Function Example
>x–1
[1] 4 7 8
>x*2
We can run other arithmetic operations on the vector as [1] 10 16 18
1
given: >x/2
[1] 2.5 4.0 4.5
>x
[1] 5 8 9
> y <- c(1, 2, 3)
>y
2 Let us practice these arithmetic operations on two vectors. [1] 1 2 3
>x+y
[1] 6 10 12
Department of Computer Science and Engineering
Vector Math : Let us define a vector, ‘x’ with three values. Let us add a scalar value (single
value) to the vector. This value will get added to each vector element
[Link] Vector Math Function Example
> x <- c(4, 7, 8)
> x +1
[1] 5 8 9
1 value will get added to each vector element.
the vector will retain its individual elements.
>x
[1] 4 7 8
> x <- x + 1
If the vector needs to be updated with the new values,
2 >x
type the statement given below.
[1] 5 8 9
>x–1
[1] 4 7 8
>x*2
3 Other Arithmetic operations
[1] 10 16 18
>x/2
[1] 2.5 4.0 4.5
Department of Computer Science and Engineering
Vector Math : Let us define a vector, ‘x’ with three values. Let us add a scalar value (single
value) to the vector. This value will get added to each vector element
[Link] Vector Math Function Example
>x–y
[1] 4 6 6
1 Other arithmetic operations are: >x*y
[1] 5 16 27
>x
[1] 5 8 9
>y
[1] 1 2 3
> x==y
Check if the two vectors are equal. The comparison takes [1] FALSE FALSE FALSE
2
place element by element. >x<y
[1] FALSE FALSE FALSE
> sin(x)
[1] -0.9589243 0.9893582 0.4121185
Department of Computer Science and Engineering
Vector Recycling : If an operation is performed involving two vectors that requires them to be of
the same length, the shorter one is recycled, i.e. repeated until it is long enough to match the longer
one.
[Link] Vector Recycling Function Example
> c(1, 2, 3) + c(4, 5, 6, 7, 8, 9)
[1] 5 7 9 8 10 12
Add two vectors wherein one has length, 3 and the
1
other has length, 6.
> c(1, 2, 3) * c(4, 5, 6, 7, 8, 9)
[1] 4 10 18 7 16 27
Multiply the two vectors wherein one has length, 3 and
2
the other has length, 6.
Department of Computer Science and Engineering
Plot a Scatter Plot
The function to plot a scatter plot is ‘plot’.
This function uses two vectors, i.e.
one for the x axis and another for the y axis.
The objective is to understand the relationship between numbers
and their sines.
We will use two vectors.
Vector, x which will have a sequence of values between 1 and 25
at an interval of 0.1 and
vector, y which stores the sines of all values held in vector, x.
Department of Computer Science and Engineering
Department of Computer Science and Engineering
Matrices : Matrices Are Nothing But Two-dimensional Arrays,where elements are
arranged in rows and columns of the same data type.
Objective 1
1 Let us create a matrix which is 3 rows by 4 columns and > matrix (1, 3, 4)
set all its elements to 1. [, 1] [, 2] [, 3] [, 4]
[1, ] 1 1 1 1
[2, ] 1 1 1 1
[3, ] 1 1 1 1
2 Use a vector to create an array, 3 rows high and 3
columns wide.
Begin by creating a vector that has elements from 10 to a <- seq(10, 90, by = 10)
90 with an interval of 10.
Validate by printing the value of vector a. >a
[1] 10 20 30 40 50 60 70 80 90
Call the matrix function with vector, ‘a’ the number of > matrix (a, 3, 3)
rows and the number of columns. [, 1] [, 2] [, 3]
[1, ] 10 40 70
[2, ] 20 50 80
[3, ] 30 60 90
Department of Computer Science and Engineering
Matrices : Matrices Are Nothing But Two-dimensional Arrays.
Objective 2
1 Re-shape the vector itself into an array using the a <- seq (10, 90, by = 10)
dim function.
Begin by creating a vector that has elements from
10 to 90 with an interval of 10
2 Validate by printing the value of vector, a. >a
[1] 10 20 30 40 50 60 70 80 90
3 Assign new dimensions to vector, a by passing a > dim(a) <- c(3, 3)
vector having 3 rows and 3 columns (c (3, 3)).
Print the values of vector, a. You will notice that >a
the values have shifted to form 3 [, 1] [, 2] [, 3]
rows by 3 columns. The vector is no longer one [1, ] 10 40 70
dimensional. It has been converted into [2, ] 20 50 80
a two-dimensional matrix that is 3 rows high and [3, ] 30 60 90
3 columns wide.
Department of Computer Science and Engineering
Matrices : Matrices Are Nothing But Two-dimensional Arrays.
Matrix Access
1 Access the elements of a 3 *4 matrix. > x <- 1:12
Create a matrix, ‘mat’, 3 rows high and 4 columns >x
wide using a vector. [1] 1 2 3 4 5 6 7 8 9 10 11 12
> mat <- matrix (x, 3, 4)
> mat
[, 1] [, 2] [, 3] [, 4]
[1, ] 1 4 7 10
[2, ] 2 5 8 11
[3, ] 3 6 9 12
2 Access the element present in the second row and > mat [2, 3]
third column of the matrix, ‘mat’. [1] 8
Department of Computer Science and Engineering
Matrices : Matrices Are Nothing But Two-dimensional Arrays.
Matrix Access
1 Access the third row of an existing matrix. > mat
Let us begin by printing the values of an existing matrix, ‘mat’ [, 1] [, 2] [, 3] [, 4]
[1, ] 1 4 7 10
[2, ] 2 5 8 11
To access the third row of the matrix, simply provide the row number and omit [3, ] 3 6 9 12
the column number.
> mat [3, ]
To access the second column of the matrix, simply provide the column number [1] 3 6 9 12
and omit the row number.
> mat[, 2]
[1] 4 5 6
2 To access the second and third columns of the matrix, simply provide the column > mat[,2:3]
numbers and omit the row number. [, 1] [, 2]
[1, ] 4 7
[2, ] 5 8
[3, ] 6 9
Department of Computer Science and Engineering
Create a contour plot
• Contour plots (sometimes called Level Plots) are a way to show a three-
dimensional surface on a two-dimensional plane.
•It graphs two predictor variables X Y on the y-axis and a response variable Z
as contours.
•These contours are sometimes called z-slices or iso- response values
Department of Computer Science and Engineering
[Link] Function Example
Create a matrix, ‘mat’ which is 9 rows high and 9 columns wide mat <- matrix(1, 9, 9)
and assign the value ‘1’ to all its elements. > mat
[, 1] [, 2] [, 3] [, 4] [, 5] [, 6] [, 7] [, 8] [, 9]
[1, ] 1 1 1 1 1 1 1 1 1
[2, ] 1 1 1 1 1 1 1 1 1
[3, ] 1 1 1 1 1 1 1 1 1
[4, ] 1 1 1 1 1 1 1 1 1
[5, ] 1 1 1 1 1 1 1 1 1
[6, ] 1 1 1 1 1 1 1 1 1
[7, ] 1 1 1 1 1 1 1 1 1
[8, ] 1 1 1 1 1 1 1 1 1
[9, ] 1 1 1 1 1 1 1 1 1
Department of Computer Science and Engineering
Dataset Output
R includes some sample data sets. One of these is ‘volcano’,
which is a 3D map of a
dormant New Zealand volcano. Create a contour map of the
volcano dataset (Figure 3.7).
> contour(volcano)
Department of Computer Science and Engineering
Dataset Output
Let us create a 3D perspective map of the sample data set,
‘volcano’ (Figure 3.8).
> persp(volcano)
Department of Computer Science and Engineering
Dataset Output
Create a heat map of the sample dataset, ‘volcano’ (Figure 3.9).
> image(volcano)
Department of Computer Science and Engineering
Dataset Output
R includes some sample data sets. One of these is ‘volcano’,
which is a 3D map of a
dormant New Zealand volcano. Create a contour map of the
volcano dataset (Figure 3.7).
> contour(volcano)
Department of Computer Science and Engineering
Factors factors are used to handle categorical data. Factors store unique category
labels and are useful in statistical modeling. :
> HouseColor <- c(‘red’, ‘green’, ‘blue’, ‘yellow’, red’, ‘green’, ‘blue’, ‘blue’)
> types <- factor(HouseColor)
> HouseColor
[1] “red” “green” “blue” “yellow” “red” “green” “blue” “blue”
> print(HouseColor)
[1] “red” “green” “blue” “yellow” “red” “green” “blue” “blue”
> print (types)
[1] red green blue yellow red green blue blue
Levels: blue green red yellow
Department of Computer Science and Engineering
Levels denotes the unique values. The above has four distinct values such as ‘blue’,‘green’, ‘red’ and ‘yellow’.
> [Link](types)
[1] 3 2 1 4 3 2 1 1
The above output is explained as given below.
1 is the number assigned to blue. 2 is the number assigned to green. 3 is the number assigned to red.
4 is the number assigned to yellow.
> levels(types) [1] “blue” “green” “red” “yellow”
Department of Computer Science and Engineering
List : List is similar to C Struct. A list can contain different types of data (vectors, matrices,
data frames, other lists, etc.). Lists are useful when working with heterogeneous data.
[Link] Command Example
1 Create a list in R. emp <- list (“EmpName=“Alex”, EmpUnit = “IT”, EmpSal =
To create a list, ‘emp’ having three elements, 55000)
‘EmpName’, ‘EmpUnit’ and ‘EmpSal’. > emp
$EmpName
[1] “Alex”
$EmpUnit
[1] “IT”
2 Emp list EmpList <- list(“Alex”, “IT”, 55000)
EmpList
[[1]]
[1] “Alex”
[[2]]
[1] “IT”
[[3]]
[1] 55000
Department of Computer Science and Engineering
A list has elements. The elements in a list can have names, which are referred to as tags.
Elements can also have values.
[Link] Command Example
1 Retrieve the names of the elements in the list ‘emp’. > names(emp)
[1] “EmpName” “EmpUnit” “EmpSal”
2 Retrieve the values of the elements in the list ‘emp’. > unlist(emp)
EmpName EmpUnit EmpSal
“Alex” “IT” “55000”
3 Retrieve the value of the element ‘EmpName’ in the > unlist(emp[“EmpName”])
list ‘emp’. EmpName
“Alex”
4 Retrieve the value of the element ‘EmpName’ in the > emp[[“EmpName”]]
list ‘emp’. [1] “Alex
Department of Computer Science and Engineering
List : List is similar to C Struct.
Add/Delete Element to or from a List
1 Before adding an element to the list ‘emp’, let us verify what elements > emp
exist in the list. $EmpName
[1] “Alex”
$EmpUnit
[1] “IT”
$EmpSal
[1] 55000
2 Add an element with the name ‘EmpDesg’ and value ‘Software Engineer’ > emp
to the list, ‘emp’. $EmpName
> emp$EmpDesg = “Software Engineer” [1] “Alex”
$EmpUnit
[1] “IT”
$EmpSal
[1] 55000
$EmpDesg
[1] “Software Engineer”
Department of Computer Science and Engineering
List : List is similar to C Struct.
Add/Delete Element to or from a List
1 Delete an element with the name ‘EmpUnit’ and value ‘IT’ from the list, > emp
‘emp’. $EmpName
> emp$EmpUnit <- NULL [1] “Alex”
$EmpSal
[1] 55000
$EmpDesg
[1] “Software Engineer”
2 Determine the number of elements in the list, ‘emp’ > length(emp)
length() function can be used to determine the number of elements [1] 3
present in the list. The list, ‘emp’ has three elements as shown:
Department of Computer Science and Engineering
Recursive List: A recursive list means a list within a list. > emp
Create a list within a list. $EmpName
Let us begin with two lists, ‘emp’ and ‘emp1’. [1] “Alex”
The elements in both the lists are as shown below. 78 Data Analytics using R
$EmpSal
[1] 55000
We would like to combine both the lists into a single list called $EmpDesg
‘EmpList’. [1] “Software Engineer”
> EmpList <- list(emp, emp1) > emp1
$EmpUnit
[1] “IT”
$EmpCity
[1] “Los Angeles”
Department of Computer Science and Engineering
Task
Department of Computer Science and Engineering
Exploring a Dataset
Department of Computer Science and Engineering
Department of Computer Science and Engineering
Department of Computer Science and Engineering
Department of Computer Science and Engineering
Conditional Manipulation of a Dataset
• Analytical data processing sometimes may require specific rows and columns of a dataset.
Department of Computer Science and Engineering
Merging Data
Department of Computer Science and Engineering
Department of Computer Science and Engineering
Department of Computer Science and Engineering
Aggregating and group processing of a variable
R provides some functions for aggregation operation.
1. aggregate() Function: is an inbuilt function of R that aggregates data values. The
function also splits data into groups after performing given statistical functions. The
syntax of the aggregate() function is
aggregate(x, …) or aggregate(x, by, FUN, …)
where, x is an object, by argument defines the list of group elements of the specific
variable of the dataset, FUN argument is a statistic function that returns a numeric
value after given statistic operations and the dots ‘…’ define the other optional
argument
Department of Computer Science and Engineering
The aggregate() function in R is used to group data based on certain variables and then apply a
statistical function (like sum, mean, count, etc.) to each group.
x: The data you want to aggregate (like a column in a dataset).
by: A list that defines the groups (i.e., which column(s) to group by).
FUN: A function that calculates something (e.g., mean, sum, count).
Department of Computer Science and Engineering
The following example reads a table, ‘Fruit_data.csv’ into object, ‘S’. The aggregate()
function computes the mean price of each type of fruit.
Here by argument is list([Link] = S$[Link]) that groups the [Link] columns
Department of Computer Science and Engineering
The tapply() function : is also an inbuilt function of R and works in a manner similar
to the function aggregate(). The function aggregates the data values into groups after
performing the given statistical functions.
The syntax of the tapply () function is
tapply (x, …) or tapply(x, INDEX, FUN, …)
where, x is an object that defines the summary variable, INDEX argument defines the
list of group elements—also called group variable, FUN argument is a statistic function
that returns a numeric value after given statistic operations and the dots ‘…’ define the
other optional argument.
Department of Computer Science and Engineering
The tapply() function in R is used to apply a function (like mean, sum, or count) to subsets of data, grouped
by one or more factors (categories).
tapply(x, INDEX, FUN, …)
x → The numeric column that you want to summarize (e.g., Prices, Sales, Marks).
INDEX → The categorical variable (factor) that defines groups (e.g., Fruit Name, Department, Gender).
FUN → The function to apply to each group (e.g., mean, sum, length).
… → Additional arguments (optional).
Department of Computer Science and Engineering
The following example reads the table, ‘Fruit_data.csv’ into object, ‘A’.
The tapply()function computes the sum and price of each type of fruit.
Here [Link] is a summary variable and [Link] is a grouping variable.
The FUN function is applied on the summary variable, [Link]
Department of Computer Science and Engineering
Department of Computer Science and Engineering
Methods for reading data
CSV and Spreadsheets: Comma separated value (CSV) files and spreadsheets are
used for storing small size data. R has an inbuilt function facility through which
analysts can read both types of files.
Reading CSV Files
A CSV file uses .csv extension and stores data in a table structure format in any
plain text. The following function reads data from a CSV file:
[Link](‘filename’)
where,
filename is the name of the CSV file that needs to be imported.
Department of Computer Science and Engineering
The [Link]() function can also read data from CSV files. The syntax of the function is
[Link](‘filename’, header=TRUE, sep=‘,’,…)
where,
filename argument defines the path of the file to be read, header argument contains logical
values TRUE and FALSE for defining whether the file has header names on the first line or
not, sep argument defines the character used for separating each column of the file and the
dots ‘…’ define the other optional arguments.
Department of Computer Science and Engineering
Department of Computer Science and Engineering
Reading Spreadsheets: A spreadsheet is a table that stores data in rows and columns. Many applications
are available for creating a spreadsheet. Microsoft Excel is the most popular for creating an Excel file.
An Excel file uses .xlsx extension and stores data in a spreadsheet. In R, different packages are available
such as gdata, xlsx, etc., that provide functions for reading Excel files. Importing such packages is
necessary before using any inbuilt function of any package. The [Link]() is an inbuilt function of ‘xlsx’
package for reading Excel files. The syntax of the [Link]() function is [Link](‘filename’,…)
filename argument defines the path of the file to be read and the dots ‘…’ define the
other optional arguments.
Department of Computer Science and Engineering
Department of Computer Science and Engineering
Reading Data from Packages
library() Function : loads packages into the R workspace. It is compulsory to
import the package before reading the available dataset of that package. The
syntax of the library() function is:
library(packagename)
Where, packagename argument is the name of the package to be read.
data() Function: lists all the available datasets of the loaded package into the R
workspace. For loading a new dataset into the loaded packages, users need to
pass the name of the new dataset into data() function. The syntax of the data()
function is: data(datasetname)
Where, datasetname argument is the name of the dataset to be read.
Department of Computer Science and Engineering
Aug 3, 2020 Department of Computer Science and Engineering
Reading Data from Web/APIs
Nowadays most business organisations are using the Internet and cloud services for storing
data.
This online dataset is directly accessible through packages and application programming
interfaces (APIs).
Different packages are available in R for reading from online datasets .
Department of Computer Science and Engineering
The following example illustrates web scraping.
Web scraping extracts data from any webpage of a website.
Here package ‘RCurl’ is used for web scraping . At first, the package, ‘RCurl’ is imported
into the workspace and then getURL() function of the package, ‘RCurl’ takes the required
webpage.
Now htmlTreeParse() function parses the content of the webpage.
Department of Computer Science and Engineering
Department of Computer Science and Engineering
Reading a JSON (Java Script Object Notation) Document
Step 1: Install rjson package.
> [Link](“rjson”)
Installing package into ‘C:/Users/seema_acharya/Documents/R/winlibrary/
3.2’(as ‘lib’ is unspecified) trying URL
‘[Link] Content type
‘application/zip’ length 493614 bytes (482 KB) downloaded 482 KB package
‘rjson’ successfully unpacked and MD5 sums checked
Department of Computer Science and Engineering
Step 2: Input data.
Store the data given below in a text file (‘D:/[Link]’). Ensure that the file is
saved
with an extension of .json
{
‘EMPID’:[‘1001’,’2001’,’3001’,’4001’,’5001’,’6001’,’7001’,’8001’
],
‘Name’:[‘Ricky’,’Danny’,’Mitchelle’,’Ryan’,’Gerry’,’Nonita’,’Sim
on’,’Gallop’ ],
‘Dept’: [‘IT’,’Operations’,’IT’,’HR’,’Finance’,’IT’,’Operations’
,’Finance’]
}
A JSON document begins and ends with a curly brace ({}). A JSON document is a set
of key value pairs. Each key:value pair is delimited using ‘,’ as a delimiter
Department of Computer Science and Engineering
Step 3: Read the JSON file, ‘d:/[Link]’.
> output <- fromJSON(file = “d:/[Link]”)
> output
$EMPID
[1] “1001” “2001” “3001” “4001” “5001” “6001” “7001” “8001”
$Name
[1] “Ricky” “Danny” “Mitchelle” “Ryan” “Gerry” “Nonita”
[7] “Simon” “Gallop”
$Dept
[1] “IT” “Operations” “IT” “HR” “Finance”
[6] “IT” “Operations” “Finance”
Step 4: Convert JSON to a data frame.
Department of Computer Science and Engineering
> JSONDataFrame <- [Link](output)
Display the content of the data frame,
‘output’.
> JSONDataFrame
EMPID Name Dept
1 1001 Ricky IT
2 2001 Danny Operations
3 3001 Mitchelle IT
4 4001 Ryan HR
5 5001 Gerry Finance
6 6001 Nonita IT
7 7001 Simon Operations
8 8001 Gallop Finance
Department of Computer Science and Engineering
Reading an XML File
Step 1: Install an XML package.
> [Link](“XML”)
Installing package into ‘C:/Users/seema_acharya/Documents/R/winlibrary/ 3.2’(as ‘lib’ is
unspecified)
trying URL ‘[Link]
[Link]’ Content type ‘application/zip’ length 4299803 bytes (4.1 MB)
downloaded 4.1 M package ‘XML’ successfully unpacked and MD5 sums checked
Step 2: Input data.
Store the data below in a text file ([Link] in the D: drive). Ensure that the file is
saved with an extension of .xml.
Department of Computer Science and Engineering
Department of Computer Science and Engineering
Reading an XML File: The xml file is read in R using the function xmlParse(). It is stored as a
list in R.
Step 1: Begin by loading the required packages.
> library(“XML”)
Warning message:
package ‘XML’ was built under R version 3.2.3
> library (“methods”)
> output <- xmlParse(file = “d:/[Link]”)
> print(output)
Department of Computer Science and Engineering
Department of Computer Science and Engineering
Step 2: Extract the root node from the XML file.
> rootnode <- xmlRoot(output)
Find the number of nodes in the root.
> rootsize <- xmlSize(rootnode)
> rootsize
[1] 3
Let us display the details of the first node.
> print (rootnode[1])
$EMPLOYEE
<EMPLOYEE>
<EMPID>1001</EMPID>
<EMPNAME>Merrilyn</EMPNAME>
<SKILLS>MongoDB</SKILLS>
<DEPT>ComputerScience</DEPT>
</EMPLOYEE>
attr(, “class”)
[1] “XMLInternalNodeList” “XMLNodeList”
Department of Computer Science and Engineering
Let us display the details of the first element of the first node.
> print(rootnode[[1]][[1]])
<EMPID>1001</EMPID>
Let us display the details of the third element of the first node.
> print(rootnode[[1]][[3]])
<SKILLS>MongoDB</SKILLS>
Next, display the details of the third element of the second node.
> print(rootnode[[2]][[3]])
<SKILLS>PeopleManagement</SKILLS>
We can also display the value of 2nd element of the first node.
> output <-xmlValue(rootnode[[1]][[2]])
> output
[1] “Merrilyn”
Department of Computer Science and Engineering
Step 3: Convert the input xml file to a data frame using the xmlToDataFrame
function.
> xmldataframe <- xmlToDataFrame(“d:/[Link]”)
Display the output of the data frame.
> xmldataframe
EMPID EMPNAME SKILLS DEPT
1 1001 Merrilyn MongoDB ComputerScience
2 1002 Ramya PeopleMananement HumanResources
3 1003 Fedora Recruitment HumanResources
Department of Computer Science and Engineering
Thank You!
Department of Computer Science and Engineering 170