0% found this document useful (0 votes)
28 views23 pages

Multivariate Analysis & Time Series Insights

This document introduces multivariate analysis and discusses introducing a third variable to assess relationships between two variables. It discusses how holding a third, causally prior variable constant allows researchers to draw better causal conclusions about the relationships between the other two variables. Experiments that randomly assign groups are better able to do this than non-experimental studies. For causal inferences to be made, certain assumptions must be met: (1) all other variables affecting both variables must be held constant, and (2) the third variable cannot be an effect of the other two variables. Controlling for a third variable can reveal if an original bivariate relationship was spurious or direct.

Uploaded by

Acshaya M N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views23 pages

Multivariate Analysis & Time Series Insights

This document introduces multivariate analysis and discusses introducing a third variable to assess relationships between two variables. It discusses how holding a third, causally prior variable constant allows researchers to draw better causal conclusions about the relationships between the other two variables. Experiments that randomly assign groups are better able to do this than non-experimental studies. For causal inferences to be made, certain assumptions must be met: (1) all other variables affecting both variables must be held constant, and (2) the third variable cannot be an effect of the other two variables. Controlling for a third variable can reveal if an original bivariate relationship was spurious or direct.

Uploaded by

Acshaya M N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT – V

MULTIVARIATE AND TIME SERIES ANALYSIS


Introducing a Third Variable - Causal Explanations - Three-Variable Contingency Tables and
Beyond - Longitudinal Data – Fundamentals of TSA – Characteristics of time series data – Data
Cleaning – Time-based indexing – Visualizing – Grouping – Resampling.
************************************************************************************

I INRODUCING A THIRD VARIABLE


We consider ways of holding a third variable constant while assessing the relationship between two others.
Causal Explanations
• We will now developed some experience of handling batches of data, summarizing features of their
distributions, and investigating relationships between variables.
• We must now change gear somewhat and ask what it would take for such relationships to be treated
as satisfactory explanations. Hume suggested that 'We may define a cause to be an object followed by
another, and where all the objects, similar to the first, are followed by objects similar to the second.
Or, in other words, where, if the first object had not been, the second never had existed'.

Direct and indirect effects


• Causality should not necessarily be understood as a simple process in which one factor or variable has
an impact on another.
• For example, it is likely in many cases that two or more factors will tend to work together to produce
an effect. Moreover, the factors or variables contributing to the effect may themselves be causally
related. For this reason, we have to keep a clear idea in our heads of the relationships between the
variables in the whole causal process.
• In investigating the causes of absenteeism from work, for example, researchers have found different
contributory factors.
• We will consider two possible causal factors: being female and being in a low status job. Let us
construct a causal path diagram depicting one possible set of relationships between these variables.

1
• The diagram in figure 11.1 represents a simple system of multiple causal paths. There is an arrow
showing that those in low status jobs are more likely to go absent.
• Being female has a causal effect in two ways. There is an arrow straight to absentee behaviour; this
says that women are more likely to be absent from work than men, regardless of the kind of job they
are in.
• This is termed a direct effect of gender on absenteeism. There is also another way in which being
female has an effect; women are more likely to be in the kind of low status, perhaps unpleasant, jobs
where absenteeism is more likely, irrespective of gender.
• We can say that being female therefore also has an indirect effect on absenteeism, through the type of
work performed. Without some empirical evidence we cannot be sure that this 'model' of the
relationships between the variables is correct.

Controlling the world to learn about causes


• It is one thing to declare confidently that causal chains exist in the world out there. However, it is quite
another thing to find out what they are.
• Causal processes are not obvious. They hide in situations of complexity, in which effects may have
been produced by several different causes acting together.
• When investigated, they will reluctantly shed one layer of explanation at a time, but only to reveal
another deeper level of complexity beneath.
• For this reason, something that is accepted as a satisfactory causal explanation at one point in time can
become problematic at another.
• Researchers investigating the causes of psychological depression spent a long time carefully
documenting how severe, traumatizing events that happen to people, such as bereavement or job loss,
can induce it.
• Now that the causal effect of such life events has been established, the research effort is turning to ask
how an event such as unemployment has its effect: is it through the loss of social esteem, through the
decline of self evaluation and self-esteem, through lack of cash or through the sheer effect of inactivity?

Do opinion polls influence people?


• Let us take an example to illustrate the different inferences which can be drawn from experiments and
non-experiments.
• Some people believe that hearing the results of opinion polls before an election always individuals
towards the winning candidate. Imagine two ways in which empirical evidence could be collected for
this proposition.
• An experiment could be conducted by taking a largish group of electors, splitting them into two at
2
random, telling half that the polls indicated one candidate would win and telling the other half that
they showed a rival would win.
• As long as there were a substantial number of people in each group, the groups would start the
experiment having the same political preferences on average, since the groups were formed at random.
• If they differed substantially in their subsequent support for the candidates, then we could be almost
certain that the phony poll information they were fed contributed to which candidate they supported.
• Alternatively, the proposition could be researched in a non-experimental way.
• A survey could be conducted to discover what individuals believed recent opinion polls showed, and
to find out which candidates the individuals themselves supported.
• The preferences of those who believed that one candidate was going to win would be compared with
those who believed that the rival was going to win.
• The hypothesis would be that the former would be more sympathetic to the candidate than the latter.
• the second survey did reveal a strong relationship between individuals' perception of the state of public
opinion and their own belief, should this be taken as evidence that opinion polls have a causal effect
on people's voting decisions? Should policy-makers consider banning polls in pre-election periods as
a result? Anyone who tried this line of argument would be taken to task by the pollsters,
• who have a commercial interest in resisting such reasoning. They would deny that the effect in any
way proves that polls influence opinion; it could, for instance, be that supporters of a right-wing
candidate are of a generally conservative predisposition, and purchase newspapers which only report
polls sympathetic to their candidate.
• In short, comparing individuals in a survey who thought that candidate A would win with those who
believed that candidate B would win, would not be comparing two groups similar in all other possible
respects, unlike the experiment discussed above. An experiment would have a better chance of
persuading people that the publication of opinion polls affected individual views.

Assumptions required to infer cause


• Imagine a common situation. A survey is conducted and an interesting statistical association
between X and Y is discovered.
• There are two basic assumptions that have to be made if we wish to infer from this that X may
cause Y. These involve the relationship between X and Y and other variables which might be
operating.
• They are designed to ensure that when we compare groups which differ on X, we are comparing
like with like. Before giving an exposition of these assumptions, we need a bit more terminology:
other variables can be causally prior to both X and Y, intervene between X and Y, or ensue from
X and Y, as shown in figure 11.2. These terms are only relative to the particular causal model in
3
hand: in a different model we might want to explain what gave rise to the prior variable.

Let us discuss each of the two core assumptions in turn.


• All other variables which affect both X and Y must be held constant. In an experiment, we can be
sure that there are no third variables which give rise to both X and Y because the only way in which
the randomized control groups are allowed to vary is in terms of X. No such assumption can be
made with non-experimental data.

• This assumption is not required before you can assume that there is a causal link between X and
Y, but it is required if you aim to understand how X is causing Y.

• Let us first consider a hypothetical example drawn from the earlier discussion of the causes of
absenteeism. Suppose previous research had shown a positive bivariate relationship between low
social status jobs and absenteeism.

• The question arises: is there something about such jobs that directly causes the people who do them
to go off sick more than others? Before we can draw such a conclusion, two assumptions have to
be made.

• There are many possible outcomes once the relationship between all three variables is considered
at once, four of which are shown in figure 11.3.
4
5
• If the relationship between two variables entirely disappears when a causally prior variable is
brought under control, we say that the original relationship was spurious.

• By this we do not mean that the bivariate effect did not really exist, but rather that any causal
conclusions drawn from it would be incorrect.

• We can now introduce another meaning for that verb 'to explain': in this situation, many researchers
say that the proportion of females in a job 'explains' the relationship between the status of the job
and absenteeism, in the sense that it accounts for it entirely.

• But what of the fourth situation which is actually the most likely outcome? It was the situation
portrayed in figure 11.1.

Simpson's paradox
• In some cases the relationship between two variables is not simply reduced when a third, prior,
variable is taken into account but indeed the direction of the relationship is completely reversed.
• This is often known as Simpson's paradox (named after Edward Simpson who wrote a paper
describing the phenomenon that was published by the Royal Statistical Society in 1951). However,
the insight that a third variable can be vitally important for understanding the relationship between
two other variables is also credited to Karl Pearson in the late nineteenth century. Simpson's
paradox can be succinctly summarized as follows: every statistical relationship between two
variables may be reversed by including additional factors in the analysis.

6
II THREE-VARIABLE CONTINGENCY TABLES AND BEYOND

Causal path models for three variables


• The set of paths of causal influence, both direct and indirect, that we want to begin to consider are
represented in figure 12.5. In this causal model we are trying to explain social trust, the base is
therefore the belief that 'You can't be too careful'.
• The base categories selected for the explanatory variables are having lower levels of qualifications
and not being a member of a voluntary organization, to try and avoid negative paths. Each arrow
linking two variables in a causal path diagram represents the direct effect of one variable upon the
other, controlling all other relevant variables.
• The rule for identifying the relevant variables was given in chapter 11: when we are assessing the
direct effect of one variable upon another, any third variable which is likely to be causally
connected to both variables and prior to one of them should be controlled.
• Coefficient b in figure 12.5 shows the direct effect of being in a voluntary association on the belief
that most people can be trusted.
• To find its value, we focus attention on the proportion who say that most people can be trusted,
controlling for level of qualifications.

7
More complex models: going beyond three variables
• Clearly there are likely to be many other factors or 'variables' that will have an influence, both on
volunteering behaviour and on social trust.
• For example, in the model discussed above we have not considered gender or age, and both of
these may have an impact on all of the variables in our model.
• As can be seen from the discussion above, it becomes quite complicated even to calculate the direct
and indirect causal paths when we have a simple model with three variables.
• We therefore need to go beyond these paper and pencil techniques if we are going to build more
complex models that aim to compare the impact of a number of different explanatory variables on
an outcome variable such as social trust. T
• he following section describes the conceptual foundations that underlie models to examine the
factors influencing a simple dichotomous (two-category) variable.

logistic regression models


• Regression analysis is a method for predicting the values of a continuously distributed dependent
variable from an independent, or explanatory, variable.
• The principles behind logistic regression are very similar and the approach to building models and
interpreting the models is virtually identical. However, whereas regression (more properly termed
Ordinary Least Squares regression, or OLS regression) is used when the dependent variable is
continuous, a binary logistic regression model is used when the dependent variable can only take
two values. In many examples this dependent variable indicates whether an event occurs or not

8
and logistic regression is used to model the probability that the event occurs.
• In the example we have been discussing above, therefore, logistic regression would be used to
model the probability that an individual believes that most people can be trusted. When we are just
using a single explanatory variable, such as volunteering, the logistic regression can be written as

III LONGITUDINAL DATA

• It is important to distinguish longitudinal data from the time series data. Although time series data
can provide us with a picture of aggregate change, it is only longitudinal data that can provide
evidence of change at the level of the individual.
• Time series data could perhaps be understood as a series of snapshots of society, whereas
longitudinal research entails following the same group of individuals over time and linking
information about those individuals from one time point to another.
• For example, in a study such as the British Household Panel Survey, individuals are interviewed
each year about a range of topics including income, political preferences and voting. This makes
it possible to link data about individuals over time and examine, for example, how an individual's
income may rise (or fall) year on year and how their political preferences may change.
• The first part provides a brief introduction to longitudinal research design and focuses on some of
the issues in collecting longitudinal data and problems of attrition. The second part then provides
a brief conceptual introduction to the analysis of longitudinal data.

Collecting longitudinal data


Prospective and retrospective research designs
• Longitudinal data are frequently collected using a prospective longitudinal research design, i.e. the
participants in a research study are contacted by researchers and asked to provide information
about themselves and their circumstances on a number of different occasions.
• This is often referred to as a panel study. However, it is not necessary to use a longitudinal research
design in order to collect longitudinal data and there is therefore a conceptual distinction between
longitudinal data and longitudinal research.
• Indeed, the retrospective collection of longitudinal data is very common. In particular, it has
become an established method for obtaining basic information about the dates of key life events
9
such as marriages, separations and divorces and the birth of any children (i.e. event history data).
• This is clearly an efficient way of collecting longitudinal data and obviates the need to re-contact
the same group of individuals over a period of time.
• A potential problem is that people may not remember the past accurately enough to provide good
quality data. While some authors have argued that recall is not a major problem for collecting
information about dates of significant life events, other research suggests that individuals may have
difficulty remembering dates accurately, or may prefer not to remember unfavourable episodes or
events in their lives.
• Large-scale quantitative surveys often combine a number of different data collection strategies so
they do not always fit neatly into the classification of prospective or retrospective designs. In
particular, longitudinal event history data are frequently collected retrospectively as part of an
ongoing prospective longitudinal study.

10
V FUNDAMENTALS OF TIME SERIES ANALYSIS

Time Series Analysis

• Time series data


o is in the form of a sequence of quantitative observations about a system or process and
is made at successive points in time.
o includes timestamps.
o it is generated while monitoring the industrial process or tracking any business
metrics.
• An ordered sequence of timestamp values at equally spaced intervals is referred to as a time
series.
• Analysis of such a time series is used in many applications such as sales forecasting, utility
studies, budget analysis, economic forecasting, inventory studies, and so on.
• There are a plethora of methods that can be used to model and forecast time series.
• Two keypoints:
o collection of observations – since it is a series
o sequentially in time – since it deals with time

Fundamentals of TSA

Generate a normalized dataset randomly:

1. Generation of the dataset using the numpy library: Code:

import os
import numpy as np
%matplotlib inline
from matplotlib import pyplot as plt import
seaborn as sns
zero_mean_series = [Link](loc=0.0, scale=1., size=50)
zero_mean_series
Output:
array([-0.73140395, -2.4944216 , -1.44929237, -0.40077112,
0.23713083, 0.89632516, -0.90228469, -0.96464949, 1.48135275,
0.64530002, -1.70897785, 0.54863901, -1.14941457, -1.49177657,

-2.04298133, 1.40936481, 0.65621356, -0.37571958, -0.04877503,


11
-0.84619236, -1.46231312, 2.42031845, -0.91949491, 0.80903063,
0.67885337, -0.1082256 , -0.16953567, 0.93628661, 2.57639376,
-0.01489153, 0.9011697 , -0.29900988, 0.04519547, 0.71230853,
-0.00626227, 1.27565662, -0.42432848, 1.44748288, 0.29585819,
0.70547011, -0.6838063 , 1.61502839, -0.04388889, 1.06261716,
0.17708138, 0.3723592 , -0.77185183, -3.3487284 , 0.59464475,
-0.89005505])

2. Plotting the time series data using the seaborn library:

The time series graph is plotted using the [Link]() function which is a built-in method
provided by the seaborn library.
Code:
[Link](figsize=(16, 8))
g = [Link](data=zero_mean_series)
g.set_title('Zero mean model') g.set_xlabel('Time
index')
[Link]()
Output:

A cumulative sum over the list can be performed and then the data is plotted using a time series plot.
The plot gives more interesting results.
Code:
random_walk = [Link](zero_mean_series) random_walk
Output:
It generates an array of the cumulative sum.
array([ -0.73140395, -3.22582556, -4.67511792,
-5.07588904,-4.83875821, -3.94243305, -4.84471774,
12
-5.80936723,-4.32801448, -3.68271446, -5.39169231, -4.8430533

,-5.99246787, -7.48424444, -9.52722576, -8.11786095,-7.46164739,


-7.83736697, -7.886142 , -8.73233436, -10.19464748,
-7.77432903, -8.69382394, -7.88479331,-7.20593994, -7.31416554,
-7.4837012 , -6.5474146 ,-3.97102084, -3.98591237, -3.08474267,
-3.38375255,-3.33855708, -2.62624855, -2.63251082,
-1.35685419,-1.78118268, -0.3336998 , -0.03784161,
0.66762849,-0.01617781, 1.59885058, 1.55496169, 2.61757885,
2.79466023, 3.16701943, 2.3951676 , -0.9535608 ,-0.35891606,
-1.2489711 ])

For any particular value, the next value is the sum of previous values.
3. By plotting the list using the time series plot, an interesting graph that shows the
change in values over time is obtained:
Code:
[Link](figsize=(16, 8))
g = [Link](data=random_walk)
g.set_title('Random Walk')
g.set_xlabel('Time index')
[Link]()

Output:

Univariate time series


• When a sequence of observations is captured for the same variable over a particular duration
of time, the series is referred to as univariate time series.
• In general, in a univariate time series, the observations are taken over regular time periods,
13
such as the change in temperature over time throughout a day.

VI CHARACTERISTICS OF TIME SERIES DATA

• When looking at time series data, it is essential to see if there is any trend. Observing a trend
means that the average measurement values seem either to decrease or increase over time.
• Time series data may contain a notable number of outliers. These outliers can be noted when
plotted on a graph.
• Some data in time series tends to repeat over a certain interval in some patterns. Such repeating
patterns are referred to as seasonality.
• Sometimes, there is an uneven change in time series data. Such uneven changes are referred
to as abrupt changes. Observing abrupt changes in time series is essential as it reveals essential
underlying phenomena.
• Some series tend to follow constant variance over time. Hence, it is essential to observe the
time series data whether or not the data exhibits constant variance over time.

TSA with Open Power System Data


Code:
load time series dataset
df_power =
pd.read_csv("[Link] sd_germany_daily.csv")
df_power.columns

Output:
Index(['Consumption', 'Wind', 'Solar', 'Wind+Solar'], dtype='object'
The columns of the dataframe are:
Date: The date is in the format yyyy-mm-dd.
Consumption: This indicates electricity consumption in GWh.
Solar: This indicates solar power production in GWh.

Wind+Solar: This represents the sum of solar and wind power production in GWh.
• The date column contains the time series dataset.
• This dataset can be used to discover how electricity consumption and production vary
over time in Germany.

14
VII DATA CLEANING

Clean the dataset for outliers:


1. Checking the shape of the dataset:
Code:
df_power.shape
Output:
(4383, 5)
The dataframe contains 4,283 rows and 5 columns.
2. Few entries can also be checked inside the dataframe. The last 10 entries can be examined
by using the following
code:
df_power.tail(10)
Output:

3. The data types of each column are reviewed in the df_power dataframe by:
df_power.dtypes
Output:
Date object
Consumption float64
Wind float64
Solar float64
Wind+Solar float64
dtype: object
• The Date column has a data type of object. This is not correct. So, the next step is to
correct the Date column, as shown here:
15
convert object to datetime format
df_power['Date'] = pd.to_datetime(df_power['Date'])
• It should convert the Date column to Datetime format. This can be verified using the
following code:
df_power.dtypes
Output:
Date datetime64[ns]
Consumption float64
Wind float64
Solar float64
Wind+Solar float64
dtype: object
The Date column has been changed to the correct data type.
• The index of the dataframe can be changed to the
Date column: df_power = df_power.set_index('Date')
df_power.tail(3)
Output:

• In the preceding screenshot, the Date column has been set as DatetimeIndex.
• This can be simply verified by using the code snippet given here:
df_power.index
Output:
DatetimeIndex(['2006-01-01', '2006-01-02', '2006-01-03',
'2006-01-04', '2006-01-05', '2006-01-06', '2006-01-07',
'2006-01-08', '2006-01-09', '2006-01-10', ... '2017-12-22',
'2017-12-23', '2017-12-24', '2017-12-25', '2017-12-26',
'2017-12-27', '2017-12-28', '2017-12-29', '2017-12-30',
'2017-12-31'],
dtype='datetime64[ns]',

16
name='Date',
length=4383, freq=None)
• Since the index is the DatetimeIndex object, it can be used to analyze the dataframe. To make our
lives easier, more columns need to be added to the dataframe.
Adding Year, Month, and Weekday Name:
• Add columns with year, month, and weekday name
df_power['Year'] = df_power.[Link]
df_power['Month'] = df_power.[Link]
df_power['Weekday Name'] = df_power.index.weekday_name
• Let's display five random rows from the dataframe:
• Display a random sampling of 5 rows
df_power.sample(5, random_state=0)
Output:

• Three more columns are —Year, Month, and Weekday Name.


• Adding these columns helps to make the analysis of data easier.

Time-based indexing
• Time-based indexing is a very powerful method of the pandas library when it comes to
time series data.
• Having time-based indexing allows using a formatted string to select data.
Code:
df_power.loc['2015-10-02']
Output:
Consumption 1391.05
Wind 81.229
Solar 160.641
Wind+Solar 241.87
Year 2015
Month 10
17
Weekday Name Friday
Name: 2015-10-02 00:00:00, dtype: object
• The pandas dataframe loc accessor is used.
• In the preceding example, the date is used as a string to select a row.
• All sorts of techniques can be used to access rows just as we can do with a normal
dataframe index.

Visualizing time series


Consider the df_power dataframe to visualize the time series dataset:
• The first step is to import the seaborn and matplotlib libraries:
import [Link] as plt
import seaborn as sns
[Link](rc={'[Link]':(11, 4)})
[Link]['[Link]'] = (8,5)
[Link]['[Link]'] = 150
• Next, a line plot of the full-time series of Germany’s daily electricity Consumption is
generated:
df_power[‘Consumption’].plot(linewidth=0.5)

Output:

• In the above screenshot, the y-axis shows the electricity


consumption and the x-axis shows the year.
• However, there are too many datasets to cover all the years.
• Using the dots to plot the data for all the other columns:
cols_to_plot = ['Consumption', 'Solar', 'Wind']
axes = df_power[cols_to_plot].plot(marker='.', alpha=0.5,
linestyle='None',figsize=(14, 6), subplots=True)
for ax in axes:
ax.set_ylabel('Daily Totals (GWh)')

18
Output:

• The output shows that electricity consumption can be broken down into two distinct
patterns:
o One cluster roughly from 1,400 GWh and above
o Another cluster roughly below 1,400 GWh
• Moreover, solar production is higher in summer and lower in winter.
• Over the years, there seems to have been a strong increasing trend in the output of wind
power.
• Investigation of a single year to have a closer look:
ax = df_power.loc['2016', 'Consumption'].plot()
ax.set_ylabel('Daily Consumption (GWh)');
Output:

• From the preceding screenshot, the consumption of electricity for 2016 can be seen clearly.
• The graph shows a drastic decrease in the consumption of electricity at the end of the year
(December) and during August.
• The month of December 2016 can be examined with the following code block:
ax = df_power.loc['2016-12', 'Consumption'].plot(marker='o', linestyle='-')
ax.set_ylabel('Daily Consumption (GWh)');

19
Output:

• As shown in the preceding graph, electricity consumption is higher on weekdays and lowest
at the weekends.
• The consumption can be observed for each day of the month.
• To see how consumption plays out in the last week of December, it can be zoomed in
further.
• In order to indicate a particular week of December, a specific date range can be supplied
as shown here:
ax=df_power.loc['2016-12-23':'2016-12- 30','Consumption'].plot(marker='o',
linestyle='-')
ax.set_ylabel('Daily Consumption (GWh)');
• As illustrated in the preceding code, the electricity consumption between 2016-12-23 and
2016-12-30 can be observed.

Output:

• As illustrated in the preceding screenshot, electricity consumption was lowest on the day of
Christmas, probably because people were busy partying.
• After Christmas, consumption increased.

VIII GROUPING TIME SERIES DATA


• The data can be grouped by different time periods and box plots can be presented:
• First, the data can be grouped by months, and then the box plots can be used to visualize
the data:
fig, axes = [Link](3, 1, figsize=(8, 7), sharex=True)

20
for name, ax in zip(['Consumption', 'Solar', 'Wind'], axes):
[Link](data=df_power, x='Month', y=name, ax=ax)
ax.set_ylabel('GWh')
ax.set_title(name)
if ax != axes[-1]:
ax.set_xlabel('')

Output:

• The preceding plot illustrates that electricity consumption is generally higher in the winter and
lower in the summer.
• Wind production is higher during the summer.
• Moreover, there are many outliers associated with electricity
consumption, wind production, and solar production.
• Next, the consumption of electricity can be grouped by the day of the week, and presented
in a box plot:
[Link](data=df_power, x='Weekday Name', y='Consumption');
Output:

21
• The preceding screenshot shows that electricity consumption is higher on weekdays than
on weekends.
• Interestingly, there are more outliers on the weekdays.

IX RESAMPLING TIME SERIES DATA


• It is often required to resample the dataset at lower or higher frequencies.
• This resampling is done based on aggregation or grouping operations.
• For example, the data can be resampled based on the weekly mean time series as follows:
1. To resample the data, the following code can be used:
columns = ['Consumption', 'Wind', 'Solar', 'Wind+Solar']
power_weekly_mean = df_power[columns].resample('W').mean()
power_weekly_mean

Output:

• The above screenshot shows that the first row, labeled 2006-01-01, includes the average of
all the data.
• The daily and weekly time series can be plotted to compare the dataset over the six-month
period.
2. Consider the last six months of 2016. Let's start by initializing the variable:
start, end = '2016-01', '2016-06'
3. To plot the graph, the following code can be used:
fig, ax = [Link]()
[Link](df_power.loc[start:end, 'Solar'],marker='.', linestyle='-', linewidth=0.5,
22
label='Daily') [Link](power_weekly_mean.loc[start:end, 'Solar'], marker='o',
markersize=8, linestyle='-', label='Weekly Mean Resample')

ax.set_ylabel('Solar Production in (GWh)')


[Link]();
Output:

• The preceding screenshot shows that the weekly mean time series is increasing over time and
is much smoother than the daily time series.

23

You might also like