0% found this document useful (0 votes)
7 views30 pages

Python Pandas

Chapter 5 introduces Pandas, a powerful data manipulation library built on Numpy, ideal for time-series and spreadsheet-style data analysis. It covers key features such as Series and DataFrame objects, which facilitate data handling, including indexing, grouping, and basic statistical operations. The chapter also highlights how to create, manipulate, and visualize data using Pandas, making it particularly useful in quantitative finance.

Uploaded by

rohini.r
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views30 pages

Python Pandas

Chapter 5 introduces Pandas, a powerful data manipulation library built on Numpy, ideal for time-series and spreadsheet-style data analysis. It covers key features such as Series and DataFrame objects, which facilitate data handling, including indexing, grouping, and basic statistical operations. The chapter also highlights how to create, manipulate, and visualize data using Pandas, making it particularly useful in quantitative finance.

Uploaded by

rohini.r
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 5

Pandas

Pandas is a powerful module that is optimized on top of Numpy and provides a set of
data structures particularly suited to time-series and spreadsheet-style data analysis
(think of pivot tables in Excel). If you are familiar with the R statistical package, then
you can think of Pandas as providing a Numpy-powered DataFrame for Python.
Pandas provides a DataFrame object (among others) built on a Numpy platform
to ease data manipulation (especially for time-series) for statistical processing.
Pandas is particularly popular in quantitative finance. Key features of Pandas include
fast data manipulation and alignment, tools for exchanging data between different
formats and between SQL databases, handling missing data, and cleaning up messy
data.

5.1 Using Series

The easiest way to think about Pandas Series objects is as a container for two Numpy
arrays, one for the index and the other for the data. Recall that Numpy arrays already
have integer-indexing just like regular Python lists.
>>> import pandas as pd
>>> x = [Link]([1,2,30,0,15,6])
>>> x
0 1
1 2
2 30
3 0
4 15
5 6
dtype: int64

This object can be indexed in plain Numpy style,


>>> x[1:3] # Numpy slicing
1 2

© The Editor(s) (if applicable) and The Author(s), under exclusive license 127
to Springer Nature Switzerland AG 2021
J. Unpingco, Python Programming for Data Analysis,
[Link]
128 5 Pandas

2 30
dtype: int64

We can also get the Numpy arrays directly,


>>> [Link] # values
array([ 1, 2, 30, 0, 15, 6])
>>> [Link][1:3]
array([ 2, 30])
>>> [Link] # index is Numpy array-like
RangeIndex(start=0, stop=6, step=1)

Unlike Numpy arrays, you can have mixed types,


>>> s = [Link]([1,2,'anything','more stuff'])
>>> s
0 1
1 2
2 anything
3 more stuff
dtype: object
>>> [Link] # Series index
RangeIndex(start=0, stop=4, step=1)
>>> s[0] # The usual Numpy slicing rules apply
1
>>> s[:-1]
0 1
1 2
2 anything
dtype: object
>>> [Link] # object data type
dtype('O')

Beware that mixed types in a single column can lead to downstream inefficiencies
and other problems. The index in the [Link] generalizes beyond integer-
indexing. For example,
>>> s = [Link]([1,2,3],index=['a','b','cat'])
>>> s['a']
1
>>> s['cat']
3

Because of its legacy as a financial data (i.e., stock prices) processing tool, Pandas
is really good at managing time-series
>>> dates = pd.date_range('20210101',periods=12)
>>> s = [Link](range(12),index=dates) # explicitly assign index
>>> s # default is calendar-daily
2021-01-01 0
2021-01-02 1
2021-01-03 2
2021-01-04 3
2021-01-05 4
2021-01-06 5
2021-01-07 6
5.1 Using Series 129

Fig. 5.1 Quick plot of Series object

2021-01-08 7
2021-01-09 8
2021-01-10 9
2021-01-11 10
2021-01-12 11
Freq: D, dtype: int64

You can do some basic descriptive statistics on the data (not the index!) right away
>>> [Link]()
5.5
>>> [Link]()
3.605551275463989

You can also plot (see Fig. 5.1) the Series using its methods:
>>> [Link](kind='bar',alpha=0.3) # can add extra matplotlib
→ keywords

Data can be summarized by the index. For example, to count the individual days of
the week for which we had data:
>>> [Link](by=lambda i:[Link]).count()
0 2
1 2
2 1
3 1
130 5 Pandas

4 2
5 2
6 2
dtype: int64

Note that the convention 0 is Monday, 1 is Tuesday, and so forth. Thus, there was
only one Sunday (day 6) in the dataset. The groupby method partitions the data
into disjoint groups based on the predicate given via the by keyword argument.
Consider the following Series,
>>> x = [Link](range(5),index=[1,2,11,9,10])
>>> x
1 0
2 1
11 2
9 3
10 4
dtype: int64

Let us group it in the following according to whether the elements in the values are
even or odd using the modulus (%) operator,
>>> grp = [Link](lambda i:i%2) # odd or even
>>> grp.get_group(0) # even group
2 1
10 4
dtype: int64
>>> grp.get_group(1) # odd group
1 0
11 2
9 3
dtype: int64

The first line groups the elements of the Series object by whether or not the index
is even or odd. The lambda function returns 0 or 1 depending on whether or not
the corresponding index is even or odd, respectively. The next line shows the 0 (i.e.,
even) group and then the one after shows the 1 (odd) group. Now that we have
separate groups, and we can perform a wide variety of summarizations on each to
reduce it to a single value. For example, in the following, we get the maximum value
of each group:
>>> [Link]() # max in each group
0 4
1 3
dtype: int64

Note that the operation above returns another Series object with an index
corresponding to the [0,1] elements. There will be as many groups as there are
unique outputs of the by function.
5.2 Using DataFrame 131

5.2 Using DataFrame

While the Series object can be thought of as encapsulating two Numpy arrays
(index and values), the Pandas DataFrame is an encapsulation of group of
Series objects that share a single index. We can create a DataFrame with
dictionaries as in the following:
>>> df = [Link]({'col1': [1,3,11,2], 'col2': [9,23,0,2]})
>>> df
col1 col2
0 1 9
1 3 23
2 11 0
3 2 2

Note that the keys in the input dictionary are now the column headings (labels) of the
DataFrame, with each corresponding column matching the list of corresponding
values from the dictionary. Like the Series object, the DataFrame also has an
index, which is the [0,1,2,3] column on the far left. We can extract elements
from each column using the iloc, which ignore the given index and return to
traditional Numpy slicing,
>>> [Link][:2,:2] # get section
col1 col2
0 1 9
1 3 23

or by directly slicing or by using the dot notation as shown below:


>>> df['col1'] # indexing
0 1
1 3
2 11
3 2
Name: col1, dtype: int64
>>> df.col1 # use dot notation
0 1
1 3
2 11
3 2
Name: col1, dtype: int64

Programming Tip: Spaces in Column Names


As long as the names of the columns in the DataFrame do not contain
spaces or other eval-able syntax like hyphens, you can use the dot notation
attribute-style access to the column values. You can report the column names
using [Link].

Subsequent operations on the DataFrame preserve its column-wise structure as in


the following:
132 5 Pandas

>>> [Link]()
col1 17
col2 34
dtype: int64

where each column was totaled. Grouping and aggregating with the DataFrame is
even more powerful than with Series. Let us construct the following DataFrame,
>>> df = [Link]({'col1': [1,1,0,0], 'col2': [1,2,3,4]})
>>> df
col1 col2
0 1 1
1 1 2
2 0 3
3 0 4

In the above DataFrame, note that the col1 column has only two distinct entries.
We can group the data using this column as in the following:
>>> grp=[Link]('col1')
>>> grp.get_group(0)
col1 col2
2 0 3
3 0 4
>>> grp.get_group(1)
col1 col2
0 1 1
1 1 2

Note that each group corresponds to entries for which col1 was either of its two
values. Now that we have grouped on col1, as with the Series object, we can also
functionally summarize each of the groups as in the following:
>>> [Link]()
col2
col1
0 7
1 3

where the sum is applied across each of the dataframes present in each group. Note
that the index of the output above is each of the values in the original col1.
The DataFrame can compute new columns based on the existing columns
using the eval method as shown below:
>>> df['sum_col']=[Link]('col1+col2')
>>> df
col1 col2 sum_col
0 1 1 2
1 1 2 3
2 0 3 3
3 0 4 4

Note that you can assign the output to a new column to the DataFrame as shown.
We can group by multiple columns as shown below:
5.2 Using DataFrame 133

>>> grp = [Link](['sum_col','col1'])

Doing the sum operation on each group gives the following:


>>> res = [Link]()
>>> res
col2
sum_col col1
2 1 1
3 0 3
1 2
4 0 4

This output is much more complicated than anything we have seen so far, so let us
carefully walk through it. Below the headers, the first row 2 1 1 indicates that for
sum_col=2 and for all values of col1 (namely, just the value 1), the value of col2
is 1. For the next row, the same pattern applies except that for sum_col=3, there are
now two values for col1, namely 0 and 1, which each have their corresponding two
values for the sum operation in col2. This layered display is one way to look at the
result. Note that the layers above are not uniform. Alternatively, we can unstack
this result to obtain the following tabular view of the previous result:
>>> [Link]()
col2
col1 0 1
sum_col
2 NaN 1.0
3 3.0 2.0
4 4.0 NaN

The NaN values indicate positions in the table where there is no entry. For
example, for the pair (sum_col=2,col2=0), there is no corresponding value in the
DataFrame, as you may verify by looking at the penultimate code block. There
is also no entry corresponding to the (sum_col=4,col2=1) pair. Thus, this shows
that the original presentation in the penultimate code block is the same as this one,
just without the above-mentioned missing entries indicated by NaN.
Let us continue with indexing dataframes.
>>> import numpy as np
>>> data=[Link](len(dates)*4).reshape(-1,4)
>>> df = [Link](data,index=dates,
... columns=['A','B','C','D' ])
>>> df
A B C D
2021-01-01 0 1 2 3
2021-01-02 4 5 6 7
2021-01-03 8 9 10 11
2021-01-04 12 13 14 15
2021-01-05 16 17 18 19
2021-01-06 20 21 22 23
2021-01-07 24 25 26 27
2021-01-08 28 29 30 31
2021-01-09 32 33 34 35
2021-01-10 36 37 38 39
134 5 Pandas

2021-01-11 40 41 42 43
2021-01-12 44 45 46 47

Now, you can access each of the columns by name, as in the following:
>>> df['A']
2021-01-01 0
2021-01-02 4
2021-01-03 8
2021-01-04 12
2021-01-05 16
2021-01-06 20
2021-01-07 24
2021-01-08 28
2021-01-09 32
2021-01-10 36
2021-01-11 40
2021-01-12 44
Freq: D, Name: A, dtype: int64

Or, using a quicker attribute-style notation


>>> df.A
2021-01-01 0
2021-01-02 4
2021-01-03 8
2021-01-04 12
2021-01-05 16
2021-01-06 20
2021-01-07 24
2021-01-08 28
2021-01-09 32
2021-01-10 36
2021-01-11 40
2021-01-12 44
Freq: D, Name: A, dtype: int64

Now, we can do some basic computing and indexing.


>>> [Link][:dates[3]] # unlike the Python convention, this
→ includes endpoints!
A B C D
2021-01-01 0 1 2 3
2021-01-02 4 5 6 7
2021-01-03 8 9 10 11
2021-01-04 12 13 14 15
>>> [Link][:,'A':'C'] # all rows by slice of column labels
A B C
2021-01-01 0 1 2
2021-01-02 4 5 6
2021-01-03 8 9 10
2021-01-04 12 13 14
2021-01-05 16 17 18
2021-01-06 20 21 22
2021-01-07 24 25 26
2021-01-08 28 29 30
2021-01-09 32 33 34
5.3 Reindexing 135

2021-01-10 36 37 38
2021-01-11 40 41 42
2021-01-12 44 45 46

5.3 Reindexing

A DataFrame or Series has an index that can align data using reindexing,
>>> x = [Link](range(3),index=['a','b','c'])
>>> x
a 0
b 1
c 2
dtype: int64
>>> [Link](['c','b','a','z'])
c 2.0
b 1.0
a 0.0
z NaN
dtype: float64

Note how the newly created Series object has a new index and fills in missing
items with NaN. You can fill in by other values by using the fill_value keyword
argument in reindex. It is also possible to do back-filling and forward-filling
(ffill) of values when working with ordered data as in the following:
>>> x = [Link](['a','b','c'],index=[0,5,10])
>>> x
0 a
5 b
10 c
dtype: object
>>> [Link](range(11),method='ffill')
0 a
1 a
2 a
3 a
4 a
5 b
6 b
7 b
8 b
9 b
10 c
dtype: object

More complicated interpolation methods are possible, but not directly using reindex.
Reindexing also applies to dataframes, but in either or both of the two dimensions.
>>> df = [Link](index=['a','b','c'],
... columns=['A','B','C','D'],
... data = [Link](3*4).reshape(3,4))
136 5 Pandas

>>> df
A B C D
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11

Now, we can reindex this by the index as in the following:


>>> [Link](['c','b','a','z'])
A B C D
c 8.0 9.0 10.0 11.0
b 4.0 5.0 6.0 7.0
a 0.0 1.0 2.0 3.0
z NaN NaN NaN NaN

Note how the missing z element has been filled in as with the prior Series object.
The same behavior applies to reindexing the columns as in the following:
>>> [Link](columns=['D','A','C','Z','B'])
D A C Z B
a 3 0 2 NaN 1
b 7 4 6 NaN 5
c 11 8 10 NaN 9

Again, we have the same filling behavior, now just applied to the columns. The same
back-filling and forward-filling works with ordered columns/indices as before.

5.4 Deleting Items

Returning to our previous Series object,


>>> x = [Link](range(3),index=['a','b','c'])

To get rid of the data at the ’a’ index, we can use the drop method that will return
a new Series object with the specified data removed,
>>> [Link]('a')
b 1
c 2
dtype: int64

Bear in mind this is a new Series object unless we use the inplace keyword
argument or explicitly using del,
>>> del x['a']

The same pattern applies to DataFrames.


>>> df = [Link](index=['a','b','c'],
... columns=['A','B','C','D'],
... data = [Link](3*4).reshape(3,4))
>>> [Link]('a')
A B C D
b 4 5 6 7
c 8 9 10 11
5.5 Advanced Indexing 137

Or, along the column dimension,


>>> [Link]('A',axis=1)
B C D
a 1 2 3
b 5 6 7
c 9 10 11

Again, the same comments regarding using del and inplace also apply for
DataFrames.

5.5 Advanced Indexing

Pandas provides very powerful and fast slicing,


>>> x = [Link](range(4),index=['a','b','c','d'])
>>> x['a':'c']
a 0
b 1
c 2
dtype: int64

Note that unlike regular Python indexing, both endpoints are included here when
slicing with labels. This can also be used to assign values as in the following:
>>> x['a':'c']=999
>>> x
a 999
b 999
c 999
d 3
dtype: int64

Analogous behavior applies to DataFrames.


>>> df = [Link](index=['a','b','c'],
... columns=['A','B','C','D'],
... data = [Link](3*4).reshape(3,4))
>>> df['a':'b']
A B C D
a 0 1 2 3
b 4 5 6 7

You can pick out individual columns without the colon (:).
>>> df[['A','C']]
A C
a 0 2
b 4 6
c 8 10

Mixing label-based slices with Numpy-like colon slicing is possible using loc,
138 5 Pandas

>>> [Link]['a':'b',['A','C']]
A C
a 0 2
b 4 6

The idea is that the first argument to loc indexes the rows and the second indexes
the columns. Heuristics allow for Numpy-like indexing without worrying about the
labels. You can go back to plain Numpy indexing with iloc.
>>> [Link][0,-2:]
C 2
D 3
Name: a, dtype: int64

5.6 Broadcasting and Data Alignment

The main thing to keep in mind when operating on one or more Series or
DataFrame objects is that the index always aligns the computation.
>>> x = [Link](range(4),index=['a','b','c','d'])
>>> y = [Link](range(3),index=['a','b','c'])

Note that y is missing one of the indices in x, so when we add them,


>>> x+y
a 0.0
b 2.0
c 4.0
d NaN
dtype: float64

Note that because y was missing one of the indices, it was filled in with a NaN. This
behavior also applies to dataframes,
>>> df = [Link](index=['a','b','c'],
... columns=['A','B','C','D'],
... data = [Link](3*4).reshape(3,4))
>>> ef = [Link](index=list('abcd'),
... columns=list('ABCDE'),
... data = [Link](4*5).reshape(4,5))
>>> ef
A B C D E
a 0 1 2 3 4
b 5 6 7 8 9
c 10 11 12 13 14
d 15 16 17 18 19
>>> df
A B C D
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
>>> df+ef
A B C D E
5.6 Broadcasting and Data Alignment 139

a 0.0 2.0 4.0 6.0 NaN


b 9.0 11.0 13.0 15.0 NaN
c 18.0 20.0 22.0 24.0 NaN
d NaN NaN NaN NaN NaN

Note that the non-overlapping elements are filled in with NaN. For simple opera-
tions, you can specify the fill value using the named operation. For example, in the
last case,
>>> [Link](ef,fill_value=0)
A B C D E
a 0.0 2.0 4.0 6.0 4.0
b 9.0 11.0 13.0 15.0 9.0
c 18.0 20.0 22.0 24.0 14.0
d 15.0 16.0 17.0 18.0 19.0

>>> s = [Link]['a'] # take first row


>>> s
A 0
B 1
C 2
D 3
Name: a, dtype: int64

When we add this Series object with the full DataFrame, we obtain the following:
>>> s + df
A B C D
a 0 2 4 6
b 4 6 8 10
c 8 10 12 14

Compare this to the original DataFrame,


>>> df
A B C D
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11

This shows that the Series object was broadcast down the rows, aligning with the
columns in the DataFrame. Here is an example of a different Series object that is
missing some of the columns in the DataFrame,
>>> s = [Link]([1,2,3],index=['A','D','E'])
>>> s+df
A B C D E
a 1.0 NaN NaN 5.0 NaN
b 5.0 NaN NaN 9.0 NaN
c 9.0 NaN NaN 13.0 NaN

Note that the broadcasting still happens down the rows, aligning with the columns,
but fills in the missing entries with NaN.
Here is a quick Python test that uses regular expressions to test for relatively
small prime numbers,
140 5 Pandas

>>> import re
>>> pattern = r'^1?$|^(11+?)\1+$'
>>> def isprime(n):
... return ([Link](pattern, '1'*n) is None) #*
...

Now, we can find which column label has the most prime numbers in it
>>> [Link](isprime)
A B C D
a False False True True
b False True False True
c False False False True

The booleans are automatically cast in the sum below:


>>> [Link](isprime).sum()
A 0
B 1
C 1
D 3
dtype: int64

This just scratches the surface of the kinds of fluid data analysis that are almost
automatic using Pandas.

Programming Tip: Pandas Performance


Pandas groupby, apply, and applymap are flexible and powerful, but
Pandas has to evaluate them in the Python interpreter and not in the optimized
Pandas code, which results in significant slowdown. Thus, it is always best to
use the functions that are built into Pandas itself instead defining pure Python
functions to feed into these methods.

5.7 Categorical and Merging

Pandas supports a few relational algebraic operations like table joins.


>>> df = [Link](index=['a','b','c'],
... columns=['A','B','C'],
... data = [Link](3*3).reshape(3,3))
>>> df
A B C
a 0 1 2
b 3 4 5
c 6 7 8
>>> ef = [Link](index=['a','b','c'],
... columns=['A','Y','Z'],
... data = [Link](3*3).reshape(3,3))
>>> ef
5.7 Categorical and Merging 141

A Y Z
a 0 1 2
b 3 4 5
c 6 7 8

The table join is implemented in the merge function.


>>> [Link](df,ef,on='A')
A B C Y Z
0 0 1 2 1 2
1 3 4 5 4 5
2 6 7 8 7 8

The on keyword argument says to merge the two DataFrames where they have
matching corresponding entries in the A column. Note that the index was not
preserved in the merge. To make things more interesting, let us make the ef
DataFrame different by dropping one of the rows,
>>> [Link]('b',inplace=True)
>>> ef
A Y Z
a 0 1 2
c 6 7 8

Now, let us try the merge again.


>>> [Link](df,ef,on='A')
A B C Y Z
0 0 1 2 1 2
1 6 7 8 7 8

Note that only the elements of A that match both DataFrames (i.e., are in the
intersection of both) are preserved. We can alter this by using the how keyword
argument.
>>> [Link](df,ef,on='A',how='left')
A B C Y Z
0 0 1 2 1.0 2.0
1 3 4 5 NaN NaN
2 6 7 8 7.0 8.0

The how=left keyword argument tells the join to keep all the keys on the left
DataFrame (df in this case) and fill in with NaN wherever missing in the right
DataFrame (ef). If ef has elements along A that are absent in df, then these would
disappear,
>>> ef = [Link](index=['a','d','c'],
... columns=['A','Y','Z'],
... data = 10*[Link](3*3).reshape(3,3))

>>> ef
A Y Z
a 0 10 20
d 30 40 50
c 60 70 80
142 5 Pandas

>>> [Link](df,ef,on='A',how='left')
A B C Y Z
0 0 1 2 10.0 20.0
1 3 4 5 NaN NaN
2 6 7 8 NaN NaN

Likewise, we can do how=right to use the right DataFrame keys,


>>> [Link](df,ef,on='A',how='right')
A B C Y Z
0 0 1.0 2.0 10 20
1 30 NaN NaN 40 50
2 60 NaN NaN 70 80

We can use how=outer to get the union of the keys,


>>> [Link](df,ef,on='A',how='outer')
A B C Y Z
0 0 1.0 2.0 10.0 20.0
1 3 4.0 5.0 NaN NaN
2 6 7.0 8.0 NaN NaN
3 30 NaN NaN 40.0 50.0
4 60 NaN NaN 70.0 80.0

Another common task is to split continuous data into discrete bins.


>>> a = [Link](10)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> bins = [0,5,10]
>>> cats = [Link](a,bins)
>>> cats
[NaN, (0.0, 5.0], (0.0, 5.0], (0.0, 5.0], (0.0, 5.0], (0.0, 5.0],
(5.0, 10.0], (5.0, 10.0], (5.0, 10.0], (5.0, 10.0]]
Categories (2, interval[int64]): [(0, 5] < (5, 10]]

The [Link] function takes the data in the array and puts them into the categorical
variable cats.
>>> [Link]
IntervalIndex([(0, 5], (5, 10]],
closed='right',
dtype='interval[int64]')

The half-open intervals indicate the bounds of each category. You can change the
parity of the intervals by passing the right=False keyword argument.
>>> [Link]
array([-1, 0, 0, 0, 0, 0, 1, 1, 1, 1], dtype=int8)

The -1 above means that the 0 is not included in either of the two categories
because the interval is open on the left. You can count the number of elements in
each category as shown next,
>>> pd.value_counts(cats)
(0, 5] 5
(5, 10] 4
dtype: int64
5.8 Memory Usage and dtypes 143

Descriptive names for each category can be passed using the labels keyword
argument.
>>> cats = [Link](a,bins,labels=['one','two'])
>>> cats
[NaN, 'one', 'one', 'one', 'one', 'one', 'two', 'two', 'two',
→ 'two']
Categories (2, object): ['one' < 'two']

Note that if you pass an integer argument for bins, it will automatically split into
equal-sized categories. The qcut function is very similar except that it splits on
quartiles.
>>> a = [Link](100) # uniform random variables
>>> cats = [Link](a,4,labels=['q1','q2','q3','q4'])
>>> pd.value_counts(cats)
q4 25
q3 25
q2 25
q1 25
dtype: int64

5.8 Memory Usage and dtypes

You will find yourself processing a lot of data with Pandas. Here are some tips to do
that efficiently. First, we need the Penguins dataset from Seaborn,
>>> import seaborn as sns
>>> df = sns.load_dataset('penguins')
>>> [Link]()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female

This is not a particularly big dataset, but it will suffice. Let us examine the dtypes
of the DataFrame,
>>> [Link]
species object
island object
bill_length_mm float64
bill_depth_mm float64
flipper_length_mm float64
body_mass_g float64
sex object
dtype: object
144 5 Pandas

Notice that some of these are marked object. This is usually means inefficiency
because this generalized dtype may consume an inordinate amount of memory.
Pandas comes with a simple way to assess the memory consumption of your
DataFrame,
>>> df.memory_usage(deep=True)
Index 128
species 21876
island 21704
bill_length_mm 2752
bill_depth_mm 2752
flipper_length_mm 2752
body_mass_g 2752
sex 20995
dtype: int64

Now, we have an idea of our memory consumption for this DataFrame, and we
can improve it by changing the dtypes. The categorical type we discussed previously
can be specified as a new dtype for the sex column,
>>> ef = [Link]({'sex':'category'})
>>> ef.memory_usage(deep=True)
Index 128
species 21876
island 21704
bill_length_mm 2752
bill_depth_mm 2752
flipper_length_mm 2752
body_mass_g 2752
sex 548
dtype: int64

This results in almost a 40 times reduction in memory for this, which may be
significant if the DataFrame had thousands of rows, for example. This works
because there are many more rows than distinct values in the sex column. Let us
continue using category as the dtype for the species and island columns.
>>> ef = [Link]({'sex':'category',
... 'species':'category',
... 'island':'category'})
>>> ef.memory_usage(deep=True)
Index 128
species 616
island 615
bill_length_mm 2752
bill_depth_mm 2752
flipper_length_mm 2752
body_mass_g 2752
sex 548
dtype: int64
5.8 Memory Usage and dtypes 145

To compare, we can put these side-by-side into a new DataFrame,


>>> ([Link]({'df':df.memory_usage(deep=True),
... 'ef':ef.memory_usage(deep=True)})
... .assign(ratio= lambda i:[Link]/[Link]))
df ef ratio
Index 128 128 1.000000
species 21876 616 0.028159
island 21704 615 0.028336
bill_length_mm 2752 2752 1.000000
bill_depth_mm 2752 2752 1.000000
flipper_length_mm 2752 2752 1.000000
body_mass_g 2752 2752 1.000000
sex 20995 548 0.026101

This shows a much smaller memory footprint for the columns that we changed
to categorical dtype. We can also change the numerical types from the default
float64, if we do not need that level of precision. For example, the
flipper_length_mm column is measured in millimeters and there is no fractional
part to any of the measurements. Thus, we can change that column as the following
dtype and save four times the memory,
>>> ef = [Link]({'sex':'category',
... 'species':'category',
... 'island':'category',
... 'flipper_length_mm': np.float16})
>>> ef.memory_usage(deep=True)
Index 128
species 616
island 615
bill_length_mm 2752
bill_depth_mm 2752
flipper_length_mm 688
body_mass_g 2752
sex 548
dtype: int64

Here is the summary again,


>>> ([Link]({'df':df.memory_usage(deep=True),
... 'ef':ef.memory_usage(deep=True)})
... .assign(ratio= lambda i:[Link]/[Link]))
df ef ratio
Index 128 128 1.000000
species 21876 616 0.028159
island 21704 615 0.028336
bill_length_mm 2752 2752 1.000000
bill_depth_mm 2752 2752 1.000000
flipper_length_mm 2752 688 0.250000
body_mass_g 2752 2752 1.000000
sex 20995 548 0.026101

Thus, by changing the default object dtype to other smaller dtypes can result in
significant savings and potentially speed up downstream processing for dataframes.
This is particularly true when scraping data from the web directly into dataframes
146 5 Pandas

using pd.read_html, which, even though the data looks numerical on the webpage,
will typically result in the unnecessarily heavy object dtype.

5.9 Common Operations

A common problem is how to split a string column into components. For example,
>>> df = [Link](dict(name=['Jon Doe','Jane Smith']))
>>> [Link](' ',expand=True)
0 1
0 Jon Doe
1 Jane Smith

The key step is the expand keyword argument that converts the result into
a DataFrame. The result can be assigned into the same DataFrame using the
following,
>>> df[['first','last']]=[Link](' ',expand=True)
>>> df
name first last
0 Jon Doe Jon Doe
1 Jane Smith Jane Smith

Note that failing to use the expand keyword argument results in an output list
instead of a DataFrame
>>> [Link](' ')
0 [Jon, Doe]
1 [Jane, Smith]
Name: name, dtype: object

This can be fixed by using apply on the output to convert from a list into a
Series object as shown,1
>>> [Link](' ').apply([Link])
0 1
0 Jon Doe
1 Jane Smith

The apply method is one of the most powerful and general DataFrame
methods. It operates on the individual columns (i.e., [Link]) objects of
the DataFrame. Unlike the applymap method that leaves the shape of the
DataFrame unchanged, the apply method can return objects of a different
shape. For example, doing something like [Link](lambda i:i[i>3])
on a DataFrame with numeric rows will return a small truncated NaN-filled
DataFrame. Further, the [Link](raw=True) keyword argument speeds

1 There
are many other Python string methods in the .str submodule such as rstrip, upper,
and title.
5.9 Common Operations 147

up the method by operating directly on the underlying Numpy array in the


DataFrame columns. This means that the ’apply’ method processes the Numpy
arrays directly instead of the usual ’[Link]’ objects.
The transform method is closely related to apply but must produce an
output DataFrame of the same dimensions. For example,
>>> df = [Link]({'A': [1,1,2,2], 'B': range(4)})
>>> df
A B
0 1 0
1 1 1
2 2 2
3 2 3

We can group by ’A’ and reduce using the usual aggregation,


>>> [Link]('A').sum()
B
A
1 1
2 5

But by using .transform() we can broadcast the results to their respective


places in the original DataFrame.
>>> [Link]('A').transform('sum')
B
0 1
1 1
2 5
3 5

The describe DataFrame method is useful for summarizing a given


DataFrame, as shown below,
>>> df = [Link](index=['a','b','c'], columns=['A','B','C'],
→ data = [Link](3*3).reshape(3,3))
>>> [Link]()
A B C
count 3.0 3.0 3.0
mean 3.0 4.0 5.0
std 3.0 3.0 3.0
min 0.0 1.0 2.0
25% 1.5 2.5 3.5
50% 3.0 4.0 5.0
75% 4.5 5.5 6.5
max 6.0 7.0 8.0

It is often useful to get rid of accidental duplicated entries.


>>> df = [Link]({'A': [1,1,2,2,2,3], 'B': range(6)})
>>> df.drop_duplicates('A')
A B
0 1 0
2 2 2
5 3 5

The keep keyword argument decides which of the duplicated entries to retain.
148 5 Pandas

5.10 Displaying DataFrames

Pandas has the set_option method to change the visual display of DataFrames
while not altering the corresponding data elements.
>>> pd.set_option('display.float_format','{:.2f}'.format)

Note that the argument is a callable that produces the formatted string. These
custom settings can be undone with reset_option, as in
pd.reset_option('display.float_format')

The chop option is handy for trimming excessive display precision,


>>> pd.set_option('[Link]',1e-5)

Within the Jupyter Notebook, formatting can utilize HTML elements with the
[Link] DataFrame method. For example, as shown in Fig. 5.2,
>>> from pandas_datareader import data
>>> df=[Link]("F", 'yahoo', '20200101',
→ '20200110').reset_index()

>>> ([Link](dict(Date='{:%m/%d/%Y}'))
... .hide_index()
... .highlight_min('Close',color='red')
... .highlight_max('Close',color='lightgreen')
... )
<[Link] object at 0x7f9376a22460>

formats the resulting table with the minimum closing price highlighted in red and
the maximum closing price highlighted in green. Providing these kinds of quick
visual cues is critically important for picking out key data elements. Note that the
parenthesis above is for using the newlines to separate the dot methods, which
is a style inherited from the R DataFrame. The key step is to expose the style
formatting with format() and then use its methods to style the resulting HTML

Fig. 5.2 HTML-highlighted items in DataFrame


5.10 Displaying DataFrames 149

Fig. 5.3 Custom color gradients for HTML DataFrame rendering

Fig. 5.4 Customized background barcharts for DataFrame

table in Jupyter Notebook. The following changes the color-wise visual gradient of
the Volume column as in Fig. 5.3.
>>> ([Link](dict(Date='{:%m/%d/%Y}'))
... .hide_index()
... .background_gradient(subset='Volume',cmap='Blues')
... )
<[Link] object at 0x7f9376a8a0d0>

Background barcharts can also be embedded in the table representation in Jupyter


Notebook, as in the following (see Fig. 5.4),
>>> ([Link](dict(Date='{:%m/%d/%Y}'))
... .hide_index()
... .bar('Volume',color='lightblue',align='zero')
... )
<[Link] object at 0x7f9374711f70>
150 5 Pandas

5.11 Multi-index

We encountered the Pandas MultiIndex when using groupby with multiple


columns, but these can be created separately:
>>> idx = [Link].from_product([['a','b'],[1,2,3]])
>>> data = 10*[Link](6).reshape(6,1)
>>> df = [Link](data=data,index=idx,columns=['A'])
>>> df
A
a 1 0
2 10
3 20
b 1 30
2 40
3 50

which is more compact than the following,


>>> df.reset_index()
level_0 level_1 A
0 a 1 0
1 a 2 10
2 a 3 20
3 b 1 30
4 b 2 40
5 b 3 50

But recall that we did not give the index a name when we created it, which explains
the uniformative headers, level_0 and level_1. We can swap the two levels of the
index in the DataFrame,
>>> [Link]()
A
1 a 0
2 a 10
3 a 20
1 b 30
2 b 40
3 b 50

The [Link] makes it much easier to index the DataFrame using the
loc accessor,
>>> ixs = [Link]
>>> [Link][ixs['a',:],:]
A
a 1 0
2 10
3 20
>>> [Link][ixs[:,2],:]
A
a 2 10
b 2 40
5.11 Multi-index 151

Note that there can be many more levels to a multi-index. These can also go in the
column index,
>>> rx = [Link].from_product([['a','b'],[1,2,3]])
>>> cx = [Link].from_product([['A','B','C'],[2,3]])
>>> data=[[2, 3, 9, 3, 4, 1],
... [9, 5, 9, 7, 2, 1],
... [9, 4, 4, 3, 2, 1],
... [1, 0, 4, 5, 5, 5],
... [5, 8, 1, 6, 1, 7],
... [0, 8, 9, 2, 1, 9]]
>>> df = [Link](index=rx,columns=cx,data=data)
>>> df
A B C
2 3 2 3 2 3
a 1 2 3 9 3 4 1
2 9 5 9 7 2 1
3 9 4 4 3 2 1
b 1 1 0 4 5 5 5
2 5 8 1 6 1 7
3 0 8 9 2 1 9

You can use [Link] for both the columns and rows,
>>> [Link][ixs['a',:],ixs['A',:]]=1
>>> df
A B C
2 3 2 3 2 3
a 1 1 1 9 3 4 1
2 1 1 9 7 2 1
3 1 1 4 3 2 1
b 1 1 0 4 5 5 5
2 5 8 1 6 1 7
3 0 8 9 2 1 9

It is helpful to add names to the levels of the index,


>>> [Link] = [Link].set_names(['X','Y'])
>>> df
A B C
2 3 2 3 2 3
X Y
a 1 1 1 9 3 4 1
2 1 1 9 7 2 1
3 1 1 4 3 2 1
b 1 1 0 4 5 5 5
2 5 8 1 6 1 7
3 0 8 9 2 1 9

Even with these complicated multi-indices on the rows/columns, the groupby


method still works, but with a full specification of the particular column as
(’B’,2),
>>> [Link](('B',2)).sum()
A B C
2 3 3 2 3
152 5 Pandas

(B, 2)
1 5 8 6 1 7
4 2 1 8 7 6
9 2 10 12 7 11

To understand how this works, take the column slice and examine its unique
elements. This explains the values in the resulting row index of the output.
>>> [Link][:,('B',2)].unique()
array([9, 4, 1])

Now, we have to examine the partitions that are created in the DataFrame by each
of these values such as:
>>> [Link](('B',2)).get_group(4)
A B C
2 3 2 3 2 3
X Y
a 3 1 1 4 3 2 1
b 1 1 0 4 5 5 5

and then summing over these groups produces the final output. You can also use
the apply function on the group to calculate non-scalar output. For example, to
subtract the minimum of each element in the group we can do the following,
>>> [Link](('B',2)).apply(lambda i:[Link]())
A B C
2 3 2 3 2 3
X Y
a 1 1 0 0 1 3 0
2 1 0 0 5 1 0
3 0 1 0 0 0 0
b 1 0 0 0 2 3 4
2 0 0 0 0 0 0
3 0 7 0 0 0 8

5.12 Pipes

Pandas implements method-chaining with the pipe function. Even though this
is un-Pythonic, it is easier than composing functions together that manipulate
dataframes from end-to-end.
>>> df = [Link](index=['a','b','c'],
... columns=['A','B','C'],
... data = [Link](3*3).reshape(3,3))
>>> [Link](lambda i:i*10).pipe(lambda i:3*i)
A B C
a 0 30 60
b 90 120 150
c 180 210 240
5.13 Data Files and Databases 153

Suppose we need to find the cases for which the sum of the columns is an odd
number. We can create an intermediate throw-away variable t using assign and
then extract the corresponding section of the DataFrame as in the following,
>>> [Link](t=df.A+df.B+df.C).query('t%2==1').drop('t',axis=1)
A B C
a 0 1 2
c 6 7 8

The assign method takes either a function whose argument is the DataFrame
itself or the named DataFrame. The query method then filters the intermediate
result according to the oddness of t and the final step is to remove the t variable
that we no longer need in the output.

5.13 Data Files and Databases

Pandas has powerful I/O utilities for manipulating Excel and CSV spreadsheets.
>>> df.to_excel('this_excel.[Link]')

You will find that the given spreadsheet has the dates formatted according to Excel’s
internal date representation.
If you have PyTables installed, you can write to HDFStore. You can also
manipulate HDF5 directly from PyTables.
>>> df.to_hdf('filename.h5','keyname')

You can later read this using


>>> dg=pd.read_hdf('filename.h5','keyname')

to get your data back. You can create a SQLite database right away because SQLite
is included with Python itself.
>>> import sqlite3
>>> cnx = [Link](':memory:')
>>> df = [Link](index=['a','b','c'],
... columns=['A','B','C'],
... data = [Link](3*3).reshape(3,3))
>>> df.to_sql('TableName',cnx)

Now, we can reload from the database using the usual relational algebra
>>> from [Link] import sql
>>> p2 = sql.read_sql_query('select * from TableName', cnx)
>>> p2
index A B C
0 a 0 1 2
1 b 3 4 5
2 c 6 7 8
154 5 Pandas

5.14 Customizing Pandas

Since Pandas 0.23, we have extensions.register_dataframe_accessor, which


allows for easy extension of Pandas Dataframes/Series without subclassing.
>>> df = [Link](index=['a','b','c'],
... columns=['A','B','C','D'],
... data = [Link](3*4).reshape(3,4))
>>> df
A B C D
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11

The following code defines a custom accessor that behaves as if it is a native


DataFrame method.
>>> @[Link].register_dataframe_accessor('custom')
... class CustomAccess:
... def __init__(self,df): # receives DataFrame
... assert 'A' in [Link] # some input validation
... assert 'B' in [Link]
... self._df = df
... @property # custom attribute
... def odds(self):
... 'drop all columns that have all even elements'
... df = self._df
... return df[df % 2==0].dropna(axis=1,how='all')
... def avg_odds(self): # custom method
... 'average only odd terms in each column'
... df = self._df
... return df[df % 2==1].mean(axis=0)
...

Now, with that established, we can use our new method prefixed with the custom
namespace, as in the following,
>>> [Link] # as attribute
A C
a 0 2
b 4 6
c 8 10
>>> [Link].avg_odds() # as method
A nan
B 5.00
C nan
D 7.00
dtype: float64

Importantly, you can use any word you want besides custom. You just have to
specify it in the decorator. The analogous register_series_accessor does the
5.15 Rolling and Filling Operations 155

same thing for Series objects and the register_index_accessor for Index
objects.2

5.15 Rolling and Filling Operations

Due to Pandas’ legacy in quantitative finance, many rolling time-series calculations


are easy to compute. Let us load some stock price data.
>>> from pandas_datareader import data
>>> df=[Link]("F", 'yahoo', '20200101',
→ '20200130').reset_index()
>>> [Link]()
Date High Low Open Close Volume Adj Close
0 2020-01-02 9.42 9.19 9.29 9.42 43425700.00 9.26
1 2020-01-03 9.37 9.15 9.31 9.21 45040800.00 9.06
2 2020-01-06 9.17 9.06 9.10 9.16 43372300.00 9.01
3 2020-01-07 9.25 9.12 9.20 9.25 44984100.00 9.10
4 2020-01-08 9.30 9.17 9.23 9.25 45994900.00 9.10

We can compute the mean over the trailing three elements,


>>> [Link](3).mean().head(5)
High Low Open Close Volume Adj Close
0 nan nan nan nan nan nan
1 nan nan nan nan nan nan
2 9.32 9.13 9.23 9.26 43946266.67 9.11
3 9.26 9.11 9.20 9.21 44465733.33 9.05
4 9.24 9.12 9.18 9.22 44783766.67 9.07

Note that we only get valid outputs for a fully filled trailing window. Whether
or not the endpoints of the window are used in the calculation is determined by
the closed keyword argument. Besides the default rectangular window, other
windows such as Blackman and Hamming are available. The [Link]()
function produces a Rolling object with methods such as apply, aggregate,
and others. Similar to rolling, exponentially weighted windowed calculations
can be computed with the ewm() method,
>>> [Link](3).mean()
High Low Open Close Volume Adj Close
0 9.42 9.19 9.29 9.42 43425700.00 9.26
1 9.39 9.17 9.30 9.30 44348614.29 9.14
2 9.30 9.12 9.21 9.24 43926424.32 9.08
3 9.28 9.12 9.21 9.24 44313231.43 9.09
4 9.29 9.14 9.22 9.25 44864456.98 9.09
5 9.29 9.15 9.24 9.25 46979043.75 9.10
6 9.31 9.18 9.25 9.25 44906738.44 9.10
7 9.30 9.16 9.25 9.25 45919910.43 9.09
8 9.31 9.17 9.24 9.26 45113266.20 9.10

2 The third-party data-cleaning module pyjanitor utilizes this approach extensively.


156 5 Pandas

9 9.30 9.18 9.25 9.24 47977202.99 9.09


10 9.30 9.17 9.24 9.22 47020077.94 9.07
11 9.28 9.16 9.23 9.21 45632324.49 9.05
12 9.27 9.14 9.21 9.21 46637216.86 9.05
13 9.26 9.15 9.21 9.20 44926124.49 9.04
14 9.24 9.09 9.19 9.18 52761475.78 9.03
15 9.21 9.06 9.17 9.14 56635156.17 8.98
16 9.14 8.99 9.10 9.07 57676520.00 8.92
17 9.11 8.96 9.06 9.05 64587200.41 8.90
18 9.07 8.93 9.01 9.00 63198880.10 8.89
19 9.01 8.88 8.96 8.96 58089908.44 8.88

We have barely scratched the surface of what Pandas is capable of and we have
completely ignored its powerful features for managing dates and times. There is
much more to learn and the online documentation and tutorials at the main Pandas
site are great for learning more.

You might also like