0% found this document useful (0 votes)

7 views30 pages

Python Pandas

Chapter 5 introduces Pandas, a powerful data manipulation library built on Numpy, ideal for time-series and spreadsheet-style data analysis. It covers key features such as Series and DataFrame objects, which facilitate data handling, including indexing, grouping, and basic statistical operations. The chapter also highlights how to create, manipulate, and visualize data using Pandas, making it particularly useful in quantitative finance.

Uploaded by

rohini.r

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views30 pages

Python Pandas

Uploaded by

rohini.r

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Chapter 5

Pandas

Pandas is a powerful module that is optimized on top of Numpy and provides a set of
data structures particularly suited to time-series and spreadsheet-style data analysis
(think of pivot tables in Excel). If you are familiar with the R statistical package, then
you can think of Pandas as providing a Numpy-powered DataFrame for Python.
Pandas provides a DataFrame object (among others) built on a Numpy platform
to ease data manipulation (especially for time-series) for statistical processing.
Pandas is particularly popular in quantitative finance. Key features of Pandas include
fast data manipulation and alignment, tools for exchanging data between different
formats and between SQL databases, handling missing data, and cleaning up messy
data.

5.1 Using Series

The easiest way to think about Pandas Series objects is as a container for two Numpy
arrays, one for the index and the other for the data. Recall that Numpy arrays already
have integer-indexing just like regular Python lists.
>>> import pandas as pd
>>> x = [Link]([1,2,30,0,15,6])
>>> x
0 1
1 2
2 30
3 0
4 15
5 6
dtype: int64

This object can be indexed in plain Numpy style,

>>> x[1:3] # Numpy slicing
1 2

© The Editor(s) (if applicable) and The Author(s), under exclusive license 127
to Springer Nature Switzerland AG 2021
J. Unpingco, Python Programming for Data Analysis,
[Link]
128 5 Pandas

2 30
dtype: int64

We can also get the Numpy arrays directly,

>>> [Link] # values
array([ 1, 2, 30, 0, 15, 6])
>>> [Link][1:3]
array([ 2, 30])
>>> [Link] # index is Numpy array-like
RangeIndex(start=0, stop=6, step=1)

Unlike Numpy arrays, you can have mixed types,

>>> s = [Link]([1,2,'anything','more stuff'])
>>> s
0 1
1 2
2 anything
3 more stuff
dtype: object
>>> [Link] # Series index
RangeIndex(start=0, stop=4, step=1)
>>> s[0] # The usual Numpy slicing rules apply
1
>>> s[:-1]
0 1
1 2
2 anything
dtype: object
>>> [Link] # object data type
dtype('O')

Beware that mixed types in a single column can lead to downstream inefficiencies
and other problems. The index in the [Link] generalizes beyond integer-
indexing. For example,
>>> s = [Link]([1,2,3],index=['a','b','cat'])
>>> s['a']
1
>>> s['cat']
3

Because of its legacy as a financial data (i.e., stock prices) processing tool, Pandas
is really good at managing time-series
>>> dates = pd.date_range('20210101',periods=12)
>>> s = [Link](range(12),index=dates) # explicitly assign index
>>> s # default is calendar-daily
2021-01-01 0
2021-01-02 1
2021-01-03 2
2021-01-04 3
2021-01-05 4
2021-01-06 5
2021-01-07 6
5.1 Using Series 129

Fig. 5.1 Quick plot of Series object

2021-01-08 7
2021-01-09 8
2021-01-10 9
2021-01-11 10
2021-01-12 11
Freq: D, dtype: int64

You can do some basic descriptive statistics on the data (not the index!) right away
>>> [Link]()
5.5
>>> [Link]()
3.605551275463989

You can also plot (see Fig. 5.1) the Series using its methods:
>>> [Link](kind='bar',alpha=0.3) # can add extra matplotlib
→ keywords

Data can be summarized by the index. For example, to count the individual days of
the week for which we had data:
>>> [Link](by=lambda i:[Link]).count()
0 2
1 2
2 1
3 1
130 5 Pandas

4 2
5 2
6 2
dtype: int64

Note that the convention 0 is Monday, 1 is Tuesday, and so forth. Thus, there was
only one Sunday (day 6) in the dataset. The groupby method partitions the data
into disjoint groups based on the predicate given via the by keyword argument.
Consider the following Series,
>>> x = [Link](range(5),index=[1,2,11,9,10])
>>> x
1 0
2 1
11 2
9 3
10 4
dtype: int64

Let us group it in the following according to whether the elements in the values are
even or odd using the modulus (%) operator,
>>> grp = [Link](lambda i:i%2) # odd or even
>>> grp.get_group(0) # even group
2 1
10 4
dtype: int64
>>> grp.get_group(1) # odd group
1 0
11 2
9 3
dtype: int64

The first line groups the elements of the Series object by whether or not the index
is even or odd. The lambda function returns 0 or 1 depending on whether or not
the corresponding index is even or odd, respectively. The next line shows the 0 (i.e.,
even) group and then the one after shows the 1 (odd) group. Now that we have
separate groups, and we can perform a wide variety of summarizations on each to
reduce it to a single value. For example, in the following, we get the maximum value
of each group:
>>> [Link]() # max in each group
0 4
1 3
dtype: int64

Note that the operation above returns another Series object with an index
corresponding to the [0,1] elements. There will be as many groups as there are
unique outputs of the by function.
5.2 Using DataFrame 131

5.2 Using DataFrame

While the Series object can be thought of as encapsulating two Numpy arrays
(index and values), the Pandas DataFrame is an encapsulation of group of
Series objects that share a single index. We can create a DataFrame with
dictionaries as in the following:
>>> df = [Link]({'col1': [1,3,11,2], 'col2': [9,23,0,2]})
>>> df
col1 col2
0 1 9
1 3 23
2 11 0
3 2 2

Note that the keys in the input dictionary are now the column headings (labels) of the
DataFrame, with each corresponding column matching the list of corresponding
values from the dictionary. Like the Series object, the DataFrame also has an
index, which is the [0,1,2,3] column on the far left. We can extract elements
from each column using the iloc, which ignore the given index and return to
traditional Numpy slicing,
>>> [Link][:2,:2] # get section
col1 col2
0 1 9
1 3 23

or by directly slicing or by using the dot notation as shown below:

>>> df['col1'] # indexing
0 1
1 3
2 11
3 2
Name: col1, dtype: int64
>>> df.col1 # use dot notation
0 1
1 3
2 11
3 2
Name: col1, dtype: int64

Programming Tip: Spaces in Column Names

As long as the names of the columns in the DataFrame do not contain
spaces or other eval-able syntax like hyphens, you can use the dot notation
attribute-style access to the column values. You can report the column names
using [Link].

Subsequent operations on the DataFrame preserve its column-wise structure as in

the following:
132 5 Pandas

>>> [Link]()
col1 17
col2 34
dtype: int64

where each column was totaled. Grouping and aggregating with the DataFrame is
even more powerful than with Series. Let us construct the following DataFrame,
>>> df = [Link]({'col1': [1,1,0,0], 'col2': [1,2,3,4]})
>>> df
col1 col2
0 1 1
1 1 2
2 0 3
3 0 4

In the above DataFrame, note that the col1 column has only two distinct entries.
We can group the data using this column as in the following:
>>> grp=[Link]('col1')
>>> grp.get_group(0)
col1 col2
2 0 3
3 0 4
>>> grp.get_group(1)
col1 col2
0 1 1
1 1 2

Note that each group corresponds to entries for which col1 was either of its two
values. Now that we have grouped on col1, as with the Series object, we can also
functionally summarize each of the groups as in the following:
>>> [Link]()
col2
col1
0 7
1 3

where the sum is applied across each of the dataframes present in each group. Note
that the index of the output above is each of the values in the original col1.
The DataFrame can compute new columns based on the existing columns
using the eval method as shown below:
>>> df['sum_col']=[Link]('col1+col2')
>>> df
col1 col2 sum_col
0 1 1 2
1 1 2 3
2 0 3 3
3 0 4 4

Note that you can assign the output to a new column to the DataFrame as shown.
We can group by multiple columns as shown below:
5.2 Using DataFrame 133

>>> grp = [Link](['sum_col','col1'])

Doing the sum operation on each group gives the following:

>>> res = [Link]()
>>> res
col2
sum_col col1
2 1 1
3 0 3
1 2
4 0 4

This output is much more complicated than anything we have seen so far, so let us
carefully walk through it. Below the headers, the first row 2 1 1 indicates that for
sum_col=2 and for all values of col1 (namely, just the value 1), the value of col2
is 1. For the next row, the same pattern applies except that for sum_col=3, there are
now two values for col1, namely 0 and 1, which each have their corresponding two
values for the sum operation in col2. This layered display is one way to look at the
result. Note that the layers above are not uniform. Alternatively, we can unstack
this result to obtain the following tabular view of the previous result:
>>> [Link]()
col2
col1 0 1
sum_col
2 NaN 1.0
3 3.0 2.0
4 4.0 NaN

The NaN values indicate positions in the table where there is no entry. For
example, for the pair (sum_col=2,col2=0), there is no corresponding value in the
DataFrame, as you may verify by looking at the penultimate code block. There
is also no entry corresponding to the (sum_col=4,col2=1) pair. Thus, this shows
that the original presentation in the penultimate code block is the same as this one,
just without the above-mentioned missing entries indicated by NaN.
Let us continue with indexing dataframes.
>>> import numpy as np
>>> data=[Link](len(dates)*4).reshape(-1,4)
>>> df = [Link](data,index=dates,
... columns=['A','B','C','D' ])
>>> df
A B C D
2021-01-01 0 1 2 3
2021-01-02 4 5 6 7
2021-01-03 8 9 10 11
2021-01-04 12 13 14 15
2021-01-05 16 17 18 19
2021-01-06 20 21 22 23
2021-01-07 24 25 26 27
2021-01-08 28 29 30 31
2021-01-09 32 33 34 35
2021-01-10 36 37 38 39
134 5 Pandas

2021-01-11 40 41 42 43
2021-01-12 44 45 46 47

Now, you can access each of the columns by name, as in the following:
>>> df['A']
2021-01-01 0
2021-01-02 4
2021-01-03 8
2021-01-04 12
2021-01-05 16
2021-01-06 20
2021-01-07 24
2021-01-08 28
2021-01-09 32
2021-01-10 36
2021-01-11 40
2021-01-12 44
Freq: D, Name: A, dtype: int64

Or, using a quicker attribute-style notation

>>> df.A
2021-01-01 0
2021-01-02 4
2021-01-03 8
2021-01-04 12
2021-01-05 16
2021-01-06 20
2021-01-07 24
2021-01-08 28
2021-01-09 32
2021-01-10 36
2021-01-11 40
2021-01-12 44
Freq: D, Name: A, dtype: int64

Now, we can do some basic computing and indexing.

>>> [Link][:dates[3]] # unlike the Python convention, this
→ includes endpoints!
A B C D
2021-01-01 0 1 2 3
2021-01-02 4 5 6 7
2021-01-03 8 9 10 11
2021-01-04 12 13 14 15
>>> [Link][:,'A':'C'] # all rows by slice of column labels
A B C
2021-01-01 0 1 2
2021-01-02 4 5 6
2021-01-03 8 9 10
2021-01-04 12 13 14
2021-01-05 16 17 18
2021-01-06 20 21 22
2021-01-07 24 25 26
2021-01-08 28 29 30
2021-01-09 32 33 34
5.3 Reindexing 135

2021-01-10 36 37 38
2021-01-11 40 41 42
2021-01-12 44 45 46

5.3 Reindexing

A DataFrame or Series has an index that can align data using reindexing,
>>> x = [Link](range(3),index=['a','b','c'])
>>> x
a 0
b 1
c 2
dtype: int64
>>> [Link](['c','b','a','z'])
c 2.0
b 1.0
a 0.0
z NaN
dtype: float64

Note how the newly created Series object has a new index and fills in missing
items with NaN. You can fill in by other values by using the fill_value keyword
argument in reindex. It is also possible to do back-filling and forward-filling
(ffill) of values when working with ordered data as in the following:
>>> x = [Link](['a','b','c'],index=[0,5,10])
>>> x
0 a
5 b
10 c
dtype: object
>>> [Link](range(11),method='ffill')
0 a
1 a
2 a
3 a
4 a
5 b
6 b
7 b
8 b
9 b
10 c
dtype: object

More complicated interpolation methods are possible, but not directly using reindex.
Reindexing also applies to dataframes, but in either or both of the two dimensions.
>>> df = [Link](index=['a','b','c'],
... columns=['A','B','C','D'],
... data = [Link](3*4).reshape(3,4))
136 5 Pandas

>>> df
A B C D
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11

Now, we can reindex this by the index as in the following:

>>> [Link](['c','b','a','z'])
A B C D
c 8.0 9.0 10.0 11.0
b 4.0 5.0 6.0 7.0
a 0.0 1.0 2.0 3.0
z NaN NaN NaN NaN

Note how the missing z element has been filled in as with the prior Series object.
The same behavior applies to reindexing the columns as in the following:
>>> [Link](columns=['D','A','C','Z','B'])
D A C Z B
a 3 0 2 NaN 1
b 7 4 6 NaN 5
c 11 8 10 NaN 9

Again, we have the same filling behavior, now just applied to the columns. The same
back-filling and forward-filling works with ordered columns/indices as before.

5.4 Deleting Items

Returning to our previous Series object,

>>> x = [Link](range(3),index=['a','b','c'])

To get rid of the data at the ’a’ index, we can use the drop method that will return
a new Series object with the specified data removed,
>>> [Link]('a')
b 1
c 2
dtype: int64

Bear in mind this is a new Series object unless we use the inplace keyword
argument or explicitly using del,
>>> del x['a']

The same pattern applies to DataFrames.

>>> df = [Link](index=['a','b','c'],
... columns=['A','B','C','D'],
... data = [Link](3*4).reshape(3,4))
>>> [Link]('a')
A B C D
b 4 5 6 7
c 8 9 10 11
5.5 Advanced Indexing 137

Or, along the column dimension,

>>> [Link]('A',axis=1)
B C D
a 1 2 3
b 5 6 7
c 9 10 11

Again, the same comments regarding using del and inplace also apply for
DataFrames.

5.5 Advanced Indexing

Pandas provides very powerful and fast slicing,

>>> x = [Link](range(4),index=['a','b','c','d'])
>>> x['a':'c']
a 0
b 1
c 2
dtype: int64

Note that unlike regular Python indexing, both endpoints are included here when
slicing with labels. This can also be used to assign values as in the following:
>>> x['a':'c']=999
>>> x
a 999
b 999
c 999
d 3
dtype: int64

Analogous behavior applies to DataFrames.

>>> df = [Link](index=['a','b','c'],
... columns=['A','B','C','D'],
... data = [Link](3*4).reshape(3,4))
>>> df['a':'b']
A B C D
a 0 1 2 3
b 4 5 6 7

You can pick out individual columns without the colon (:).
>>> df[['A','C']]
A C
a 0 2
b 4 6
c 8 10

Mixing label-based slices with Numpy-like colon slicing is possible using loc,
138 5 Pandas

>>> [Link]['a':'b',['A','C']]
A C
a 0 2
b 4 6

The idea is that the first argument to loc indexes the rows and the second indexes
the columns. Heuristics allow for Numpy-like indexing without worrying about the
labels. You can go back to plain Numpy indexing with iloc.
>>> [Link][0,-2:]
C 2
D 3
Name: a, dtype: int64

5.6 Broadcasting and Data Alignment

The main thing to keep in mind when operating on one or more Series or
DataFrame objects is that the index always aligns the computation.
>>> x = [Link](range(4),index=['a','b','c','d'])
>>> y = [Link](range(3),index=['a','b','c'])

Note that y is missing one of the indices in x, so when we add them,

>>> x+y
a 0.0
b 2.0
c 4.0
d NaN
dtype: float64

Note that because y was missing one of the indices, it was filled in with a NaN. This
behavior also applies to dataframes,
>>> df = [Link](index=['a','b','c'],
... columns=['A','B','C','D'],
... data = [Link](3*4).reshape(3,4))
>>> ef = [Link](index=list('abcd'),
... columns=list('ABCDE'),
... data = [Link](4*5).reshape(4,5))
>>> ef
A B C D E
a 0 1 2 3 4
b 5 6 7 8 9
c 10 11 12 13 14
d 15 16 17 18 19
>>> df
A B C D
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
>>> df+ef
A B C D E
5.6 Broadcasting and Data Alignment 139

a 0.0 2.0 4.0 6.0 NaN

b 9.0 11.0 13.0 15.0 NaN
c 18.0 20.0 22.0 24.0 NaN
d NaN NaN NaN NaN NaN

Note that the non-overlapping elements are filled in with NaN. For simple opera-
tions, you can specify the fill value using the named operation. For example, in the
last case,
>>> [Link](ef,fill_value=0)
A B C D E
a 0.0 2.0 4.0 6.0 4.0
b 9.0 11.0 13.0 15.0 9.0
c 18.0 20.0 22.0 24.0 14.0
d 15.0 16.0 17.0 18.0 19.0

>>> s = [Link]['a'] # take first row

>>> s
A 0
B 1
C 2
D 3
Name: a, dtype: int64

When we add this Series object with the full DataFrame, we obtain the following:
>>> s + df
A B C D
a 0 2 4 6
b 4 6 8 10
c 8 10 12 14

Compare this to the original DataFrame,

>>> df
A B C D
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11

This shows that the Series object was broadcast down the rows, aligning with the
columns in the DataFrame. Here is an example of a different Series object that is
missing some of the columns in the DataFrame,
>>> s = [Link]([1,2,3],index=['A','D','E'])
>>> s+df
A B C D E
a 1.0 NaN NaN 5.0 NaN
b 5.0 NaN NaN 9.0 NaN
c 9.0 NaN NaN 13.0 NaN

Note that the broadcasting still happens down the rows, aligning with the columns,
but fills in the missing entries with NaN.
Here is a quick Python test that uses regular expressions to test for relatively
small prime numbers,
140 5 Pandas

>>> import re
>>> pattern = r'^1?$|^(11+?)\1+$'
>>> def isprime(n):
... return ([Link](pattern, '1'*n) is None) #*
...

Now, we can find which column label has the most prime numbers in it
>>> [Link](isprime)
A B C D
a False False True True
b False True False True
c False False False True

The booleans are automatically cast in the sum below:

>>> [Link](isprime).sum()
A 0
B 1
C 1
D 3
dtype: int64

This just scratches the surface of the kinds of fluid data analysis that are almost
automatic using Pandas.

Programming Tip: Pandas Performance

Pandas groupby, apply, and applymap are flexible and powerful, but
Pandas has to evaluate them in the Python interpreter and not in the optimized
Pandas code, which results in significant slowdown. Thus, it is always best to
use the functions that are built into Pandas itself instead defining pure Python
functions to feed into these methods.

5.7 Categorical and Merging

Pandas supports a few relational algebraic operations like table joins.

>>> df = [Link](index=['a','b','c'],
... columns=['A','B','C'],
... data = [Link](3*3).reshape(3,3))
>>> df
A B C
a 0 1 2
b 3 4 5
c 6 7 8
>>> ef = [Link](index=['a','b','c'],
... columns=['A','Y','Z'],
... data = [Link](3*3).reshape(3,3))
>>> ef
5.7 Categorical and Merging 141

A Y Z
a 0 1 2
b 3 4 5
c 6 7 8

The table join is implemented in the merge function.

>>> [Link](df,ef,on='A')
A B C Y Z
0 0 1 2 1 2
1 3 4 5 4 5
2 6 7 8 7 8

The on keyword argument says to merge the two DataFrames where they have
matching corresponding entries in the A column. Note that the index was not
preserved in the merge. To make things more interesting, let us make the ef
DataFrame different by dropping one of the rows,
>>> [Link]('b',inplace=True)
>>> ef
A Y Z
a 0 1 2
c 6 7 8

Now, let us try the merge again.

>>> [Link](df,ef,on='A')
A B C Y Z
0 0 1 2 1 2
1 6 7 8 7 8

Note that only the elements of A that match both DataFrames (i.e., are in the
intersection of both) are preserved. We can alter this by using the how keyword
argument.
>>> [Link](df,ef,on='A',how='left')
A B C Y Z
0 0 1 2 1.0 2.0
1 3 4 5 NaN NaN
2 6 7 8 7.0 8.0

The how=left keyword argument tells the join to keep all the keys on the left
DataFrame (df in this case) and fill in with NaN wherever missing in the right
DataFrame (ef). If ef has elements along A that are absent in df, then these would
disappear,
>>> ef = [Link](index=['a','d','c'],
... columns=['A','Y','Z'],
... data = 10*[Link](3*3).reshape(3,3))

>>> ef
A Y Z
a 0 10 20
d 30 40 50
c 60 70 80
142 5 Pandas

>>> [Link](df,ef,on='A',how='left')
A B C Y Z
0 0 1 2 10.0 20.0
1 3 4 5 NaN NaN
2 6 7 8 NaN NaN

Likewise, we can do how=right to use the right DataFrame keys,

>>> [Link](df,ef,on='A',how='right')
A B C Y Z
0 0 1.0 2.0 10 20
1 30 NaN NaN 40 50
2 60 NaN NaN 70 80

We can use how=outer to get the union of the keys,

>>> [Link](df,ef,on='A',how='outer')
A B C Y Z
0 0 1.0 2.0 10.0 20.0
1 3 4.0 5.0 NaN NaN
2 6 7.0 8.0 NaN NaN
3 30 NaN NaN 40.0 50.0
4 60 NaN NaN 70.0 80.0

Another common task is to split continuous data into discrete bins.

>>> a = [Link](10)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> bins = [0,5,10]
>>> cats = [Link](a,bins)
>>> cats
[NaN, (0.0, 5.0], (0.0, 5.0], (0.0, 5.0], (0.0, 5.0], (0.0, 5.0],
(5.0, 10.0], (5.0, 10.0], (5.0, 10.0], (5.0, 10.0]]
Categories (2, interval[int64]): [(0, 5] < (5, 10]]

The [Link] function takes the data in the array and puts them into the categorical
variable cats.
>>> [Link]
IntervalIndex([(0, 5], (5, 10]],
closed='right',
dtype='interval[int64]')

The half-open intervals indicate the bounds of each category. You can change the
parity of the intervals by passing the right=False keyword argument.
>>> [Link]
array([-1, 0, 0, 0, 0, 0, 1, 1, 1, 1], dtype=int8)

The -1 above means that the 0 is not included in either of the two categories
because the interval is open on the left. You can count the number of elements in
each category as shown next,
>>> pd.value_counts(cats)
(0, 5] 5
(5, 10] 4
dtype: int64
5.8 Memory Usage and dtypes 143

Descriptive names for each category can be passed using the labels keyword
argument.
>>> cats = [Link](a,bins,labels=['one','two'])
>>> cats
[NaN, 'one', 'one', 'one', 'one', 'one', 'two', 'two', 'two',
→ 'two']
Categories (2, object): ['one' < 'two']

Note that if you pass an integer argument for bins, it will automatically split into
equal-sized categories. The qcut function is very similar except that it splits on
quartiles.
>>> a = [Link](100) # uniform random variables
>>> cats = [Link](a,4,labels=['q1','q2','q3','q4'])
>>> pd.value_counts(cats)
q4 25
q3 25
q2 25
q1 25
dtype: int64

5.8 Memory Usage and dtypes

You will find yourself processing a lot of data with Pandas. Here are some tips to do
that efficiently. First, we need the Penguins dataset from Seaborn,
>>> import seaborn as sns
>>> df = sns.load_dataset('penguins')
>>> [Link]()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female

This is not a particularly big dataset, but it will suffice. Let us examine the dtypes
of the DataFrame,
>>> [Link]
species object
island object
bill_length_mm float64
bill_depth_mm float64
flipper_length_mm float64
body_mass_g float64
sex object
dtype: object
144 5 Pandas

Notice that some of these are marked object. This is usually means inefficiency
because this generalized dtype may consume an inordinate amount of memory.
Pandas comes with a simple way to assess the memory consumption of your
DataFrame,
>>> df.memory_usage(deep=True)
Index 128
species 21876
island 21704
bill_length_mm 2752
bill_depth_mm 2752
flipper_length_mm 2752
body_mass_g 2752
sex 20995
dtype: int64

Now, we have an idea of our memory consumption for this DataFrame, and we
can improve it by changing the dtypes. The categorical type we discussed previously
can be specified as a new dtype for the sex column,
>>> ef = [Link]({'sex':'category'})
>>> ef.memory_usage(deep=True)
Index 128
species 21876
island 21704
bill_length_mm 2752
bill_depth_mm 2752
flipper_length_mm 2752
body_mass_g 2752
sex 548
dtype: int64

This results in almost a 40 times reduction in memory for this, which may be
significant if the DataFrame had thousands of rows, for example. This works
because there are many more rows than distinct values in the sex column. Let us
continue using category as the dtype for the species and island columns.
>>> ef = [Link]({'sex':'category',
... 'species':'category',
... 'island':'category'})
>>> ef.memory_usage(deep=True)
Index 128
species 616
island 615
bill_length_mm 2752
bill_depth_mm 2752
flipper_length_mm 2752
body_mass_g 2752
sex 548
dtype: int64
5.8 Memory Usage and dtypes 145

To compare, we can put these side-by-side into a new DataFrame,

>>> ([Link]({'df':df.memory_usage(deep=True),
... 'ef':ef.memory_usage(deep=True)})
... .assign(ratio= lambda i:[Link]/[Link]))
df ef ratio
Index 128 128 1.000000
species 21876 616 0.028159
island 21704 615 0.028336
bill_length_mm 2752 2752 1.000000
bill_depth_mm 2752 2752 1.000000
flipper_length_mm 2752 2752 1.000000
body_mass_g 2752 2752 1.000000
sex 20995 548 0.026101

This shows a much smaller memory footprint for the columns that we changed
to categorical dtype. We can also change the numerical types from the default
float64, if we do not need that level of precision. For example, the
flipper_length_mm column is measured in millimeters and there is no fractional
part to any of the measurements. Thus, we can change that column as the following
dtype and save four times the memory,
>>> ef = [Link]({'sex':'category',
... 'species':'category',
... 'island':'category',
... 'flipper_length_mm': np.float16})
>>> ef.memory_usage(deep=True)
Index 128
species 616
island 615
bill_length_mm 2752
bill_depth_mm 2752
flipper_length_mm 688
body_mass_g 2752
sex 548
dtype: int64

Here is the summary again,

>>> ([Link]({'df':df.memory_usage(deep=True),
... 'ef':ef.memory_usage(deep=True)})
... .assign(ratio= lambda i:[Link]/[Link]))
df ef ratio
Index 128 128 1.000000
species 21876 616 0.028159
island 21704 615 0.028336
bill_length_mm 2752 2752 1.000000
bill_depth_mm 2752 2752 1.000000
flipper_length_mm 2752 688 0.250000
body_mass_g 2752 2752 1.000000
sex 20995 548 0.026101

Thus, by changing the default object dtype to other smaller dtypes can result in
significant savings and potentially speed up downstream processing for dataframes.
This is particularly true when scraping data from the web directly into dataframes
146 5 Pandas

using pd.read_html, which, even though the data looks numerical on the webpage,
will typically result in the unnecessarily heavy object dtype.

5.9 Common Operations

A common problem is how to split a string column into components. For example,
>>> df = [Link](dict(name=['Jon Doe','Jane Smith']))
>>> [Link](' ',expand=True)
0 1
0 Jon Doe
1 Jane Smith

The key step is the expand keyword argument that converts the result into
a DataFrame. The result can be assigned into the same DataFrame using the
following,
>>> df[['first','last']]=[Link](' ',expand=True)
>>> df
name first last
0 Jon Doe Jon Doe
1 Jane Smith Jane Smith

Note that failing to use the expand keyword argument results in an output list
instead of a DataFrame
>>> [Link](' ')
0 [Jon, Doe]
1 [Jane, Smith]
Name: name, dtype: object

This can be fixed by using apply on the output to convert from a list into a
Series object as shown,1
>>> [Link](' ').apply([Link])
0 1
0 Jon Doe
1 Jane Smith

The apply method is one of the most powerful and general DataFrame
methods. It operates on the individual columns (i.e., [Link]) objects of
the DataFrame. Unlike the applymap method that leaves the shape of the
DataFrame unchanged, the apply method can return objects of a different
shape. For example, doing something like [Link](lambda i:i[i>3])
on a DataFrame with numeric rows will return a small truncated NaN-filled
DataFrame. Further, the [Link](raw=True) keyword argument speeds

1 There
are many other Python string methods in the .str submodule such as rstrip, upper,
and title.
5.9 Common Operations 147

up the method by operating directly on the underlying Numpy array in the

DataFrame columns. This means that the ’apply’ method processes the Numpy
arrays directly instead of the usual ’[Link]’ objects.
The transform method is closely related to apply but must produce an
output DataFrame of the same dimensions. For example,
>>> df = [Link]({'A': [1,1,2,2], 'B': range(4)})
>>> df
A B
0 1 0
1 1 1
2 2 2
3 2 3

We can group by ’A’ and reduce using the usual aggregation,

>>> [Link]('A').sum()
B
A
1 1
2 5

But by using .transform() we can broadcast the results to their respective

places in the original DataFrame.
>>> [Link]('A').transform('sum')
B
0 1
1 1
2 5
3 5

The describe DataFrame method is useful for summarizing a given

DataFrame, as shown below,
>>> df = [Link](index=['a','b','c'], columns=['A','B','C'],
→ data = [Link](3*3).reshape(3,3))
>>> [Link]()
A B C
count 3.0 3.0 3.0
mean 3.0 4.0 5.0
std 3.0 3.0 3.0
min 0.0 1.0 2.0
25% 1.5 2.5 3.5
50% 3.0 4.0 5.0
75% 4.5 5.5 6.5
max 6.0 7.0 8.0

It is often useful to get rid of accidental duplicated entries.

>>> df = [Link]({'A': [1,1,2,2,2,3], 'B': range(6)})
>>> df.drop_duplicates('A')
A B
0 1 0
2 2 2
5 3 5

The keep keyword argument decides which of the duplicated entries to retain.
148 5 Pandas

5.10 Displaying DataFrames

Pandas has the set_option method to change the visual display of DataFrames
while not altering the corresponding data elements.
>>> pd.set_option('display.float_format','{:.2f}'.format)

Note that the argument is a callable that produces the formatted string. These
custom settings can be undone with reset_option, as in
pd.reset_option('display.float_format')

The chop option is handy for trimming excessive display precision,

>>> pd.set_option('[Link]',1e-5)

Within the Jupyter Notebook, formatting can utilize HTML elements with the
[Link] DataFrame method. For example, as shown in Fig. 5.2,
>>> from pandas_datareader import data
>>> df=[Link]("F", 'yahoo', '20200101',
→ '20200110').reset_index()

>>> ([Link](dict(Date='{:%m/%d/%Y}'))
... .hide_index()
... .highlight_min('Close',color='red')
... .highlight_max('Close',color='lightgreen')
... )
<[Link] object at 0x7f9376a22460>

formats the resulting table with the minimum closing price highlighted in red and
the maximum closing price highlighted in green. Providing these kinds of quick
visual cues is critically important for picking out key data elements. Note that the
parenthesis above is for using the newlines to separate the dot methods, which
is a style inherited from the R DataFrame. The key step is to expose the style
formatting with format() and then use its methods to style the resulting HTML

Fig. 5.2 HTML-highlighted items in DataFrame

5.10 Displaying DataFrames 149

Fig. 5.3 Custom color gradients for HTML DataFrame rendering

Fig. 5.4 Customized background barcharts for DataFrame

table in Jupyter Notebook. The following changes the color-wise visual gradient of
the Volume column as in Fig. 5.3.
>>> ([Link](dict(Date='{:%m/%d/%Y}'))
... .hide_index()
... .background_gradient(subset='Volume',cmap='Blues')
... )
<[Link] object at 0x7f9376a8a0d0>

Background barcharts can also be embedded in the table representation in Jupyter

Notebook, as in the following (see Fig. 5.4),
>>> ([Link](dict(Date='{:%m/%d/%Y}'))
... .hide_index()
... .bar('Volume',color='lightblue',align='zero')
... )
<[Link] object at 0x7f9374711f70>
150 5 Pandas

5.11 Multi-index

We encountered the Pandas MultiIndex when using groupby with multiple

columns, but these can be created separately:
>>> idx = [Link].from_product([['a','b'],[1,2,3]])
>>> data = 10*[Link](6).reshape(6,1)
>>> df = [Link](data=data,index=idx,columns=['A'])
>>> df
A
a 1 0
2 10
3 20
b 1 30
2 40
3 50

which is more compact than the following,

>>> df.reset_index()
level_0 level_1 A
0 a 1 0
1 a 2 10
2 a 3 20
3 b 1 30
4 b 2 40
5 b 3 50

But recall that we did not give the index a name when we created it, which explains
the uniformative headers, level_0 and level_1. We can swap the two levels of the
index in the DataFrame,
>>> [Link]()
A
1 a 0
2 a 10
3 a 20
1 b 30
2 b 40
3 b 50

The [Link] makes it much easier to index the DataFrame using the
loc accessor,
>>> ixs = [Link]
>>> [Link][ixs['a',:],:]
A
a 1 0
2 10
3 20
>>> [Link][ixs[:,2],:]
A
a 2 10
b 2 40
5.11 Multi-index 151

Note that there can be many more levels to a multi-index. These can also go in the
column index,
>>> rx = [Link].from_product([['a','b'],[1,2,3]])
>>> cx = [Link].from_product([['A','B','C'],[2,3]])
>>> data=[[2, 3, 9, 3, 4, 1],
... [9, 5, 9, 7, 2, 1],
... [9, 4, 4, 3, 2, 1],
... [1, 0, 4, 5, 5, 5],
... [5, 8, 1, 6, 1, 7],
... [0, 8, 9, 2, 1, 9]]
>>> df = [Link](index=rx,columns=cx,data=data)
>>> df
A B C
2 3 2 3 2 3
a 1 2 3 9 3 4 1
2 9 5 9 7 2 1
3 9 4 4 3 2 1
b 1 1 0 4 5 5 5
2 5 8 1 6 1 7
3 0 8 9 2 1 9

You can use [Link] for both the columns and rows,
>>> [Link][ixs['a',:],ixs['A',:]]=1
>>> df
A B C
2 3 2 3 2 3
a 1 1 1 9 3 4 1
2 1 1 9 7 2 1
3 1 1 4 3 2 1
b 1 1 0 4 5 5 5
2 5 8 1 6 1 7
3 0 8 9 2 1 9

It is helpful to add names to the levels of the index,

>>> [Link] = [Link].set_names(['X','Y'])
>>> df
A B C
2 3 2 3 2 3
X Y
a 1 1 1 9 3 4 1
2 1 1 9 7 2 1
3 1 1 4 3 2 1
b 1 1 0 4 5 5 5
2 5 8 1 6 1 7
3 0 8 9 2 1 9

Even with these complicated multi-indices on the rows/columns, the groupby

method still works, but with a full specification of the particular column as
(’B’,2),
>>> [Link](('B',2)).sum()
A B C
2 3 3 2 3
152 5 Pandas

(B, 2)
1 5 8 6 1 7
4 2 1 8 7 6
9 2 10 12 7 11

To understand how this works, take the column slice and examine its unique
elements. This explains the values in the resulting row index of the output.
>>> [Link][:,('B',2)].unique()
array([9, 4, 1])

Now, we have to examine the partitions that are created in the DataFrame by each
of these values such as:
>>> [Link](('B',2)).get_group(4)
A B C
2 3 2 3 2 3
X Y
a 3 1 1 4 3 2 1
b 1 1 0 4 5 5 5

and then summing over these groups produces the final output. You can also use
the apply function on the group to calculate non-scalar output. For example, to
subtract the minimum of each element in the group we can do the following,
>>> [Link](('B',2)).apply(lambda i:[Link]())
A B C
2 3 2 3 2 3
X Y
a 1 1 0 0 1 3 0
2 1 0 0 5 1 0
3 0 1 0 0 0 0
b 1 0 0 0 2 3 4
2 0 0 0 0 0 0
3 0 7 0 0 0 8

5.12 Pipes

Pandas implements method-chaining with the pipe function. Even though this
is un-Pythonic, it is easier than composing functions together that manipulate
dataframes from end-to-end.
>>> df = [Link](index=['a','b','c'],
... columns=['A','B','C'],
... data = [Link](3*3).reshape(3,3))
>>> [Link](lambda i:i*10).pipe(lambda i:3*i)
A B C
a 0 30 60
b 90 120 150
c 180 210 240
5.13 Data Files and Databases 153

Suppose we need to find the cases for which the sum of the columns is an odd
number. We can create an intermediate throw-away variable t using assign and
then extract the corresponding section of the DataFrame as in the following,
>>> [Link](t=df.A+df.B+df.C).query('t%2==1').drop('t',axis=1)
A B C
a 0 1 2
c 6 7 8

The assign method takes either a function whose argument is the DataFrame
itself or the named DataFrame. The query method then filters the intermediate
result according to the oddness of t and the final step is to remove the t variable
that we no longer need in the output.

5.13 Data Files and Databases

Pandas has powerful I/O utilities for manipulating Excel and CSV spreadsheets.
>>> df.to_excel('this_excel.[Link]')

You will find that the given spreadsheet has the dates formatted according to Excel’s
internal date representation.
If you have PyTables installed, you can write to HDFStore. You can also
manipulate HDF5 directly from PyTables.
>>> df.to_hdf('filename.h5','keyname')

You can later read this using

>>> dg=pd.read_hdf('filename.h5','keyname')

to get your data back. You can create a SQLite database right away because SQLite
is included with Python itself.
>>> import sqlite3
>>> cnx = [Link](':memory:')
>>> df = [Link](index=['a','b','c'],
... columns=['A','B','C'],
... data = [Link](3*3).reshape(3,3))
>>> df.to_sql('TableName',cnx)

Now, we can reload from the database using the usual relational algebra
>>> from [Link] import sql
>>> p2 = sql.read_sql_query('select * from TableName', cnx)
>>> p2
index A B C
0 a 0 1 2
1 b 3 4 5
2 c 6 7 8
154 5 Pandas

5.14 Customizing Pandas

Since Pandas 0.23, we have extensions.register_dataframe_accessor, which

allows for easy extension of Pandas Dataframes/Series without subclassing.
>>> df = [Link](index=['a','b','c'],
... columns=['A','B','C','D'],
... data = [Link](3*4).reshape(3,4))
>>> df
A B C D
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11

The following code defines a custom accessor that behaves as if it is a native

DataFrame method.
>>> @[Link].register_dataframe_accessor('custom')
... class CustomAccess:
... def __init__(self,df): # receives DataFrame
... assert 'A' in [Link] # some input validation
... assert 'B' in [Link]
... self._df = df
... @property # custom attribute
... def odds(self):
... 'drop all columns that have all even elements'
... df = self._df
... return df[df % 2==0].dropna(axis=1,how='all')
... def avg_odds(self): # custom method
... 'average only odd terms in each column'
... df = self._df
... return df[df % 2==1].mean(axis=0)
...

Now, with that established, we can use our new method prefixed with the custom
namespace, as in the following,
>>> [Link] # as attribute
A C
a 0 2
b 4 6
c 8 10
>>> [Link].avg_odds() # as method
A nan
B 5.00
C nan
D 7.00
dtype: float64

Importantly, you can use any word you want besides custom. You just have to
specify it in the decorator. The analogous register_series_accessor does the
5.15 Rolling and Filling Operations 155

same thing for Series objects and the register_index_accessor for Index
objects.2

5.15 Rolling and Filling Operations

Due to Pandas’ legacy in quantitative finance, many rolling time-series calculations

are easy to compute. Let us load some stock price data.
>>> from pandas_datareader import data
>>> df=[Link]("F", 'yahoo', '20200101',
→ '20200130').reset_index()
>>> [Link]()
Date High Low Open Close Volume Adj Close
0 2020-01-02 9.42 9.19 9.29 9.42 43425700.00 9.26
1 2020-01-03 9.37 9.15 9.31 9.21 45040800.00 9.06
2 2020-01-06 9.17 9.06 9.10 9.16 43372300.00 9.01
3 2020-01-07 9.25 9.12 9.20 9.25 44984100.00 9.10
4 2020-01-08 9.30 9.17 9.23 9.25 45994900.00 9.10

We can compute the mean over the trailing three elements,

>>> [Link](3).mean().head(5)
High Low Open Close Volume Adj Close
0 nan nan nan nan nan nan
1 nan nan nan nan nan nan
2 9.32 9.13 9.23 9.26 43946266.67 9.11
3 9.26 9.11 9.20 9.21 44465733.33 9.05
4 9.24 9.12 9.18 9.22 44783766.67 9.07

Note that we only get valid outputs for a fully filled trailing window. Whether
or not the endpoints of the window are used in the calculation is determined by
the closed keyword argument. Besides the default rectangular window, other
windows such as Blackman and Hamming are available. The [Link]()
function produces a Rolling object with methods such as apply, aggregate,
and others. Similar to rolling, exponentially weighted windowed calculations
can be computed with the ewm() method,
>>> [Link](3).mean()
High Low Open Close Volume Adj Close
0 9.42 9.19 9.29 9.42 43425700.00 9.26
1 9.39 9.17 9.30 9.30 44348614.29 9.14
2 9.30 9.12 9.21 9.24 43926424.32 9.08
3 9.28 9.12 9.21 9.24 44313231.43 9.09
4 9.29 9.14 9.22 9.25 44864456.98 9.09
5 9.29 9.15 9.24 9.25 46979043.75 9.10
6 9.31 9.18 9.25 9.25 44906738.44 9.10
7 9.30 9.16 9.25 9.25 45919910.43 9.09
8 9.31 9.17 9.24 9.26 45113266.20 9.10

2 The third-party data-cleaning module pyjanitor utilizes this approach extensively.

156 5 Pandas

9 9.30 9.18 9.25 9.24 47977202.99 9.09

10 9.30 9.17 9.24 9.22 47020077.94 9.07
11 9.28 9.16 9.23 9.21 45632324.49 9.05
12 9.27 9.14 9.21 9.21 46637216.86 9.05
13 9.26 9.15 9.21 9.20 44926124.49 9.04
14 9.24 9.09 9.19 9.18 52761475.78 9.03
15 9.21 9.06 9.17 9.14 56635156.17 8.98
16 9.14 8.99 9.10 9.07 57676520.00 8.92
17 9.11 8.96 9.06 9.05 64587200.41 8.90
18 9.07 8.93 9.01 9.00 63198880.10 8.89
19 9.01 8.88 8.96 8.96 58089908.44 8.88

We have barely scratched the surface of what Pandas is capable of and we have
completely ignored its powerful features for managing dates and times. There is
much more to learn and the online documentation and tutorials at the main Pandas
site are great for learning more.

Pandas DataFrame Cheat Sheet
100% (2)
Pandas DataFrame Cheat Sheet
12 pages
Pandas Library Overview and Setup Guide
No ratings yet
Pandas Library Overview and Setup Guide
9 pages
Pandas Cheatsheets 1.0.6 Web Binder PDF
No ratings yet
Pandas Cheatsheets 1.0.6 Web Binder PDF
8 pages
Pandas DataFrame Cheat Sheet
67% (3)
Pandas DataFrame Cheat Sheet
13 pages
Pandas Summarized Visually in 8
100% (2)
Pandas Summarized Visually in 8
8 pages
Data Wrangling with Python and Pandas
No ratings yet
Data Wrangling with Python and Pandas
7 pages
Pandas Worksheets ALL
100% (1)
Pandas Worksheets ALL
8 pages
Understanding Pandas DataFrames and Series
No ratings yet
Understanding Pandas DataFrames and Series
13 pages
Data Analysis with Pandas: Series & DataFrames
No ratings yet
Data Analysis with Pandas: Series & DataFrames
14 pages
Data Wrangling with Pandas in Python
No ratings yet
Data Wrangling with Pandas in Python
15 pages
Pandas DataFrame Cheat Sheet
100% (1)
Pandas DataFrame Cheat Sheet
10 pages
Pandas DataFrame Cheat Sheet
No ratings yet
Pandas DataFrame Cheat Sheet
12 pages
Pandas DataFrame Cheat Sheet
100% (1)
Pandas DataFrame Cheat Sheet
10 pages
Pandas DataFrame Cheat Sheet
No ratings yet
Pandas DataFrame Cheat Sheet
10 pages
Intro To Pandas - Ipynb Colab
No ratings yet
Intro To Pandas - Ipynb Colab
8 pages
Pandas Lecture
No ratings yet
Pandas Lecture
4 pages
Pandas Data Structures and Visualization
No ratings yet
Pandas Data Structures and Visualization
11 pages
Pandas DataFrame Cheatsheet
No ratings yet
Pandas DataFrame Cheatsheet
12 pages
Introduction to Pandas Data Structures
No ratings yet
Introduction to Pandas Data Structures
27 pages
Data Aggregation with Pandas GroupBy
No ratings yet
Data Aggregation with Pandas GroupBy
34 pages
Advanced Pandas Data Analysis Techniques
No ratings yet
Advanced Pandas Data Analysis Techniques
33 pages
Real PDFF
No ratings yet
Real PDFF
21 pages
Understanding Pandas Series Basics
No ratings yet
Understanding Pandas Series Basics
16 pages
Pandas Study Notes
No ratings yet
Pandas Study Notes
9 pages
Pandas Series and DataFrame Basics
No ratings yet
Pandas Series and DataFrame Basics
49 pages
Pandas DataFrame Cheat Sheet
No ratings yet
Pandas DataFrame Cheat Sheet
11 pages
Introduction to Pandas for Data Analysis
No ratings yet
Introduction to Pandas for Data Analysis
9 pages
Introduction to Pandas for Data Analysis
No ratings yet
Introduction to Pandas for Data Analysis
16 pages
Pandas Cheatsheet
No ratings yet
Pandas Cheatsheet
12 pages
Essential Guide to Pandas in Python
No ratings yet
Essential Guide to Pandas in Python
28 pages
Writing References in Practical Files
No ratings yet
Writing References in Practical Files
98 pages
Pandas Data Analysis: A Comprehensive Guide
No ratings yet
Pandas Data Analysis: A Comprehensive Guide
13 pages
Introduction to Python Libraries
No ratings yet
Introduction to Python Libraries
12 pages
Master Pandas: Data Analysis with Python
No ratings yet
Master Pandas: Data Analysis with Python
19 pages
Pandas Data Analysis Guide
No ratings yet
Pandas Data Analysis Guide
66 pages
Pandas Data Structures and Features Guide
No ratings yet
Pandas Data Structures and Features Guide
32 pages
Introduction to Pandas for Data Analysis
No ratings yet
Introduction to Pandas for Data Analysis
16 pages
Introduction to Python Pandas Library
No ratings yet
Introduction to Python Pandas Library
28 pages
Pandas Data Manipulation Cheat Sheet
No ratings yet
Pandas Data Manipulation Cheat Sheet
9 pages
Pandas DataFrame Basics in Python
100% (1)
Pandas DataFrame Basics in Python
9 pages
Pandas DataFrame Manipulation Guide
No ratings yet
Pandas DataFrame Manipulation Guide
25 pages
Python Data Science Cheat Sheet
97% (33)
Python Data Science Cheat Sheet
11 pages
Pandas Data Manipulation Techniques
No ratings yet
Pandas Data Manipulation Techniques
21 pages
Python Notes Overall
No ratings yet
Python Notes Overall
41 pages
GroupBy Operations in Pandas DataFrame
No ratings yet
GroupBy Operations in Pandas DataFrame
63 pages
Python Libraries for Data Science
No ratings yet
Python Libraries for Data Science
27 pages
1761900500-2. Data Analysis With Python
No ratings yet
1761900500-2. Data Analysis With Python
101 pages
Pandas DataFrame and Series Guide
No ratings yet
Pandas DataFrame and Series Guide
47 pages
02 Pandas Notes
No ratings yet
02 Pandas Notes
16 pages
Introduction to Pandas Basics
No ratings yet
Introduction to Pandas Basics
44 pages
M Ds Unit II
No ratings yet
M Ds Unit II
25 pages
Pandas Notes
No ratings yet
Pandas Notes
16 pages
Notes IP Python Class 12
No ratings yet
Notes IP Python Class 12
14 pages
12th Short Notes Total
No ratings yet
12th Short Notes Total
3 pages
Pandas Assignment: Data Structures Guide
No ratings yet
Pandas Assignment: Data Structures Guide
9 pages
Snapmaker 2.0 A350 Comprehensive Review
No ratings yet
Snapmaker 2.0 A350 Comprehensive Review
10 pages
Union of India Digital Driving License
No ratings yet
Union of India Digital Driving License
1 page
Guide to Using FakeYou TTS
No ratings yet
Guide to Using FakeYou TTS
1 page
Welcome To MacBook Air Getting Started Guide - Apple Support
No ratings yet
Welcome To MacBook Air Getting Started Guide - Apple Support
3 pages
Infosys Internship Opportunities 2025
No ratings yet
Infosys Internship Opportunities 2025
2 pages
Essential SQL Queries for Database Management
No ratings yet
Essential SQL Queries for Database Management
74 pages
8051 Microcontroller Architecture Overview
No ratings yet
8051 Microcontroller Architecture Overview
8 pages
Blood Bank & Donor Management System
No ratings yet
Blood Bank & Donor Management System
32 pages
Python Algorithms: BFS, Water Jug, Punctuation Removal
No ratings yet
Python Algorithms: BFS, Water Jug, Punctuation Removal
6 pages
Construction Schedule Weekly
No ratings yet
Construction Schedule Weekly
5 pages
Information Systems Security Quiz
No ratings yet
Information Systems Security Quiz
20 pages
C++ Face Detection Project Overview
No ratings yet
C++ Face Detection Project Overview
11 pages
Dell Unity XT Hybrid Flash Arrays Overview
No ratings yet
Dell Unity XT Hybrid Flash Arrays Overview
2 pages
Quick Start: Farther - Faster.Clearer
No ratings yet
Quick Start: Farther - Faster.Clearer
8 pages
Wireshark Guide for RTI Data Service
No ratings yet
Wireshark Guide for RTI Data Service
42 pages
Smart Campus & Community Engagement Solutions
No ratings yet
Smart Campus & Community Engagement Solutions
7 pages
ESSS Software: Security Testing Methods
No ratings yet
ESSS Software: Security Testing Methods
30 pages
Generative AI's Impact on OSS Development
No ratings yet
Generative AI's Impact on OSS Development
75 pages
Power BI Basics: Data Modeling & Visualization
No ratings yet
Power BI Basics: Data Modeling & Visualization
6 pages
Microsoft Volume Licensing Keys Summary
No ratings yet
Microsoft Volume Licensing Keys Summary
2 pages
Decomposition in Problem Solving Guide
No ratings yet
Decomposition in Problem Solving Guide
44 pages
Motivational Letter for MSc in Software Engineering
No ratings yet
Motivational Letter for MSc in Software Engineering
2 pages
1ez S4hana2022 BPD en MX
No ratings yet
1ez S4hana2022 BPD en MX
36 pages
Intrusion Detection in Computer Security
No ratings yet
Intrusion Detection in Computer Security
43 pages
Word PG
No ratings yet
Word PG
11 pages
Impact of Technology on Work Methods
No ratings yet
Impact of Technology on Work Methods
8 pages
Ren'Py 7.5.3 Android Log Analysis
No ratings yet
Ren'Py 7.5.3 Android Log Analysis
2 pages
IT Periodic Test I for Std X - 2024-25
No ratings yet
IT Periodic Test I for Std X - 2024-25
2 pages
Troubleshooting and Maintains of PC 08
No ratings yet
Troubleshooting and Maintains of PC 08
9 pages
Overview of Fountain Codes
No ratings yet
Overview of Fountain Codes
7 pages

Python Pandas

Uploaded by

Python Pandas

Uploaded by

Chapter 5

5.1 Using Series

This object can be indexed in plain Numpy style,

We can also get the Numpy arrays directly,

Unlike Numpy arrays, you can have mixed types,

Fig. 5.1 Quick plot of Series object

5.2 Using DataFrame

or by directly slicing or by using the dot notation as shown below:

Programming Tip: Spaces in Column Names

Subsequent operations on the DataFrame preserve its column-wise structure as in

>>> grp = [Link](['sum_col','col1'])

Doing the sum operation on each group gives the following:

Or, using a quicker attribute-style notation

Now, we can do some basic computing and indexing.

Now, we can reindex this by the index as in the following:

5.4 Deleting Items

Returning to our previous Series object,

The same pattern applies to DataFrames.

Or, along the column dimension,

5.5 Advanced Indexing

Pandas provides very powerful and fast slicing,

Analogous behavior applies to DataFrames.

5.6 Broadcasting and Data Alignment

Note that y is missing one of the indices in x, so when we add them,

a 0.0 2.0 4.0 6.0 NaN

>>> s = [Link]['a'] # take first row

Compare this to the original DataFrame,

The booleans are automatically cast in the sum below:

Programming Tip: Pandas Performance

5.7 Categorical and Merging

Pandas supports a few relational algebraic operations like table joins.

The table join is implemented in the merge function.

Now, let us try the merge again.

Likewise, we can do how=right to use the right DataFrame keys,

We can use how=outer to get the union of the keys,

Another common task is to split continuous data into discrete bins.

5.8 Memory Usage and dtypes

To compare, we can put these side-by-side into a new DataFrame,

Here is the summary again,

5.9 Common Operations

up the method by operating directly on the underlying Numpy array in the

We can group by ’A’ and reduce using the usual aggregation,

But by using .transform() we can broadcast the results to their respective

The describe DataFrame method is useful for summarizing a given

It is often useful to get rid of accidental duplicated entries.

5.10 Displaying DataFrames

The chop option is handy for trimming excessive display precision,

Fig. 5.2 HTML-highlighted items in DataFrame

Fig. 5.3 Custom color gradients for HTML DataFrame rendering

Fig. 5.4 Customized background barcharts for DataFrame

Background barcharts can also be embedded in the table representation in Jupyter

We encountered the Pandas MultiIndex when using groupby with multiple

which is more compact than the following,

It is helpful to add names to the levels of the index,

Even with these complicated multi-indices on the rows/columns, the groupby

5.13 Data Files and Databases

You can later read this using

5.14 Customizing Pandas

Since Pandas 0.23, we have extensions.register_dataframe_accessor, which

The following code defines a custom accessor that behaves as if it is a native

5.15 Rolling and Filling Operations

Due to Pandas’ legacy in quantitative finance, many rolling time-series calculations

We can compute the mean over the trailing three elements,

2 The third-party data-cleaning module pyjanitor utilizes this approach extensively.

9 9.30 9.18 9.25 9.24 47977202.99 9.09

You might also like