Python Pandas
Python Pandas
Pandas
Pandas is a powerful module that is optimized on top of Numpy and provides a set of
data structures particularly suited to time-series and spreadsheet-style data analysis
(think of pivot tables in Excel). If you are familiar with the R statistical package, then
you can think of Pandas as providing a Numpy-powered DataFrame for Python.
Pandas provides a DataFrame object (among others) built on a Numpy platform
to ease data manipulation (especially for time-series) for statistical processing.
Pandas is particularly popular in quantitative finance. Key features of Pandas include
fast data manipulation and alignment, tools for exchanging data between different
formats and between SQL databases, handling missing data, and cleaning up messy
data.
The easiest way to think about Pandas Series objects is as a container for two Numpy
arrays, one for the index and the other for the data. Recall that Numpy arrays already
have integer-indexing just like regular Python lists.
>>> import pandas as pd
>>> x = [Link]([1,2,30,0,15,6])
>>> x
0 1
1 2
2 30
3 0
4 15
5 6
dtype: int64
© The Editor(s) (if applicable) and The Author(s), under exclusive license 127
to Springer Nature Switzerland AG 2021
J. Unpingco, Python Programming for Data Analysis,
[Link]
128 5 Pandas
2 30
dtype: int64
Beware that mixed types in a single column can lead to downstream inefficiencies
and other problems. The index in the [Link] generalizes beyond integer-
indexing. For example,
>>> s = [Link]([1,2,3],index=['a','b','cat'])
>>> s['a']
1
>>> s['cat']
3
Because of its legacy as a financial data (i.e., stock prices) processing tool, Pandas
is really good at managing time-series
>>> dates = pd.date_range('20210101',periods=12)
>>> s = [Link](range(12),index=dates) # explicitly assign index
>>> s # default is calendar-daily
2021-01-01 0
2021-01-02 1
2021-01-03 2
2021-01-04 3
2021-01-05 4
2021-01-06 5
2021-01-07 6
5.1 Using Series 129
2021-01-08 7
2021-01-09 8
2021-01-10 9
2021-01-11 10
2021-01-12 11
Freq: D, dtype: int64
You can do some basic descriptive statistics on the data (not the index!) right away
>>> [Link]()
5.5
>>> [Link]()
3.605551275463989
You can also plot (see Fig. 5.1) the Series using its methods:
>>> [Link](kind='bar',alpha=0.3) # can add extra matplotlib
→ keywords
Data can be summarized by the index. For example, to count the individual days of
the week for which we had data:
>>> [Link](by=lambda i:[Link]).count()
0 2
1 2
2 1
3 1
130 5 Pandas
4 2
5 2
6 2
dtype: int64
Note that the convention 0 is Monday, 1 is Tuesday, and so forth. Thus, there was
only one Sunday (day 6) in the dataset. The groupby method partitions the data
into disjoint groups based on the predicate given via the by keyword argument.
Consider the following Series,
>>> x = [Link](range(5),index=[1,2,11,9,10])
>>> x
1 0
2 1
11 2
9 3
10 4
dtype: int64
Let us group it in the following according to whether the elements in the values are
even or odd using the modulus (%) operator,
>>> grp = [Link](lambda i:i%2) # odd or even
>>> grp.get_group(0) # even group
2 1
10 4
dtype: int64
>>> grp.get_group(1) # odd group
1 0
11 2
9 3
dtype: int64
The first line groups the elements of the Series object by whether or not the index
is even or odd. The lambda function returns 0 or 1 depending on whether or not
the corresponding index is even or odd, respectively. The next line shows the 0 (i.e.,
even) group and then the one after shows the 1 (odd) group. Now that we have
separate groups, and we can perform a wide variety of summarizations on each to
reduce it to a single value. For example, in the following, we get the maximum value
of each group:
>>> [Link]() # max in each group
0 4
1 3
dtype: int64
Note that the operation above returns another Series object with an index
corresponding to the [0,1] elements. There will be as many groups as there are
unique outputs of the by function.
5.2 Using DataFrame 131
While the Series object can be thought of as encapsulating two Numpy arrays
(index and values), the Pandas DataFrame is an encapsulation of group of
Series objects that share a single index. We can create a DataFrame with
dictionaries as in the following:
>>> df = [Link]({'col1': [1,3,11,2], 'col2': [9,23,0,2]})
>>> df
col1 col2
0 1 9
1 3 23
2 11 0
3 2 2
Note that the keys in the input dictionary are now the column headings (labels) of the
DataFrame, with each corresponding column matching the list of corresponding
values from the dictionary. Like the Series object, the DataFrame also has an
index, which is the [0,1,2,3] column on the far left. We can extract elements
from each column using the iloc, which ignore the given index and return to
traditional Numpy slicing,
>>> [Link][:2,:2] # get section
col1 col2
0 1 9
1 3 23
>>> [Link]()
col1 17
col2 34
dtype: int64
where each column was totaled. Grouping and aggregating with the DataFrame is
even more powerful than with Series. Let us construct the following DataFrame,
>>> df = [Link]({'col1': [1,1,0,0], 'col2': [1,2,3,4]})
>>> df
col1 col2
0 1 1
1 1 2
2 0 3
3 0 4
In the above DataFrame, note that the col1 column has only two distinct entries.
We can group the data using this column as in the following:
>>> grp=[Link]('col1')
>>> grp.get_group(0)
col1 col2
2 0 3
3 0 4
>>> grp.get_group(1)
col1 col2
0 1 1
1 1 2
Note that each group corresponds to entries for which col1 was either of its two
values. Now that we have grouped on col1, as with the Series object, we can also
functionally summarize each of the groups as in the following:
>>> [Link]()
col2
col1
0 7
1 3
where the sum is applied across each of the dataframes present in each group. Note
that the index of the output above is each of the values in the original col1.
The DataFrame can compute new columns based on the existing columns
using the eval method as shown below:
>>> df['sum_col']=[Link]('col1+col2')
>>> df
col1 col2 sum_col
0 1 1 2
1 1 2 3
2 0 3 3
3 0 4 4
Note that you can assign the output to a new column to the DataFrame as shown.
We can group by multiple columns as shown below:
5.2 Using DataFrame 133
This output is much more complicated than anything we have seen so far, so let us
carefully walk through it. Below the headers, the first row 2 1 1 indicates that for
sum_col=2 and for all values of col1 (namely, just the value 1), the value of col2
is 1. For the next row, the same pattern applies except that for sum_col=3, there are
now two values for col1, namely 0 and 1, which each have their corresponding two
values for the sum operation in col2. This layered display is one way to look at the
result. Note that the layers above are not uniform. Alternatively, we can unstack
this result to obtain the following tabular view of the previous result:
>>> [Link]()
col2
col1 0 1
sum_col
2 NaN 1.0
3 3.0 2.0
4 4.0 NaN
The NaN values indicate positions in the table where there is no entry. For
example, for the pair (sum_col=2,col2=0), there is no corresponding value in the
DataFrame, as you may verify by looking at the penultimate code block. There
is also no entry corresponding to the (sum_col=4,col2=1) pair. Thus, this shows
that the original presentation in the penultimate code block is the same as this one,
just without the above-mentioned missing entries indicated by NaN.
Let us continue with indexing dataframes.
>>> import numpy as np
>>> data=[Link](len(dates)*4).reshape(-1,4)
>>> df = [Link](data,index=dates,
... columns=['A','B','C','D' ])
>>> df
A B C D
2021-01-01 0 1 2 3
2021-01-02 4 5 6 7
2021-01-03 8 9 10 11
2021-01-04 12 13 14 15
2021-01-05 16 17 18 19
2021-01-06 20 21 22 23
2021-01-07 24 25 26 27
2021-01-08 28 29 30 31
2021-01-09 32 33 34 35
2021-01-10 36 37 38 39
134 5 Pandas
2021-01-11 40 41 42 43
2021-01-12 44 45 46 47
Now, you can access each of the columns by name, as in the following:
>>> df['A']
2021-01-01 0
2021-01-02 4
2021-01-03 8
2021-01-04 12
2021-01-05 16
2021-01-06 20
2021-01-07 24
2021-01-08 28
2021-01-09 32
2021-01-10 36
2021-01-11 40
2021-01-12 44
Freq: D, Name: A, dtype: int64
2021-01-10 36 37 38
2021-01-11 40 41 42
2021-01-12 44 45 46
5.3 Reindexing
A DataFrame or Series has an index that can align data using reindexing,
>>> x = [Link](range(3),index=['a','b','c'])
>>> x
a 0
b 1
c 2
dtype: int64
>>> [Link](['c','b','a','z'])
c 2.0
b 1.0
a 0.0
z NaN
dtype: float64
Note how the newly created Series object has a new index and fills in missing
items with NaN. You can fill in by other values by using the fill_value keyword
argument in reindex. It is also possible to do back-filling and forward-filling
(ffill) of values when working with ordered data as in the following:
>>> x = [Link](['a','b','c'],index=[0,5,10])
>>> x
0 a
5 b
10 c
dtype: object
>>> [Link](range(11),method='ffill')
0 a
1 a
2 a
3 a
4 a
5 b
6 b
7 b
8 b
9 b
10 c
dtype: object
More complicated interpolation methods are possible, but not directly using reindex.
Reindexing also applies to dataframes, but in either or both of the two dimensions.
>>> df = [Link](index=['a','b','c'],
... columns=['A','B','C','D'],
... data = [Link](3*4).reshape(3,4))
136 5 Pandas
>>> df
A B C D
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
Note how the missing z element has been filled in as with the prior Series object.
The same behavior applies to reindexing the columns as in the following:
>>> [Link](columns=['D','A','C','Z','B'])
D A C Z B
a 3 0 2 NaN 1
b 7 4 6 NaN 5
c 11 8 10 NaN 9
Again, we have the same filling behavior, now just applied to the columns. The same
back-filling and forward-filling works with ordered columns/indices as before.
To get rid of the data at the ’a’ index, we can use the drop method that will return
a new Series object with the specified data removed,
>>> [Link]('a')
b 1
c 2
dtype: int64
Bear in mind this is a new Series object unless we use the inplace keyword
argument or explicitly using del,
>>> del x['a']
Again, the same comments regarding using del and inplace also apply for
DataFrames.
Note that unlike regular Python indexing, both endpoints are included here when
slicing with labels. This can also be used to assign values as in the following:
>>> x['a':'c']=999
>>> x
a 999
b 999
c 999
d 3
dtype: int64
You can pick out individual columns without the colon (:).
>>> df[['A','C']]
A C
a 0 2
b 4 6
c 8 10
Mixing label-based slices with Numpy-like colon slicing is possible using loc,
138 5 Pandas
>>> [Link]['a':'b',['A','C']]
A C
a 0 2
b 4 6
The idea is that the first argument to loc indexes the rows and the second indexes
the columns. Heuristics allow for Numpy-like indexing without worrying about the
labels. You can go back to plain Numpy indexing with iloc.
>>> [Link][0,-2:]
C 2
D 3
Name: a, dtype: int64
The main thing to keep in mind when operating on one or more Series or
DataFrame objects is that the index always aligns the computation.
>>> x = [Link](range(4),index=['a','b','c','d'])
>>> y = [Link](range(3),index=['a','b','c'])
Note that because y was missing one of the indices, it was filled in with a NaN. This
behavior also applies to dataframes,
>>> df = [Link](index=['a','b','c'],
... columns=['A','B','C','D'],
... data = [Link](3*4).reshape(3,4))
>>> ef = [Link](index=list('abcd'),
... columns=list('ABCDE'),
... data = [Link](4*5).reshape(4,5))
>>> ef
A B C D E
a 0 1 2 3 4
b 5 6 7 8 9
c 10 11 12 13 14
d 15 16 17 18 19
>>> df
A B C D
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
>>> df+ef
A B C D E
5.6 Broadcasting and Data Alignment 139
Note that the non-overlapping elements are filled in with NaN. For simple opera-
tions, you can specify the fill value using the named operation. For example, in the
last case,
>>> [Link](ef,fill_value=0)
A B C D E
a 0.0 2.0 4.0 6.0 4.0
b 9.0 11.0 13.0 15.0 9.0
c 18.0 20.0 22.0 24.0 14.0
d 15.0 16.0 17.0 18.0 19.0
When we add this Series object with the full DataFrame, we obtain the following:
>>> s + df
A B C D
a 0 2 4 6
b 4 6 8 10
c 8 10 12 14
This shows that the Series object was broadcast down the rows, aligning with the
columns in the DataFrame. Here is an example of a different Series object that is
missing some of the columns in the DataFrame,
>>> s = [Link]([1,2,3],index=['A','D','E'])
>>> s+df
A B C D E
a 1.0 NaN NaN 5.0 NaN
b 5.0 NaN NaN 9.0 NaN
c 9.0 NaN NaN 13.0 NaN
Note that the broadcasting still happens down the rows, aligning with the columns,
but fills in the missing entries with NaN.
Here is a quick Python test that uses regular expressions to test for relatively
small prime numbers,
140 5 Pandas
>>> import re
>>> pattern = r'^1?$|^(11+?)\1+$'
>>> def isprime(n):
... return ([Link](pattern, '1'*n) is None) #*
...
Now, we can find which column label has the most prime numbers in it
>>> [Link](isprime)
A B C D
a False False True True
b False True False True
c False False False True
This just scratches the surface of the kinds of fluid data analysis that are almost
automatic using Pandas.
A Y Z
a 0 1 2
b 3 4 5
c 6 7 8
The on keyword argument says to merge the two DataFrames where they have
matching corresponding entries in the A column. Note that the index was not
preserved in the merge. To make things more interesting, let us make the ef
DataFrame different by dropping one of the rows,
>>> [Link]('b',inplace=True)
>>> ef
A Y Z
a 0 1 2
c 6 7 8
Note that only the elements of A that match both DataFrames (i.e., are in the
intersection of both) are preserved. We can alter this by using the how keyword
argument.
>>> [Link](df,ef,on='A',how='left')
A B C Y Z
0 0 1 2 1.0 2.0
1 3 4 5 NaN NaN
2 6 7 8 7.0 8.0
The how=left keyword argument tells the join to keep all the keys on the left
DataFrame (df in this case) and fill in with NaN wherever missing in the right
DataFrame (ef). If ef has elements along A that are absent in df, then these would
disappear,
>>> ef = [Link](index=['a','d','c'],
... columns=['A','Y','Z'],
... data = 10*[Link](3*3).reshape(3,3))
>>> ef
A Y Z
a 0 10 20
d 30 40 50
c 60 70 80
142 5 Pandas
>>> [Link](df,ef,on='A',how='left')
A B C Y Z
0 0 1 2 10.0 20.0
1 3 4 5 NaN NaN
2 6 7 8 NaN NaN
The [Link] function takes the data in the array and puts them into the categorical
variable cats.
>>> [Link]
IntervalIndex([(0, 5], (5, 10]],
closed='right',
dtype='interval[int64]')
The half-open intervals indicate the bounds of each category. You can change the
parity of the intervals by passing the right=False keyword argument.
>>> [Link]
array([-1, 0, 0, 0, 0, 0, 1, 1, 1, 1], dtype=int8)
The -1 above means that the 0 is not included in either of the two categories
because the interval is open on the left. You can count the number of elements in
each category as shown next,
>>> pd.value_counts(cats)
(0, 5] 5
(5, 10] 4
dtype: int64
5.8 Memory Usage and dtypes 143
Descriptive names for each category can be passed using the labels keyword
argument.
>>> cats = [Link](a,bins,labels=['one','two'])
>>> cats
[NaN, 'one', 'one', 'one', 'one', 'one', 'two', 'two', 'two',
→ 'two']
Categories (2, object): ['one' < 'two']
Note that if you pass an integer argument for bins, it will automatically split into
equal-sized categories. The qcut function is very similar except that it splits on
quartiles.
>>> a = [Link](100) # uniform random variables
>>> cats = [Link](a,4,labels=['q1','q2','q3','q4'])
>>> pd.value_counts(cats)
q4 25
q3 25
q2 25
q1 25
dtype: int64
You will find yourself processing a lot of data with Pandas. Here are some tips to do
that efficiently. First, we need the Penguins dataset from Seaborn,
>>> import seaborn as sns
>>> df = sns.load_dataset('penguins')
>>> [Link]()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
This is not a particularly big dataset, but it will suffice. Let us examine the dtypes
of the DataFrame,
>>> [Link]
species object
island object
bill_length_mm float64
bill_depth_mm float64
flipper_length_mm float64
body_mass_g float64
sex object
dtype: object
144 5 Pandas
Notice that some of these are marked object. This is usually means inefficiency
because this generalized dtype may consume an inordinate amount of memory.
Pandas comes with a simple way to assess the memory consumption of your
DataFrame,
>>> df.memory_usage(deep=True)
Index 128
species 21876
island 21704
bill_length_mm 2752
bill_depth_mm 2752
flipper_length_mm 2752
body_mass_g 2752
sex 20995
dtype: int64
Now, we have an idea of our memory consumption for this DataFrame, and we
can improve it by changing the dtypes. The categorical type we discussed previously
can be specified as a new dtype for the sex column,
>>> ef = [Link]({'sex':'category'})
>>> ef.memory_usage(deep=True)
Index 128
species 21876
island 21704
bill_length_mm 2752
bill_depth_mm 2752
flipper_length_mm 2752
body_mass_g 2752
sex 548
dtype: int64
This results in almost a 40 times reduction in memory for this, which may be
significant if the DataFrame had thousands of rows, for example. This works
because there are many more rows than distinct values in the sex column. Let us
continue using category as the dtype for the species and island columns.
>>> ef = [Link]({'sex':'category',
... 'species':'category',
... 'island':'category'})
>>> ef.memory_usage(deep=True)
Index 128
species 616
island 615
bill_length_mm 2752
bill_depth_mm 2752
flipper_length_mm 2752
body_mass_g 2752
sex 548
dtype: int64
5.8 Memory Usage and dtypes 145
This shows a much smaller memory footprint for the columns that we changed
to categorical dtype. We can also change the numerical types from the default
float64, if we do not need that level of precision. For example, the
flipper_length_mm column is measured in millimeters and there is no fractional
part to any of the measurements. Thus, we can change that column as the following
dtype and save four times the memory,
>>> ef = [Link]({'sex':'category',
... 'species':'category',
... 'island':'category',
... 'flipper_length_mm': np.float16})
>>> ef.memory_usage(deep=True)
Index 128
species 616
island 615
bill_length_mm 2752
bill_depth_mm 2752
flipper_length_mm 688
body_mass_g 2752
sex 548
dtype: int64
Thus, by changing the default object dtype to other smaller dtypes can result in
significant savings and potentially speed up downstream processing for dataframes.
This is particularly true when scraping data from the web directly into dataframes
146 5 Pandas
using pd.read_html, which, even though the data looks numerical on the webpage,
will typically result in the unnecessarily heavy object dtype.
A common problem is how to split a string column into components. For example,
>>> df = [Link](dict(name=['Jon Doe','Jane Smith']))
>>> [Link](' ',expand=True)
0 1
0 Jon Doe
1 Jane Smith
The key step is the expand keyword argument that converts the result into
a DataFrame. The result can be assigned into the same DataFrame using the
following,
>>> df[['first','last']]=[Link](' ',expand=True)
>>> df
name first last
0 Jon Doe Jon Doe
1 Jane Smith Jane Smith
Note that failing to use the expand keyword argument results in an output list
instead of a DataFrame
>>> [Link](' ')
0 [Jon, Doe]
1 [Jane, Smith]
Name: name, dtype: object
This can be fixed by using apply on the output to convert from a list into a
Series object as shown,1
>>> [Link](' ').apply([Link])
0 1
0 Jon Doe
1 Jane Smith
The apply method is one of the most powerful and general DataFrame
methods. It operates on the individual columns (i.e., [Link]) objects of
the DataFrame. Unlike the applymap method that leaves the shape of the
DataFrame unchanged, the apply method can return objects of a different
shape. For example, doing something like [Link](lambda i:i[i>3])
on a DataFrame with numeric rows will return a small truncated NaN-filled
DataFrame. Further, the [Link](raw=True) keyword argument speeds
1 There
are many other Python string methods in the .str submodule such as rstrip, upper,
and title.
5.9 Common Operations 147
The keep keyword argument decides which of the duplicated entries to retain.
148 5 Pandas
Pandas has the set_option method to change the visual display of DataFrames
while not altering the corresponding data elements.
>>> pd.set_option('display.float_format','{:.2f}'.format)
Note that the argument is a callable that produces the formatted string. These
custom settings can be undone with reset_option, as in
pd.reset_option('display.float_format')
Within the Jupyter Notebook, formatting can utilize HTML elements with the
[Link] DataFrame method. For example, as shown in Fig. 5.2,
>>> from pandas_datareader import data
>>> df=[Link]("F", 'yahoo', '20200101',
→ '20200110').reset_index()
>>> ([Link](dict(Date='{:%m/%d/%Y}'))
... .hide_index()
... .highlight_min('Close',color='red')
... .highlight_max('Close',color='lightgreen')
... )
<[Link] object at 0x7f9376a22460>
formats the resulting table with the minimum closing price highlighted in red and
the maximum closing price highlighted in green. Providing these kinds of quick
visual cues is critically important for picking out key data elements. Note that the
parenthesis above is for using the newlines to separate the dot methods, which
is a style inherited from the R DataFrame. The key step is to expose the style
formatting with format() and then use its methods to style the resulting HTML
table in Jupyter Notebook. The following changes the color-wise visual gradient of
the Volume column as in Fig. 5.3.
>>> ([Link](dict(Date='{:%m/%d/%Y}'))
... .hide_index()
... .background_gradient(subset='Volume',cmap='Blues')
... )
<[Link] object at 0x7f9376a8a0d0>
5.11 Multi-index
But recall that we did not give the index a name when we created it, which explains
the uniformative headers, level_0 and level_1. We can swap the two levels of the
index in the DataFrame,
>>> [Link]()
A
1 a 0
2 a 10
3 a 20
1 b 30
2 b 40
3 b 50
The [Link] makes it much easier to index the DataFrame using the
loc accessor,
>>> ixs = [Link]
>>> [Link][ixs['a',:],:]
A
a 1 0
2 10
3 20
>>> [Link][ixs[:,2],:]
A
a 2 10
b 2 40
5.11 Multi-index 151
Note that there can be many more levels to a multi-index. These can also go in the
column index,
>>> rx = [Link].from_product([['a','b'],[1,2,3]])
>>> cx = [Link].from_product([['A','B','C'],[2,3]])
>>> data=[[2, 3, 9, 3, 4, 1],
... [9, 5, 9, 7, 2, 1],
... [9, 4, 4, 3, 2, 1],
... [1, 0, 4, 5, 5, 5],
... [5, 8, 1, 6, 1, 7],
... [0, 8, 9, 2, 1, 9]]
>>> df = [Link](index=rx,columns=cx,data=data)
>>> df
A B C
2 3 2 3 2 3
a 1 2 3 9 3 4 1
2 9 5 9 7 2 1
3 9 4 4 3 2 1
b 1 1 0 4 5 5 5
2 5 8 1 6 1 7
3 0 8 9 2 1 9
You can use [Link] for both the columns and rows,
>>> [Link][ixs['a',:],ixs['A',:]]=1
>>> df
A B C
2 3 2 3 2 3
a 1 1 1 9 3 4 1
2 1 1 9 7 2 1
3 1 1 4 3 2 1
b 1 1 0 4 5 5 5
2 5 8 1 6 1 7
3 0 8 9 2 1 9
(B, 2)
1 5 8 6 1 7
4 2 1 8 7 6
9 2 10 12 7 11
To understand how this works, take the column slice and examine its unique
elements. This explains the values in the resulting row index of the output.
>>> [Link][:,('B',2)].unique()
array([9, 4, 1])
Now, we have to examine the partitions that are created in the DataFrame by each
of these values such as:
>>> [Link](('B',2)).get_group(4)
A B C
2 3 2 3 2 3
X Y
a 3 1 1 4 3 2 1
b 1 1 0 4 5 5 5
and then summing over these groups produces the final output. You can also use
the apply function on the group to calculate non-scalar output. For example, to
subtract the minimum of each element in the group we can do the following,
>>> [Link](('B',2)).apply(lambda i:[Link]())
A B C
2 3 2 3 2 3
X Y
a 1 1 0 0 1 3 0
2 1 0 0 5 1 0
3 0 1 0 0 0 0
b 1 0 0 0 2 3 4
2 0 0 0 0 0 0
3 0 7 0 0 0 8
5.12 Pipes
Pandas implements method-chaining with the pipe function. Even though this
is un-Pythonic, it is easier than composing functions together that manipulate
dataframes from end-to-end.
>>> df = [Link](index=['a','b','c'],
... columns=['A','B','C'],
... data = [Link](3*3).reshape(3,3))
>>> [Link](lambda i:i*10).pipe(lambda i:3*i)
A B C
a 0 30 60
b 90 120 150
c 180 210 240
5.13 Data Files and Databases 153
Suppose we need to find the cases for which the sum of the columns is an odd
number. We can create an intermediate throw-away variable t using assign and
then extract the corresponding section of the DataFrame as in the following,
>>> [Link](t=df.A+df.B+df.C).query('t%2==1').drop('t',axis=1)
A B C
a 0 1 2
c 6 7 8
The assign method takes either a function whose argument is the DataFrame
itself or the named DataFrame. The query method then filters the intermediate
result according to the oddness of t and the final step is to remove the t variable
that we no longer need in the output.
Pandas has powerful I/O utilities for manipulating Excel and CSV spreadsheets.
>>> df.to_excel('this_excel.[Link]')
You will find that the given spreadsheet has the dates formatted according to Excel’s
internal date representation.
If you have PyTables installed, you can write to HDFStore. You can also
manipulate HDF5 directly from PyTables.
>>> df.to_hdf('filename.h5','keyname')
to get your data back. You can create a SQLite database right away because SQLite
is included with Python itself.
>>> import sqlite3
>>> cnx = [Link](':memory:')
>>> df = [Link](index=['a','b','c'],
... columns=['A','B','C'],
... data = [Link](3*3).reshape(3,3))
>>> df.to_sql('TableName',cnx)
Now, we can reload from the database using the usual relational algebra
>>> from [Link] import sql
>>> p2 = sql.read_sql_query('select * from TableName', cnx)
>>> p2
index A B C
0 a 0 1 2
1 b 3 4 5
2 c 6 7 8
154 5 Pandas
Now, with that established, we can use our new method prefixed with the custom
namespace, as in the following,
>>> [Link] # as attribute
A C
a 0 2
b 4 6
c 8 10
>>> [Link].avg_odds() # as method
A nan
B 5.00
C nan
D 7.00
dtype: float64
Importantly, you can use any word you want besides custom. You just have to
specify it in the decorator. The analogous register_series_accessor does the
5.15 Rolling and Filling Operations 155
same thing for Series objects and the register_index_accessor for Index
objects.2
Note that we only get valid outputs for a fully filled trailing window. Whether
or not the endpoints of the window are used in the calculation is determined by
the closed keyword argument. Besides the default rectangular window, other
windows such as Blackman and Hamming are available. The [Link]()
function produces a Rolling object with methods such as apply, aggregate,
and others. Similar to rolling, exponentially weighted windowed calculations
can be computed with the ewm() method,
>>> [Link](3).mean()
High Low Open Close Volume Adj Close
0 9.42 9.19 9.29 9.42 43425700.00 9.26
1 9.39 9.17 9.30 9.30 44348614.29 9.14
2 9.30 9.12 9.21 9.24 43926424.32 9.08
3 9.28 9.12 9.21 9.24 44313231.43 9.09
4 9.29 9.14 9.22 9.25 44864456.98 9.09
5 9.29 9.15 9.24 9.25 46979043.75 9.10
6 9.31 9.18 9.25 9.25 44906738.44 9.10
7 9.30 9.16 9.25 9.25 45919910.43 9.09
8 9.31 9.17 9.24 9.26 45113266.20 9.10
We have barely scratched the surface of what Pandas is capable of and we have
completely ignored its powerful features for managing dates and times. There is
much more to learn and the online documentation and tutorials at the main Pandas
site are great for learning more.