Python for Scientific Computing in Chemistry
Python for Scientific Computing in Chemistry
Python
Charles J. Weiss
i
3.4 Multifigure Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.5 3D Scatter Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
3.6 Surface & Wireframe Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
3.7 3D Data on a 2D Surface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
ii
9.1 Deterministic Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
9.2 Stochastic Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
Chapter 12: Nuclear Magnetic Resonance with nmrglue & nmrsim 361
12.1 NMR Processing with nmrglue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
12.2 Simulating NMR with nmrsim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
iii
Chapter 16: Bioinformatics with Biopython & Nglview 455
16.1 Working with Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
16.2 Structural Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
16.3 Visualization of Molecules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
Index 545
Index 547
iv
Scientific Computing for Chemists with Python
Scientific computing utilizes computers to aid in scientific tasks such as data processing and digital simulations, among
others. The well-developed field of computational chemistry is part of scientific computing and focuses on utilizing
computing to simulate chemical phenomena and calculate properties. However, there is less focus in the field of chemistry
on the data processing side of computing, so this book strives to fill this void by introducing the reader to tools and
methods for processing, visualizing, and analyzing chemical data. This book serves as an introduction to coding for
chemists. The tools employed in this book are the powerful and popular combination of Jupyter notebooks and the
Python programming language. No background beyond first-year college chemistry and occasionally some very basic
spectroscopy (for advanced chapters) is assumed for most of this book. This book starts with a brief primer on Jupyter
notebooks in chapter 0 and computer programming with Python in chapters 1 and 2. If you already have a background in
these tools, feel free to skip ahead. The rest of the book dives into applications of Python to solving chemical problems.
Python and Jupyter were chosen for a variety of reasons, including that they are:
• Relatively easy to use and learn
• Powerful and well-suited for solving chemical problems
• Free, open-source software
• Cross-platform (e.g., runs on Windows, macOS, and Linux)
• Supplemented with numerous, specialized libraries for handling specific types of data or problems (e.g., machine
learning)
• Supported by a helpful and welcoming community
Learning to use a number of popular Python scientific libraries to solve chemical problems is one of the themes of this
book. A Python library can be thought of as a tool pack with premade functions for performing common tasks in scientific
data processing, analysis, and visualization. For example, the matplotlib library provides a variety of functions for creating
a wide range of plots, while the scikit-learn library contains functions and resources for machine learning.
License
This book is copyright © 2017-2025 Charles J. Weiss and is released under the CC BY-NC-SA 4.0 license. All files with
the book are also copyright and released under the CC BY-NC-SA 4.0 license unless otherwise noted (see [Link]
CONTENTS 1
Scientific Computing for Chemists with Python
This book has both a PDF and web version with different advantages listed below. The web version is recommended
because it contains all the interactive features and is updated more regularly. The web version and book files are available
on GitHub.
Organization of Book
This book is organized in order of more fundamental topics first, but not every chapter is a prerequisite for all subsequent
chapters. Chapter 0 provides a quick introduction to Jupyter notebooks, and chapters 1-2 provide background on the
Python programming language. Anyone who already knows Python can skim or skip past these two chapters. Chapter
3 introduces plotting and visualization, and chapter 4 covers the NumPy library. Both chapter 3 and 4 are used heavily
in this book and should not be bypassed. The pandas library is covered in chapter 5, which is used in some subsequent
chapters, but not all. This library adds functionality and extra ease-of-use to NumPy. Anyone looking to streamline their
schedule could skip this chapter, but be aware that it is heavily utilized in chapters 10, 11, and 13. However, chapters
10 and 13 should be largely readable by someone who is not familiar with pandas or at least has read sections 5.1-5.2.
Chapters beyond chapter 5 are mostly applications, advanced topics, or cover libraries for very specific applications such
as image processing, machine learning, bioinformatics, or optimization. Chapters 6-17 are designed to be mostly modular,
so after getting through chapters 0-5, these subsequent chapters can be covered in any order depending on the reader’s
needs and interests. This book also has a few appendices that contain interesting topics, such as controlling your code with
widgets or visualizing atomic orbitals, that do not fit well into any of the chapters but are still worth checking out.
2 CONTENTS
Scientific Computing for Chemists with Python
Chapter Description
Number
Chapter 0 Short introduction to installing and using Jupyter notebooks
Chapter 1 Core Python programming skills
Chapter 2 Intermediate Python programming skills - this chapter contains many useful topics but may be skipped
over and returned to as needed for the impatient reader
Chapter 3 Matplotlib plotting library for visualization of data and results
Chapter 4 NumPy library which is the foundation of much of the scientific Python ecosystem
Chapter 5 Pandas data analysis library
Chapter 6 Basic signal processing in Python including finding peaks, smoothing data, and fitting/interpolation
among other topics
Chapter 7 Image processing using the NumPy and scikit-image libraries
Chapter 8 Symbolic math and other more advanced mathematics in Python
Chapter 9 Simulating physical and chemical processes in Python
Chapter 10 Seaborn plotting library
Chapter 11 Interactive plotting with Altair
Chapter 12 NMR processing and simulations with nmrglue and nmrsim
Chapter 13 Machine learning using the scikit-learn library
Chapter 14 Using functions from the [Link] module to perform minimization, curve fitting, and root
finding
Chapter 15 Cheminformatics with RDKit
Chapter 16 Bioinformatics with Biopython and nglview
Chapter 17 Writing Python scripts using Spyder and running them from the command line
Appendix 0 IPython widgets for interactive notebooks
Appendix 1 Remote requests for accessing online databases
Appendix 2 Visualizing atomic orbitals
Appendix 3 Uncertainty propagation made easier
Appendix 4 Regular Expressions
One of the goals of this book is to provide a streamlined introduction to Python and its scientific libraries in order to allow
the reader to start applying these new skills to chemistry as quickly as possible. As a result, not all topics covered in a
typical computer science course on Python are included here. Instead, the most relevant topics to chemistry are covered
along with a selection of scientific libraries not likely taught in most Python courses. Another difference between this
book and a typical computer science course on Python is that many computer science courses would have students write
and save code as text files and run them from the command line. In contrast, this book assumes that the reader is running
his or her code in a Jupyter notebook, as described in chapter 0, which is an ideal environment for scientific data analysis.
The Jupyter notebook provides immediate feedback to the user, convenient graphical outputs, is shareable, and is simpler
to use than running Python scripts from the command line. For those students who wish to continue on to run Python
scripts from the command line, chapter 17 provides a brief introduction to this process. In an effort to make this text
usable in a wide range of courses, there is little in-depth analysis of the data. This book instead focuses more on how to
work with the data and leaves the chemical analysis to the individual instructors.
Any data file(s) referred to in the chapters or end-of-chapter exercises can be found in the data folder in the same directory
as the chapter’s Jupyter notebook. Alternatively, you can download a zip file of the data for this chapter from here by
selecting the appropriate chapter file and then clicking the Download button. The latter option is recommended for those
who do not use Git or GitHub.
CONTENTS 3
Scientific Computing for Chemists with Python
Exercise Answers
Copies of exercise answer keys are available for instructors upon request. To obtain copies, please email the author using
your school email address.
While great efforts have gone into ensuring that all the code in this book works as prescribed and all text and code are
free of errors, some errors could exist. Additionally, some examples in this book are simplified for pedagogical reasons
and may not be appropriate for research and other applications. It is the responsibility of the reader to check that their
code is free of errors, behaves as required, and that the methods are appropriate for their applications.
The code in this version of the book has been most recently tested with the following software versions unless otherwise
noted but will likely work with other versions.
• Python – 3.12.7
• JupyterLab - 4.4.4
• NumPy – 2.2.6
• SciPy – 1.16.0
• Pandas – 2.3.0
• Matplotlib – 3.10.3
• Seaborn – 0.13.2
• Altair - 5.5.0
• Scikit-image – 0.25.2
• Scikit-learn – 1.7.0
• Sympy - 1.14.0
• nmrglue – 0.11
• nmrsim - 0.6
• Spyder – 5.4.2
• Biopython - 1.85
• Nglview - 3.1.2
• RDKit - 2025.3.3
• Pybaselines - 1.2.0
• Requests - 2.32.4
• IPywidgets - 8.1.7
• Uncertainties - 3.2.3
Acknowledgments
4 CONTENTS
Scientific Computing for Chemists with Python
This book took a substantial time to write along with the time and effort in developing the curriculum. Thank you to
those who supported and encouraged me along the way. Finally, thank you to the following people for proofreading or
reporting errors. Reports of additional errors are welcome on GitHub or in an email. Please do not email files to the
author but rather include your error report in the body of the email.
• Wesley A. Deutscher helping collect some example data
• M. Roarke Tollar providing feedback and reporting typos in chapters 0 and 1
• Andrew Klose providing feedback and reporting typos in chapter 12 and an idea for an exercise
• Harrison Kuhn for identifying a code error in chapter 11
• nzjakemartin (GitHub handle) identifying a type in the weighted average equation
• Paul A. Craig identifying a typo in the Python code
• Patrick Coppock providing feedback
• Zachary M. Schulte for providing feedback and reporting typos in chapter 14
• Yuthana Tantirungrotechai for reporting errors in chapter 6
• @Avalanchian for reporting an error in chapter 10 data/elements_data.csv file
• Matthew A. Kubasik for reporting an error in chapters 6 and 8
• Geoffrey M. Sametz for help with nmrsim
• Ryan Schulte for reporting typos
• Robert Belford for reporting typos
• Filippo Muzzini for reporting a missing data file in chapter 6
• jaredchis (GitHub handle) for reporting a code typo in chapter 3
CONTENTS 5
Scientific Computing for Chemists with Python
6 CONTENTS
Part I
7
CHAPTER 0: PYTHON & JUPYTER NOTEBOOKS
0.1 Python
Python is a popular programming language available on all major computer platforms including macOS, Linux, and
Windows. It is a scripting language which means that the moment the user presses the Return key or Run, the Python
software interprets and runs the code. This is in contrast to a compiled language like C where the code must first be
translated into binary (i.e., machine language) before it can be run. On-the-fly interpretation makes Python quick to use
and often provides the user with rapid results. This is ideal for scientific data analysis where the user is routinely making
changes to the processing and visualization of the data.
Python is free, open-source software and is maintained by the non-profit Python Software Foundation. This is appealing
for two major reasons. The first is that it is widely, freely, and irrevocably available to anyone who wants to use it
regardless of budget. With proprietary software, which is more and more commonly offered under a subscription model,
if a company stops offering or updating a software package, it may simply become unavailable leaving users without the
software they built their work around. Second, it is open source, so anyone can inspect and modify the code. This allows
anyone to review the code to ensure it does what it claims instead of relying on the assertions of the software distributor.
Another reason to use Python over other options, free or otherwise, is the power and the community support available
to Python users. Python is a common and popular programming language that has been applied to a wide variety of
applications including data analysis, visualization, machine learning, robotics, web scraping, 3D graphics, and more. As a
result, there is a large community built around Python that provides valuable support for those who need assistance. If you
are stuck on a problem or have a question, a quick internet search will likely provide the answer. Common internet forums
include [Link] or [Link] among others. If you have a question or need help on something, you
are probably not the first person to ask that question.
Along with Python, this book uses the IPython environment and Jupyter notebooks as a medium for running and shar-
ing Python code. More details are given below on Jupyter notebooks, but for now, know that they provide interactive
environments ideal for scientific computing. In addition, we will use a variety of free, open-source libraries to provide
collections of useful functions for scientific data processing, analysis, and visualization. Think of a library as an add-on
or tool pack for Python, and there are many to choose from.
9
Scientific Computing for Chemists with Python
Á Warning
Software installation instructions may have changed since these instructions were written and may vary depending on
the operating system.
The first step is to get access to the software which includes Python, Jupyter notebooks, and all the libraries/packages
used in this book; and there are multiple options for accomplishing this. We will cover two common options below - this
includes either installing the software on your own computer using Anaconda or using Google Colab to run the software
from a Google server. Both are relatively simple to set up and have different advantages. Some of the major advantages
of each are listed below.
You only need to use one of the above options, but you can always switch later on if you want because both use the same
notebook files to store your work. Go ahead and follow the instructions for one of the following.
There are multiple ways to install the software on your computer. Two common options are the Anaconda Distribution
and the Miniconda installers, both provided for free by Anaconda Inc. There are other installation options available, but
the instructions in this book often assume one of these options. Miniconda is currently recommended over the Anaconda
Distribution installer even though Miniconda requires a little more effort. When installing the software, be sure to choose
Python 3 as this is the current version. While some applications still support Python 2, it is technically legacy. As of the
time of this writing, multiple major projects in the scientific Python ecosystem no longer support Python 2, so it is likely
in your interest to be on Python 3. You are strongly encouraged to install the most recent version of Python.
Anaconda Distribution
The Anaconda installer brings almost everything you need. Any software used in this book that is not installed by default
with Anaconda is addressed in its respective chapter. If you want to install additional libraries, open the Anaconda-
Navigator (green circle icon) and select the Environment tab on the left. Select Not Installed from the pull-down menu
to see all the libraries available to be installed as shown in Figure 1. To install a library, check the box next to it and click
the Apply button that appears on the bottom right. Anaconda will install it and anything else that is required for the new
library to work properly. To update a library, select Upgradable from the pull-down menu, select the package(s) you
want to update, and click Apply.
10
Scientific Computing for Chemists with Python
Á Warning
The conda command pulls software from online databases known as channels. By default conda pulls from the
Anaconda Inc. channel which is not free for large companies. The user can optionally include -c conda-forge
to the below commands to pull from the conda-forge channel which is free for everyone. This is why many
packages (e.g., matplotlib) include -c conda-forge in their installation instructions.
From here, you can install various libraries using either of the below commands where <library> is the name of the
library to install.
or
Most libraries can be installed using either of the above commands, but a few can only be installed with one. You should
do a quick internet search to see which is the preferred method for a particular library before installing it. The pip
list or conda list command will display a list of all libraries currently installed with version numbers. To perform
an update, the following two commands may be used for many libraries. Again, check to see which is preferred for a
particular library.
Miniconda (preferred)
Miniconda is a lighter installer that uses less space on your computer and is my preferred installation method. It is not
quite as convenient as the previous method because it installs minimal software, so after installing Miniconda, the user
also needs to install the Python packages. Below are the steps for installing Miniconda and Python packages. These
instructions are written for macOS or Linux. I have not tested these instructions on Windows, but there are Windows
instructions on the web.
1. Download the Miniconda installer and install Miniconda following the prompts.
2. Open your computer’s Terminal and install JupyterLab using either the conda or pip commands (e.g., pip in-
stall jupyterlab).
3. Install core Python packages using either pip or conda. Below is a list of packages that should be installed to get
started. It is strongly recommended to stick with either pip or conda and avoid using a mixture of both because this
can lead to issues later on.
• jupyterlab
• numpy
• scipy
• matplotlib
• pandas
• seaborn
• scikit-image
12
Scientific Computing for Chemists with Python
• scikit-learn
• seaborn
• sympy
For example, to install numpy using pip, you would run the following command in your computer’s Terminal or in the
JupyterLab Terminal (Figure 2).
b Tip
If you want a shortcut, a requirements file containing a list of all the packages, one per line, can be created, called
[Link] below, and run using the following command. This should be a simple text (.txt) file. If the
command is not finding the file, try typing the first part through pip install -r and then click and drag the
file into the Terminal window.
pip install -r [Link]
To launch JupyterLab and start coding, type jupyter-lab in the Terminal window. It should launch in your browser
(e.g., Chrome or Firefox). JupyterLab is not a website; it just uses your web browser as a file viewer.
The topic of environments is technically an optional one. If you are just getting started, you can probably skip over
this for now, but as you establish yourself more in coding and work on more projects, it is a good idea to learn to use
environments. Using environments is considered best practices and allows you to have multiple different versions of
Python and/or Python packages installed on a single computer at the same time. This is helpful when you are working on
multiple projects with different software requirements. There are two common types of environment you will often hear
about - conda and venv. We address using conda environments here. Again, if you are just getting started, this may not
be necessary, but here are instructions for doing this when the time comes.
b Tip
The -c conda-forge command can be added to the below commands to use the conda-forge channel when
installing software. Again, the conda-forge channel is free for all users, including large companies.
® Note
As of 2024, the conda default channel and conda-forge channels are not intercompatible. That is, the user should
not install packages from both in the same conda environment.
1. Open the Terminal on your computer or in JupyterLab and type one of the following commands to create a new
conda environment with the name <env_name>. The <env_name> can be anything you want. The python tells
the command to also install Python in that environment. Optionally, you can also list Python packages to install in
the environment at this stage by listing them like is done in the second command example.
or
2. Now the new conda environment has been created. To see a list of all your environments, type the following. You
should always have one called base along with any others you created.
3. Next, we need to switch over to the new environment by typing the activate command below. If you again type
conda env list, you will see the * has shifted from base to your new environment indicating that your new
environment is currently active.
4. If you want to install additional libraries in this environment, you can do this now using conda or pip. Remember
to install JupyterLab if you intend to use it.
5. If you want to use this environment in a Jupyter notebook, you will need to register it with JupyterLab. First
install ipykernel (e.g., conda install ipykernel or pip install ipykernel) and then type the
command below to register your environment with JupyterLab. Now when you start a new Jupyter notebook, your
new environment will be an option. There will also be a pull-down menu on the top right of your notebook where
you can select which environment you want to use.
conda deactivate
or
2. If you registered the environment with Jupyter, unregister it with the following.
14
Scientific Computing for Chemists with Python
The other option we’ll cover is to run the software on a Google server using Google Colab. You don’t need to install any
software for this option, but you will need a free Google account. If you have a Gmail account or your institution’s email
is run by Google, you already have a Google account. While you could just go directly to the Colab Page, we want to
be able to work with data files on your Google Drive, so below are instructions for setting up Google Colab from your
Google Drive.
First, log into your Google account or create an account if you don’t have one already. Next, navigate to Google Drive by
clicking on the Google Apps icon (3 × 3 grid of dots) on the top right and click Drive (Figure 3)
® Note
If you already have a Jupyter notebook (.ipynb extension) in your Google Drive, opening it by double-clicking it
will sometimes install the Google Colab add-on automatically. This, of course, requires that you already have a
Jupyter notebook from some other source.
b Tip
If installing the Colaboratory add-on does not allow you to open Jupyter notebooks, try refreshing your Google
Drive page.
Most of the libraries (see section 0.6) used in this book are already available in Google Colab by default including NumPy,
SciPy, pandas, seaborn, scikit-image, and scikit-learn. If you need any additional libraries (or “packages”), you can usually
install them by adding a code cell at the top of your Jupyter notebooks that looks like the following inserting the library
name for <library>. If you need any additional libraries installed for this book, this will be addressed in the appropriate
chapter.
The Jupyter notebook (formerly known as the IPython notebook) is an electronic document designed to support interactive
data processing, analysis, and visualization in a shareable format. A Jupyter notebook can contain live code, equations,
explanatory text, and the output of code such as values, text, images, and plots. The code and examples in this book are
intended to be run from a Jupyter notebook but should work fine in many other environments including a basic IPython
terminal. You can work with Jupyter notebooks either by having the Jupyter software installed on your computer or by
running them on Google Colab which is Google’s implementation of Jupyter.
® Note
The name changed as a result of support for more programming languages beyond Python. The name Jupyter was
forged from Julia/Python/R, the first three languages supported, and is a nod to Galileo Galilei for his notebooks
where he sketched the planet Jupiter and moons as observed through his telescope. The Jupyter notebook currently
supports dozens of programming languages, but for this book, we will only be addressing Python.
The Jupyter notebook is structured as a series of cells of two main types: code and Markdown. The code cells contain live
Python code that can be run inside the notebook with any output of the code, including values, text, and plots, appearing
16
Scientific Computing for Chemists with Python
directly below the cell (Figure 5). The Markdown cell is the other common cell type and is designed to contain explanatory
information on what is happening in the code cells. They can contain text, equations, and images to help the user convey
information. Markdown cells support formatting in Markdown, HTML, and LaTex. These two types of cells provide the
user with the ability to produce documents containing the data analysis, results, and explanations of the data and analysis
along with any conclusions.
Figure 5 An example
18
Scientific Computing for Chemists with Python
Jupyter notebook with Markdown cells, code, and outputs of the code when open on Jupyter installed on a computer
(top) and from Google Colab (bottom).
Jupyter notebooks can be opened and edited using Jupyter installed on your own computer or Google Colab. While the
two platforms of Jupyter are similar, there are some minor differences in the location of some controls and other features.
Using installed Jupyter and Google Colab are both addressed below.
If you have Python and Jupyter installed on your computer using Anaconda (section 0.2.1), a Jupyter notebook can be
launched by starting the Navigator application (green circle icon) and then clicking the Launch button under JupyterLab.
Alternatively, Jupyter can be launched from the Terminal or shell by typing jupyter-lab. The Jupyter notebook will
launch in the web browser, but this is not a website. An internet browser is fundamentally a fancy file viewer that displays
documents and images from web servers, but it can also view files on your own computer which is what Jupyter is doing.
From here, you can either select an already existing Jupyter notebook, denoted by the orange icons and .ipynb extension
(Figure 6, left), to open it or create a new notebook by selecting New from the File menu (Figure 6, right) and selecting
Notebook. If a popup dialogue appears titled Select Kernel, you should select Python 3 (or your environment if you
installed a conda environment).
Figure 6 Launching a Jupyter notebook can be accomplished by opening a preexisting notebook from within JupyterLab
(left) or launching a new Jupyter notebook from the File menu (right).
Both code and Markdown cells can be run by either selecting Run Selected Cells in the Run menu, by clicking the ►
button at the top of the notebook (Figure 7), or by using the Shift + Return shortcut. When a code cell is run, the code
is executed with any output appearing directly below. When a Markdown cell is executed, the text in the cell is rendered
to look nicer, and any HTML or LaTex code is rendered to generate the equation(s) or desired formatting. Markdown
cells do not execute Python code and treat code like regular text.
Figure 7 Run a selected cell in a Jupyter notebook by clicking the ► button at the top of the notebook or by selecting
Run Selected Cells from the Run menu. The output of a code cell appears directly below the executed code cell.
To add additional cells in Jupyter, click the + above the notebook to produce another cell and then select either Code or
Markdown from the pulldown menu at the top to set the cell type.
Google Colab is Google’s flavor of Jupyter with Python. If you are using Google Colab (section 0.2.2), you can open
a notebook by double-clicking on the Jupyter notebook file (.ipynb extension) in your Google Drive. To create a new
notebook, click the New button on the top left of the Google Drive window and then More → Google Colaboratory.
(Figure 8).
20
Scientific Computing for Chemists with Python
Figure 8 Launching a new notebook in Google Colab using New → More → Google Colaboratory.
Once your notebook is open, you can execute code or Markdown cells by either selecting one of the run options (e.g.,
Run all) in the Runtime menu, by clicking the ► button at the left of the cell (Figure 9), or by using the Shift + Return
shortcut.
Figure 9 A cell can be executed by clicking the ► button at the left of a cell in Google Colab among other methods.
Just like Jupyter installed on a computer, once a code cell is run, the code is executed with any output (e.g., numbers, text,
or graphs) appearing directly below the code cell. When a Markdown cell is executed, the text in the cell is rendered to
look nicer, and any HTML or LaTex code is rendered to generate the equation(s) or desired formatting. If code is written
in a Markdown cell, it is treated like regular text instead of code.
To add additional cells in Google Colab, click either the + Code or + Text above the notebook to produce another code
or Markdown cell, respectively.
The one other major difference between running the software installed on your own computer and Google Colab is that if
you want Colab to be able to interact with data or images files on your Google Drive, you need to include the three extra
lines of code shown below at the top of your notebook. The first two lines grant the notebook access to read/write files
on your Google Drive while the third line (%cd /content/drive/My Drive/project) points your notebook
to where your files are located. The path should reflect the location of the folder containing your notebook and data files.
For example, if your notebook is contained in a folder titled project on Google Drive, the path will be /content/
drive/My Drive/project.
0.4 Markdown
Markdown is a lightweight markup language that allows users to make bold, italic, or monospaced text and various kinds
of lists and other simple formatting. The table below provides a collection of common Markdown syntax (left) with
the corresponding rendered result (right). These are worth knowing to generate sharp Markdown cells in your Jupyter
notebooks. You will likely find that regular usage will commit them to memory.
Table 1 Markdown Syntax
One difference between writing code in a code cell versus a Markdown cell is that code cells color the text based on the
syntax or the role the text plays in the code, known as syntax highlighting, and Markdown cells do not. It would be like if
a word processor colored nouns gray, verbs orange, prepositions blue, and punctuation marks red so that the reader can
see the role each word or symbol plays in a sentence. If you want to include example text in a Markdown cell with syntax
highlighting, place ~~~python in the line above the code and ~~~ in the line below the code.
0.5 Comments
Along with Markdown cells, it is good practice to add comments to your code. Comments are a means of describing what
each section of code does and make it easier for you and others to navigate the code. It may seem clear to you what each
piece of code does as you write it, but after a week, month, or longer, it is unlikely to be as obvious. Someone (attribution
uncertain) once elegantly described the importance of comments in stating that the “Your closest collaborator is you six
months ago, but you don’t reply to emails.” Comment your code now so that you are not confused later.
The code comments are added directly to code cells using the hash # symbol. Anything in a line after a hash symbol is
not executed. This means that an entire line can be a comment or a comment can be added after code as demonstrated
below with comments colored differently than the rest of the code.
22
Scientific Computing for Chemists with Python
import numpy as np
rng = [Link].default_rng()
for frame in t:
# add random value to locations
loc += 2 * ([Link](particles) - 0.5)
The Python programming language allows for add-ons known as libraries or packages to provide extra features. Each
library is a collection of modules, and each module is a collection of functions… or occasionally data. For example, the
SciPy library contains a module called integrate which contains a collection of functions for integrating equations
or sampled data. For scientific applications, there is a series of core libraries collectively known as the SciPy stack along
with many other popular libraries. The table below lists some of the common libraries for scientific applications with an
asterisk by those often considered part of the SciPy stack.
Table 2 Common Python Scientific Libraries
Library Description
NumPy* Foundation of the SciPy stack and provides arrays and a large collection of mathematical functions
SciPy* Scientific data analysis tools for common scientific data analysis tasks including signal analysis, Fourier
transform, integration, linear algebra, optimization, feature identification, and others
Mat- Popular and powerful plotting library
plotlib*
Scikit- Scientific image processing and analysis
Image*
Seaborn Advanced plotting library built on matplotlib
SymPy* Symbolic mathematics (somewhat analogous to Mathematica)
Pan- Advanced data analysis tools
das*
Scikit- Machine learning tools
Learn
Tensor- Machine learning tools for neural networks
Flow
NMR- Nuclear magnetic resonance data processing
glue
Biopy- Computational biology and bioinformatics
thon
Scikit- Computational biology and bioinformatics
Bio
RDKit General purpose cheminformatics
Further Reading
For further reading and exploration on Jupyter notebooks, the Jupyter Project website below is a good place to see what is
happening. There are also a number of books that include chapters on the Jupyter notebooks and the interactive IPython
environment.
1. Jupyter Project Website. [Link] (free resource)
2. Google Colab (and Jupyter) Cheat Sheet. [Link] (free resource)
3. SciPy Website. [Link] (free resource)
4. IPython Interactive Computing Website. [Link] (free resource)
5. VanderPlas, J. Python Data Science Handbook: Essential Tools for Working with Data, 1st ed.;
O’Reilly: Sebastopol, CA, 2017, chapter 1. Freely available from the author at [Link]
PythonDataScienceHandbook/ (free resource)
24
CHAPTER 1: BASIC PYTHON
1.1 Numbers
To a degree, Python is an extremely powerful calculator that can perform both basic arithmetic and advanced mathematical
calculations. Doing math in a Python interpreter is similar to using a graphing calculator – the user inputs a mathematical
expression in a line and presses Return (or Shift-Return in the case of a Jupyter notebook cell), and the output appears
directly below. Python includes a few basic mathematical operators shown in the table below.
Table 1 Python Mathematical Operators
Operator Description
+ Addition
- Subtraction
* Multiplication
/ Division (regular)
// Integer division (aka. floor division)
** Exponentiation
% Modulus (aka. remainder)
The addition, subtraction, multiplication, and division (regular) operators work the same way as they do in most math
classes. In addition, Python follows the standard order of operation, so parentheses can be used to change the flow of the
mathematical operations as needed.
8 + 3 * 2
14
(8 + 3) * 2
22
You may have noticed that there are spaces around the mathematical operators in the example calculations above. Python
does not care about spaces within a line, so feel free to add spaces to make your calculation more readable as is done
above. Python does, however, care about spaces at the beginning of a line. This will be further addressed in the sections
on conditions and loops.
25
Scientific Computing for Chemists with Python
Regular division, denoted by a single forward slash (/), is exactly what you probably expect. Three divided by two is
one and a half. Integer division, shown with a double forward slash (//), is a little more surprising. Instead of providing
the exact answer, it can be viewed as either rounding down to the nearest integer (also known as flooring it) or simply
truncating off anything after the decimal place.
3 / 2
1.5
3 // 2
Exponentiation is performed with a double asterisk (**). The carrot (^) means something else, so be careful not to
accidentally use this.
2 ** 3
Occasionally, obtaining the modulus is also useful and is done using the modulo operator (%). This is also sometimes
referred to as the remainder after division as it is whatever is leftover that does not divide evenly into the divisor. In the
example below, 3 is seen as going into 10 thrice with 1 leftover. The leftover portion is the modulus. This is often useful
in determining if a number is even among other things.
10 % 3
There are two types of numbers in Python – floats and integers. Floats, short for “floating point numbers,” are values with
decimals in them. They may be either whole or non-whole values such as 3.0 or 1.2, but there is always a decimal point.
Integers are whole numbers with no decimal point such as 2 or 53.
Mathematical operations that include only integers and evaluate to a whole number will generate an integer. All other
situations will generate a float. In the second example below, a float is generated because one of the inputs is a float. In the
third example below, a float is generated despite only integers in the input because the operation evaluates to a fraction.
3 + 8
11
3.0 + 8
11.0
2 / 5
0.4
Integers and floats can be interconverted using the int() and float() functions.
26
Scientific Computing for Chemists with Python
int(3.0)
float(4)
4.0
The distinction between floats and integers is often a minor detail. There are times when a specific application or function
will require a value as an integer or float. However, a majority of the time, you do not need to think much about it as
Python manages most of this for you in the background.
In addition to basic mathematical operators, Python contains a number of functions. As in mathematics, a function has a
name (e.g., 𝑓) and the arguments are placed inside of the parentheses after the name. The argument is any value or piece
of information fed into a function. In the case below, 𝑓 requires a single argument x.
𝑓(𝑥)
There are a number of useful math functions in Python with Table 2 describing a few common ones such as the absolute
value, abs(), and round, round(), functions. Note that the round() function uses Banker’s rounding - if a number
is halfway between two integers (e.g., 4.5), it will round toward the even integer (i.e., 4).
abs(-4)
round(4.5)
Function Description
abs() Returns the absolute value
float() Converts a value to a float
int() Converts a value to an integer
len() Returns the length of an object
list() Converts an object to a list
max() Returns the maximum value
min() Returns the minimum value
open() Opens a file
print() Displays an output
round() Rounds a value using banker’s rounding
str() Converts an object to a string
sum() Returns the sum of values
tuple() Converts an object to a tuple
type() Returns the object type (e.g., float)
zip() Zips together two lists or tuples
1.1 Numbers 27
Scientific Computing for Chemists with Python
The print() function is one of the most commonly used functions that tells Python to display some text or values.
While Jupyter notebooks will display the output or contents of a variable by default, the print() function allows for
considerably more control as you will see below in section 1.3.
print(8.3145)
8.3145
In addition to Python’s native collection of functions, Python also contains a math module with more mathematical
functions. Think of a module as an add-on or tool pack for Python. The math module comes with every installation of
Python and is activated by importing it (i.e., loading it into memory) using the import math command. After the
module has been imported, any function in the module is called using [Link]() where function is the
name of the function. For example, math contains the function sqrt() for taking the square root of values.
import math
[Link](4)
2.0
Table 3 lists some commonly used functions in the math module, and a few examples are shown below. Interestingly,
some functions simply provide a mathematical constant.
[Link](4.3)
[Link]
3.141592653589793
[Link](2, 8)
256.0
Function Description
ceil(x) Rounds 𝑥 up to nearest integer
cos(x) Returns 𝑐𝑜𝑠(𝑥)
degrees(x) Converts 𝑥 from radians to degrees
e Returns the value 𝑒
exp(x) Returns 𝑒𝑥
factorial(x) Takes the factorial (!) of 𝑥
floor(x) Rounds 𝑥 down to the nearest integer
log(x) Takes the natural log (ln) of 𝑥
log10(x) Takes the common log (base 10) of 𝑥
pi Returns the value 𝜋
pow(x, y) Returns 𝑥𝑦
radians(x) Converts 𝑥 from degrees to radians
sin(x) Returns 𝑠𝑖𝑛(𝑥)
sqrt(x) Returns the square root of 𝑥
tan(x) Returns 𝑡𝑎𝑛(𝑥)
28
Scientific Computing for Chemists with Python
There are more ways to import functions or modules in Python. If you only want to use a single function from the entire
module, you can selectively import it using the from statement. Below is an example of importing only the radians()
function.
0.06981317007977318
One advantage of importing only a single function or variable is that you do not need to use the math. prefix. Some
Python users take this method one step further by using a wild card (*), which imports everything from the module. That
is, they type from math import *. This imports all functions and variables and again allows the user to use them
without the math. prefix. The downside is that you might accidentally overwrite a variable (see following section 1.2 on
variables) in your code this way. Unless you are absolutely certain you know all the functions and variables in a module
and that it will not overwrite any variables in your code, do not use the * import. On second thought, just avoid using the
* import anyway.
1.2 Variables
When performing mathematical operations, it is often desirable to store values in variables for later use instead of manually
typing them back in. This will save effort when writing your code and make any changes automatically propagate through
your calculations.
Attaching a value to a variable is called assignment and is performed using a single equal sign (=). Below, 5.0 and 3 are
assigned to the variables a and b, respectively. Mathematical operations can then be performed with the variables just as
is done with numerical values.
a = 5.0
b = 3
a + b
8.0
Variables can be almost any string of characters as long as they start with a letter, do not contain an operator (see Table
1), and are not contained in Python’s list of reserved words shown in Table 4. It is also important to not use a variable
twice as this will overwrite the first value. Modules and functions are also attached to variables, so if you have imported
the math module, the module is attached to the variable math.
Table 4 Reserved Words in Python
1.2 Variables 29
Scientific Computing for Chemists with Python
It is also in your best interest to create variable names that clearly indicate what it contains if it is more than a generic
example (like used in this book) or experiment. This will make writing and reading code significantly easier and is a good
habit to start early. In the examples below, a reader might be able to determine that the first example is calculating energy
using 𝐸 = 𝑚𝑐2 while it is more difficult to determine what the second example is calculating.
# clear variables
mass = 1.6
light_speed = 3.0e8
mass * light_speed**2
1.44e+17
# not-so-great variables
x = 3.2
a = 1.77
a + x
4.970000000000001
A variable can be assigned to another variable as is shown below. When this happens, both variables are assigned to the
same value, which is not particularly surprising.
x = 5
y = x
However, watch what happens if the first variable, x, is then assigned to a new value.
x = 8
30
Scientific Computing for Chemists with Python
Instead of y updating to the new value, it still contains the first value. This is because instead of y being assigned to x,
the value 5 was assigned directly to y. Behind the scenes, Python handles assignment by making a pointer that connects
a variable name to a value in the computer’s memory. Figure 1 illustrates what happens in the above example.
Figure 1 A representation of memory pointer during variable assignment is shown with the Python code (left) and the
corresponding points (right).
The x pointer is directed to a new value but the y pointer is still aimed at 5.
1.3 Strings
Floats and integers are means of storing numerical data. The other major type of data is text which is stored as a string of
characters known simply as a string. Strings can contain a variety of characters including letters, numbers, and symbols
and are identified by single or double quotes.
'some text'
'some text'
b Tip
Triple quotes can also be used to extend a string across multiple lines and are also used for the docstring in a newly
defined function (see section 1.9.5).
1.3 Strings 31
Scientific Computing for Chemists with Python
The simplest way to create a string is to enclose the text in either single or double quotes, and a string can be assigned to
variables just like floats and integers. To have Python print out the text, use the print() function.
print(text)
some text
Strings can also be created by converting a float or integer into a string using the str() function.
str(4)
'4'
Even though a number can be contained in a string, Python will not perform mathematical operations with it because it
sees anything in a string as a series of characters and nothing more. As can be seen below, in attempting to add '4' and
'2', instead of doing mathematical addition, Python concatenates the two strings. Similarly, in attempting to multiply
'4' by 2, Python returns the string twice and concatenates them. These are ways of combining or lengthening strings,
but no actual math is performed.
'4' + '2'
'42'
'4' * 2
'44'
If two strings are multiplied, Python returns an error. This is an issue commonly encountered when importing numerical
data from a text document. The remedy is to convert the string(s) into numbers using either the float() or int()
functions.
int('4') * int('2')
If we want to know the length of a string, we can use the len() function as shown below.
len(text)
32
Scientific Computing for Chemists with Python
4.0 g
print(4.0, 'g')
4.0 g
Accessing a piece or slice of a string is a common task in scientific computing among other applications. This is often
encountered when importing data into Python from text files and only wanting a section of it. Indexing allows the user to
access a single character in a string. For example, if a string contains the amino acid sequence of a peptide and we want
to know the first amino acid, we can use indexing to extract this character. The key detail about indexing in Python is
that indices start from zero. That means the first character is index zero, the second character is index one, and so on.
If we have a peptide sequence of ‘MSLFKIRMPE’, then the indices are as shown below.
Characters M S L F K I R M P E
Index 0 1 2 3 4 5 6 7 8 9
To access a character, place the index in square brackets after the name of the string.
seq = 'MSLFKIRMPE'
seq[0]
'M'
Interestingly, we do not have to use variables to do this; we could perform the same operation directly on the string.
'MSLFKIRMPE'[1]
'S'
What happens if you want to know the last character of a string? One method is to determine the length of a string and
use that to determine the index of the last character.
len(seq)
10
seq[9]
'E'
The string can also be reverse indexed from the last character to the first using negatives starting with -1 the last character.
1.3 Strings 33
Scientific Computing for Chemists with Python
Characters M S L F K I R M P E
Index -10 -9 -8 -7 -6 -5 -4 -3 -2 -1
seq[-1]
'E'
Indexing only provides a single character, but it is common to want a series of characters from a string. Slicing allows us
to grab a section of the string. It uses the same index values as above except requires the start and stop indices separated
by a colon in the square brackets. One important detail is that the character at the starting index is included in the slice
while the character at the final index is excluded from the slice.
seq[0:5]
'MSLFK'
If you look at the index values for each letter, you will notice that the character at index 5 (I) is not included.
What happens if you want to grab the last three characters of a string to determine the file extension (i.e., what type of
file it is)? The fact that the last index is not included in the slice causes a problem as is shown below.
file = '[Link]'
file[-3:-1]
'pd'
The way around this is to just leave the stop index blank. This tells Python to just go to the end.
file[-3:]
'pdb'
This trick also works for the start index to get the file name without the extension. Notice that the -4 index is the period.
file[:-4]
'1rxt'
Finally, we can also adjust the step size in the slice. That is, we can ask for every other character in a string by setting a
step size of 2. The overall structure is [start : stop : step].
seq[::2]
'MLKRP'
34
Scientific Computing for Chemists with Python
A method is a function that works with a specific type of object. String methods only work on strings, and they do not
work on other objects such as floats. Later on, you will see other objects like lists and NumPy arrays which have their
own methods for performing common tasks with those types of objects. If it makes it any easier, feel free to equate the
term “method” with “function” in your mind, but know that there is a bit more to methods.
One example of a string method is the capitalize() function which returns a string with the first letter capitalized.
Using a string method is referred to as calling the method… it is computer science lingo for executing a function. The
method is called by appending .capitalize() to the string or a variable representing the string. For example, below
is an Albert Einstein quote that needs to have the first letter capitalized.
quote = 'anyone who has never made a mistake has never tried anything new.'
[Link]()
'Anyone who has never made a mistake has never tried anything new.'
Notice that if we check the original quote, it is unchanged (below). This method does not change the original string but
rather returns a copy with the first letter capitalized. If we want to save the capitalized version, we can assign it to a new
variable or overwrite the original.
quote
'anyone who has never made a mistake has never tried anything new.'
cap_quote = [Link]()
cap_quote
'Anyone who has never made a mistake has never tried anything new.'
As a minor note, string methods can also be called with [Link](string) with method being the name of the
string method and string being the string or string variable. While this works, it is used less often. The first approach
with [Link]() is preferred because any string method needs a string to act upon, so many people find it
logical that a string should start the function call. It is also shorter to type, which is certainly a virtue.
[Link](quote)
False
[Link](quote)
'Anyone who has never made a mistake has never tried anything new.'
Below are a few common string methods you may find useful.
1.3 Strings 35
Scientific Computing for Chemists with Python
b Tip
A more powerful, and advanced, approach to searching and modifying strings is regular expressions introduced
in appendix 4.
Method Description
capitalize() Capitalizes the first letter in the string
center(width) Returns the string centered with spaces on both sides to have a requested total width
count(characters)Returns the number of non-overlapping occurrences of a series of characters
find(characters) Returns the index of the first occurrence of characters in a string
isalnum() Determines whether a string is all alphanumeric characters and returns True or False
isalpha() Determines whether a string is all letters and returns True or False
isdigit() Determines whether a string is all numbers and returns True or False
lstrip(characters)
Returns a string with the leading characters removed; if no characters are given,
it removes spaces
rstrip(characters)
Returns a string with the trailing characters removed; if no characters are given,
it removes spaces
split(sep=None) Splits a string apart based on a separator; if sep=None, it defaults to white spaces
startswith(prefix)
Determines if the string starts with a prefix and returns True or False
endswith(suffix) Determines if the string ends with a suffix and returns True or False
In section 1.3.1, we were able to concatenate two strings by using the + operator as shown below. With this approach, it
is necessary to convert any non-string into a string using the str() function.
MW = 63.21
"Molar mass = " + str(MW) + " g/mol."
While this approach usually works fine, it can get messy or unwieldy as you are combining more strings. In this section,
we will cover a couple of other methods for merging strings. Which you choose to use is a matter of personal preference,
but it is good to be aware of them as you may see them around.
36
Scientific Computing for Chemists with Python
[Link]() Method
The first method we will address is using the [Link]() method. In this approach, the string (i.e., str) includes
curly brackets {} where you want to insert additional strings, and these additional strings are provided as arguments in
the [Link]() function. As an example, below we are generating a sentence providing the name and molecular
weight of a compound. Notice how compound is inserted in the sentence where the first {} is located while MW is
inserted in the location of the second {}.
compound = 'ammonia'
MW = 17.03
If we assign the compound and MW variable to other values, the [Link]() function dutifully inserts these new
strings into our sentence. Also notice that the format() function automatically converts non-string objects into strings
for us.
compound = 'urea'
MW = 60.06
A variation of the above approach is to include an index value inside the curly brackets indicating which string provided
to the [Link]() function is inserted where in the sentence. In the example below, compound is provided to the
[Link]() function first, so it replaces {0} while MW is second, so it replaces {1}. Remember that Python index
values start with zero.
compound = 'urea'
MW = 60.06
Because we are explicitly providing index values, we can insert strings into the sentence in any order. Notice in the
example below that the MW and compound variables are provided to the function in a different order.
We can also insert strings into our sentence multiple times as shown below.
'The compound urea is a molecular compound and urea has a molar mass of 60.06 g/
↪mol.'
1.3 Strings 37
Scientific Computing for Chemists with Python
F-Strings
The next approach to combining strings is using f-strings. In this approach, the string is preceded with f, and any inserted
strings are denoted using {} with the variable name inside the curly brackets as demonstrated below. The appeal of this
approach is that it is simple, versatile, and relatively easy to follow.
We can also modify the strings by placing additional Python code inside the brackets like below where the first letter of
the compound is capitalized.
Python supports Boolean logic where all expressions are evaluated as either True or False. These are useful for adding
conditions to scripts. For example, if you are writing code to determine if a sample is a neutral pH, you will want to test
if the pH equals 7. If the pH == 7 evaluates as True, the sample is neutral, and if this statement is False, the sample
is not neutral.
There are a number of Boolean operators available in Python with the most common summarized in Table 6. These
operators are essentially truth tests with Python returning either True or False. Many of them work as one would
expect. For example, if 8 is tested for equality with 3, a False is returned. Note that the operator for equals is a double
equal sign, whereas a single equal sign assigns a value to a variable.
® Note
8 == 3
False
38
Scientific Computing for Chemists with Python
Operator Description
== Equal (double equal sign)
!= Not equal
<= Less than or equal
>= Greater than or equal
< Less than
> Greater than
is Identity
is not Negative identity
The is and is not Boolean operators are not as intuitive. These two operators test to see if two objects are the same
thing (i.e., identity) or not the same thing, respectively. For example, if we test 8 and 8.0 for equality, the result is True
because they are the same quantity. However, if we test for identity, the result is False because 8 is an integer and 8.0
is a float.
8 > 3
True
8 == 8.0
True
8 is 8.0
<>:1: SyntaxWarning: "is" with 'int' literal. Did you mean "=="?
<>:1: SyntaxWarning: "is" with 'int' literal. Did you mean "=="?
/var/folders/zy/7y6kpdbx6p1ffrp1vtxy3ttc0000gn/T/ipykernel_4034/[Link]:␣
↪SyntaxWarning: "is" with 'int' literal. Did you mean "=="?
8 is 8.0
False
In the last example, Python generates a warning because the user probably meant to use == instead of is.
Comparisons can be concatenated together with Boolean logic operators to make compound comparisons. Common
Boolean logic operators are shown in Table 7.
Table 7 Common Boolean Logic Operators
Operator Description
and Tests for both being True
or Tests for either being True
not Tests for False
The and operator requires both input values to be True in order to return True while the or operator requires only
one input value to be True in order to evaluate as True. The not operator is different in that it only takes a single input
value and returns True if and only if the input value is False. It is essentially a test for False.
False
True or False
True
8 > 3 or 8 < 2
True
not 8 > 3
False
Truth tables for the three common Boolean logic operators are shown below. Boolean logic by itself is not immensely
useful, but when paired with conditions (introduced below), it is a powerful tool in programming and data analysis.
Table 8 Truth Table for the and/or Logic Operators
p q p and q p or q
True True True True
True False False True
False True False True
False False False False
p p not q
True False
False True
The values 1 and 0 can also be used in place of True and False, respectively, as Python recognizes them as surrogates.
For Python to know that you mean these values as Booleans and not simply integers, Python sometimes requires the
bool() function.
bool(1)
True
bool(0)
False
40
Scientific Computing for Chemists with Python
bool(5)
True
You can perform some of the above Boolean operations from section 1.4.2 with 1 and 0, but Python will return the result
in terms of 1 and 0.
1 or 0
1 and 0
It is sometimes helpful to test if any or all values test True in a list or tuple (covered in section 1.6). The any() and
all() functions do exactly this. The former will return True if one or more of the values in the object test True while
the latter will only evaluate as True only if all values are True.
True
False
True
When fed numbers, both the any() and all() functions will treat them as Booleans as described in section 1.4.3.
any([0, 1, 0])
True
Python allows for the testing of inclusion using the in operator. Let us say we want to test if there is nickel in a provided
molecular formula. We can simply test to see if “Ni” is in the formula.
comp1 = 'Co(NH3)6'
comp2 = 'Ni(H2O)6'
'Ni' in comp1
False
'Ni' in comp2
True
The in operator also works for other objects beyond strings including lists and tuples which you will learn about in section
1.6.
1.5 Conditions
Conditions allow for the user to specify if and when certain lines or blocks of code are executed. Specifically, when a
condition is true, the block of indented code directly below runs. In the example below, if pH is greater than 7, the code
prints out the statements “The solution is basic” and “Neutralize with acid.”
if pH > 7:
print('The solution is basic.')
print('Neutralize with acid.')
1.5.1 if Statements
The if statement is a powerful way to control when a block of code is run. It is structured as shown below with the if
statement ending in a colon and the block of code below indented by four spaces. In the Jupyter notebook, hitting the
Tab key will also generate four spaces.
x = 7
if x > 5:
y = x **2
print(y)
49
If the Boolean statement is True at the top of the if statement, the code indented below will be run. If the statement is
False, Python skips the indented code as shown below.
x = 3
if x > 5:
y = x **2
print(y)
42
Scientific Computing for Chemists with Python
There are times when there is an alternative block of code that you will want to be run when the if statement evaluates
as False. This is accomplished using the else statement as shown below.
pH = 9
if pH == 7:
print('The solution is neutral.')
else:
print('The solution is not neutral.')
If pH does not equal 7, then anything indented below the else statement is executed.
There is an additional statement called the elif statement, short for “else if,” which is used to add extra conditions
below the first if statement. The block of code below an elif statement only runs if the if statement is False and
the elif statement is True. In the example below, if pH is equal to 7, the first indented block is run. Otherwise, if pH
is greater than 7, the second block is executed. In the event that the if and all elif statements are False, then the
else block is executed.
if pH == 7:
print('The solution is neutral.')
elif pH > 7:
print('The solution is basic.')
else:
print('The solution is acidic.')
It is worth noting that else statements are not required with every if statement, and the last condition above could have
been elif pH < 7: and have accomplished the same result.
Up to this point, we have only been dealing with single values or strings. It is common to work with a collection of values
such as the average atomic masses of the chemical elements, but it is inconvenient to assign each value to its own variable.
Instead, the values can be placed in a list or tuple. Lists and tuples are both collections of elements, such as numbers or
strings, with the key difference that a list can be modified while a tuple cannot. A tuple is said to be immutable as it cannot
be changed once created. Not surprisingly, lists are often more useful than tuples.
A list is created by placing elements inside square brackets. Below, the list called mass is created containing the atomic
mass of the first six chemical elements.
mass
A single list can contain a variety of different types of objects. Below a list called EN is created to store the Pauling
electronegativity values for the first six elements on the periodic table. The list contains mostly floats, but being that the
value for He is unavailable in this example, an 'NA' string resides where a value would otherwise be.
EN
Indexing is used to access individual elements in a list, and this method is similar to indexing strings as demonstrated below.
The index is the position in the list of a given object, and again, the index numbering starts with zero. Accessing an
element of a list is done by placing the numerical index of the element we want in square brackets behind the list name.
For example, if we want the first element in the electronegativity list (EN), we use EN[0], while EN[1] provides the
second element and so on.
b Tip
Variable names can be anything as long as they follow Python rules for variable names, but there are also a few
informal conventions. One convention is to use the lowercase letter i to hold an index value if you ever need to
store indices.
EN[0]
2.1
EN[1]
'NA'
Multiple elements can be retrieved at once by including the start and stop indices separated by a colon. Like in strings,
this process is known as slicing. A convention that occurs throughout Python is that the first index is included but the
second is not, [included : excluded : step].
EN[0:3]
EN[3:5]
[1.5, 2.0]
44
Scientific Computing for Chemists with Python
Just like in strings, if we want everything to the end, provide no stop index.
EN[3:]
Similar to strings, list objects also have a collection of methods (i.e., functions) for performing common tasks. Some of
the more common and useful list methods are presented in Table 10, and all of these methods modify the original list
except copy(). As is the case with methods, they only work on the object type they are designed for, so list methods
only work on lists.
Table 10 Common List Methods
Method Description
append(element) Adds a single element to the end of the list
clear() Removes all elements from the list
copy() Creates an independent copy of the list
count(element) Returns the number of times an element occurs in the list
extend(elements) Adds multiple elements to the list
index(element) Returns the index of the first occurrence of font
insert(index, ele- Inserts the given element at the provided index
ment)
pop(index) Removes and returns the element from a given index; if no index is provided, it defaults
to the last element
remove(element) Removes the first occurrence of element in the list
reverse() Reverses the order of the entire list
sort() Sorts the list in place
Below is a list containing the masses, in g/mol, of the first seven elements on the periodic table. They are clearly not in
order, so they can be sorted using the sort() method. Unlike the sorted() function (Table 2), the sort() method
modifies the original list.
[Link]()
mass
[Link]()
mass
Probably one of the most useful methods in Table 10 is the append() method. This is used for adding a single element
to a list. The extend() method is related but is used to add multiple elements to the list.
[Link](16.00)
mass
[Link]([19.00, 20.18])
mass
[14.01, 12.01, 10.81, 9.01, 6.94, 4.0, 1.01, 16.0, 19.0, 20.18]
If multiple elements are added using the append() method, it will result in a nested list… that is, a list inside the list as
demonstrated below.
[Link]([23.00, 24.31])
mass
[14.01, 12.01, 10.81, 9.01, 6.94, 4.0, 1.01, 16.0, 19.0, 20.18, [23.0, 24.31]]
There are times when this might be what we want, but probably not here.
b Tip
The append() method is frequently used as a means of storing values in a list as they are generated like the following
calculation of the wavelengths in the Balmer series. The for loop is explained in section 1.7.1.
wavelengths = []
for n in range(3,6):
wl = 1 / (1.097E-2 * (0.25 - 1/n**2))
[Link](wl)
It is common to need a sequential series of values in a specific range. The user can manually type these values into a
list, but computer programming is about making the computer do the hard work for you. Python includes a function
called range() that will generate a series of values in the desired range. The range() function requires at least one
argument to tell it how high the range should be. For example, range(10) generates values up to and excluding 10.
a = range(10)
print(a)
range(0, 10)
The output of a is probably not what you expected. You were likely expecting a list from 0 → 9, which is what used
to happen back in the Python 2 days. Now, Python generates a range object that stands in the place of a list because it
requires less memory. If you want an actual list from it, just convert it using the list() function.
list(a)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
46
Scientific Computing for Chemists with Python
The range() function also takes additional arguments to further customize the range and spacing of values. A start and
stop position may be provided to the range() function as shown below. Consistent with indexing, the range includes
the start value and excludes the stop value.
list(range(3, 12))
Finally, a step size can also be included. The default step size is one, but it can be increased to any integer value including
negative numbers.
list(range(10, 3, -1))
[10, 9, 8, 7, 6, 5, 4]
While range objects may seem intimidating, they can be used in place of a list. Just pretend the range object is really a
list. For example, you can index it like a list as shown below.
ten_nums = range(10)
ten_nums[2]
1.6.5 Tuples
Tuples are another object type similar to lists except that they are immutable - that is to say, they cannot be changed
once created. They look similar to a list except that they use parentheses instead of square brackets. So what use is
an unchangeable list-like object? There are times when you might want data inside your code, but you do not want to
accidentally change it. Think of it as something similar to locking a file on your computer to avoid accidentally making
modifications. While this feature is not strictly necessary, it may be a prudent practice in some situations in case you
make a mistake.
Below is a tuple containing the energy in joules of the first five hydrogen atomic orbitals. There is no need to change this
data in your code, so fixing it in a tuple makes sense. Indexing and slicing work exactly the same in tuples as they do in
strings and lists, so we can use this tuple to quickly calculate the energy difference between any pair of atomic orbitals.
nrg[1] - nrg[0]
1.635e-18
nrg[4] - nrg[3]
4.879999999999998e-20
That last output is worth commenting on. You may have noticed that the value returned by Python is not exactly what you
probably expected based on the precision of the values in the nrg tuple. This is because Python does not store values to
infinite precision, so this is merely a rounding error.
1.7 Loops
Loops allow programs to rerun the same block of code multiple times. This is important because there are often sections
of code that need to be run numerous times, sometimes extending into the thousands. If we needed to include a separate
copy of the same code for every time it is run, our scripts would be unreasonably large.
The for loop is probably the most common loop you will encounter. It is often used to iterate over a multi-element
object like lists or tuples, and for each element, the block of indented code below is executed. For example:
8
12
4
During the for loop, each element in the list is assigned to the variable value and then the code below is run. Essentially,
what is happening is shown below.
value = 4
print(2 * value)
value = 6
print(2 * value)
value = 2
print(2 * value)
This allows us to perform mathematical operations on each element of a list or tuple. If we instead try multiplying the list
by two, we get a list of twice the length.
2 * [4, 6, 2]
[4, 6, 2, 4, 6, 2]
The for loop does not, however, modify the original list. If we want a list containing the squares of the values in a
previous list, we should first create an empty list and append the square values to the list.
squares
48
Scientific Computing for Chemists with Python
We can also iterate over range objects and strings using for loops. Remember that range objects do not actually generate
a list, but we can often treat them as if they do. As an example, we can generate the wavelengths (𝜆) in the Balmer series
by the following equation where 𝑅∞ is the Rydberg constant (1.097 × 10−2 nm−1 ) and 𝑛𝑖 is the initial principal quantum
number.
1 1 1
= 𝑅∞ ( − 2 )
𝜆 4 𝑛𝑖
The code below generates the first five wavelengths (nm) in the Balmer series.
for n in range(3,8):
lam = 1 / (1.097e-2 * (0.25 - (1 / n**2)))
print(lam)
656.3354603463993
486.1744150714068
434.084299170899
410.2096627164995
397.04243897498225
L
I
N
U
S
Another common use of for loops is to repeat a task a given number of times. It essentially acts as a counter. Imagine
we want to determine how much of a 183.2 g 235 U sample would be left after six half-lives. We can divide the quantity
six times and print the result of each division. To accomplish this, we will have a for loop iterate over an object with a
length of six, executing the division and printing each mass. The easiest way to generate an iterable object of length six
is using the range() function.
U235 = 183.2
for x in range(6):
U235 = U235 / 2
print(str(U235) + ' g')
91.6 g
45.8 g
22.9 g
11.45 g
5.725 g
2.8625 g
In the above example, the value x from the range object is not used in the for loop. There is no rule that says it has to
be. Also, you may notice that the variable names in all the above examples keep changing. Just like in the rest of your
code, you are also welcome to pick your variables in the for loop. Some people like to use x as a generic variable, but it
is often best to give the for loop variable an intuitive name so that it is easy to follow as your code grows more complex.
1.7 Loops 49
Scientific Computing for Chemists with Python
The other common loop is the while loop. It is used to keep executing the indented block of code below until a stop
condition is satisfied. As an example, the indented block of code below the while statement is run until x is no longer
less than ten. The x < 10 is known as the termination condition, and it is checked each time before the indented code
is executed.
x = 0
while x < 10:
print(x)
x = x + 2 # increments by 2
0
2
4
6
8
Essentially, what is going on is shown in the following example, and this continues until x is no longer greater than 10.
if x < 10:
print(x)
x = x + 2
if x < 10:
print(x)
x = x + 2
b Tip
The while loop is not as common as the for loop and should be used with caution. This is because it is not difficult to
have what is known as a faulty termination condition resulting in the code executing indefinitely… or until you manually
stop Python or Python crashes because it ran out of memory. This happens because the termination condition is never
met resulting in a runaway process.
Á Warning
x = 0
while x != 10:
x = x + 3
print('Done')
In the above code, the value is incremented until it reaches 10 (remember, != means “does not equal”), and then a “Done”
message is printed - at least that is the intention. No message is ever printed and the while loop keeps running. If we
do the math on the values for x, we find that in incrementing by three (0, 3, 6, 9, 12,…), the value for x never equals 10,
so the while loop never stops. For this reason, it is wise to avoid while loops unless you absolutely must use them. If
50
Scientific Computing for Chemists with Python
you do use a while loop, triple check your termination condition and avoid using = or != in your termination condition.
Instead, try to use <= or >=. These are less likely to fail.
Other ways to control the flow of code execution are the continue, pass, and break commands. These are not used
heavily, but it is helpful to know about them on the occasions that you need them. Table 11 summarizes each of these
statements below.
Table 11 Loop Interruptions
Statement Description
break Breaks out of immediate containing for/while loop
continue Starts the next iteration of the immediate containing for/while loop
pass No action; code continues on
The break statement breaks out of the most immediate containing loop. This is useful if you want to apply a condition
to completely stop the for or while loop early. For example, we can simulate the titration of 0.9 M NaOH with 1 mL
increments of 1.0 M HCl. In the code below, the initial volumes of NaOH and HCl are 25 mL and 0 mL, respectively.
The for loop successively checks to see if there are more or equal moles of HCl as NaOH (i.e., the equivalence point).
If not, the volume of HCl is incremented by one milliliter.
vol_OH = 35
vol_H = 0
If we solve this titration using the C1 V1 = C2 V2 equation where C is concentration and V is volume, we expect an endpoint
of 31.5 mL of HCl, so a simulated endpoint of 32 mL makes sense. The above simulation can also be written as a while
loop. A break statement can often be avoided through other methods, but it is good to be able to use one for instances
where you really need it.
The continue statement is similar to the break except that instead of completely stopping a loop, it stops only the
current iteration of the loop and immediately starts the next cycle. The script below takes the square root of even numbers
only. The even number check is performed with number % 2 == 1. If this is True, the number is odd, and the
continue statement causes the for loop to continue on to the next number.
numbers = [1, 2, 3, 4, 5, 6, 7]
for number in numbers:
if number % 2 == 1:
continue
print([Link](number))
1.7 Loops 51
Scientific Computing for Chemists with Python
1.4142135623730951
2.0
2.449489742783178
Finally, the pass statement does nothing. Seriously. It is merely a placeholder for code that you have not yet written by
telling the Python interpreter to continue on. No completed code should contain a pass statement. The reason for using
one is to be able to run and test code without errors occurring due to missing parts. If the following code is executed, an
error will occur because there is nothing below the else statement.
pH = 5
if pH > 7:
print('Basic')
else:
However, if we add a pass statement, no error occurs allowing us to see if the code works, aside from the missing part.
pH = 5
if pH > 7:
print('Basic')
else:
pass
Up to this point, we have only been dealing with computer-generated and manually typed values, strings, lists, and tuples.
In research and laboratory environments, we often need to work with data stored in a file. These files may be generated
from an instrument or as the result of humans typing values into a spreadsheet as they take measurements or make
observations. There are two general categories of data files: text and binary files. Text files are those that, when opened
by a text editor, can be read by humans, while binary files cannot. The reading of binary files requires other specialized
software, such as demonstrated in chapter 12, and text files are very common for storing data, so we will focus only on
text files here.
There are a large variety of text files which differ simply by the way in which the information is formatted in the file.
Common examples include comma separated values (CSV), protein database (PDB), and xyz coordinates (XYZ). These
files have different extensions (i.e., those 3-4 letters after the period at the end of a file name), but they are all just text
files. You can change the extension to .txt if you like and open them in any text editor or word processor. The .csv, .pdb,
and .xyz are simply tags to help your computer decide which software application can and should open the file.
We will focus on the CSV file format as it is extremely common, and many software applications can export data in
the CSV format. Comma separated value files are a way of encoding information that might otherwise be stored in
a spreadsheet, and spreadsheet applications are able to easily read and write CSV files. Each line of the text file is a
different row, and each item in a row is separated by commas… hence the name. Below are the contents of a CSV file
and how it would look in a spreadsheet. In some files, you may see a \n at the end of each line. This is a line terminator
character telling some software applications where a line ends.
52
Scientific Computing for Chemists with Python
b Tip
Next time you collect data in the lab, see what other file formats the software/instrument can save/export the data
as. Odds are good that it can save it as a CSV file.
The first method we will cover for reading text files is the native Python method of reading the lines of the text file one at
a time. This method requires a little more effort than the other methods in this book, but it also offers much more control.
There are three general steps for this approach: open the file, read each line one at a time, and close the file. Opening
the file is performed with the open() function. Be sure to attach the file to a variable to be accessed later. Next, the
data is read a single line at a time using the readlines() method. Being that we need to do the same task over and
over, we will use a for loop. Finally, it is a good practice to close the file using the close() command. This process
is demonstrated below in opening the data shown above in a file called [Link].
® Note
Unless otherwise indicated, Python searches for the file in the same directory (folder) as the Jupyter notebook. If
the file is not in this directory, be sure to provide a path to the file. For more advanced techniques on navigating
your file system, see section 2.4.1.
® Note
One major difference between running the software installed on your own computer and Google Colab is that if you
want Colab to be able to interact with data or images files on your Google Drive, you need to include the three extra
lines of code shown below at the top of your notebook. The first two lines grant the notebook access to read/write
files on your Google Drive while the third line (%cd /content/drive/My Drive/project) points your
notebook to where your files are located. The path should reflect the location of the folder containing your notebook
and data files. For example, if your notebook is contained in a folder titled project on Google Drive, the path will
be /content/drive/My Drive/project.
from [Link] import drive
[Link]('/content/drive')
file = open('data/[Link]')
for line in [Link]():
print(line)
[Link]()
1,1
2,4
3,9
4,16
5,25
6,36
7,49
8,64
9,81
10,100
It worked! The above code reads each line and prints the contents. Of course, this is not particularly useful in this form.
It would be much more useful in lists. We can fix this by creating a couple of empty lists and appending the values to the
lists as the file is read.
54
Scientific Computing for Chemists with Python
file = open('data/[Link]')
numbers = []
squares = []
[Link]()
Now the values are in two separate lists. The first values are in the numbers list and the squares of the numbers are in
the squares list.
numbers
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
squares
While the above methods work fine, it is considered best practice to read a file inside a context so that even if an error
occurs, the file will still be closed properly. This is done as shown below using a with statement. There is no need to
explicitly close the file because it is done automatically.
1,1
2,4
3,9
4,16
5,25
6,36
7,49
8,64
9,81
10,100
Python can also write data to a file using the write() function which takes a string and writes it to a file. Before this
can be done, the file needs to be opened using the open() function which requires the name of the file to write to, and if
the file does not already exist, it creates a new file with this name. There is an optional second argument for the open()
function that sets the mode in which the file is opened. There are a number of modes, but common modes include 'w'
for write-only mode, 'r' for read-only mode, and 'a' for append mode. The latter adds any new text to the end of an
already-existing file.
In the example below, a list, angular, containing nested lists of angular quantum numbers and shapes is written to
a new file. Following each nested list (i.e., angular quantum number and shape pair) is a line terminator character \n.
Because the following code opens the file in a context using a with statement, there is no need to explicitly close the file
as this is done automatically.
The second approach to reading data from files uses a function from the NumPy library called genfromtxt(). NumPy
will not be covered in depth until chapter 4, but we can still use a couple of functions before then. Before using NumPy,
we need to import it using import numpy as np, which can be thought of as activating the library. The np.
genfromtxt() function takes two required arguments for reading a text file: the file name and the delimiter.
[Link]('file_name', delimiter='')
The delimiter is the symbol that separates values in each row and can be almost any symbol including spaces or tabs. If
you encounter tab-separated data, use delimiter='\t', and for comma separated values (CSV) files, use delim-
iter=','.
import numpy as np
The output of this function is something called a NumPy array. It is similar to a list except more powerful. You will learn
to use these in chapter 4, but for now, just treat it as a list. If we want to know the square of 4, we can access that value
using indexing. In the example below, the first index identifies the nested list inside the main list, and the second index
indicates the second value inside that list.
56
Scientific Computing for Chemists with Python
file[4][1]
np.float64(25.0)
Another feature of the [Link]() function is the skip_header= optional argument. It instructs the func-
tion to disregard data until after a certain number of rows in the file. This is helpful because files often include non-data
headers providing details like the instrument, date, time, and other details about the data. A data file may look like this.
July 7, 2017
number, square
1, 1
2, 4
3, 9
4, 16
5, 25
6, 36
7, 49
8, 64
9, 81
10, 100
In this case, we need the function to skip the first two rows as follows.
® Note
NumPy has a similar function to [Link]() called [Link]() that you may see around. Both
functions are similar except that [Link]() can also read files that have missing data while np.
loadtxt() cannot.
One of the easiest approaches to writing data back to a file is to again use a NumPy function, [Link](), which
requires both a file name as a string and the data. It is also recommended to include a delimiter as a string using the
delimiter= keyword argument. This function can write a file from a list, tuple, or NumPy array (introduced in section
4.1), and if a list or tuple is nested, each inner list/tuple is a row in the written file.
As an example, below is a nested list of temperatures (∘ C) and the density of water at each temperature (g/mL). These
data are saved to a file water_density.csv with each value separated by a comma.
® Note
For more information on loading data with NumPy and handling missing data, see section 4.6. Additional tools
for reading/writing data are also discussed in section 5.2.
# temp(C), density(g/mL)
H2O_dens = [[10, 0.999], [20, 0.998], [30, 0.996],
[40, 0.992], [60, 0.983], [80, 0.972]]
After you have been programming for a while, you will likely find yourself repeating the same tasks. For example, let
us say your research has you repeatedly calculating the distance between two atoms based on their xyz coordinates. You
certainly could rewrite or copy-and-paste the same code every time you need to find the distance between two atoms, but
that sounds horrible. You can avoid this by creating your own function that calculates the distance. This way, every time
you need to calculate the distance between a pair of atoms, you can call the function and the same section of code located
in the function is executed. You only have to write the code once and then you can execute it as many times as you need
whenever you need.
To create your own function, you first need a name for the function. The name should be descriptive of what it does and
makes sense to you and anyone who would use it. If we want to create a function to measure the distance between two
atoms, distance might be a good name for the function.
The first line of a function definition looks like the following: the def statement followed by the name of the function
with whatever information, called arguments, that is fed into the function, and a colon at the end. In this function, we
will feed it the xyz coordinates for both atoms as either a pair of lists or tuples. In the parentheses following the function
name, place variable names you want to use to represent these coordinates. We will use coords1 and coords2 here.
58
Scientific Computing for Chemists with Python
Everything inside a function is indented four spaces directly below the first line. The distance between two points in 3D
space is described by the following equation.
It is now a matter of coding this into the function. Being that we will take the square root, we also need to import the
math module.
import math
If you run the above code, nothing seems to happen. This is because you defined the function but never actually used it.
Calling our new function is done the same way as any other function in Python.
It works! This function prints out a message stating the distance between the two xyz coordinates, and the better part is
that we can use this over and over again without having to deal with the function code.
The distance() function prints out a value for the distance, but what happens if we want to use this value for a
subsequent calculation? Perhaps we want to calculation the average of the distances between multiple pairs of atoms. We
certainly do not want to retype these values back into Python, so instead we can have the function return the value. You
can think of functions as little machines where the arguments in the parentheses are the input and the return at the end
of the function is what comes out of the machine. Below is a modified version of our distance() function with a
return statement instead of printing the value. By running the following code, it overwrites the original function.
return d
7.483314773547883
Now the function returns a float. We can assign this to a variable or append it to a list for later use.
dist = distance([5, 6, 7], [3, 2, 1])
dist
7.483314773547883
Below is code for iterating over a list of xyz coordinate pairs and calculating the distances between each pair. The values
are appended to a list called dist_list from which the average distance is calculated.
pairs = (((1, 2, 3),(2, 3, 4)),
((3, 7, 1), (9, 3, 0)),
((0, 0, 1), (5, 2, 7)))
dist_list = []
for pair in pairs:
dist = distance(pair[0], pair[1])
dist_list.append(dist)
5.691472815049315
Another advantage of using functions is that they maintain variables in a local scope. That is, any variable created inside
a function is not accessible outside the function. If you look back at our distance() function, the variable d is only
used inside the function. If we try to see what is attached to d, we get the following error message.
d
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[149], line 1
----> 1 d
This is because the variable d can only be used or accessed inside the distance() function. This is often very con-
venient because we do not have to worry about overwriting a variable or using it twice. This means that if a collaborator
sends you a function that he/she wrote, you do not need to be concerned if a variable in your code is the same as one in
your collaborator’s function. The function is self-contained making everything a lot simpler.
The obvious downside to variables being in a local scope inside a function is that you cannot access them directly. If
you really need to access a variable in a function, place it in the return statement at the end of the function so that the
function outputs the contents. Alternatively, you can also assign the contents of a variable inside a function to a variable
that was created outside the function. For example, a function can append values to a list created outside of the function,
shown below, and the list can be viewed anywhere. This works because anything that is created outside of the function is
visible everywhere and is said to have a global scope.
60
Scientific Computing for Chemists with Python
def roots(numbers):
for number in numbers:
value = [Link](number)
square_roots.append(value)
square_roots = []
roots(range(10))
square_roots
[0.0,
1.0,
1.4142135623730951,
1.7320508075688772,
2.0,
2.23606797749979,
2.449489742783178,
2.6457513110645907,
2.8284271247461903,
3.0]
1.9.4 Arguments
Functions take in data through arguments placed in the parentheses after the function name. Different functions take
different numbers and types of arguments from as few as zero to potentially dozens of arguments. Function arguments
are also sometimes optional. Some functions allow the user to add extra data or change the function’s behavior through
arguments.
The first type of argument is a positional argument. This is an argument that is required to be in a specific position inside
the parentheses. For example, the function below takes in the number of protons and neutrons, respectively, and outputs
the isotope name. This function is only written for the first ten elements on the periodic table.
print(f'{mass}{symbol}')
If we want to know the isotope contains six protons and seven neutrons, we input the values as isotope(6, 7) and get
13C as expected. However, if we switch the arguments to isotope(7,6), we get 13N, which is not correct. Positional
arguments are extremely common, but the user needs to know what information goes where when calling a function.
isotope(6, 7)
13C
isotope(7, 6)
13N
The other common type of argument is the keyword argument. These arguments are attached to a variable inside the
parentheses. The advantage of a keyword argument is that the user does not need to be concerned about argument order
as long as the arguments have the proper labels. Below is the same isotope() function redefined using keyword
arguments.
print(f'{mass}{symbol}')
isotope(protons=1, neutrons=2)
3H
isotope(neutrons=2, protons=1)
3H
Another advantage of a keyword argument is that a default value can be easily coded in the function. Look up at the most
recent version of the isotope() function and you will notice that protons was assigned to 1 and neutrons was
assigned to 0 in the function definition. These are the default values. If we call the function without inputting either or
both of these values, the function will assume those values.
isotope()
1H
isotope(neutrons=2)
3H
Functions can also take an indeterminate number of positional or keyword arguments, but this is less common and is
covered in section 2.7 as an optional topic for those who are interested.
1.9.5 Docstrings
The final component of a function is the docstring. Strictly speaking, this is not necessary for a function to work and is
sometimes left out for simple functions, but it is a good habit to include them. This is especially true if you are creating
the function for a much larger project or passing it to other people. A docstring is a string placed at the top of a function
definition describing what the function does, what types of data it takes, and what is returned at the end of the function.
Traditionally, docstrings are enclosed in triple quotes. The first line of the docstring describes what type of data goes in
the function and what comes out. In the distance() function above, our function takes in a pair of lists or tuples and
outputs a single value, so the first line may look something like this.
The subsequent lines in the docstring can include other information such as more complete descriptions of what the
function does and even short examples.
62
Scientific Computing for Chemists with Python
return d
Once a docstring is created, it can be accessed by typing the function name, complete with parentheses, and leaving the
cursor in the parentheses. Then hit Shift + Tab to see the docstring. This trick works with any function in this book.
Further Reading
There are a plethora of books and resources, free and otherwise, available on the Python programming language. Below
are multiple examples. The most authoritative and up-to-date resource is the Python Software Foundation’s documentation
page also listed below.
1. Python Documentation Page. [Link] (free resource)
2. Downey, Allen B. Think Python, Green Tea Press, 2012. [Link] (free
resource)
3. Reitz, K.; Schlusser, T. The Hitchhiker’s Guide to Python: Best Practices for Development, O’Reilly: Sebastopol,
CA, 2016.
4. Das, U; Lawson, A.; Mayfield, C.; Norouzi, N.; Rajasekhar, Y.; Kanemaru, R. Introduction to Python Programming,
Open Stax: Houston, TX, 2024. [Link] (free re-
source)
Exercises
Complete the following exercises in a Jupyter notebook. Any data file(s) referred to in the problems can be found in the
data folder in the same directory as this chapter’s Jupyter notebook. Alternatively, you can download a zip file of the data
for this chapter from here by selecting the appropriate chapter file and then clicking the Download button.
1. A 1.6285 L (𝑉 ) flask contains 1.220 moles (𝑛) of ideal gas at 273.0 K (𝑇 ). Calculate the pressure (𝑃 ) for the above
system by assigning all values to variables and performing the mathematical operations on the variables. Remember
that 𝑃 𝑉 = 𝑛𝑅𝑇 describes the relationship between 𝑉 , 𝑛, 𝑃 , and 𝑇 where 𝑅 is 0.08206 L·atm/mol·K.
2. Calculate the distance of point (23, 81) from the origin on an xy-plane first using the [Link]() function
and then by the following distance equation.
√(Δ𝑥)2 + (Δ𝑦)2 )
Further Reading 63
Scientific Computing for Chemists with Python
4. Solve the quadratic equation using the quadratic formula below for a = 1, b = 2, and c = 1.
√
−𝑏 ± 𝑏2 − 4𝑎𝑐
𝑥=
2𝑎
5. Create the following variable elements = 'NaKBrClNOUP' and slice it to obtain the following strings.
elements = 'NaKBrClNOUP'
a. NaK
b. UP
c. KBr
d. NKrlOP
6. A single bond is comprised of a sigma bond while a double bond includes a sigma plus a pi bond. The following
strings contain the bond energies (kJ/mol) for a typical C-C single bond and C=C double bond. Perform a math-
ematical operation on CC_single and CC_double to estimate how much energy a pi bond contributes to a C=C
double bond.
CC_single = "345"
CC_double = "611"
9. The following are the atomic numbers of lithium, carbon, and sodium. Assign each to a variable and use Python
Boolean logic operators to evaluate each of the following.
Li, C, Na = 3, 6, 11
a) Is Li greater than C?
b) Is Na less than or equal to C?
c) Is either Li or Na greater than C?
d) Are both C and Na greater than Li?
10. Write a Python script that can take in any of the following molecular formulas as a string and print out whether
the compound is an acidic, basic, or neutral compound when dissolved in water. The script should not contain
pre-sorted lists of compounds but rather determine the class of molecule based on the formula. Hint: first look for
patterns in the acid and base formulas in the following collection.
64
Scientific Computing for Chemists with Python
11. Write a Python script that takes in the number of electrons and protons and determines if a compound is cationic,
anionic, or neutral.
12. Create a list of even numbers from 18 → 88 including 88. Using list methods, perform the following transformations
in order on the same list:
a) Reverse the list
b) Remove the last value (i.e., 18)
c) Append 16
13. In a Jupyter notebook:
a) Create a tuple of even numbers from 18 → 320 including 320.
b) Can you reverse, remove, or append values to the tuple?
14. The following code generates a random list of integers from 0 → 20 (section 2.4.3 will cover this in more detail).
Run the code and test to see if 7 is in the list. Hint: section 1.4.5 may be helpful.
import random
nums = [[Link](0,20) for x in range(10)]
import math
with open('[Link]', 'a') as file:
[Link]('time, [A] \n')
(continues on next page)
Exercises 65
Scientific Computing for Chemists with Python
21. Using Python’s native open() and readlines() functions, open the [Link] file and print each line.
22. Using [Link](), read the [Link] file and append the time values to one list and the concentration
values to a second list. You will need to skip a line in the file.
23. Write and test a function, complete with docstring, that solves the Ideal Gas Law for pressure when provided with
volume, temperature, and moles of gas (R = 0.08206 L·atm/mol·K) with the following stipulations.
a) Create one version of the function that takes only positional arguments.
b) Create a second copy of the function that takes only keyword arguments. Try testing this function with positional
arguments. Does it still work?
24. Complete a function started below that calculates the rate of a single-step chemical reaction nA → P using the
differential rate law (Rate = k[A]𝑛 ).
25. DNA is composed of two strands of the nucleotides adenine (A), thymine (T), guanine (G), and cytosine (C). The
two strands are lined up with adenine always opposite of thymine and guanine opposite cytosine. For example, if
one strand is ATGGC, then the opposite strand is TACCG. Write a function that takes in a DNA strand as a string
and prints the opposite DNA strand of nucleotides.
66
CHAPTER 2: INTERMEDIATE PYTHON
This chapter is intended for those who wish to dive deeper into the Python programming language. Many of the topics
herein are not strictly required for most subsequent chapters but will make you more efficient and effective as a Python
programmer. The contents from this chapter are occasionally used in subsequent chapters, but you should still be able to
follow along in most places without having read this chapter. If you are in a rush, you can bypass this chapter and circle
back as needed. The sections and sometimes subsections of this chapter may also be read in any order.
Syntactic sugar is a nickname given to any part of a programming language that does not extend the capabilities of the
language. If any of these features were suddenly removed from the language, the language would still be just as capable,
but the advantage of anything labeled “syntactic sugar” is that it makes the code quicker/shorter to write or easier to read.
Below are a few examples from the Python language that you are likely to come across and find useful.
Augmented assignment is a simple example of syntactic sugar that allows the user to modify the value assigned to a
variable. If we want to increase a value by one, we can recursively assign the variable to itself plus one as shown below.
x = 5
x = x + 1
x
This is certainly not difficult, but it does involve typing the variable more than once which becomes less desirable as your
variable names get longer. As an alternative, we can also use augmented assignment shown below that accomplishes the
same task. The += means “increment.”
x += 1
x
Augmented assignment can also be used with addition, subtraction, multiplication, and division as shown in Table 1.
Table 1 Augmented Assignment
67
Scientific Computing for Chemists with Python
At this point, you may have noticed that it is fairly common to generate a list populated with a series of numbers. If the
values are evenly spaced integers, simply use the range() function and converts it to a list using list(). In all other
scenarios, you will need to create an empty list, use a for loop to calculate the values, and append the values to the list
as they are generated. Below is an example of generating a list of squares of all integers from 0 → 9 using this method.
squares = []
for integer in range(10):
sqr = integer**2
[Link](sqr)
squares
This whole process can be condensed down into a single line using list comprehension demonstrated below.
squares = [integer**2 for integer in range(10)]
squares
To help you visualize where each part comes from, below are both methods again but with common sections in the same
colors.
List comprehension can take a little time to get used to, but it is well worth it. It saves both time and space and makes the
code less cluttered.
® Note
In addition to list comprehension, there are the related dictionary comprehension and set comprehension shown below
that can be used for dictionary and set objects introduced in the following two sections.
[1]: {n: 2*n**2 for n in range(5)}
[1]: {0: 0, 1: 2, 2: 8, 3: 18, 4: 32, 5: 50}
68
Scientific Computing for Chemists with Python
At the beginning of a program or calculations, it is often necessary to populate a series of variables with values. Each
variable may get its own line in the code, and if there are numerous variables, this can clutter your code. An alternative
is to assign multiple variables in the same assignment as shown below with atomic masses of the first three elements.
1.01
Each variable is assigned to the respective value. This is known as tuple unpacking as H, He, Li and 1.01, 4.00, 5.39
are automatically turned into tuples by Python (behind the scenes) as demonstrated below.
The lambda function is an anonymous function for generating simple Python functions. Their value is that they can be
used to generate functions in fewer lines of code than the standard def statement, and they do not necessarily need to be
assigned to a variable, hence the anonymous part. This is useful in applications that require a Python function but the user
does not want to clutter the namespace by assigning it to a variable or take the time to define a function normally. The
lambda function is defined as shown below with the variable immediately after the lambda statement as the independent
variable in the function. In other words, the variable to the left of the : is the variable that goes in the parentheses in a
normal function definition, and everything to the right of the : is what is indented in a normal function definition.
lambda x: x**2
<function __main__.<lambda>(x)>
Being that it is not attached to a variable, it needs to be used immediately. Alternatively, it can be attached to a variable
as shown below and then operates like any other Python function.
f = lambda x: x**2
f(9)
81
As an example looking ahead to chapter 8, the quad() function from the [Link] module is a general-
purpose method for integrating the area under mathematical functions. Along with the upper and lower limits, the
quad() function requires a mathematical function in the form of a Python function (i.e., not just a mathematical ex-
pression). This would ordinarily require a formally defined Python function, but it is often more convenient to use a
lambda function as a single-use Python function as shown below. In the following example, we use integration to find the
probability of finding a particle in the lowest state between 0 and 0.4 in a box of length 1 by performing the following
integration.
0.4
𝑝 = 2∫ 𝑠𝑖𝑛2 (𝜋𝑥)
0
(0.30645107162113616, 3.402290356348383e-15)
The first value in the returned tuple is the result of the integration, and the second value is the estimated uncertainty.
Therefore, the particle has about a 30.6% probability of being found in the region of 0 → 0.4. Performing this same
calculation by defining the function with def is shown below. This requires more lines of code than a lambda expression.
def particle_box(x):
return 2 * [Link]([Link] * x)**2
quad(particle_box, 0, 0.4)
(0.30645107162113616, 3.402290356348383e-15)
2.2 Dictionaries
Python dictionaries are a multi-element Python object type that connects keys and values analogous to the way a real
dictionary connects a word (the key) with a definition (the value). These are also known as associative arrays. Dictionaries
allow the user to access the stored values using a key without knowing anything about the order of items in the dictionary.
One way to think of a dictionary is as an object full of variables and assigned values. For example, if we are looking to
write a script to calculate the molecular weight of a compound based on its molecular formula, we would need access to
the atomic mass of each element based on the elemental symbol. Here the key is the symbol and the value is the atomic
mass. It looks something like a list with curly brackets and each item is a key:value pair separated by a colon. Below
is an example of a dictionary containing the atomic masses of the first ten elements on the periodic table.
AM = {'H':1.01, 'He':4.00, 'Li':6.94, 'Be':9.01,
'B':10.81, 'C':12.01, 'N':14.01, 'O':16.00,
'F':19.00, 'Ne':20.18}
With the dictionary in hand, we can access the mass of any element in it using the atomic symbol as the key.
AM['Li']
6.94
Even though it is traditional to call them key:value pairs, the value does not need to be a numerical value. It can also be
a string or other object type, and the key can also be any object type.
If you ever find yourself with a dictionary and not knowing the keys, you can find out using the keys() dictionary
method.
70
Scientific Computing for Chemists with Python
[Link]()
dict_keys(['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne'])
We can also get a look at the key:value pairs using the items() method or iterate over the dictionary to get access to
keys, values, or both.
[Link]()
dict_items([('H', 1.01), ('He', 4.0), ('Li', 6.94), ('Be', 9.01), ('B', 10.81), ('C
↪', 12.01), ('N', 14.01), ('O', 16.0), ('F', 19.0), ('Ne', 20.18)])
1.01
4.0
6.94
9.01
10.81
12.01
14.01
16.0
19.0
20.18
Additional key:value pairs can be added to an already existing dictionary by calling the key and assigning it to a value as
demonstrated below. Instead of giving an error, the dictionary inserts that key: value pair.
AM['Na'] = 22.99
AM
{'H': 1.01,
'He': 4.0,
'Li': 6.94,
'Be': 9.01,
'B': 10.81,
'C': 12.01,
'N': 14.01,
'O': 16.0,
'F': 19.0,
'Ne': 20.18,
'Na': 22.99}
Notice that after adding sodium to the atomic mass dictionary, the order of all the pairs changed. Unlike a tuple or list,
the order in a dictionary does not matter, so it is not preserved.
Another method for generating a dictionary is the dict() function which takes in pairs for nested lists or tuples and
generates key:value pairs as follows.
dict([('H',1), ('He',2), ('Li',3)])
Not only can dictionaries be used to store data for calculations, such as atomic masses, they can also be used to store
changing data as we perform calculations or operations. For example, let’s say we want to count how often each base (i.e.,
2.2 Dictionaries 71
Scientific Computing for Chemists with Python
A, T, C, and G) appears in the following DNA sequence DNA. For this, we create a dictionary dna_bases to hold the
totals for each base and add one to each value as we iterate along the DNA sequence.
DNA = 'GGGCTCCATTGTCTGCCCGGGCCGGGTGTAGTCTAAGGTT'
dna_bases
2.3 Set
Sets are another Python object type you may encounter and use on occasions. These are multi-element objects similar to
lists with the key difference that each element can appear only once in the set. This may be useful in applications where
code is taking stock of what is present. For example, if we are taking inventory of the chemical stockroom to know which
chemical compounds are on hand for experiments, the names of the compounds can be stored in a set. If more than one
bottle of a compound is present in the stockroom, the set only contains the name once because we are only concerned
with what is available, not how many are available. A set looks like a list except curly brackets are used instead of square
brackets.
We can add additional items to the set using the add() set method.
® Note
The method is called add() and not append() as is used for lists because unlike lists, sets do not preserve the
order of items contained within them.
[Link]('calcium chloride')
compounds
{'acetone',
'calcium chloride',
'ethanol',
'sodium chloride',
'toluene',
'water'}
[Link]('ethanol')
compounds
72
Scientific Computing for Chemists with Python
{'acetone',
'calcium chloride',
'ethanol',
'sodium chloride',
'toluene',
'water'}
Notice that when ethanol is added to the set, nothing changes. This is because ethanol is already in the set, and sets do
not store redundant copies of elements.
Multiple sets can be concatenated or subtracted from each other using the | and – operators, and two sets can be compared
using Boolean operators. Below are two sets containing the atomic orbitals in nitrogen (N) and calcium (Ca) atoms. Even
though there are three 2p orbitals in nitrogen, it only appears once telling us what types of orbitals are present but not how
many.
N = {'1s','2s','2p'}
Ca = {'1s','2s','2p', '3s', '3p', '4s'}
- Difference Returns items in the first set minus common items in both
sets
^ Symmetric Differ- Merges both sets minus items in common (i.e., “exclusive
ence or”)
2.3 Set 73
Scientific Computing for Chemists with Python
Remember from the last chapter that a module is a collection of functions and data with a common theme. You have
already seen the math module in section 1.1.3, but Python also contains a number of other native modules that come with
every installation of Python. Table 3 lists a few common examples, but there are certainly many others worth exploring.
You are encouraged to visit the Python website and explore other modules. This section will introduce a few useful
modules with some examples of their uses.
® Note
See [Link] for a more complete listing and descriptions of built-in Python
modules.
Name Description
os Provides access to your computer file system
itertools Iterator and combinatorics tools
random Functions for pseudorandom number generation
datetime Handling of date and time information (see section 2.9)
csv For writing and reading CSV files
pickle Preserves Python objects on the file system
timeit Times the execution of code
audioop Tools for reading and working with audio files
statistics Statistics functions
2.4.1 os Module
The os module provides access to the files and directories (i.e., folders) on your computer. Up to this point, we have
been opening files that are in the same directory as the Jupyter notebook, so Jupyter has no difficulty finding the files.
However, if you ever want to open a file somewhere else on your computer or open multiple files, this module is particularly
useful. Below you will learn to use the os module to open files in non-local directories (i.e., not the directory your Jupyter
notebook is in) and to open an entire folder of files.
Table 4 Select os Module Functions
Function Description
[Link]() Changes the current working directory to the path provide
[Link]() Returns the current working directory path
[Link]() Returns a list of all files in the current or indicated directory
Table 4 provides a description of the three functions that we will be using. To open a file not in the directory of your Jupyter
notebook, you will need to change the directory Python is currently looking in, known as the current working directory,
using the chdir() method. It takes a single string argument of the path in string format to the folder containing the files
of interest. For example, if the files are in a folder called “my_folder” on your computer desktop, you might use something
74
Scientific Computing for Chemists with Python
like the following. The exact format will vary depending on your computer and if you are using macOS, Windows, or
Linux.
import os
[Link]('/Users/me/Desktop/my_folder')
If you are not sure which directory is the current working directory, you can use the getcwd() function. It does not
require any arguments.
[Link]()
Another useful function from the os module is the listdir() method which lists all the files and directories in a folder.
It is useful not only for determining the contents of a folder but also for iterating through all the files in a folder. Imagine
you have not just a single CSV file with data but an entire folder of similar CSV files that you need to import into Python.
Instead of handling these files one at a time, you can have Python iterate through the folder and import each CSV file it
finds. Below is a demonstration of importing and printing every CSV file on the computer desktop.
import numpy as np
[Link]('/Users/me/Desktop') # changes directory
for file in [Link]():
if [Link]('csv'): # only open csv files
data = [Link](file)
print(data)
The code above goes through every file on the computer desktop, and if the file name ends in “csv”, Python imports and
prints the contents. Checking the file extension is an important step even if you have a folder that you believe only contains
CSV files. This is because folders on many computers contain invisible files for use by the computer operating system.
The user usually cannot see them, but Python can and will generate an error if it tries to open it as a CSV file. Checking
the file extension ensures that Python only tries to open the actual CSV files. See section 13.2.5 for an example of this.
The itertools module contains an assortment of tools for looping over data in an efficient manner. There are a number of
functions that are good to know from this module, but we will focus on the combinatorics functions combinations()
and permutations().
The combinations(collection, n) function generates all n-sized combinations of elements from a collection
such as a list, tuple, or range object. With combinations(), order does not matter, so (1, 2) is equivalent to (2,
1). In the below code, the combinations() function generates all pairs of elements from numbers.
import itertools
numbers = range(5)
[Link](numbers, 2)
<[Link] at 0x1095fd0d0>
So what just happened? Instead of returning a list, it returned a combinations object. You do not need to know much
about these except that they can be converted into lists or iterated over to extract their elements, and they are single-use.
Once you have iterated over them, they need to be generated again if you need them again.
® Note
combinations() is a type of function called a generator. It only generates values on demand in an effort to
reduce the memory usage. This is similar to the range() function.
(0, 1)
(0, 2)
(0, 3)
(0, 4)
(1, 2)
(1, 3)
(1, 4)
(2, 3)
(2, 4)
(3, 4)
Each combination is returned in a tuple, and if the combination object is converted to a list, it would be a list of tuples.
The permutations() function is very similar to combinations(), except with permutations(), order mat-
ters. Therefore, (2, 1) and (1, 2) are non-equivalent. This is especially important in probability and statistics.
Permutations of a group of items can be generated just like in the combinations example above.
(0, 1)
(0, 2)
(0, 3)
(0, 4)
(1, 0)
(1, 2)
(1, 3)
(1, 4)
(2, 0)
(2, 1)
(2, 3)
(2, 4)
(3, 0)
(3, 1)
(3, 2)
(3, 4)
(4, 0)
(4, 1)
(4, 2)
(4, 3)
Notice how (0, 2) and (2, 0) are both present in the permutations while only one is listed in the combinations.
76
Scientific Computing for Chemists with Python
The random module provides a selection of functions for generating random values. Random values can be integers
or floats and can be generated from a variety of ranges and distributions. A selection of common functions from the
random module are shown in Table 5. We will not go into much detail here as random value generation is covered in
significantly more detail at the end of chapter 4. One key limitation of the random module is that the functions typically
only generate a single value at a time. If you want multiple random values, you need to either use a loop or use the random
value functions from NumPy presented in chapter 4.
Table 5 Functions from random Module
Function Description
[Link]() Generates a value from [0, 1)
[Link](x, y) Generates a float from the range [x, y) with a uniform probability
[Link](x, y) Generates an integer from the provided range [x, y)
[Link]() Randomly selects an item from a list, tuple, or other multi-element object
[Link]() Shuffles a multi-element object
One point worth noting is that square brackets mean inclusive while parentheses mean exclusive, so [0, 9) means from 0
→ 9 including 0 but not including 9.
import random
[Link]()
0.25677523670981783
[Link](0, 10)
a = [1,2,3,4,5,6]
[Link](a)
a
[6, 1, 3, 4, 5, 2]
There are times when it is necessary to iterate over two lists simultaneously. For example, let us say we have a list of
the atomic numbers (AN) and a list of approximate atomic masses (mass) of the most abundant isotopes for the first six
elements on the periodic table.
AN = [1, 2, 3, 4, 5, 6]
mass = [1, 4, 7, 9, 11, 12]
If we want to calculate the number of neutrons in each isotope, we need to subtract each atomic number (equal to the
number of protons) from the atomic mass. To accomplish this, it would be helpful to iterate over both lists simultaneously.
Below are a couple of methods of doing this.
2.5.1 Zipping
The simplest way to iterate over two lists simultaneously is to combine both lists into a single, iterable object and iterate
over it once. The zip() function does exactly this by merging two lists or tuples, like a zipper on a jacket, into something
like a nested list of lists. However, instead of returning a list or tuple, the zip() function returns a single-use zip object.
zipped = zip(AN, mass)
0
2
4
5
6
6
As noted above, these are single-use objects, so if we try to use it again, nothing happens.
for pair in zipped:
print(pair[1] - pair[0])
If the two lists are of different length, zip() stops at the end of the shorter list and returns a zip object with a length of
the shorter list.
2.5.2 Enumeration
A close relative to zip() is the enumerate() function. Instead of zipping two lists or tuples together, it zips a list or
tuple to the index values for that list. Similar to zip(), it returns a one-time use iterable object.
enum = enumerate(mass)
(0, 1)
(1, 4)
(2, 7)
(3, 9)
(4, 11)
(5, 12)
The zip() function can be made to do the same thing by zipping a list with a range object of the same length as shown
below, but enumerate() may be slightly more convenient.
zipped = zip(range(len(mass)), mass)
for item in zipped:
print(item)
(0, 1)
(1, 4)
(2, 7)
(3, 9)
(continues on next page)
78
Scientific Computing for Chemists with Python
During most of your work in Python, you do not need to think about how and where the values are stored because Python
handles this for you. If you assign a number to a variable, Python will determine how to properly store this information.
However, there are instances where you will need to understand a little about how numbers are encoded such as in grayscale
images (chapter 7).
Numbers on your computer are stored in binary which is a base-two numbering system. That is, instead of using digits
from 0 → 9 to describe a number, only 0 and 1 are used.
® Note
Standard numbers used by humans are a base-ten because we describe values using combinations of ten digits (0,
1, 2, 3, 4, 5, 6, 7, 8, and 9). Once we get to 9, the digit returns to 0 and a 1 is placed to the left. In a binary
numbering system, we use only 0 and 1 to describe values. Analogously, once we get to 1, the digit returns to 0
and a 1 is placed to the left. Therefore, “10” is two in binary.
When a number is stored in memory, a fixed block of zeros/ones is allocated to store this information, and depending on
the size or precision of the number to be stored, this block may need to be larger or smaller. By convention, the blocks
are typically 8, 16, 32, 64, or 128 bits (i.e., zeros or ones) in size. Table 6 lists a few examples with the terms used by
Python.
Table 6 Python Data Types
Probably the simplest way to encode a number is an unsigned 8-bit integer. The “unsigned” means that it cannot have a
negative sign while the “8-bit” means it can use eight zeros and ones to describe the number. For example, if we want to
encode the number 3, it is 00000011. Even if not all the bits are strictly required, they have been allotted for the storage
of this value, and with 8 bits, we can encode numbers from 0 → 255 (i.e., 00000000 → 11111111). If we want to encode
any larger numbers, a longer block of bits such as 16 or 32 will need to be allotted.
To encode negative integers, signed integers are required. The key difference between a signed and unsigned integer is
that an unsigned integer is always positive while a signed integer can describe positive and negative values by using the
first bit to describe the sign. The first bit is 0 for a positive number and 1 for a negative number. Because the first bit is
reserved for sign, a signed integer can describe values of only half the magnitude as an unsigned integer of the same bit
length. For example, an 8-bit signed integer can describe values from -128 → 127. All combinations of zeros/ones that
start with a 0 define positive values from 0 → 127 while all combinations of zeros/ones that start with a 1 define values
from -128 → -1. That is, 10000000 equals -128 while 11111111 describes -1.
For non-integer values, we need floats. The number of bits used to describe a float dictates the precision of the value… or
rather is the number of decimal places the float extends. The various types listed above support both positive and negative
values, and the more bits, the more precision they offer.
Section 1.9 describes positional arguments and keyword arguments as two methods for providing functions with infor-
mation and instructions, but thus far, these methods have only allowed the function to take a predetermined number of
arguments. While some flexibility is offered by the ability to set default keyword arguments that users have the option
of overriding or leaving as the default, there is still a limit on the number of parameters in the function. What do we do
when we need to write a function that takes an unspecified number of arguments? This section provides two approaches
to solving this problem.
As a possible use case, it is common practice in labs to purify a solid compound by recrystallization, and chemists will
often harvest multiple crops of crystals from the same solution to get the highest possible yield. If we want to write
a function that returns the percent yield of a synthesized compound using the theoretical yield and the yields of each
recrystallization crop, we are faced with the challenge of not knowing how many crops to expect. One solution is a
var-positional argument.
The var-positional argument (often *arg), is a positional argument that accepts variable numbers of inputs. The arguments
are then stored as a local tuple in the function attached to the arg variable. Even though it is extremely common in
examples to see people use arg as the variable, you may use any non-reserved variable you like as long as you precede it
with an asterisk in the function definition. For example, a function for calculating the percent yield is shown below with
g_theor as the theoretical yield in grams and g_crops as the var-positional parameter storing the mass of each crop
of crystals in grams.
66.66666666666666
Interestingly, depending on how you write the internals of the function, the var-positional argument is not strictly necessary
for the function to work. In this case, because the sum() function returns 0 if no arguments are passed to it, the
per_yield() function still works with no error returned.
per_yield(1.32)
0.0
80
Scientific Computing for Chemists with Python
Similarly, an unspecified number of keyword arguments can also be accepted by a Python function using var-keyword
arguments. In this case, the user not only dictates the number of arguments but also picks the variable names. The user-
defined variables and values are stored in a local dictionary as key:value pairs. As an example, we can write a function that
calculates the molar mass of a compound based on the number and type of elements it contains. It is certainly possible to
write a function with every chemical element as a keyword argument, but this gets absurd with so many chemical elements
to choose from. Instead, we can use a var-keyword parameter as demonstrated below. The var-keyword argument is
indicated with a ** before the variable name. The function below is only designed to work with the first nine elements
for brevity.
def mol_mass(**elements):
m = {'H':1.008, 'He':4.003, 'Li':6.94, 'Be':9.012,
'B':10.81, 'C':12.011, 'N':14.007, 'O':15.999,
'F':18.998}
masses = [] # mass total from each element
for key in [Link]():
[Link](elements[key] * m[key])
return sum(masses)
Let us test this function by calculating the molar mass of caffeine which has a molecular formula of C8 H10 N4 O2 .
194.194
The user experience would be the same if we wrote the function to accept keyword arguments with default values of
zero, but it is sometimes more convenient for the person writing the code to design the function to accept var-keyword
arguments.
Functions can call other functions. This is probably not surprising as we have already seen functions call [Link]()
and append(), but what may be surprising is that Python allows a function to call itself. This is known as a recursive
function.
If we want to write a function that calculates the remaining mass of radioactive materials after a given number of half-lives,
this can be accomplished using a for or while loop, but it can also be accomplished recursively. We start by having
the function divide the provided mass (mass) in half and then decrement the number of half-lives (hl) by one. This is
the core component of the function. If hl is zero, the function is done and returns the mass. If not, the function calls
itself again with the remaining mass and number of half-lives. This is the recursive part. The second time the function is
run, the mass is again halved and the half-lives decremented by one, and the number of half-lives is again checked.
if hl == 0:
return mass
(continues on next page)
half_life(4.00, hl=2)
1.0
half_life(4.00, hl=4)
0.25
It works! In the second example above, the half_life() function is run four times because the function called itself an
additional three times. What happens if we feed the function 1.5 half-lives? Like a while loop with a faulty termination
condition, this function will keep going because hl never equals zero. Luckily, Python has a safeguard that stops recursive
functions from running more than a thousand iterations, but this is still a problem. We can protect against this issue by
doing a check at the start of the function to ensure an integer is provided using the isinstance() function which takes
two arguments: the variable and the object type.
isinstance(x, type)
b Tip
If you cannot guarantee that the inputs for your code will conform to certain requirements (e.g., be an integer), it
is wise to do checks at the start of your code. This is especially true if the input or data for your code comes from
people other than the author of the code.
mass /= 2
hl -= 1
if hl <= 0:
return mass
else:
return half_life(mass, hl=hl)
half_life(4.00, hl=1.5)
82
Scientific Computing for Chemists with Python
While getting an error message is not what anyone likes to see, this is a good thing. It is better for the code to generate
an error and not work than to run away uncontrollably or return an incorrect answer.
As a final note on recursive functions, you may have noticed that you could just as easily have accomplished the above
task with a while or for loop. Recursive functions can usually be avoided, but once in a while a recursive function will
substantially simplify your code. It is a good technique to have in your back pocket for the moment you need it, but you
will not likely use them often.
It doesn’t take long to realize that error messages are an inevitable part of computer programming, so it is helpful to know
what the different types of error messages mean and how to deal with them. This section provides a quick overview of
major types of error messages and how to get Python to work past them when appropriate.
Whenever you encounter an error message, it includes the type of error followed by more details. There are numerous
types of errors, but there are a few error types that are more prevalent and worth being familiar with. Below is a short list
of some of these common error types.
Table 7 A Selected List of Python Error Types
NameError
The NameError means the code uses a variable or function name that does not exist because it has not been defined.
This is often the result of mistyping a variable name but can have other causes like running code cells in a Jupyter notebook
without first running necessary earlier code cells. If you just opened a Jupyter notebook, it is often worth selecting Run
→ Run All Cells from the top menu to ensure the latter doesn’t happen.
print(root)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[54], line 1
(continues on next page)
SyntaxError
A programming language’s syntax is the set of rules that dictate how the code is formatted, the appropriate symbols, valid
values and variables, etc. It’s all the rules that we’ve been learning about in the past couple of chapters. A SyntaxError
indicates that your code violated one of these rules. To be helpful, the error message shows the line of code with the invalid
syntax and points to where in the line the problem seems to be occurring.
In the first example below, the error occurred because <> is not a valid operator in Python.
5 <> 6
The below example generates a SyntaxError because variable names cannot start with a number.
5sdq = 52
TypeError
A TypeError occurs when using the wrong object type for a particular function or application. For example, Python
cannot take the absolute value of a letter, so this generates a TypeError.
abs('a')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[57], line 1
----> 1 abs('a')
A TypeError is encountered below because a boolean operation cannot be performed on a list - at least not without a
for loop or NumPy (introduced in chapter 4).
[1,2,3] > 5
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[58], line 1
----> 1 [1,2,3] > 5
(continues on next page)
84
Scientific Computing for Chemists with Python
ValueError
The ValueError is somewhat similar to a TypeError, except in this case it indicates that a numerical value is not
valid or appropriate for a particular function. Some functions require that their arguments be within a certain range such as
the [Link]() which does not accept negative numbers. As a result, taking the square root of -1 with this function
generates a ValueError.
import math
[Link](-1)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[59], line 2
1 import math
----> 2 [Link](-1)
ZeroDivisionError
The ZeroDivisionError error is what the name says - the code attempted to divide by zero.
4 / 0
---------------------------------------------------------------------------
ZeroDivisionError Traceback (most recent call last)
Cell In[60], line 1
----> 1 4 / 0
IndentationError
Python does not care about spaces except those at the start of a line as these spaces or indentations have meaning. In the
example below, the print(x) should be indented below the start of the for loop, so it generates an Indentation-
Error.
for x in range(5):
print(x)
When indexing a composite object like a list, an index value that is outside the range results in an IndexError. In the
list below, the indices run from 0 to 4, so using an index of 5 returns an IndexError.
lst = [1,5,7,4,3]
lst[5]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[62], line 2
1 lst = [1,5,7,4,3]
----> 2 lst[5]
Similarly, if the code tries to look up a value using a key not present in a dictionary, it returns a KeyError as shown
below.
elements['N']
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[64], line 1
----> 1 elements['N']
KeyError: 'N'
DeprecationWarning
A DeprecationWarning occurs when code uses a feature that will be removed or changed in a future release of
Python or a third-party library. This error does not stop your code and is a friendly heads up that your code may not work
in the future.
b Tip
Python error messages indicate the line where the error occurs, but on occasions you may find no error in that line of
code. In these instances, the error is likely in the previous line. This can happen because Python provides means for
continuing a line of code onto subsequent lines such as using a left parenthesis, (, on the first line but not closing the
parentheses with a right parenthesis, ), until a later line. As an example, the following is executed by Python as if it
were all on the same line.
V = (n * R * T_K
/ P_atm)
86
Scientific Computing for Chemists with Python
While this may seem like a bad idea at first glance, there are times when you may want Python to not come to a grinding
halt in the face of an error. One common situation is when importing a large number of data files from different sources.
Not every data source may have formatted data or files the same, and some files may be malformed or there may be other
unexpected edge cases. To get Python to not stop at an error message, you can use a try/except block.
The general structure of a try/except block is to include the code you originally intend to run under the try statement,
and under the following except statement, include what Python should do in the event of a specific error. The general
structure looks like the following.
try:
regular code
regular code
except ErrorType:
contingency code
As an example, let’s say we are iterating through a list of numbers and appending the square root to a second list. Because
one item in the original list of numbers is four, this causes a TypeError.
import math
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[66], line 5
2 sqr_root = []
4 for num in sqr_nums:
----> 5 sqr_root.append([Link](num))
Instead, the for loop has been placed under a try: telling Python to make a best attempt at running the code. The
code under the except TypeError: tells Python to run the following code in the event of a TypeError.
In the above example, nothing is done with the string except to inform the user that there was a problem. It is a prudent
practice to not let unsolved errors pass by silently. If you have a good idea of where errors may turn up and have a solution
to them, you can include that code under the except: as well.
Being that we know the above error is caused by a string, we can convert it to a float using a dictionary like below.
sqr_root
It is worth noting that try/except blocks can be avoided using if/else blocks like below.
So when should you use try/except versus if/else? If you anticipate exceptions to occur frequently, if/else is
likely to be more efficient, but if exceptions are rare, it may be more efficient to use try/except.
One thing worse than code not running is code running and producing incorrect outputs. At least when code fails to run,
the user knows something is wrong whereas code that fails silently can lull the user into false conclusions. It is a prudent
practice in coding to include checks that important conditions are met, and when these conditions are not met, the code
should stop and produce an error known as raising an exception. To include checks in your code, you can use a condition
with a raise statement followed by some form of error from Table 7 and an error message. The more specific you can
be in your error type and message, the better.
As an example, we will write a function below which quantifies the differences between two DNA sequences. The
Hamming distance is one possible metric for determining how different two sequences are and is simply the number
of locations where two sequences of the same length are different. For example, AATGC and AATGT have a Hamming
distance of 1 because they are identical except for the last base position. Because it is critical that the two DNA sequences
be the same length, this should be checked before any further calculations, and if the sequences have different lengths,
the function should not proceed and provide a helpful error message.
if len(seq1) != len(seq2):
raise ValueError('Sequences must be of equal length')
Because the two sequences have the wrong number of bases, this qualifies as a ValueError (see Table 7). Inside the
parentheses behind ValueError, a more detailed message can and should be provided.
88
Scientific Computing for Chemists with Python
dna1 = 'AACCT'
dna2 = 'ATCCA'
dna3 = 'ATCCTA'
if len(seq1) != len(seq2):
raise ValueError('Sequences must be of equal length')
return distance
When we compare the first two DNA sequences that are the same length, the function returns a numerical value. However,
when comparing the second two sequences that are not the same length, the error message appears instead of a number.
hamming(dna1, dna2)
hamming(dna2, dna3)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[74], line 1
----> 1 hamming(dna2, dna3)
It is often necessary to know when data were collected such as in chemical kinetics. This information may be stored in the
file itself or as a timestamp at the end of the file name. Not only is it necessary to extract this date and time information, it
is often also necessary to calculate the times since the start of the experiment or between data points. This section covers
Python’s native datetime module useful for working with date and time information and extracting this information
from files. The four object types covered here are listed in Table 8. The first three tell us when the data were collected
while the third, timedelta, tells us the amount of time between two times or dates.
Table 8 Common datetime Objects
® Note
We assume here that the data collection occurred in one timezone and not across leap years. If this is not the case,
see the Python datetime documentation for dealing with these added complexities.
We will start with what these objects are and how to work with them followed by how to use datetime to extract date
and time information from data files. First, we need to import the datetime module.
import datetime
The datetime module often stores date and time information in a datetimeobject. A datetime object can be created
multiple ways such as explicitly indicating a specific date and time using the datetime() method. For example, below
we indicate noon on Pi Day 2025. The datetime() method takes the year, month, day, hour, minutes, seconds, and
microseconds as optional positional arguments in this order.
[Link](year, month, day, hour, minutes, seconds, microseconds)
The date and time information can also be provided to datetime() using keyword arguments like below.
mario_day = [Link](year=2025, month=3, day=10 ,hour=8, minute=10, second=0,
↪ microsecond=0)
The current date and time can be accessed using the now() method for the datetime module. This function also
accepts an optional timezone (tz=) argument (not discussed here). If no argument is provided, then tz=None. There
is also a [Link]() function that is equivalent to the now() function when no timezone is
provided or tz=None. The now() function is recommended by the Python datetime documentation.
now = [Link]()
now
The hours, minutes, seconds, and microseconds can be accessed individually using the hour, minute, second, and
microsecond attributes, respectively.
90
Scientific Computing for Chemists with Python
[Link]
10
A datetime object can be modified in place using the replace method like below.
[Link](hour=3)
The datetime module also has a time object that is similar to the datetime object except that it restricts itself to time
information. The time() function used to create a time object accepts the time as optional positional arguments.
[Link](5, 3, 32)
Like datetime objects, the hours, minutes, seconds, and microseconds can be accessed individually or modified in
place.
[Link]
32
[Link](second=42)
[Link](5, 3, 42)
The differences between two datetime objects can also be calculated by subtracting the two objects. The result is
returned as a timedelta object.
[Link](days=4, seconds=13800)
The days or seconds in the timedelta object can be accessed using the days or seconds attributes, respectively.
[Link]
[Link]
13800
delta.total_seconds()
359400.0
Extracting date and time from a file or file name can be accomplished using the ‘string-parsed time’ strptime()
function and formatting codes shown below. Additional codes can be found on the Python website.
b Tip
If you want to convert from a datetime object to a string, use the ‘string from time’ strftime() function.
These codes will allow you to parse strings into the datetime module by providing the strptime() function with
both the string from the data file and a description of how the date and time information is organized. For example, below
is a file where the collection time is included in the file name as hours, minutes, seconds separated by hyphens.
file_name_1 = 'Absorbance_12-[Link]'
timestamp = [Link](file_name_1[-12:-4], '%H-%M-%S')
timestamp
Because the date (i.e., year, month, and day) information was not provided, default values of January 1, 1900 was chosen
for the datetime object. If you only want the date or time information, you can access them using the date() or
time() functions, respectively.
92
Scientific Computing for Chemists with Python
[Link]()
[Link](1900, 1, 1)
[Link]()
[Link](12, 3, 48)
If the values are not formatted like Python assumes, a little extra effort may be required. For example, below the time
is formatted at hours-minutes-seconds-microseconds, but microseconds is not represented as six digits with zero padding
like Python assumed. To deal with this, the microseconds are sliced out of the file name and added to the datetime
object using the replace() method.
file_name_2 = 'glucose_Absorbance_12-[Link]'
Further Reading
The official Python website is the ultimate authority for documentation on the Python programming language and is well
written. There are also numerous books available on the subject both free and otherwise. Below are a few examples.
There is an abundance of other free resources such as YouTube videos and [Link] boards for people
looking for more information.
1. Python Documentation Page. [Link] (free resource)
2. Reitz, K.; Schlusser, T. The Hitchider’s Guide to Python: Best Practices for Development, O’Reilly: Sebastopol, CA,
2016.
3. Downey, Allen B. Think Python Green Tea Press 2012. [Link] (free resource)
4. Das, U; Lawson, A.; Mayfield, C.; Norouzi, N.; Rajasekhar, Y.; Kanemaru, R. Introduction to Python Programming,
Open Stax: Houston, TX, 2024. [Link] (free re-
source)
Exercises
Complete the following exercises in a Jupyter notebook. Any data file(s) refered to in the problems can be found in the
data folder in the same directory as this chapter’s Jupyter notebook. Alternatively, you can download a zip file of the data
for this chapter from here by selecting the appropriate chapter file and then clicking the Download button.
1. Generate a list containing the natural logs of integers from 2 → 23 (including 23) using append and then again
using using list comprehension.
2. Write a function, using augmented assignment, that takes in a starting xyz coordinates of an atom along with how
much the atom should translate along each axis and returns the final coordinates. The docstring for this function is
below.
Further Reading 93
Scientific Computing for Chemists with Python
3. Generate a function that returns the square of a number using a lambda function. Assign it to a variable for reuse
and test it.
4. Generate a dictionary called aacid that converts single-letter amino acid abbreviations to the three-letter abbre-
viations. You will need to look up the abbreviations from a textbook or online resource.
5. For the following two sets: acids1 = {‘HCl’, ‘HNO3’, ‘HI’, ‘H2SO4’} acids2 = {‘HI’, ‘HBr’, ‘HClO4’, ‘HNO3’}
a) Generate a new set with all items from acids1 and acids2.
b) Generate a new set with the overlap between acids1 and acids2
c) Add a new item HBrO3 to acids1.
d) Generate a new set with items from either set but not in both
6. Use a for loop and listdir() method to print the name of every file in a folder on your computer. Compare
what Python prints out to what you see when looking in the folder using the file browser. Does Python print any
files that you do not see in the file browser?
7. Use the random module for the following.
a) Generate 10 random integers from 0 → 9 and calculate the mean of these values. What is the theoretical mean
for this dataset?
b) Generate 10,000 random integers from 0 → 9 and calculate the mean of these values. Is this mean closer or
further than the mean from part a? Rationalize your answer. Hint: look up the “law of large numbers” for help.
8. The following code generates five atoms at random coordinates in 3D space. Write a Python script that calculates
the distance between each pair of atoms and returns the shortest distance. The itertools module might be helpful
here. See section 1.9.1 for help calculating distance.
94
Scientific Computing for Chemists with Python
[out]: 1.4142135623730951
[in]: dist(3, 2, 1)
[out]: 3.7416573867739413
13. Below is a function calculates the theoretical number of remaining protons(p) and neutrons(n) remaining after x
alpha decays. Convert this function to a recursive function. Hint: start by removing the for loop and replace it
with an if statement.
# tests
>> alpha_decay(2, 10, 10)
6 protons and 6 neutrons remaining.
>> alpha_decay(1, 6, 6)
4 protons and 4 neutrons remaining.
'''
for decay in range(x):
p -= 2
n -= 2
14. DNA strands contain sequences of nucleitide bases, and for DNA, these bases are adenine (A), thymine (T), guanine
(G), and cytosine (C). When comparing two DNA strands of the same length, the Hamming distance is the number
of places strand where the two DNA strands contain a different base. For example, the ATTG and ATCG sequences
have a Hamming distance of 1 because they differ only by the third base position. Write a Python function that
calculates the Hamming distance between two DNA sequences by zipping the two sequences. Your function should
first first check that the two sequences are of the same length and return an error message if they are not. Test the
function on the following two DNA sequences.
dna1 = 'ATCCTGCATTAGGGAGCTTTTATTGCCCAATAGCTA'
dna2 = 'ATCCTGGATTAGGGAGCATTTATTGCCCAATAGGTA'
15. Chap 02: DNA sequences often to not contain equal quantities of GC versus AT bases, and the percentage of GC
is known as the GC-content.
a) Write a Python function that generates a random DNA sequence of a user defined number bases long with an
average GC-content of 40%. The [Link]() function may be helpful here. Execute your function
for a 50 bases DNA strand. Note: because your function generates a random sequence, the GC-content may not
always be 40%, but the generated sequences GC-content should average to near 40% over a very large number of
sequences generated.
b) Write and test a separate Python function from above that calculates the GC-content of a user provided DNA
sequence.
Exercises 95
Scientific Computing for Chemists with Python
96
CHAPTER 3: PLOTTING WITH MATPLOTLIB
Data visualization is an important part of scientific computing both in analyzing your data and in supporting your conclu-
sions. There are a variety of plotting libraries available in Python, but the one that stands out from the rest is matplotlib.
Matplotlib is a core scientific Python library because it is powerful and can generate nearly any plot a user may need. The
main drawback is that it is often verbose. That is to say, anything more complex than a very basic plot may require a few
lines of boilerplate code to create. This chapter introduces plotting with matplotlib.
Before the first plot can be created, we must first import matplotlib using the code below. This imports the pyplot
module which does much of the basic plotting in matplotlib. While the plt alias is not required, it is a common convention
in the SciPy community and is highly recommended as it will save you a considerable amount of typing. You may
sometimes also see a %matplotlib inline line. This used to be required to ensure the plots appeared in the
notebook but is now typically not necessary.
import [Link] as plt
In all the examples below, simply calling a plotting function in a Jupyter notebook will automatically make the plot appear
in the notebook below the plotting function. However, if you choose to use matplotlib in some other environment, it is
often necessary to also execute the following [Link]() function to make the plot appear. This can also be done in
Jupyter, but it is not shown in the rest of this chapter as Jupyter does not require it.
[Link]()
Before creating our first plot, we need some data to plot, so we will generate data points from orbital radial wave functions.
The following equation defines the wave function (𝜓) for the 3s atomic orbital of hydrogen with respect to atomic radius
(𝑟) in Bohrs (𝑎0 ).
2√
𝜓3𝑠 = 3(2𝑟2/9 − 2𝑟 + 3)𝑒−𝑟/3
27
We will generate points on this curve using a method called list comprehension covered in section 2.1.2. In the examples
below, r is the distance from the nucleus and psi_3s is the wave function. If you choose to plot something else, just
make two lists or tuples of the same length containing the 𝑥- and 𝑦-values.
# create Python function for generating 3s radial wave function
import math
def orbital_3S(r):
wf = (2/27)*[Link](3)*(2*r**(2/9) - 2*r + 3)* [Link](-r/3)
return wf
97
Scientific Computing for Chemists with Python
To visualize the 3s wave functions, we will call the plot() function, which is a general-purpose function for plotting.
The r and psi_3s data are fed into it as positional arguments as the 𝑥- and 𝑦-variables, respectively.
0.4
0.3
0.2
0.1
0.0
0.1
0 5 10 15 20 25 30 35
b Tip
You may have noticed a line of text above the plot that looks something like [<[Link].Line2D
at 0x7f83318383a0>]. If it bothers you, you can suppress it by either ending the line of code with a
semicolon (;) or adding a line with [Link]().
By default, matplotlib creates a scatter plot using blue as the default color. This can be modified if blue circles are not to
your taste. If the plot() function is only provided a single argument, matplotlib assumes the data are the 𝑦-values and
plots them against their indices.
98
Scientific Computing for Chemists with Python
To change the color and markers, you can add a few extra arguments: marker, linestyle, and color. All of
these keyword arguments take strings. The marker argument allows the user to choose from a list of markers (Table
1). The linestyle argument (Table 2) determines if a line is solid or the type of dashing that occurs, and the color
argument (Table 3) allows the user to dictate the color of the line/markers. If an empty string is provided to linestyle
or marker, no line or marker, respectively, is included in the plot. See the matplotlib website for a more complete list
of styles.
Table 1 Common Matplotlib Marker Styles
Argument Description
‘o’ circle
‘*’ star
‘p’ pentagon
‘^’ triangle
‘s’ square
Argument Description
‘-’ solid
‘–’ dashed
‘-.’ dash-dot
‘:’ dotted
Argument Description
‘b’ blue
‘r’ red
‘k’ black (key)
‘g’ green
‘m’ magenta
‘c’ cyan
‘y’ yellow
There are numerous other arguments that can be placed in the plot command. A few common, useful ones are shown
below in Table 4.
Table 4 A Few Common plot Keyword Arguments
Argument Description
linestyle or ls line style
marker marker style
linewidth or lw line width
color or c line color
markeredgecolor or mec marker edge color
markerfacecolor or mfc marker color
markersize or ms marker size
Now that you have seen the keyword argument approach which allows for the fine-tuning of plots, there is also a shortcut
useful for basic plots. The plot function can take a third, positional argument which makes plotting a lot quicker. If
you place a string with a marker style and/or line style, you can adjust the color and markers without the full keyword
arguments. This approach does not allow the user as much control as the keyword arguments, but it is popular because
of its brevity.
# ro = red circle
[Link](r, psi_3s, 'ro');
0.4
0.3
0.2
0.1
0.0
0.1
0 5 10 15 20 25 30 35
# g.- = green solid line with dots along it
[Link](r, psi_3s, 'g.-');
100
Scientific Computing for Chemists with Python
0.4
0.3
0.2
0.1
0.0
0.1
0 5 10 15 20 25 30 35
3.1.3 Labels
It is often important to label the axes of your plot. This is accomplished using the [Link]() and plt.
ylabel() functions which are placed on different lines as the [Link]() function. Both functions take strings.
0.4
0.3
0.2
Y Values
0.1
0.0
0.1
0 5 10 15 20 25 30 35
X Values
In the event you want a title at the top of your plots, you can add one using the [Link]() argument. To add symbols
to the axes, this can be done using LaTex commands which are used below, but discussion of LaTex is beyond the scope
of this chapter.
102
Scientific Computing for Chemists with Python
0.3
0.2
Wave Function,
0.1
0.0
0.1
0 5 10 15 20 25 30 35
Radius, Bohrs
b Tip
There are times when you may want to reverse the direction of an axis so that the numbering runs from large to
small. Add the extra code lines [Link]().invert_xaxis() and [Link]().invert_yaxis() to
reverse the x-axis and y-axis, respectively. Alternatively, you can just specify your axis limits in the reverse order
using [Link](nlarge, nsmall) and [Link](nlarge, nsmall).
If you want to change the size or dimensions of the figure in the Jupyter notebook, this can be accomplished by plt.
figure(figsize=(width, height)). It is important that this function be above the actual plotting function
and not below for it to modify the figure.
[Link](figsize=(8,4))
[Link](r, psi_3s, 'go-')
[Link]('Radius, Bohrs')
[Link]('Wave Function, $\\Psi$')
[Link]('3S Radial Wave Function');
0.3
0.2
Wave Function,
0.1
0.0
0.1
0 5 10 15 20 25 30 35
Radius, Bohrs
A majority of matplotlib usage is to generate figures in a Jupyter notebook. However, there are times when it is necessary
to save the figures to files for a manuscript, report, or presentation. In these situations, you can save your plot using the
[Link]() function which takes a few arguments. The first and only required argument is the name of the output
file as a string. Following this, the user can also choose the resolution in dots per inch using the dpi keyword argument.
Finally, there are a number of file formats supported by the [Link]() functions including PNG, TIF, JPG,
PDF, SVG, among others. The formats can be selected using the format argument which also takes a string, and if no
format is explicitly chosen, matplotlib defaults to PNG.
104
Scientific Computing for Chemists with Python
0.4
0.3
0.2
0.1
0.0
0.1
0 5 10 15 20 25 30 35
® Note
If you do not see your output image file, be sure that you are looking in the current working directory, which is
likely the same folder as your Jupyter notebook. See section 2.4.1 for using the os module to change directories.
Matplotlib supports a wide variety of plotting types including scatter plots, bar plots, histograms, pie charts, stem plots,
and many others. A few of the most common ones are introduced below. For additional plotting types, see the matplotlib
website.
Bar plots, despite looking very different, are quite similar to scatter plots. They both show the same information except
that instead of the vertical position of a marker showing the magnitude of a 𝑦-value, it is represented by the height of a
bar. Bar plots are generated using the [Link]() function. Similar to the [Link]() function, the bar plot takes
𝑥- and 𝑦-values as positional arguments, and if only one argument is given, the function assumes it is the 𝑦-variables and
plots the values with respect to the index values.
The atomic numbers (AN) for the first ten chemical elements are generated below using list comprehension introduced in
section 2.1.2 to be plotted with the molecular weights (MW).
AN = [x + 1 for x in range(10)]
MW = [1.01, 4.04, 6.94, 9.01, 10.81, 12.01, 14.01, 16.00, 19.00, 20.18]
[Link](AN, MW)
[Link]('Atomic Number')
[Link]('Molar Mass, g/mol');
20.0
17.5
15.0
Molar Mass, g/mol
12.5
10.0
7.5
5.0
2.5
0.0
2 4 6 8 10
Atomic Number
The bar plot characteristics can be adjusted like most other types of plots in matplotlib. The main arguments you will
probably want to adjust are color and width, but some other arguments are provided in Table 5. The color arguments are
consistent with the [Link]() colors from earlier. The error bar arguments can take either a single value to display
homogeneous error bars on all data points or can take a multi-element object (e.g., a list or tuple) containing the different
margins of uncertainty for each data point.
Table 5 A Few Common plot Keyword Arguments
106
Scientific Computing for Chemists with Python
Argument Description
width bar width
color bar color
edgecolor bar edge color
xerr X error bar
yerr Y error bar
capsize caps on error bars
We have already generated scatter plots using the [Link]() function, but they can also be created using the plt.
scatter() function. The latter is partially redundant, but unlike [Link](), [Link]() allows for dif-
ferent sizes, shapes, and colors of individual markers using the s=, marker=, and c= keyword arguments, respectively.
See section 3.1.2 for a short list of some of the marker shapes and colors available. Links to more complete lists can be
found in the Further Reading section.
In the example below, we are loading the famous wine dataset that describes wine samples through a number of mea-
surements including alcohol content, magnesium levels, color, etc. For convenience, we will load the dataset using the
scikit-learn library introduced in section 13.2.2. We then plot it and include a third attribute to the color c= argument.
4.0
1600
3.5
1400
3.0 1200
Total Phenols
2.5 1000
2.0 800
1.5 600
400
1.0
11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0
Alcohol Content
In the example above, the alcohol content is represented on the 𝑥-axis, the alkalinity is represented on the 𝑦-axis, and the
proline content is shown using the color of the markers. The spectrum of colors that represent the values is called the
colormap, and this can be changed using an optional cmap= argument. See the matplotlib colormap page for a list of
available colormaps.
b Tip
In the above example, the lighter colors represent the higher values while the darker colors represent the lower
values. If you want to reverse the order of the colors, just place _r at the end of the colormap name. For example,
cmap='viridis' becomes cmap='viridis_r'.
The [Link]() provides a guide as to the meaning of the colors, but it would be nice to also have a text label
on the color bar just like the axes. This can be accomplished by assigning the color bar to a variable and then using the
set_label() attribute to add a label as demonstrated below.
cbar = [Link]()
cbar.set_label('Proline Content');
108
Scientific Computing for Chemists with Python
4.0
1600
3.5
1400
3.0 1200
Proline Content
Total Phenols
2.5 1000
2.0 800
1.5 600
400
1.0
11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0
Alcohol Content
As an additional example, we can generate a plot of nuclide atomic numbers versus the number of neutrons and color the
markers with the log of the half-life, in years, of each nuclide.
import numpy as np
nuc = [Link]('data/[Link]', delimiter=',', skip_header=1)
nuc
array([[ 0. , 1. , -4.71070897],
[ 0. , 4. , -29.25458877],
[ 1. , 2. , 1.09089879],
...,
[117. , 176. , -9.35267857],
[117. , 177. , -8.79123643],
[118. , 176. , -10.73537861]], shape=(2960, 3))
175
20
150
10
125
Number of Neutrons
log(half-life, yrs)
100 0
75
10
50
25 20
0
0 20 40 60 80 100 120
Atomic Number
One of the issues we encounter in the above plot is that the range of half-lives is large with relatively few points in the
extreme ends. We can see this in the histogram plot of these log half-life values shown below (see section 3.2.3).
110
Scientific Computing for Chemists with Python
1000
800
600
Counts
400
200
0
30 20 10 0 10 20
Log Half-Life, yrs
In order to prevent the few values at the extremes from effectively washing out the color and making it difficult to see the
differences, we can use the [Link]() arguments vmax= and vmin= to narrow the colormap range like shown
below. By doing this, any values above the vmax= value will be a fixed color, and any values below the vmin= value
will be a fixed color.
10.0
175
7.5
150
5.0
125
Number of Neutrons
2.5
log(half-life, yrs)
100
0.0
75
2.5
50
5.0
25
7.5
0
10.0
0 20 40 60 80 100 120
Atomic Number
Histograms display bars representing the frequency of values in a particular dataset. Unlike bar plots, the width of the
bars in a histogram plot is meaningful as each bar represents the number of 𝑥-values that fall within a particular range. A
histogram plot can be generated using the [Link]() function which does two things. First, the function takes the data
provided and sorts them into equally spaced groups, called bins; and second, it plots the totals in each bin. For example,
we have a list, Cp, of specific heat capacities for various metals in J/g⋅𝑜 C, and we want to visualize the distribution of the
specific heat capacities.
112
Scientific Computing for Chemists with Python
4
Number of Metals
0
0.2 0.4 0.6 0.8 1.0
Heat Capacity, J/gC
From the plot above, we can see that a large number of heat capacities reside in the area of 0.1-0.5 J/g⋅𝑜 C and none fall
in the 0.6-0.8 J/g⋅𝑜 C range.
The two main arguments for the [Link](data, bins=) function are data and bins. The bins argument
can be either a number of evenly spaced bins in which the data is sorted, like above, or it can be a list of bin edges like
below. The function automatically determines which you are providing based on your input.
5
Number of Metals
0
0.0 0.2 0.4 0.6 0.8 1.0
Heat Capacity, J/gC
Providing the histogram function bin edges offers far more control to the user, but writing out a list can be tedious.
As an alternative, the histogram function also accepts bin edges as range() objects. Unfortunately, Python’s built-in
range() function only generates values with integer steps. As an alternative, you can use list comprehension from
chapter 2 or use NumPy’s [Link]() function from section 4.1.3 which does allow non-integer step sizes.
There are a variety of other two-dimensional plotting types available in the matplotlib library including stem, step, pie,
polar, box plots, and contour plots. Below is a table of a few worth knowing about along with the code that created them.
See the matplotlib website for further details. Many Python library websites, including matplotlib’s, contain a gallery page
which showcases examples of what can be done with that library. It is recommended to browse these pages when learning
a new library.
x = range(20)
y = [[Link](num) for num in x]
[Link](x, y)
[Link]('Sine Wave');
114
Scientific Computing for Chemists with Python
Sine Wave
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5
AN = range(1, 11)
mass_avg = [1.01, 4.00, 6.94, 9.01,
10.81, 12.01, 14.01, 16.00, 19.00,
20.18]
[Link](AN, mass_avg)
[Link]('Average Atomic Mass')
[Link]('Atomic Number')
[Link]('Average Atomic Mass');
12.5
10.0
7.5
5.0
2.5
2 4 6 8 10
Atomic Number
labels = ['Solids', 'Liquids','Gases']
percents = (85.6, 2.2, 12.2)
[Link]('Naturally Occurring Elements')
[Link](percents, labels=labels,
explode=(0, 0.2, 0))
[Link]('equal');
116
Scientific Computing for Chemists with Python
Solids
Gases
Liquids
import numpy as np
theta = [Link](0, 360,0.1)
r = [abs([Link](5 / (16 * [Link])) *
(3 * [Link](num)**2 - 1)) for num in theta]
[Link](theta, r)
[Link](r'$d_{z^2} \,$' + 'Orbital');
dz2 Orbital
90°
135° 45°
0.5 0.6
0.3 0.4
0.1 0.2
180° 0°
225° 315°
270°
It is often necessary to plot more than one set of data on the same axes, and this can be accomplished in two ways with
matplotlib. The first is to call the plotting function twice in the same Jupyter code cell. Matplotlib will automatically
place both plots in the same figure and scale it appropriately to include all data. Below, data for the wave function for the
3p hydrogen orbital is generated similar to the 3s earlier, so now the wave functions for both the 3s and 3p orbitals can
be plotted on the same set of axes.
b Tip
Here we are using more data points to visualize the orbital radial functions because more points give a smoother
plot.
def orbital_3P(r):
wf = ([Link](6) * r * (4 - (2/3) * r) * math.e**(-r/3))/81
return wf
118
Scientific Computing for Chemists with Python
[Link](r, psi_3s)
[Link](r, psi_3p)
[Link]('Radius, Bohrs')
[Link]('Wave Function');
0.4
0.3
Wave Function
0.2
0.1
0.0
0.1
0 5 10 15 20 25 30 35
Radius, Bohrs
The second approach is to include both sets of data in the same plotting command as is shown below. Matplotlib will
assume that each new non-keyword is a new set of data and that the positional arguments are associated with the most
recent data.
0.4
0.3
Wave Function
0.2
0.1
0.0
0.1
0 5 10 15 20 25 30 35
Radius, Bohrs
In the second plot above, r, psi_3s, 'bo' are the data and style for the first set of data while r, psi_3p,'r^' are
the data and plotting style for the second.
One issue that quickly arises with multifigure plots is identifying which symbols belong to which data. Matplotlib allows
the user to add a legend to the plot. The user first needs to provide a label for each dataset using the label= keyword
argument. Finally, calling [Link]() causes the labels to be displayed on the plot. The default is for matplotlib to
place the legend where it decides is the optimal location, but this behavior can be overridden by adding a keyword loc=
argument. A complete list of location arguments is available on the matplotlib website.
It would also be helpful to include a horizontal line at zero as a guide to the eye. Matplotlib includes a [Link](y,
xmin, xmax) function for just this purpose, and this function takes similar arguments for color and style.
120
Scientific Computing for Chemists with Python
3s orbital
3p orbital
0.4
0.3
Wave Function
0.2
0.1
0.0
0.1
0 5 10 15 20 25 30 35
Radius, Bohrs
To generate multiple, independent plots in the same figure, a few more lines of code are required to describe the dimensions
of the figure and which plot goes where. Once you get used to it, it is fairly logical. There are two general methods for
generating multifigure plots outlined below. The first is a little quicker, but the second is certainly more powerful and
gives the user access to extra features. Whichever method you choose to adopt, just be aware that you will likely see the
other method at times as both are common.
In the first method, we first need to generate the figure using the [Link]() command. For every subplot, we first
need to call [Link](rows, columns, plot_number). The first two values are the number of rows
and columns in the figure, and the third number is which subplot you are referring to. For example, we will generate a
figure with two plots side-by-side. This is a one-by-two figure (i.e., one row and two columns). Therefore, all subplots
will be defined using [Link](1, 2, plot_number). The plot_number indicates the subplot with the
first subplot being 1 and the second subplot being 2. The numbering always runs left-to-right and top-to-bottom.
[Link]()
3s Orbital 3p Orbital
0.08
0.4
0.06
0.3
0.04
0.2
0.02
0.1
0.00
0.0
0.02
0.1
0 10 20 30 0 10 20 30
Radius, Bohrs Radius, Bohrs
If you don’t like the dimensions of your plot, you can still change them using a figsize=(width, height)
argument in the figure() function like the following.
[Link](figsize=(12,4))
122
Scientific Computing for Chemists with Python
3s Orbital 3p Orbital
0.08
0.4
0.06
0.3
0.04
0.2
0.02
0.1
0.00
0.0
0.02
0.1
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Radius, Bohrs Radius, Bohrs
The values in the [Link]() command may seem redundant. Why are the dimensions for the figure repeatedly
defined instead of just once? The answer is that subplots with different dimensions can be created in the same figure
(Figure 1). In this example, the top subplot dimension is created as if it were the first subplot in a 2 × 1 figure. The
bottom two subplot dimensions are created as if they are the third and fourth subplots in a 2 × 2 figure.
Figure 1 Multifigure plots with subplots of different dimensions (right) describe each subplot dimension as if it were part
of a plot with equally sized subplots (left).
In the following example, dihedral angle data contained in a hydrogenase enzyme from Nat. Chem. Biol. 2016, 12, 46-50
is important and displayed. The top plot shows the relationship between the psi (𝜓) and phi (𝜙) angles while the bottom
two plots show the distribution of psi and phi angles using histogram plots.
rama = [Link]('data/hydrogenase_5a4m_phipsi.csv',
delimiter=',', skip_header=1)
psi = rama[:,0]
phi = rama[:,1]
[Link](figsize=(10,8))
[Link](2,1,1)
[Link](phi, psi, '.', markersize=8)
[Link](-180, 180)
[Link](-180, 180)
[Link]('$\\phi, degrees$', fontsize=15)
[Link]('$\\psi, degrees$', fontsize=15)
(continues on next page)
[Link](2,2,3)
[Link](phi[1:], edgecolor='k')
[Link]('$\\phi, degrees$')
[Link]('Count')
[Link]('$\\phi , Angles$')
[Link](2,2,4)
[Link](psi[:-1], edgecolor='k')
[Link]('$\\psi, degrees$')
[Link]('Count')
[Link]('$\\psi , Angles$')
plt.tight_layout();
Ramachandran Plot
150
100
50
, degrees
0
50
100
150
150 100 50 0 50 100 150
, degrees
, Angles , Angles
120 60
100 50
80 40
Count
Count
60 30
40 20
20 10
0 0
150 100 50 0 50 100 150 150 100 50 0 50 100 150
, degrees , degrees
124
Scientific Computing for Chemists with Python
b Tip
There are times when the titles and axis labels for multiple subplots will inadvertently overlap. If this happens,
simply add plt.tight_layout() at the very end to fix this.
The second method is somewhat similar to the first except that it more explicitly creates and links subplots, called axes.
To create a figure with subplots, we first need to generate the overall figure using the [Link]() command again,
and we also need to attach it to a variable so that we can explicitly assign axes to it. To create each subplot, use the
add_subplot(rows, columns, plot_number) command. The arguments in the add_subplot() com-
mand are the same as [Link]() in section 3.4.1. After an axis has been created as part of the figure, call your
plotting function preceded by the axis variable name as demonstrated below.
One noticeable difference in this method is that the functions for customizing the plots are typically preceded with set_
such as set_title(), set_xlim(), or set_ylabel().
fig = [Link](figsize=(8,6))
ax1 = fig.add_subplot(2,1,1)
[Link](r, psi_3s)
[Link](0, 0, 35, linestyle='dashed', color='C1')
ax1.set_title('3s Orbital')
ax1.set_xlabel('Radius, $a_u$')
ax2 = fig.add_subplot(2,1,2)
[Link](r, psi_3p)
[Link](0, 0, 35, linestyle='dashed', color='C1')
ax2.set_title('3p Orbital')
ax2.set_xlabel('Radius, $a_u$')
plt.tight_layout();
3s Orbital
0.4
0.3
0.2
0.1
0.0
0.1
0 5 10 15 20 25 30 35
Radius, au
3p Orbital
0.08
0.06
0.04
0.02
0.00
0.02
0 5 10 15 20 25 30 35
Radius, au
To plot in 3D, we will use the approach outlined in section 3.4.2 with two additions. First, add from mpl_toolkits.
mplot3d import Axes3D as shown below. Second, make the plot 3D by adding projection='3D' to the
[Link]() command. After that, it is analogous to the two-dimensional plots above except 𝑥, 𝑦, and 𝑧 data are
provided.
In the following example, we will import 𝑥𝑦𝑧-coordinates for a C60 buckyball molecule and plot the carbon atom positions
in 3D.
ax = fig.add_subplot(1,1,1, projection='3d')
[Link](x, y, z, 'o')
ax.set_xlabel('X axis')
ax.set_ylabel('Y axis')
ax.set_zlabel('Z axis');
126
Scientific Computing for Chemists with Python
3
2
1
Z axis
0
1
2
3
4
3
2
4 1
5 0
6 1
xis
7 8 2 Ya
X axis 9 10 3
11 4
The above 3D plot is simply a scatter plot in a three-dimensional space, but it is often useful to connect these points to
describe surfaces in 3D space which can be used for displaying energy surfaces, chemical spectra, or atomic orbital shapes
among other applications. We again will import Axes3D from mpl_toolkits.mplot3d as we did in section 3.5.
When choosing matplotlib functions below, it depends not only on what you want your surface to look like but also on the
format of the data. Specifically, your data may be in a grid or 𝑥𝑦𝑧 format. Below addressed both scenarios.
If the height data are formatted as a grid, we will need to generate a mesh grid of the 𝑥- and 𝑦-axis locations to create a
surface plot. Mesh grids are simply the 𝑥- and 𝑦-axes values extended into a 2D array. An example is shown below where
the 𝑥- and 𝑦-axes are integers from 0 → 8. In the left grid, the values represent where each point is with respect to the
𝑥-axis, and the right grid is likewise where each point is located with respect to the 𝑦-axis.
We will use NumPy to generate these grids as NumPy arrays. If you have not yet seen NumPy, you can still follow along
in this example without understanding how arrays operate, or you can read chapter 4 and come back to this topic later.
For those who are familiar with NumPy, being that the two grids/arrays are of the same dimension, all math is done on
a position-by-position basis to generate a third array of the same dimensions as the first two. For example, if we were to
take the sum of the squares of the two grids above, we would get the following grid.
𝑧 = 𝑥2 + 𝑦 2
Notice that each value on the 𝑧 grid is the sum of the squared values from the equivalent positions on the 𝑥 and 𝑦 grids,
so for example, the bottom left value is 64 because it is the sum of 64 and 0.
To generate mesh grids, we will use the [Link]() function from NumPy. It requires the input of the desired
values from the 𝑥 and 𝑦 axes as a list, range object, or NumPy array. The output of the [Link]() function is
two arrays – the 𝑥-grid and 𝑦-grid, respectively.
128
Scientific Computing for Chemists with Python
import numpy as np
x = [Link](-10, 10)
y = [Link](-10, 10)
X, Y = [Link](x, y)
Z = 1 - X**2 - Y**2
Now to plot the surface. We will use the plot_surface() function which requires the X, Y, and Z mesh grids as
arguments. As an optional argument, you can designate a color map (cmap). Color maps are a series of colors or shades
of a color that represent values. The default for matplotlib is viridis, but you can change this to anything from a wide
selection of color maps provided by matplotlib. For more information on color maps, see the matplotlib website.
fig = [Link](figsize=(10,6))
ax = fig.add_subplot(1,1,1, projection='3d')
ax.plot_surface(X, Y, Z, cmap='viridis');
0
25
50
75
100
125
150
175
200
10.0
7.5
5.0
2.5
10.0 7.5 0.0
5.0 2.5 2.5
0.0 2.5 5.0
5.0 7.5 7.5
10.0
10.0
As a more chemical example, we can plot the standing waves for a 2D particle in a box by the following equation where
𝑛𝑥 and 𝑛𝑦 are the principal quantum numbers along each axis and 𝐿 is the length of the box.
We will select 𝐿 = 1, 𝑛𝑥 = 2, and 𝑛𝑦 = 1. Again, a meshgrid is generated and a height value is calculated from the 𝑥- and
𝑦-values.
L = 1
nx = 2
ny = 1
x = [Link](0, L, 20)
y = [Link](0, L, 20)
X, Y = [Link](x,y)
fig = [Link](figsize=(10,6))
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X, Y, Z, cmap='viridis');
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
1.0
0.8
0.0 0.6
0.2 0.4
0.4 0.2
0.6
0.8 0.0
1.0
You are encouraged to increase the values for 𝑛𝑥 and 𝑛𝑦 and see how the surface plot changes.
Alternatively, a surface can be represented with a wireframe using the plt.plot_wireframe() function which
operates similarly to the plt.plot_surface() function.
130
Scientific Computing for Chemists with Python
fig = [Link](figsize=(12,6))
ax = fig.add_subplot(111, projection='3d')
ax.plot_wireframe(X, Y, Z, linewidths=1.5, colors='royalblue');
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
1.0
0.8
0.0 0.6
0.2 0.4
0.4 0.2
0.6
0.8 0.0
1.0
If the data are formatted as three columns containing 𝑥, 𝑦, and 𝑧 values, matplotlib provides triangulated grid function,
plt.plot_trisurf(), that can work with these data. Because the function cannot guarantee that the data points
are arranged in rectangular grids, the surface mesh is instead composed of triangular faces. The function takes the 𝑥, 𝑦,
and 𝑧 values as the required arguments. As an example, the data from the above standing wave are repacked below as a
series of 𝑥𝑦𝑧 vector coordinates and plotted using the plt.plot_trisurface().
wave_xyz = [Link](wave_xyz)
fig = [Link](figsize=(14,6))
ax = fig.add_subplot(1,1,1, projection='3d', )
ax.plot_trisurf(x, y, z, cmap='viridis')
# adjusts view
ax.view_init(azim=60, elev=30)
# prevents z label from being cut off
ax.set_box_aspect(aspect=(1,1,1))
ax.set_xlabel('X axis')
ax.set_ylabel('Y axis')
ax.set_zlabel('Z axis');
2.0
1.5
1.0
0.5
Z axis
0.0
0.5
1.0
1.5
2.0
0.0
0.2
0.4 0.0
0.6 0.2
Ya
0.8 0.4
xis
0.6
1.0 1.0
0.8 X axis
132
Scientific Computing for Chemists with Python
3.6.3 3D Surfaces
Matplotlib supports the ability to plot 3D surfaces and wireframes which is useful for molecular orbitals among other
applications. We will start with a basic sphere and then morph it into the angular component of an atomic orbital. We are
going to again use the plt.plot_surface() and plt.plot_wireframe() functions, so we first need a mesh
grid using the [Link]() function to yield the theta (𝜃) and phi (𝜙) values. There are multiple conventions for
these angles, but here we will follow the SciPy convention which treats phi as the azimuthal angle (i.e., direction on the
xy-plane) and theta as the polar angle (i.e., angle off the positive z-axis). The values for phi do a full circle, ranging from 0
→ 2𝜋, while theta here swings from the north pole to the south pole, ranging from 0 → 𝜋. These angles are then converted
to xyz-coordinates using the trigonometric equations shown below. In this example, we are plotting a unit sphere, so r =
1. Finally, the x, y, and z values are provided to either the plt.plot_surface() or plt.plot_wireframe()
functions to plot a sphere. It is important here to set the aspect ratio to equal using ax.set_aspect('equal') so
that equal changes in value are represented with equal distances along all axes. Otherwise, the z-axis will be compressed
here making the sphere look squished or oblate.
𝑥 = 𝑟 𝑠𝑖𝑛(𝜃) 𝑠𝑖𝑛(𝜙)
𝑦 = 𝑟 𝑠𝑖𝑛(𝜃) 𝑐𝑜𝑠(𝜙)
𝑧 = 𝑟 𝑐𝑜𝑠(𝜃)
# plotting
fig = [Link](figsize = (10, 6))
ax = fig.add_subplot(1, 1, 1, projection='3d')
ax.plot_surface(x, y, z, cmap='viridis')
ax.set_aspect('equal') # sets aspect ratio to equal
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
1.00
0.75
0.50
0.25
1.000.75 0.00
0.500.25 0.25
0.000.25 0.50
0.500.75 0.75
1.00 1.00
To plot orbital angular components, we can modify or warp our sphere by multiplying the xyz-coordinates by the orbital’s
angular wave function. We are essentially changing the radius at different angles in the trigonometric equations above.
For example, below is the angular wave function for the 𝑑𝑧2 orbital.
5 1/2
𝑌𝑑𝑑2 = ( ) (3 𝑐𝑜𝑠2 𝜃 − 1)
16𝜋
Á Warning
The plot of angular wave function does not include the radial information, so it does not fully describe the shape
of atomic orbitals. Do not interpret the angular plots below as the actual shapes of atomic orbitals.
# plotting
fig = [Link](figsize = (10, 6))
ax = fig.add_subplot(1, 1, 1, projection='3d')
(continues on next page)
134
Scientific Computing for Chemists with Python
While the angular wave functions can be coded manually, the SciPy library includes a spherical harmonics
sph_harm_y(l, m, theta, phi) function that will calculate the angular wave function for any combination of
the angular (𝑙) and magnetic (𝑚𝑙 ) quantum numbers. We only want the positive, real results, so we will take the absolute
value of the real component.
® Note
For 𝑚𝑙 values other than zero, you can select only the real or only the imaginary component to plot using f =
[Link]([Link]) or f = [Link]([Link]).
# plotting
fig = [Link](figsize = (10, 6))
ax = fig.add_subplot(1, 1, 1, projection='3d')
ax.plot_wireframe(X, Y, Z, colors='royalblue')
ax.set_aspect('equal') # sets aspect ratio to equal
ax.set_axis_off() # turns off axes and background
136
Scientific Computing for Chemists with Python
There are times when it is useful to represent 3D data on a 2D surface, requiring the third dimension to be represented by
color or contour lines. This can be useful for representing an energy surface, 3D fluorescence spectra, where the 𝑥- and
𝑦-axes are absorption and emission wavelengths, or 2D NMR spectra. This section demonstrates a number of plotting
functions in matplotlib to generate 2D histograms and contour plots.
3.7.1 2D Histograms
The first plot we will cover is the 2D histogram. This is similar to the standard histogram except that the bins are 2D and
the quantity in a bin is represented by color instead of a bar height. There are two functions available in matplotlib for
this task listed below. Each of these functions requires the 𝑥- and 𝑦-coordinates as the two required arguments, and like
the previously seen histogram function, these functions total the counts in each bin for the user. For this example, we will
again use the Ramachandran data from section 3.4.1.
plt.hist2d(x, y)
[Link](x, y)
The plt.hist2d() function, like the regular histogram function, can accept additional arguments such as the num-
ber or position of the bins (bins=) or minimum or maximum values for bins to be displayed (cmin= and cmax=,
respectively). In the example below, there are 50 bins on each axis, and any bin with fewer than 1 count is not displayed.
9
150
8
100 7
50 6
Psi, degrees
0 5
50 4
100 3
2
150
1
150 100 50 0 50 100
Phi, degrees
The [Link]() function in its basic form is like the plt.hist2d() function except that the bins are hexagons
instead of rectangles.
[Link](phi, psi, gridsize=50, vmax=10)
[Link]('Phi, degrees')
[Link]('Psi, degrees');
150
100
50
Psi, degrees
50
100
150
We will next look at contour plots which show the 𝑧 values using color or lines. When lines are used, this is similar to
a topographic map where the closer the lines, the steeper the change in 𝑧 values. The lines are also colored to show the
values. Like plotting 3D surfaces in section 3.6, the data may be represented as either three grids or a series of 𝑥𝑦𝑧 values.
For our gridded example, we will again visualize our standing wave function from sections 3.6. The [Link]()
plot accepts x, y, and z grids as the required arguments, but it can also accept the number of levels (levels=) and a
colormap (cmap=).
L = 1
nx = 2
ny = 1
x = [Link](0, L, 20)
y = [Link](0, L, 20)
X, Y = [Link](x,y)
138
Scientific Computing for Chemists with Python
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
We can also generate a contour plot where the space between the lines is filled using the [Link]() function.
The “f” is for “filled”.
1.0 2.0
1.5
0.8
1.0
0.6 0.5
0.0
0.4 0.5
1.0
0.2
1.5
0.0 2.0
0.0 0.2 0.4 0.6 0.8 1.0
If the data are in 𝑥𝑦𝑧 coordinate format, we will instead use the [Link]() or [Link]()
functions as demonstrated below with COSY NMR data of quinine.
140
Scientific Computing for Chemists with Python
4
ppm
8 6 4 2 0
ppm
[Link](x, y, z, levels=200, vmax=0.05, cmap='Blues')
[Link]().invert_xaxis()
[Link]().invert_yaxis()
[Link]('ppm')
[Link]('ppm');
4
ppm
8 6 4 2 0
ppm
As a final example, it is possible to merge a contour plot with a line plot. This is useful for representing 2D NMR spectra
such as COSY NMR, where the COSY NMR data is represented by the contour plot while the 1 H NMR spectrum is
located on the margins of the contour plot. Below, a function plot_2d_nmr() is defined (click Show code cell source)
to generate such a plot.
142
Scientific Computing for Chemists with Python
4
ppm
9
9 8 7 6 5 4 3 2 1 0
ppm
Further Reading
The matplotlib website is an excellent place to learn more about plotting in Python. Similar to some other Python library
websites, there is a gallery page that showcases many of the capabilities of the matplotlib library. It is often worth browsing
to get ideas and a sense of what the library can do. The matplotlib website also provides free cheat sheets summarizing
key features and functions.
1. Matplotlib Website. [Link] (free resource)
2. Matplotlib Cheatsheets [Link] (free resouce)
3. VanderPlas, J. Python data Science Handbook: Essential Tools for Working with Data, 1st ed.;
O’Reilly: Sebastopol, CA, 2017, chapter 4. Freely available from the author at [Link]
PythonDataScienceHandbook/ (free resource)
Exercises
Complete the following exercises in a Jupyter notebook using the matplotlib library and be sure to label axes and include
units when appropriate. Any data file(s) refered to in the problems can be found in the data folder in the same directory
as this chapter’s Jupyter notebook. Alternatively, you can download a zip file of the data for this chapter from here by
selecting the appropriate chapter file and then clicking the Download button.
1. Visualize the relationship between pressure and volume for 1.00 mol of He(g) at 298 K in an expandable vessel as
it increases from 1 L → 20 L. R = 0.08206 L·atm/mol·K. This will require you to generate values and perform the
calculating using the equation below.
𝑃 𝑉 = 𝑛𝑅𝑇
2. Plot the electronegativity versus atomic number for the first five halogens, and make the size or color of the markers
based on the atomic radii of the element. You will need to look up the values which should be available in most
general chemistry textbooks. If you do not have one available, you can also find these values in the free, open
chemistry textbook available on OpenStax among other online resources.
3. The following functions are an example of the sandwich theorem which aids in determining limits of function 𝑔(𝑥)
by knowing its range is between 𝑓(𝑥) and ℎ(𝑥) in the relevant domain. Plot all three functions on the same axes to
show that f(x) ≤ g(x) ≤ h(x) for x of -50 → 50. Be sure to include a legend.
4. Plot the concentration of A with respect to time for the following elementary step if 𝑘 = 0.12 M−1 s−1 using the
appropriate integrated rate law.
2𝐴 → 𝑃
5. Import the gc_trace.csv file containing a gas chromatography (GC) trace and plot the intensity (y-axis) versus time
(x-axis) using a line plot. Be sure to label the axes.
6. Import the mass spectra file ms_bromobenzene.csv and visualize it using a stem plot where m/z is on the x-axis
and intensity is on the y-axis. Hint: the dots on the top of the lines can be removed using markerfmt='None'.
7. Earth’s atmosphere is composed of 78% N2 , 21% O2 , and 1% other gases. Represent this data with a pie chart,
and make the last 1% slice stick out of the pie like in section 3.2.4.
8. Create a histogram plot to examine the distribution of values generated below.
import random
rdn = [[Link]() for value in range(1000)]
9. The 1 H NMR spectrum of caffeine in CDCl3 is composed of four singlets with the following chemical shifts and
relative intensities. Visualize this data using a stem plot. Hint: the dots on the top of the lines can be removed using
markerfmt='None'.
144
Scientific Computing for Chemists with Python
10. The following table presents the calculated free energies for each step in the binding and splitting of H2 (g) by a
nickel phosphine catalyst. Visualize the energies over the course of the reaction using a plotting type other than a
line or scatter plot. Data from Inorg. Chem. 2016, 55, 445−460.
11. Generate two side-by-side plots that show the atomic radii and first ionization energies versus atomic number for
the first ten elements on the periodic table. This data should be available on the internet or any general chemistry
textbook, including OpenStax in the periodic trends chapter. Include titles on both plots along with appropriate
axis labels.
12. Generate a standing wave surface plot (similar to the one at the end of section 3.6) using using the following equation
and parameters: 𝐿 = 1, 𝑛𝑥 = 2, 𝑛𝑦 = 2.
13. Load the amine_bp.csv file in the data folder which contains the boiling points of primary, secondary, and tertiary
amines and the number of carbons in each amine. Plot the boiling point (𝑥-axis) versus number of carbons (𝑦-axis)
for each degree of amine. Your plot should have three distinct trends, one for each degree, represented both in
different colors and with different markers. Include a legend on your plot indicating which data points represent
which degree of amine.
14. Visualize the angular component of a d-orbital other than 𝑑𝑧2 and identify which d-orbital you visualized the angular
component for. You will need to find a table of the real components of spherical harmonics for this task.
Exercises 145
Scientific Computing for Chemists with Python
146
CHAPTER 4: NUMPY
NumPy is a popular library in the Python ecosystem and a critical component of the SciPy stack. So much so that NumPy
is even included in Apple’s default installation of Python and in other Python-powered applications such as Blender. While
it may be tempting to work with NumPy’s objects as lists or to circumnavigate the NumPy library altogether, the time it
takes to learn NumPy’s powerful features is well worth it! It will often allow you to solve problems with less effort and
time and with shorter and faster-executing code. This is due to:
• NumPy automatically propagating operations to all values in an array instead of requiring for loops
• A massive collection of functions for working with numerical data
• Many of NumPy’s functions are Python-wrapped C code, making them run faster
The NumPy package can be imported by import numpy, but the scientific Python community has developed an un-
official, but strong, convention of importing NumPy using the np alias. It is a matter of personal preference whether to
use the alias or not, but it is strongly encouraged for consistency with the rest of the community. Instead of numpy.
function(), the function is then called by the shorter [Link](). All of the NumPy code in this and subse-
quent chapters assumes the following import.
import numpy as np
One of the main contributions of NumPy is the ndarray (i.e., “n-dimensional array”), NumPy array, or just array for
short. This is an object similar to a list or nested list of lists except that mathematical operations and NumPy functions
automatically propagate to each element instead of requiring a for loop to iterate over it. Because of their power and
convenience, arrays are the default object type for any operation performed with NumPy and many scientific libraries that
are built on NumPy (e.g., SciPy, pandas, scikit-learn, etc.).
The NumPy array looks like a Python list wrapped in array(). It is an iterable object, so you could iterate over it using
a for loop if you really want to. However, because NumPy automatically propagates operations through the array, for
loops are typically unnecessary. For example, let us say you want to multiply a list of numbers by 2. Doing this with a
list would likely look like the following.
nums = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
for value in nums:
print(2 * value)
147
Scientific Computing for Chemists with Python
0
2
4
6
8
10
12
14
16
18
In contrast, performing this same operation using a NumPy array only requires multiplying the array by 2.
[ 0 2 4 6 8 10 12 14 16 18]
There are three common ways to generate a NumPy array that we will cover in the beginning of this chapter. The first is
simply to convert a list or tuple to an array using the [Link]() function.
a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] # list
arr = [Link](a)
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
The fact that the object is an NumPy array is denoted by the array().
We can also create an array using NumPy sequence-generating functions. There are two common functions in NumPy
for this task: [Link]() and [Link](). The [Link]() function behaves similarly to the native
Python range() function with the key difference that it outputs an array. Another minor difference is that while
range() generates a range object, [Link]() generates a sequence of values immediately. The arguments for
[Link]() are similar to those of Python’s range() function where start is inclusive and stop is exclusive, but
unlike range(), the step size for [Link]() does not need to be an integer value.
The [Link]() function is related to [Link]() except that instead of defining the sequence based on
step size, it generates a sequence based on how many evenly distributed points to generate in the given span of num-
bers. Additionally, [Link]() excludes the stop values while [Link]() includes them. The difference
between these two functions is somewhat subtle, and the use of one over the other often comes down to user preference
or convenience.
148
Scientific Computing for Chemists with Python
Two other useful functions for generating arrays are [Link]() and [Link](), which generate arrays populated
with exclusively zeros and ones, respectively. The functions accept the shape argument as a tuple of the array dimensions
in the form (rows, columns).
[Link]((2, 4))
[Link](10)
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
You should commit to remembering [Link]() and [Link](), as these are used often. The np.
zeros() and [Link]() functions are not as common but are useful in particular applications. They can also be
used to generate arrays filled with other values. For example, to generate an array of threes, an array of zeros can be
generated and then incremented by 3.
[[3. 3. 3. 3.]
[3. 3. 3. 3.]]
A third approach is to generate an array from a function using [Link](), which generates an array of
values using the array indices as inputs. This function requires a function as an argument.
[Link](function, shape)
Let us make an array of the dimensions (3,3) where each element is the product of the row and column indices.
Modifying the dimensions of one or more arrays is a common task in NumPy. This may involve changing the number
of columns and rows or merging multiple arrays into a larger array. The size and shape of an array are the number of
elements and dimensions, respectively. These can be determined using the size and shape NumPy methods.
[Link]
[Link]
(2, 3)
The NumPy convention is to provide the dimensions of a two-dimensional array as (rows, columns).
The dimensions of arrays can be modified using the [Link]() method. This method maintains the number of
elements and order of elements in the array but repacks them into a different number of rows and columns. Because the
number of elements is maintained, the new array size needs to be able to contain the same number of elements as the
original.
[Link](array, dimensions)
In this function, array is the NumPy array being reshaped and dimensions is a tuple containing the desired number
of rows and columns in that order. The original array must fit exactly into the new dimensions or else NumPy will refuse
to change it. This method does not change the original array in place but rather returns a modified copy. This is a good
time to note that because this and other NumPy functions are methods for NumPy arrays, they can also be called by
listing the array up front like list and string methods presented in chapter 1. For example, the reshape() function can
be called with [Link](dimensions).
150
Scientific Computing for Chemists with Python
As an alternative and preferred way to reshape an array, the reshape() function can be used as an array method. Start
with the original array and follow it with .reshape((rows, columns)) like below. This format is often preferred
and will be used often herein.
array_1D.reshape((4, 5))
If you need to reshape an array with only one new dimension known, place a -1 in the other. This signals to NumPy that
it should choose the second dimension to make the data fit.
Flattening an array takes a higher-dimensional array and squishes it into a one-dimensional array. To flatten out an array,
the [Link]() method is often the most convenient way.
array_2D.flatten()
The format of the output makes it look like it is still a 2D array, but notice that there is a comma instead of a square
bracket at the end of the first row. The dimensions of this array are 1 × 20.
Transposing an array rotates the array around the diagonal (Figure 1).
Figure 1 The [Link]() or array.T method transposes the NumPy array effectively flipping the rows and
columns.
The [Link]() method flips the rows and columns. NumPy also provides an alias/shortcut of array.T to
accomplish the same outcome. The latter is far more common, so it is the method used here.
array_2D
array_2D.T
Merging arrays can be done in multiple ways. NumPy provides convenient methods for merging arrays using np.
vstack, [Link], and [Link], which merge arrays along the vertically, horizontally, and depth-wise axes,
respectively (Figure 2).
Figure 2 NumPy arrays can be stacked vertically (top left), as columns (top center), depth-wise (top right), or horizon-
tally (bottom) using the [Link](), np.column_stack(), [Link](), and [Link]() functions,
respectively.
a = [Link](0, 5)
b = [Link](5, 10)
[Link]((a, b))
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
[Link]((a, b))
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
[Link]((a, b))
array([[[0, 5],
[1, 6],
[2, 7],
(continues on next page)
152
Scientific Computing for Chemists with Python
A related function is the np.column_stack() function that stacks the corresponding elements in column lists in a
column arrangement.
np.column_stack((a, b))
array([[0, 5],
[1, 6],
[2, 7],
[3, 8],
[4, 9]])
The outcome of the np.column_stack() function can also be accomplished by transposing the output of the np.
vstack() function.
[Link]((a, b)).T
array([[0, 5],
[1, 6],
[2, 7],
[3, 8],
[4, 9]])
Similar to lists, it is often useful to be able to index and slice NumPy arrays. Because arrays are often higher dimensional,
there are some differences in indexing that provide extra convenience.
Indexing one-dimensional arrays is done in an identical fashion to lists. Simply include the index value(s) or range in
square brackets behind the array name.
array_1D
array_1D[5]
np.float64(2.5)
Two-dimensional arrays can also be indexed in a similar fashion to nested lists, but because arrays are often multidimen-
sional, there is also a shortcut to make working with arrays more convenient. To access the entire second row of an array,
provide the row index in square brackets behind the array name just like indexing in lists.
array_2D
array_2D[1]
To access the first element in the second row, it is perfectly valid to use two adjacent square brackets just as one would
use in a nested list of lists. However, to make work more convenient, these square brackets are often combined with the
row and column indices separated by commas.
array_name[rows, columns]
array_2D[1][0]
np.float64(2.5)
array_2D[1, 0]
np.float64(2.5)
Ranges of values can also be accessed in arrays by using slicing. The following array input generates a slice of the second
row of the array.
array_2D[1, 1:]
As seen above, if you want to access an entire row, it is not necessary to indicate the columns. It is implicitly understood
that all columns are requested. However, if you want to access the first column, something needs to be placed before the
column. The easiest solution is to use a colon to explicitly indicate all rows.
154
Scientific Computing for Chemists with Python
b Tip
As you index higher-dimensional arrays, you may see code that looks like arr[...,0] where arr is the array
name. The three dots mean to include everything, so arr[...,0] has the same effect as arr[:,:,0] for a
three-dimensional array, for example.
In the event you have a multidimensional array, you can access elements in the array using multiple collections of values
(i.e., NumPy arrays, lists, or tuples) where each collection indicates the location along a different dimension. This is an
instance of fancy indexing. For example, if we want to select the following bolded, orange elements from array_2D,
we can create two lists - the first list contains the row indices for each element and the second list likewise contains the
column indices.
0.0 0.5 1.0 1.5 2.0
⎡2.5 3.0 3.5 4.0 4.5⎤
⎢ ⎥
⎢5.0 5.5 6.0 6.5 7.0⎥
⎣7.5 8.0 8.5 9.0 9.5⎦
row = [2, 2, 0]
col = [0, 1, 3]
array_2D[row, col]
Another feature of indexing NumPy arrays is that the returned array will have the same dimensions as the array containing
the indices. In the following example, we have two index arrays where i_flat is a 1 × 4 array while i_square is a
2 × 2 array resulting in 1 × 4 and 2 × 2 arrays, respectively
threes[i_flat]
threes[i_square]
array([[ 3, 12],
[ 6, 18]])
The latter result can also be accomplished by indexing using a flat (i.e., one-dimensional) array followed by reshaping it
to the desired dimensions, as demonstrated below.
i = [Link]([0, 3, 1, 5])
threes[i].reshape((2, 2))
array([[ 3, 12],
[ 6, 18]])
4.3.4 Masking
Elements in a NumPy array can also be selected using a boolean array through a process known as masking. The masking
array is a boolean array filled with either 1 and 0 or True and False and has the same dimensions as the original array.
Any element in the original array that has a 1 or True in the corresponding position of the masking array is returned.
For example,
orig_array[mask]
array([7, 3, 4, 2, 0, 8])
It’s important to note that if you use 1 and 0 in the masking array, it is required that you include dtype=bool or else
NumPy will treat the 1 and 0 as indices instead of booleans and attempt indexing.
orig_array[mask]
array([[[5, 7, 1],
[3, 4, 2],
[5, 7, 1]],
[[3, 4, 2],
[3, 4, 2],
[3, 4, 2]],
[[3, 4, 2],
[5, 7, 1],
[3, 4, 2]]])
The true power of masking is when the masking array is generated through boolean logic such as >, <=, or ==. This
enables the user to select elements of an array through conditions as demonstrated below where we select all elements of
the orig_array that are greater than 5.
156
Scientific Computing for Chemists with Python
® Note
If a masking array is generated by a boolean condition, the resulting masking array will automatically be a boolean
array suitable for masking.
orig_array[cond]
array([7, 9, 8])
We can also include the condition directly in the square brackets to save a step, as shown below.
orig_array[orig_array > 5]
array([7, 9, 8])
One of the major advantages of NumPy arrays over lists is that operations automatically vectorize across the arrays. That
is, mathematical operations propagate through the array(s) instead of requiring for loops. This both speeds up the
calculations and makes the code easier to read and write.
Let’s take the square root of numbers using NumPy’s [Link]() function. The square root is taken of each element
automatically.
Performing this operation requires NumPy’s sqrt() function. If this is attempted with the math module’s sqrt()
function, an error is returned because this function cannot take the square root of a multi-element object without loops.
import math
[Link](squares)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[51], line 2
1 import math
----> 2 [Link](squares)
When performing mathematical operations between a scalar and an array, the same operation is performed across each
element of the array, returning an array of the same dimension as the starting array. Below, an array is multiplied by the
scalar 3, which results in every element in the array being multiplied by this value.
5 6 15 18
3×[ ]=[ ]
7 8 21 24
array([[15, 18],
[21, 24]])
The same outcome arises when performing a similar operation between a 1 ×1 array and a larger array.
array([12, 22])
If a mathematical operation is performed between two arrays of the same dimensions, then the mathematical operation
is performed between corresponding elements in the two arrays. For example, if a pair of 2 × 2 arrays are added to
one another, then the corresponding elements are added to one another. This means that the top-left elements are added
together and so on.
1 2 5 6 6 8
[ ]+[ ]=[ ]
3 4 7 8 10 12
array([[ 6, 8],
[10, 12]])
158
Scientific Computing for Chemists with Python
Broadcasting is another form of vectorization that is a series of rules for dealing with mathematical operations between
two arrays of different dimensions. In broadcasting, one of the dimensions of the two arrays must be either identical or
one-dimensional; otherwise, nothing happens except an error message. To deal with the different dimensions, NumPy
clones the array with fewer dimensions out so that it has the same dimensions as the other array. It should be noted that
NumPy does not really clone out the array in the background; its behavior acts as if it does. It is a convenient way of
thinking about the behavior and results. For example, below is the addition between a 2×2 and a 1×2 array.
1 2
[ ] + [2 5] = ?
3 4
To make the two arrays the same size, the smaller array is cloned along the smaller dimension until the two arrays are the
same size, as shown below. We are then left with simple corresponding element-by-corresponding-element mathematical
operations described in section 4.4.3.
1 2 2 5 3 7
[ ]+[ ]=[ ]
3 4 2 5 5 9
a = [Link]([[1, 2], [3, 4]])
b = [Link]([2, 5])
a + b
array([[3, 7],
[5, 9]])
What happens if a mathematical operation is performed between an array of higher dimensions with a scalar or a 1×1
array as shown below? You already probably know the answer from section 4.4.2, but here is how to rationalize the
behavior. In this case, no dimensions are the same, but being that one of the arrays has dimensions of one where the two
arrays differ, the arrays still broadcast.
1 2
[ ] × [2] = ?
3 4
Again, the smaller array is cloned until the two arrays are the same size.
1 2 2 2 2 4
[ ]+[ ]=[ ]
3 4 2 2 6 8
a = [Link]([[1, 2], [3, 4]])
b = [Link]([2])
a * b
array([[2, 4],
[6, 8]])
Finally, if we attempt to perform a mathematical operation between two arrays with different dimensions and none of the
arrays have a dimension of one where the two arrays are different, an error is raised, and no operation is performed.
1 1 1
1 2
[ ]+⎡
⎢2 2 2⎤⎥=?
3 4
⎣3 3 3⎦
a = [Link]([[1, 2], [3, 4]])
b = [Link]([[1, 1, 1],
[2, 2, 2],
[3, 3, 3]])
a + b
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[57], line 5
1 a = [Link]([[1, 2], [3, 4]])
2 b = [Link]([[1, 1, 1],
3 [2, 2, 2],
4 [3, 3, 3]])
----> 5 a + b
ValueError: operands could not be broadcast together with shapes (2,2) (3,3)
Standard Python functions are often designed to perform a calculation a single time and output Python objects and not
NumPy arrays. As an example, the following function calculates the rate of a first-order reaction given the rate constant
(k) and concentration of reactant (conc).
rate(1.2, 0.80)
0.96
What happens if we attempt the above calculation using a list of concentration values?
rate(1.2, concs)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[61], line 1
----> 1 rate(1.2, concs)
We get an error because Python cannot multiply a list by a value the way NumPy can. However, the above function can
be converted to a NumPy function using [Link](), which will allow the function to perform the calculation
on a series of values and returns a NumPy array.
vrate = [Link](rate)
vrate(1.2, concs)
160
Scientific Computing for Chemists with Python
Technically, NumPy array methods have already been employed in this chapter. The functions above are NumPy methods
specifically for working with NumPy arrays. If an array is fed to many non-NumPy functions, an error will result because
they cannot handle multi-element objects or arrays specifically. Interestingly, if a float or integer is fed into a NumPy
method, it will still work. As an example, the integer 4 can be fed into the [Link]() function as well as an array of
values.
[Link](4)
np.float64(2.0)
[Link]([Link]([1, 4, 9]))
NumPy contains an extensive listing of methods for working with arrays… so much so that it would be impractical to
list them all here. However, below are tables of some common and useful methods. It is worth browsing and being
aware of them; many are worth committing to memory. If you ever find yourself needing to manipulate an array in some
complex fashion, it is worth doing a quick internet search and including “NumPy” in your search. You will likely either
find additional NumPy methods that will help or advice on how others solved a similar problem.
Table 1 Common Methods for Generating Arrays
Method Description
[Link]() Generates an array from another object
[Link]() Creates an array from [start, stop) with a given step size
[Link]() Creates an array from [start, stop] with given number of steps
[Link]() Creates an “empty” array (actually filled with garbage)
[Link]() Generates an array of a given dimensions filled with zeros
[Link]() Generates an array of a given dimensions filled with ones
[Link]() Generates an array using a Python function
[Link]() Loads text file data into an array
[Link]() Load text file data into an array (cannot handle missing data)
Method Description
[Link](array) Returns the dimensions of an array
[Link](array) Returns the number of dimensions (e.g., a 2D array is 2)
[Link](array) Returns the number of elements in an array
Method Description
[Link]() Flattens an array in place
[Link]() Returns a flattened view of the array without changing the array
[Link]() Reshapes an array in place
[Link]() Returns a resized view of an array without modifying the original
[Link]() Returns a view of transposed array
[Link]() Vertically stacks an arrays into a new array
[Link]() Horizontally stacks an arrays into a new array
[Link]() Depth-wise stacks an arrays into a new array
[Link]() Splits an array vertically
[Link]() Splits an array horizontally
[Link]() Splits an array depth-wise
[Link]() Creates a meshgrid (see chapter 3 for an example)
[Link]() Sorts elements in array; defaults along last axis
[Link]() Returns index values of sorted array
[Link](x) Sets all values in an array to x
[Link]() Rolls the array along the given axis; elements that fall off one end of the array appear
at the other
[Link]() Returns the floor (i.e., rounds down) of all elements in an array
[Link](x, deci- Rounds every number in an array x to y decimal places by Banker’s rounding
mals=y)
Á Warning
Like the native Python round() function, [Link]() performs Banker’s or half to even rounding.
162
Scientific Computing for Chemists with Python
Method Description
[Link]() Returns the minimum value in the array
[Link]() Returns the maximum value in the array
[Link]() Returns argument (i.e., index) of min
[Link]() Returns argument (i.e., index) of max
[Link]() Returns argument (i.e., index) of the local max
[Link]() Returns the element-by-element min between two arrays of the same size
[Link]() Returns the element-by-element max between two arrays of the same size
[Link]() Returns the specified percentile
[Link]() Returns the mean (commonly known as the average)
[Link]() Returns the median
[Link]() Returns the standard deviation; be sure to include ddof=1
[Link]() Returns counts and bins for a histogram
[Link]() Returns the cumulative product
[Link]() Returns the cumulative sum
[Link]() Returns the sum of all elements
[Link]() Returns the product of all elements
[Link]() Returns the peak-to-peak separation of max and min values
[Link]() Returns an array of unique elements in an array, set return_counts=True to get
frequency
np. Returns an array of unique elements in an array and a second array with the frequency of
unique_counts() these elements
® Note
The standard deviation equation includes a degrees of freedom. The default value for NumPy is zero, but the
default value for Excel, and some other software, is one. If you want your standard deviations to match Excel,
include the ddof=1 argument to the [Link]() standard deviation function.
Real datasets frequently contain gaps or missing values, so it is important to be able to deal with missing data.
When importing data into NumPy, there are two commonly employed functions, [Link]() and np.
loadtxt(). Though these are largely analogous functions in terms of capabilities, there is a key difference in that
[Link]() can handle missing data while [Link]() cannot. This means if your dataset may contain
gaps, you should use [Link]().
In the event the data file contains a gap, the [Link]() function will place a nan in that location by default.
The nan stands for “not a number” and is simply a placeholder. For example, the file dHf_ROH.csv contains the number
of carbons in linear alcohols and the gas-phase heat of formation in kJ/mol of each alcohol. The value for 1-undecanol
(eleven carbons) is missing, so [Link]() places a nan in its place.
[Link]('data/dHf_ROH.csv', delimiter=',')
Some data files use placeholder values instead of no value at all. These placeholders are often -1, 0, 999, or some
physically meaningless or improbable value. If you have alternative values you want in the missing data location, you can
specify this using the filling_values= argument. As an example below, the missing value is replaced with a 999.
In the event you have data with missing values, the nan placeholders can pose an issue when running statistics on the
data. Below, we use the [Link]() method to try to calculate the mean enthalpy of formation but get a nan instead
because the [Link]() function cannot handle the placeholder.
[Link](dHf[:,1])
np.float64(nan)
Alternatively, NumPy has a number of versions of functions (Table 5) that are specifically designed to handle data with
missing values.
Table 5 Statistics Methods Dealing with NaNs
Function Description
[Link]() Standard deveation
[Link]() Mean
[Link]() Variance
[Link]() Median
[Link]() Qth percentile
[Link]() Qth quantile
164
Scientific Computing for Chemists with Python
[Link](dHf[:, 1])
np.float64(-317.3636363636364)
np.float64(-298.0)
Stochastic simulations, addressed in chapter 9, are a common tool in the sciences and rely on a series of random numbers,
so it is worth addressing their generation using NumPy. Depending upon the requirements of the simulation, random
numbers may be a series of floats or integers, and they may be generated from various ranges of values. The numbers
may also be generated as a uniform distribution where all values are equally likely or a biased distribution where some
values are more probable than others. Below are random number functions from the NumPy random module useful in
generating random number distributions to suit the needs of your simulations.
® Note
Software-generated random numbers are really pseudorandom numbers. However, they are close enough to ran-
dom for most chemical simulations and will be referred to as “random numbers” herein.
The simplest distribution is the uniform distribution of random numbers where every number in the range has an equal
probability of occurring. The distribution may not always appear as even with small sample sizes due to the random nature
of the number generation, but as a larger population of samples is generated, the relative distribution will appear more
even. The histograms below (Figure 3) are of a hundred (left) and a hundred thousand (right) randomly generated floats
from the [0,1) range in an even distribution. While the plot on the right appears more even, this is mostly an effect of the
different scales.
60
8
6 40
4
20
2
0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Values Values
Figure 3 Histograms of a hundred (left) and a hundred thousand (right) randomly generated floats from the [0,1) range
in an even distribution using the random() method.
Starting in version 1.18, NumPy’s preferred method for producing random values is through a Generator called using
the rng = [Link].default_rng() function. Once a generator has been created, it can be used to generate
the necessary random values. NumPy has multiple methods available for generating evenly-distributed random numbers
including the following two functions where n is the number of random values to be generated. The [Link](n)
function generates n random floats from the range [0,1). The [Link](low, high=, size=n) function
generates random integers in the range [low, high) and can generate multiple values using the size argument.
rng = [Link].default_rng()
[Link](n)
Á Warning
Prior to version 1.18 of NumPy, random numbers were generated using function calls that look like [Link].
randint(). While these should still work, they are considered legacy, so it is uncertain how long they will continue
to be supported.
rng = [Link].default_rng()
[Link](5)
array([6, 3, 5, 6, 1, 5, 3, 5, 6, 8])
166
Scientific Computing for Chemists with Python
A binomial distribution results when values are generated from two possible outcomes. This is useful for applications
such as deciding if a simulated molecule reacts or whether a polymer chain terminates or propagates. The two outcomes
are represented by a 0 or 1 with the probability, p, of a 1 being generated. Binomial distributions are generated by the
NumPy random module using the [Link]() function call.
rng = [Link].default_rng()
[Link](t, p, size=n)
The t argument is the number of trials, while the size= argument is the number of generated values. For example, if
t = 2, two binomial values are generated, and the sum is returned, which may be 0, 1, or 2. Basic probability predicts
that these sums will occur in a 1:2:1 ratio, respectively. If t is increased to 10, a shape more closely representing a bell
curve is obtained. A Bernoulli distribution is the specific instance of a binomial distribution where t = 1. The histograms
below (Figure 4) are of a hundred randomly generated numbers in a binomial distribution with p = 0.5 and where t
= 1 (left), t = 2 (center), and t = 10 (right).
t=1 t=2 t = 10
5000 5000 2500
0 0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 0 2 4 6 8 10
Values Values Values
Figure 4 Histograms of a hundred randomly generated numbers in a binomial distribution with p = 0.5 and t = 1
(left), t = 2 (center), and t = 10 (right).
A Poisson distribution is a probability distribution of how likely it is for independent events to occur in a given interval
(time or space) with a known average frequency (𝜆). Each sample in a Poisson distribution is a count of how many events
have occurred in the time interval, so they are always integers. NumPy can generate integers in a Poisson distribution
using the [Link]() function, which accepts two arguments.
[Link](lam=1.0, size=n)
The first argument, 𝜆 (lam), is the statistical mean for the values generated, and the second argument, size, is the
requested number of values. For example, a Geiger counter can be simulated detecting background radiation in a location
that is known to have an average of 3.6 radiation counts per second with the following function call.
[Link](lam=3.6, size=30)
array([2, 2, 4, 2, 3, 2, 4, 3, 2, 6, 3, 3, 4, 3, 4, 4, 3, 0, 5, 5, 4, 4,
8, 2, 2, 3, 7, 4, 4, 5])
The returned array of values are the total radiation detections for each second for thirty seconds, and the mean value is
3.8 counts. While not precisely the target of 3.6 counts, it is close, and larger sample sizes are statistically more likely
to generate results closer to the target value. A histogram of these values is shown below (Figure 5, left). When this
simulation was repeated with thirty thousand samples (Figure 5, right), a mean of 3.61 counts is obtained. In addition,
the larger number of values results in a classic Poisson distribution curve which appears something like a bell curve with
more tapering on the high end.
Thirty Values Thirty Thousand Values
7000
8
7 6000
6 5000
5 4000
Counts
4
3000
3
2000
2
1 1000
0 0
0 1 2 3 4 5 6 7 0 2 4 6 8 10 12
Values Values
Figure 5 Histograms of thirty (left) and thirty thousand (right) randomly generated integers in a Poisson distribution with
a target mean (𝜆) of 3.6 (dashed red line).
Alternative distributions of random numbers can be generated by manipulating the output of the above functions. For
example, random numbers in a [-1, 1) distribution, which is useful in a 2D diffusion simulation, can be generated by
subtracting 0.5 from values in the range [0, 1) and multiplying by two.
rand_float = 2 * ([Link](10) - 0.5)
rand_float
The random module in NumPy also includes a large variety of other random number and sequence generators. This
includes [Link](), which generates values centered around zero in a normal distribution. The [Link]()
function selects a random value from a provided array of values, while the [Link]() function randomizes the
order of values for a given array. Other random distribution functions can be found on the SciPy website (see Further
Reading). A summary of common NumPy random functions is in Table 6.
Table 6 Summary of Common NumPy [Link] Functions
Function Description
[Link]() Generates random floats in the range [0,1) in an even distribution
rng. Generates random integers from a given range in an even distributionb
integers()
[Link]() Generates random floats in a normal distribution centered around zero
rng. Generates random integers in a binomial distribution; takes a probability ,p, and size argu-
binomial() ments
[Link]() Generates random floats in a Poisson distribution; takes a target mean argument (lam)
[Link]() Selects random values taken from a 1-D array or range
[Link]() Randomizes the order of an array
168
Scientific Computing for Chemists with Python
[Link](1)
array([0.28404729])
[Link](0, high=100)
np.int64(29)
array([1, 1, 2])
[Link](lam=2.0, size=5)
array([2, 2, 1, 5, 5])
[Link](20, size=3)
array([0, 1, 2, 3, 4])
Further Reading
The NumPy documentation is well written and a good resource. Because NumPy is the foundation of the SciPy ecosystem,
if you find a Python book on scientific computing, odds are that it will discuss or use NumPy at some level.
1. NumPy Website. [Link] (free resource)
2. NumPy User Guide. [Link] (free resource)
3. VanderPlas, J. Python Data Science Handbook: Essential Tools for Working with Data, 1st ed.;
O’Reilly: Sebastopol, CA, 2017, chapter 2. Freely available from the author at [Link]
PythonDataScienceHandbook/ (free resource)
Exercises
Complete the following exercises in a Jupyter notebook using NumPy and NumPy arrays. Avoid using for loops when-
ever possible. Any data file(s) referred to in the problems can be found in the data folder in the same directory as this
chapter’s Jupyter notebook. Alternatively, you can download a zip file of the data for this chapter from here by selecting
the appropriate chapter file and then clicking the Download button.
1. Generate an array containing the atomic numbers for the first 26 elements.
2. The following equation defines the relationship between energy (J) of a photo and its wavelength (m) where h is
Plank’s constant (6.626 × 10−34 𝐽 ⋅ 𝑠) and c is the speed of light in a vacuum (2.998 × 108 𝑚/𝑠).
ℎ𝑐
𝐸=
𝜆
a. Generate an array containing the wavelengths of visible light (4.00 × 10−7 m → 8.00 × 10−7 m) in 5 × 10−8
m increments.
b. Generate a second array containing the energy of each wavelength of light from part a.
3. Generate an array containing 101.325 a hundred times.
4. The following array contains temperatures in Fahrenheit. Convert these values to ∘ C without using a for loop.
𝑦 = 𝑠𝑖𝑛(𝑥)
𝑦 = 𝑠𝑖𝑛(1.1𝑥 + 0.5)
a. Plot these two sine waves on the same plot.
b. Add the two sine functions together and plot the result.
c. Explain why the signal in part b is smaller in one area and larger in another. Hint: look at your plot for part a to
see how the two origonal sine waves related to each other.
6. The numerical relationship between Δ𝐺𝑜 and K (equilibrium constant) is shown below. Plot Δ𝐺𝑜 versus K at
standard temperature and pressure for K values of 0.001 → 1000. Use NumPy arrays and do not use any for loops.
Δ𝐺 = −𝑅𝑇 𝑙𝑛(𝑘)
7. The numerical relationship between k (rate constant) and E𝑎 is shown below. Plot k versus Ea at standard temper-
ature and pressure for activation energies of 1 → 20 kJ/mol. Use NumPy arrays, do not use any for loops, and use
A = 1. Watch your energy units carefully.
𝑘 = 𝐴𝑒−𝐸𝑎 /𝑅𝑇
170
Scientific Computing for Chemists with Python
1 1
[ ] + [1] = ?
2 2
13. Predict the outcome of the following operation between two NumPy arrays. Test your your prediction.
1 8 9
⎡8 1 1
⎢ 1 9⎤⎥ + [1 ]= ?
1
⎣1 8 1⎦
14. Predict the outcome of the following operation between two NumPy arrays. Test your your prediction.
1 8 1 1
[ ]+[ ]= ?
3 2 1 1
Exercises 171
Scientific Computing for Chemists with Python
18. Hydrogen nuclei can have a spin of +1/2 and -1/2 and occur in approximately a 1:1 ratio. Simulate the number of
+1/2 hydrogen nuclei in a molecule of six hydrogen atoms and plot the distribution. Hint: being that there are two
possible outcomes, this can be simulated using a binomial distribution. See section 4.7.2.
19. Using NumPy’s random module, generate a random DNA sequence (i.e., series of ‘A’, ‘T’, ‘C’, and ‘G’ bases) 40
bases long stored in an array.
172
CHAPTER 5: PANDAS
While NumPy is the foundation of much of the SciPy ecosystem and provides very capable ndarray objects, it has a few
missing features. The first is that NumPy arrays cannot hold different types of objects in a single array. For example, if
we attempt to convert the following list containing integers, floats, and strings into an array, NumPy converts all elements
into strings as a way of making the object types uniform.
import numpy as np
[Link](nums)
The second shortcoming is that NumPy arrays do not have strong support for labels in the data. That is, you might want
to label rows and columns describing what they represent, like you might see in a well-constructed spreadsheet. While
there is some support for this in NumPy, it is not as strong as pandas’ support. Finally, while NumPy contains a wealth
of basic tools for working with data, there are still many operations that it does not support, like grouping data based on
the value of a particular column or the ability to merge two datasets with automatic alignment of related data.
To fill in these missing features, the pandas library provides a wealth of additional tools on top of NumPy for working
with data, and possibly the most endearing feature, the ability to call data based on labels. That is, data columns and rows
can contain human-readable labels that are used to access the data. Pandas still supports accessing data using indices if
the user wishes to go that route, but the user can now access data without knowing which column it is in as long as the
user knows the column label.
By popular convention, the pandas library is imported with the pd alias, which is used here. This chapter assumes the
following imports.
import pandas as pd
import [Link] as plt
To support the wealth of features, pandas uses its own objects to hold data called a Series and a DataFrame, which are built
on NumPy arrays. Because they are built on NumPy, many of the NumPy functions (e.g., [Link]()) work on pandas
objects. The key difference between a Series and DataFrame is that a Series is one-dimensional while a DataFrame is
two-dimensional. Unlike a NumPy array, pandas objects have fixed dimensionality. There is a three-dimensional object
called a Panel, but this will not be covered here as it is not often used.
173
Scientific Computing for Chemists with Python
5.1.1 Series
While the pandas Series is restricted to being a single dimension, it can be as long as necessary to hold the data. A Series
containing the atomic masses of the first five elements on the periodic table is generated below using the [Link]()
function. This function is always capitalized.
mass = [Link]([1.01,4.00,6.94,9.01,10.81])
mass
0 1.01
1 4.00
2 6.94
3 9.01
4 10.81
dtype: float64
The right column is the actual data in the Series, while the values on the left are the assigned indices for each value in the
Series. The index column is not part of the dimensionality of the Series; it is metadata (i.e., data about the data). Think
of the numbers as the row labels you would see in a traditional spreadsheet software application.
Consistent with lists, tuples, and ndarrays, values in a Series can be accessed using indexing with square brackets as
demonstrated below.
mass[2]
np.float64(6.94)
Unlike most other multi-element objects seen so far, data in a Series can be accessed using indices different from the
default (i.e., 0, 1, 2, etc.) values. That is, custom row labels can be assigned using the index= argument shown below.
H 1.01
He 4.00
Li 6.94
Be 9.01
B 10.81
dtype: float64
The custom indices can now be used to access an element in a Series. This makes a Series behave something like a
dictionary (section 2.2).
mass2['He']
np.float64(4.0)
The indices can be accessed by using [Link]. Series indices can also be modified after a Series has been created
by using .index and assignment as demonstrated below.
[Link]
174
Scientific Computing for Chemists with Python
H 1.01
He 4.00
Li 6.94
Be 9.01
B 10.81
dtype: float64
Even if we create or modify a Series to have custom indices, we can still access the elements using the traditional numerical
indices using the iloc[] method. This method allows the user to access elements the same way as in a NumPy array
regardless of custom index values.
[Link][2]
np.float64(6.94)
5.1.2 DataFrame
Most data you will find yourself working with will be best placed in a two-dimensional pandas object called a DataFrame,
which is always written with two capital letters. The DataFrame is similar to a Series except that now there are also
columns with names. The columns can be accessed by column names, and rows can be accessed by indices. You might
think of a DataFrame as a collection of Series objects. Below, a DataFrame is constructed to hold the names, atomic
numbers, masses, and ionization energies of the first five elements.
H He Li Be B
name hydrogen helium lithium beryllium boron
AN 1 2 3 4 5
mass 1.01 4.0 6.94 9.01 10.81
IE 13.6 24.6 5.4 9.3 8.3
elements['Li']
name lithium
AN 3
mass 6.94
IE 5.4
Name: Li, dtype: object
Essentially, what we get out of a column is a Series with the indices shown on the leftward side.
To indicate a row, instead use the loc[] method. We again get a Series with indices derived from the column names in
the source DataFrame. This Series can be placed in a variable and indexed just like in section 5.1.1.
[Link]['IE']
H 13.6
He 24.6
Li 5.4
Be 9.3
B 8.3
Name: IE, dtype: object
atomic_number = [Link]['AN']
atomic_number['B']
Alternatively, we can use the DataFrame directly and index it with the loc[] method as [row, column].
[Link]['IE', 'Li']
5.4
Numerical index values can also be used with the iloc[] method. This reduces indexing to how NumPy arrays are
indexed.
[Link][2:, 2]
mass 6.94
IE 5.4
Name: Li, dtype: object
A summary of the methods of indexing pandas Series and DataFrames is presented below in Table 1.
Table 1 Summary of Pandas Indexing
176
Scientific Computing for Chemists with Python
Similar to NumPy, pandas contains multiple, convenient functions for reading/writing data directly to and from its own
object types, and each function is suited to a specific file format. This includes CSV, HTML, JSON, SQL, Excel, and
HDF5 files, among others.
Table 2 Import/Export Functions in Pandas
Function Description
read_csv() and to_csv() Imports/Exports data from/to a CSV file
read_table() and to_table() General-purpose importer/exporter
read_hdf5() and to_hdf5() Imports/Exports data from/to an HDF5 file
read_clipboard() and to_clipboard() Transfers data to/from the clipboard to a Series or DataFrame
read_excel() and to_excel() Reads/writes an Excel file
® Note
The \s+ syntax is from regular expressions, which are covered in more detail in Appendix 4.
Before we start with more well-defined file formats, pandas provides a general-purpose file reader pd.read_table().
This function imports text files where lines represent rows, and the data in each row is separated by characters or spaces.
The user can designate what character(s) separate the data by using the delimiter or sep arguments (they do the
same thing). To set a space as a delimiter, use sep=\s+. The function also includes a series of other arguments listed
below in Table 3.
Á Warning
The delim_whitespace= argument has been deprecated and will be removed from pandas at some point.
Use sep=\s+ instead.
Argument Description
delimiter Data separator; default is tab
sep Data separator; default is tab
skiprows Number of rows at the top of the file to skip before reading data
skipfooter Number of rows at the bottom of the file to skip
skip_blank_lines If True, skips blank lines in file; default is False
header Row number to use for a data header; also accepts None if no header is provided in the file
skipini- If True, skips white space after delimiter
tialspace
As an example, we can use this function to read a calculated PDB file of benzene and extract the 𝑥𝑦𝑧 coordinates for
each atom. This particular file type, shown below, is strictly formatted based on the position in a line, but being that all
the data columns here have spaces between them, we can use space delimitation by setting sep=\s+. Because the data
do not start until the third line and we do not need the last thirteen lines of the file, we should exclude these rows. We set
header=None because we do not want the function to treat the first line of data as a header or data label.
HEADER
REMARK
HETATM 1 H UNK 0001 0.000 0.000 -0.020
HETATM 2 C UNK 0001 0.000 0.000 1.067
HETATM 3 C UNK 0001 0.000 0.000 3.857
HETATM 4 C UNK 0001 0.000 -1.208 1.764
HETATM 5 C UNK 0001 0.000 1.208 1.764
HETATM 6 C UNK 0001 0.000 1.208 3.159
HETATM 7 C UNK 0001 0.000 -1.208 3.159
HETATM 8 H UNK 0001 0.000 -2.149 1.221
HETATM 9 H UNK 0001 0.000 2.149 1.221
HETATM 10 H UNK 0001 0.000 2.149 3.703
HETATM 11 H UNK 0001 0.000 -2.149 3.703
HETATM 12 H UNK 0001 0.000 0.000 4.943
CONECT 1 2
CONECT 2 1 5 4
CONECT 3 6 7 12
CONECT 4 7 2 8
CONECT 5 2 6 9
CONECT 6 5 3 10
CONECT 7 3 4 11
CONECT 8 4
CONECT 9 5
CONECT 10 6
CONECT 11 7
CONECT 12 3
END
0 1 2 3 4 5 6 7
0 HETATM 1 H UNK 1 0.0 0.000 -0.020
1 HETATM 2 C UNK 1 0.0 0.000 1.067
2 HETATM 3 C UNK 1 0.0 0.000 3.857
3 HETATM 4 C UNK 1 0.0 -1.208 1.764
(continues on next page)
178
Scientific Computing for Chemists with Python
The 𝑥, 𝑦, and 𝑧 data are in columns 5, 6, and 7, respectively, and can be extracted by indexing as discussed in section
5.1.2.
Pandas provides a collection of more format-specific functions for reading/writing files. The most popular is possibly the
CSV file because it is simple, and many scientific instruments support exporting data in this format. To import a CSV
file, we will use the read_csv() function. This function is very similar to the read_table() function except that
a default value for the separator/delimiter is set to a comma. To create a CSV file, use the to_csv() method, which at
a minimum requires the file name and a pandas object with the data.
We can write the above chemical element data assembled in section 5.1 as shown below. Because we are starting from a
pandas object and are using a pandas method, the df.to_csv() format is used where df is a DataFrame.
elements.to_csv('[Link]')
If we check the directory containing the Jupyter notebook, the data folder contains a file titled [Link] that looks like
the following. Each row in the DataFrame is a different line in the file, and every column is separated by a comma.
,H,He,Li,Be,B
name,hydrogen,helium,lithium,berylium,boron
AN,1,2,3,4,5
mass,1.01,4.0,6.94,9.01,10.81
IE,13.6,24.6,5.4,9.3,8.3
To read the data back in from the file, use pd.read_csv(). Because we are not starting with a pandas object, the
function is called using the [Link]() format.
pd.read_csv('data/[Link]')
Unnamed: 0 H He Li Be B
0 name hydrogen helium lithium beryllium boron
1 AN 1 2 3 4 5
2 mass 1.01 4.0 6.94 9.01 10.81
3 IE 13.6 24.6 5.4 9.3 8.3
Pandas provides another useful function that imports Excel notebook files (i.e., .xls or .xlsx). Excel files are a specialized
file type that requires the support of additional libraries, known as dependencies, that pandas does not install by default. A
list of these dependencies is provided on the pandas website. You can either install each dependency yourself, or pandas
provides a shortcut (for pandas version 2.0.0 and later) of pip install "pandas[excel]" that is run in the
Terminal window (see section 0.2 for Terminal instructions). However, please check the pandas website for the full and
most current instructions as things may have changed. Because Excel files can contain multiple sheets, this function is a
little more complicated to use. The simplest way to import an Excel file is to use pd.read_excel() and provide it
with the Excel file name.
® Note
The pip install "pandas[excel]" command only works for pandas version 2.0.0 and later. If this
command doesn’t work, it’s because you may need to upgrade your version of pandas.
pd.read_excel('data/[Link]')
x y
0 1 1
1 2 4
2 3 9
3 4 16
4 5 25
5 6 36
6 7 49
In the above example, pandas loads the first sheet in the file, which is the default behavior. If you want to access a different
sheet in the file, you can specify this by using the sheet_name keyword argument. If you do not know the sheet name,
the sheet_name argument also accepts integer index values (i.e., 0 for the first sheet and so on).
data = pd.read_excel('data/[Link]', sheet_name='Sheet2')
data
b.1
0 NaN
1 NaN
2 NaN
3 NaN
(continues on next page)
180
Scientific Computing for Chemists with Python
Alternatively, if you want to extract the sheet names, you can use the sheets_names method with the ExcelFile
class as demonstrated below.
xl = [Link]('data/[Link]')
xl.sheet_names
['Sheet1', 'Sheet2']
Writing to an Excel file requires two steps – generate an ExcelWriter engine and then write each sheet. The Excel writer
offers more power in generating Excel files including embedding charts, conditional formatting, coloring cells, and other
tasks; but we will stick to the basics here.
Pandas will also accept data from the computer’s copy and paste clipboard. Start by highlighting some data from a
webpage or a spreadsheet, then select copy. This is typically located under the Edit menu of most software applications.
Alternatively, you can type Command + C on a macOS or Control + C on Windows and Linux. Finally, use the pd.
read_clipboard() function to convert it to a pandas DataFrame.
pd.read_clipboard()
Loading data from the clipboard is not a robust and efficient way to do much of your automated data analysis, but it is a
very convenient method to experiment with data or to quickly grab some data off a website to experiment with.
Once you load data into pandas, you will likely want to get an idea of what the data look like before you proceed to calcu-
lations and in-depth analyses. This section covers a few methods provided in pandas to gain a preliminary understanding
of your data.
Pandas provides a few simple functions to view and describe new data. The first two are head() and tail() which
allow you to see the top and bottom of the DataFrame, respectively. These are particularly useful when dealing with very
large DataFrames. Below, a DataFrame containing random values in an even, normal, and Poisson distribution (𝜆 = 3.0)
demonstrates these functions.
rng = [Link].default_rng()
[Link]()
[Link]()
Pandas also contains a describe() function that returns a variety of statistics on each column. For example, the mean
is provided, which are approximately 0.5, 0.0, and 3.0 for the even, normal, and poisson distributions, respectively. This is
not surprising, being that the even distribution is centered around 0.5, the normal around 0.0, and the poisson distribution
is generated for an average of 3.0. The user is also provided with the minimum, maximum, standard deviation, and the
quartile boundaries.
[Link]()
Another useful function is the value_counts() method, which returns all unique values in a Series (or DataFrame
column or row). Below, it is demonstrated on the poisson column, being that the other two columns will have a relatively
large number of unique values.
counts = random['poisson'].value_counts()
counts
182
Scientific Computing for Chemists with Python
poisson
3 235
2 224
4 188
1 155
5 74
6 52
0 43
7 18
8 5
10 3
9 3
Name: count, dtype: int64
Data in DataFrames can be plotted by calling the desired columns of data and feeding them into plotting functions
like [Link](). The data can also be visualized by using the [Link](kind=) format where df is the
DataFrame and kind is the plot type (e.g., 'bar', 'hist', 'scatter', 'line', 'pie', etc.). However, this is
just matplotlib doing the plotting and is largely redundant with other methods already covered. Below is a quick example
of the counts data generated above.
[Link](kind='bar');
200
150
100
50
0
3
2
4
1
5
6
0
7
8
10
9
poisson
Because pandas is built upon NumPy arrays, mathematical operations are propagated through Series and DataFrames.
The user is able to use NumPy methods on pandas objects, and there are a number of other mathematical operations to
choose from such as those listed below.
Table 4 Broadcasted Pandas Methods
Function Description
abs() Absolute value
count() Counts items
cumsum() Cumulative sum
cumprod() Cumulative product
mad() Mean absolute deviation
max() Maximum
min() Minimum
mean() Mean
median() Median
mode() Mode
std() Standard deviation
® Note
The default delta degree of freedom (ddof) of the std() function in pandas equals one unlike Microsoft Excel
or NumPy (see section 4.5) where the default is zero. This behavior can be modified with the ddof=1 argument.
Now that you are able to generate DataFrames, it is useful to be able to modify them as you clean your data or per-
form calculations. This can be done through methods such as assignment, dropping rows and columns, and combining
DataFrames or Series.
Possibly the easiest method of adding a new column is through assignment. If a nonexistent column is called and assigned
values, instead of returning an error, pandas creates a new column with the given name and populates it with the data. For
example, the elements DataFrame below does not contain a carbon column, so the column is added when assigned to
a Series with the data.
elements
H He Li Be B
name hydrogen helium lithium beryllium boron
AN 1 2 3 4 5
(continues on next page)
184
Scientific Computing for Chemists with Python
H He Li Be B C
name hydrogen helium lithium beryllium boron carbon
AN 1 2 3 4 5 6
mass 1.01 4.0 6.94 9.01 10.81 12.01
IE 13.6 24.6 5.4 9.3 8.3 11.3
Another important feature of pandas is the ability to automatically align data based on labels. In the above example,
carbon is added to the DataFrame with the name, atomic number, atomic mass, and ionization energy in the same order
as in the DataFrame. What happens if the new data is not in the correct order? If we are using NumPy, this would require
additional effort on the part of the user to reorder the data. However, if each value is labeled, pandas will see to it that
they are placed in the correct location.
AN 7
mass 14.01
name nitrogen
IE 14.5
dtype: object
Data for nitrogen is placed in a Series above. Notice that the values are out of order with respect to the data in elements.
There are index labels (i.e., row labels) that tell pandas what each piece of data is, and pandas will use them to determine
where to place the new information.
elements['N'] = nitrogen
elements
H He Li Be B C N
name hydrogen helium lithium beryllium boron carbon nitrogen
AN 1 2 3 4 5 6 7
mass 1.01 4.0 6.94 9.01 10.81 12.01 14.01
IE 13.6 24.6 5.4 9.3 8.3 11.3 14.5
The new column of nitrogen data has been added to elements with all pieces of data residing in the correct row.
When cleaning up data, you may wish to drop a column or row. Pandas provides the drop() method for this purpose.
It requires the name of the column or row to be dropped, and by default, it assumes a row, axis=0, is to be dropped. If
you want to drop a column, change the axis using the axis=1 argument. Below, the hydrogen column is dropped from
the elements DataFrame.
[Link]('H', axis=1)
He Li Be B C N
name helium lithium beryllium boron carbon nitrogen
AN 2 3 4 5 6 7
mass 4.0 6.94 9.01 10.81 12.01 14.01
IE 24.6 5.4 9.3 8.3 11.3 14.5
[Link]('IE', axis=0)
H He Li Be B C N
name hydrogen helium lithium beryllium boron carbon nitrogen
AN 1 2 3 4 5 6 7
mass 1.01 4.0 6.94 9.01 10.81 12.01 14.01
In the second example above, the hydrogen is back despite being previously dropped. This is because the drop() method
does not by default modify the original DataFrame. To make the changes permanent, either assign the new DataFrame
to a new variable or add the inplace=True keyword argument to the above drop() function.
There is a similar function [Link]() that drops columns or rows from a DataFrame that contain nan values. This
is commonly used to remove incomplete data from a dataset. The [Link]() function behaves very similarly to the
[Link]() function including the inplace= and axis= arguments.
5.4.4 Merge
To merge multiple DataFrames, pandas provides a merge() method. Similar to above, the merge() function will
properly align data, but because DataFrames have multiple columns and index values to choose from, the merge()
function can align data based on any of these values. The default behavior for merge() is to check for common columns
between the two DataFrames and align the data based on those columns. As an example, below are two DataFrames
containing data from various chemical compounds.
chmdf1
186
Scientific Computing for Chemists with Python
chmdf2
Both DataFrames above have a property column, so the merge() function uses this common column to align all the
data into a new DataFrame.
[Link](chmdf2)
If there are multiple columns with the same name, the user can specify which to use with the on keyword argument (e.g.,
on='property'). Alternatively, if the two DataFrames contain columns with different names that the user wants used
for alignment, the user can specify which columns to use with the left_on and right_on keyword arguments.
In the two DataFrames generated above, each contains data on cobalt, iron, chromium, and nickel; but the first DataFrame
labels metals as element while the second labels the metals as metal. The following merges the two DataFrames based
on values in these two columns.
[Link](comps2, left_on='element',right_on='metal')
Notice that the values in the element and metal columns were aligned in the resulting DataFrame. To get rid of one
of the redundant columns, just use the drop() method described in section 5.4.3.
element protons IE
0 Co 27 7.88
1 Fe 26 7.90
2 Cr 24 6.79
3 Ni 28 7.64
5.4.5 Concatenation
Concatenation is the process of splicing two DataFrames along a given axis. This is different from the merge() method
above in that merge() merges and aligns common data between the two DataFrames while [Link]() blindly ap-
pends one DataFrame to another. As an example, imagine two lab groups measure the densities of magnesium, aluminum,
titanium, and iron and load their results into DataFrames below.
group1 = [Link]({'metal':['Mg', 'Al', 'Ti', 'Fe'],
'density': [1.77, 2.73, 4.55, 7.88]})
group2 = [Link]({'metal':['Al', 'Mg', 'Ti', 'Fe'],
'density': [2.90, 1.54, 4.12, 8.10]})
group1
metal density
0 Mg 1.77
1 Al 2.73
2 Ti 4.55
3 Fe 7.88
metal density
0 Mg 1.77
1 Al 2.73
2 Ti 4.55
3 Fe 7.88
0 Al 2.90
1 Mg 1.54
2 Ti 4.12
3 Fe 8.10
Notice how the two DataFrames are appended with no consideration for common values in the metal column. The
default behavior is to concatenate along the first axis (axis=0), but this behavior can be modified with the axis=
keyword argument. Again, the metals are not all aligned below because they were not in the same order in the original
DataFrames.
[Link]((group1, group2), axis=1)
For comparison, if the two DataFrames are merged instead of concatenating them, pandas will align the data based on the
metal as demonstrated below. Because density appears twice as a column header, pandas deals with this by adding
a suffix to differentiate between the two datasets.
[Link](group1, group2, on='metal')
188
Scientific Computing for Chemists with Python
Further Reading
For further resources on the pandas library, see the following. The value of the pandas website cannot be emphasized
enough, as it contains a large quantity of high-quality documentation and illustrative examples on using pandas for data
analysis and processing.
1. Pandas Website. [Link] (free resource)
2. VanderPlas, J. Python Data Science Handbook: Essential Tools for Working with Data, 1st ed.;
O’Reilly: Sebastopol, CA, 2017, chapter 3. Freely available from the author at [Link]
PythonDataScienceHandbook/ (free resource)
3. McKinney, W. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, 2nd ed.; O’Reilly:
Sebastopol, CA, 2018.
Exercises
Complete the following exercises in a Jupyter notebook using the pandas library. Avoid using for loops unless absolutely
necessary. Any data file(s) referred to in the problems can be found in the data folder in the same directory as this
chapter’s Jupyter notebook. Alternatively, you can download a zip file of the data for this chapter from here by selecting
the appropriate chapter file and then clicking the Download button.
1. Below is a table containing the melting points and boiling points of multiple common chemical solvents.
Solvent bp mp
benzene 80 6
acetone 56 -95
toluene 111 -95
pentane 36 -130
ether 35 -116
ethanol 78 -114
methanol 65 -98
a) Create a Series containing the boiling points of the above solvents with the solvent names as the indices. Call
the Series to look up the boiling point of ethanol.
b) Create a DataFrame that contains both the boiling points and melting points with the solvent names as the indices.
Call the DataFrame to look up the melting point of benzene.
c) Access the boiling point of pentane in the DataFrame from part b using numerical indices.
2. Import the attached file [Link] containing the absorption spectrum of Blue 1 food dye using pandas.
a) Set the wavelengths as the index values.
b) Plot the absorption versus wavelength.
c) Determine the absorbance of Blue 1 at 620 nm.
3. Chemical Kinetics: Import the file [Link] containing time series data for the conversion of A → Product
using pandas IO tools. Generate new columns for 𝑙𝑛[𝐴], [𝐴]−1 , and [𝐴]0.5 and determine the order of the reaction.
4. Import the ROH_data.csv file containing data on various simple alcohols to a DataFrame. Notice that this data is
missing densities for some of the compounds.
a) Use pandas to remove any rows with incomplete information in the density column using the [Link]()
function. Check the DataFrame to see if it has changed.
b) Again using the [Link]() function, drop incomplete row with the parameter inplace=True. Check
to see if the DataFrame has changed.
5. Import the following four files containing UV-vis spectra of four food dyes with the first column listing the wave-
lengths (nm) and the second column containing the absorbances. Each file contains data in from 400-850 nm in 1
nm increments.
a) Concatenation the files into a single DataFrame with the first column as the wavelength (nm) and the other four
columns as the absorbances for each dye.
b) Replace the column headers with meaningful labels.
6. Import the two files [Link] and [Link] containing the boiling points of the two classes of organic com-
pounds with respect to the number of carbons in each compound.
a) Drop the columns containing the names of the compounds.
b) Merge the two DataFrames allowing pandas to align the two DataFrames based on carbon number.
190
Part II
191
CHAPTER 6: SIGNAL & NOISE
When collecting data from a scientific instrument, a measurement is returned as a value or series of values, and these values
are composed of both signal and noise. The signal is the component of interest, while the noise is random instrument
response resulting from a variety of sources that can include the instrument itself, the sample holder, and even the tiny
vibrations of the building. For the most interpretable data, you want the largest signal-to-noise ratio possible in order to
reliably identify the features in the data.
This chapter introduces the processing of signal data, including detecting features, removing noise from the data, and
fitting the data to mathematical models. We will be using the NumPy library in this chapter and also start to use modules
from the SciPy library. SciPy, short for “scientific Python,” is one of the core libraries in the scientific Python ecosystem.
This library includes a variety of modules for dealing with signal data, performing Fourier transforms, and integrating
sampled data, among other common tasks in scientific data analysis. Table 1 summarizes some of the key modules in the
SciPy library.
Table 1 Common SciPy Modules
Unlike NumPy, many of the functions in SciPy are stored in modules, so each module from SciPy needs to be imported
individually or listed when calling the function. It is common to see specific SciPy modules imported as shown below.
Because NumPy and plotting are used heavily in signal processing, the examples in this chapter assume the following
NumPy and matplotlib imports.
193
Scientific Computing for Chemists with Python
import numpy as np
import [Link] as plt
When analyzing experimental data, there are typically key features in the signal that you are most interested in. Often,
they are peaks or a series of peaks, but they can also be negative peaks (i.e., low points), the slopes, or inflection points.
This section covers extracting feature information from signal data.
The simplest and probably most commonly sought-after features in signal data are peaks and negative peaks. These are
known as the maxima and minima, respectively, or collectively known as the extrema. In the simplest data, there may be
only one peak or negative peak, so finding it is a matter of finding the maximum or minimum value in the data. For this,
we can use NumPy’s [Link]() and [Link]() functions, and these functions can also be called using the
shorter [Link]() and [Link]() function calls, respectively.
To demonstrate peak finding, we will use both a 13 C{1 H} Nuclear Magnetic Resonance (NMR) spectrum and an infrared
(IR) spectrum. These data are imported below using NumPy.
194
Scientific Computing for Chemists with Python
12
10
0
70 60 50 40 30 20 10 0
Chemical Shift, ppm
ir = [Link]('data/IR_acetone.csv', delimiter=',')
[Link](ir[:,0], ir[:,1])
[Link]('Wavenumbers, cm$^{-1}$')
[Link]('Transmittance, %')
[Link](4000, 600);
100
95
90
Transmittance, %
85
80
75
70
np.float64(11.7279863357544)
[Link](ir[:,1])
np.float64(66.80017)
These functions output the max and min values of the independent variable (𝑦-axis). If we want to know the location on
the 𝑥-axes, we need to use the NumPy functions [Link]() and [Link]() which return the indices of the
max or min values instead of the actual value (“arg” is short for argument).
imax = [Link](nmr[:,1])
imax
np.int64(5395)
imin = [Link](ir[:,1])
imin
np.int64(2302)
With the indices, we can extract the desired information using indexing of the 𝑥-axes. Below, the largest peak in the NMR
spectrum is at 18.3 ppm while the smallest transmittance (i.e., largest absorbance) is at 1710 cm−1 in the IR spectrum.
196
Scientific Computing for Chemists with Python
nmr[imax, 0]
np.float64(18.312606267778)
ir[imin, 0]
np.float64(1710.068)
Below, these values are plotted on the spectra as orange dots to validate that they are indeed the largest features in the
spectra.
12
10
0
70 60 50 40 30 20 10 0
Chemical Shift, ppm
[Link](ir[:,0], ir[:,1])
[Link](ir[imin, 0], ir[imin, 1], 'o')
[Link](4000, 600)
[Link]('Wavenumbers, cm$^{-1}$')
[Link]('Transmittance, %');
100
95
90
Transmittance, %
85
80
75
70
A considerable amount of data in science contains numerous peaks and negative peaks which are called local extrema.
To locate the multiple max and min values, we will use SciPy’s relative max/min functions argrelmax() and ar-
grelmin(). These functions determine if a point is a max/min by checking a range of data points on both sides to
see if the point is the largest/smallest. The range of data points examined is known as the window, and the window can
be modified using the order argument. Instead of the actual max/min values, these functions return the indices as the
“arg” part of the name suggests.
(array([1219, 5395]),)
The argrelmax() function returned two indices as an array wrapped in a tuple. If we plot the maxima marked with
dots, we see that the function correctly identified both peaks.
198
Scientific Computing for Chemists with Python
12
10
0
70 60 50 40 30 20 10 0
Chemical Shift, ppm
The argrelmax() function may at times identify an edge or a point in a flat region as a local maximum because there
is nothing larger near it. There are multiple ways to mitigate these erroneous peaks. First, we can increase the window
for which the function checks to see if a point is the largest value in its neighborhood. Unfortunately, making the window
too large can also prevent the identification of multiple extrema near each other. The second mitigation is to change the
function’s mode from the default 'clip' to 'wrap'. This makes the function treat the data as wrapped around on
itself instead of stopping at the edge. That is, both edges of the data are treated as being connected. This makes it more
likely that an extrema value is in the neighborhood. Finally, the user can filter for values that correspond to peaks above
a certain height value. Below is an example of filtering values based on a height. The window below is intentionally
narrowed so that the argrelmax() function returns too many values for demonstration purposes.
Next, we will create a boolean mask (see section 4.3.4) which is a series of True and False values indicating if the
data point is above a height value or not. In this example, we are using 1 as a height, but another height value may be
more appropriate for different data. This is accomplished below by using the boolean > operator. The nmr[imax, 1]
indexes the identified peaks from above and only returns the height values as a result of the 1. If the 1 was not included,
we would get a collection of [ppm, height] pairs.
Finally, we treat the mask of True/False values as if they are indices to get only the values for legitimate peaks.
imax[mask]
array([1219, 5395])
The [Link] module includes a convenient find_peaks() function that facilitates the finding of multiple
peaks in a spectrum based on parameters such as the height of the peaks or prominence. This function requires a one-
dimensional array as a positional argument and a number of optional, keyword arguments (Table 2).
Table 2 Select Keyword Argument for the [Link].find_peaks() Function
Parameter Descrption
height= Verticle height of the peak apex.
thresh- Verticle distance between a data point and the adjacent data points.
old=
dis- Horizontal distance between a peak and its nearest neighbor. If two peaks are near each other, the
tance= smaller one is discarded.
promi- Distance between a peak apex and the base of the peak.
nence=
width= Peak width measured in number of data points.
Each of these parameters can take a single number treated as a minimum value. Alternatively, most of these parameters
can also take two numbers in an array, list, or tuple in which case the first value is a minimum while the second value is
a maximum.
find_peaks(data, height=min)
find_peaks(data, height=(min, max))
This function only identifies positive peaks (i.e. pointing upwards). Our IR spectrum is currently represented as percent
transmittance, so we can convert it to absorbance using the following equation.
absorb = 2 - np.log10(ir[:,1])
Now that the peaks are pointed upward, we feed the data into the find_peaks() function and decide how to best
identify the peaks we are interested in. This will depend on the type of spectrum and other conditions. One straightforward
method is the height= parameter where any peaks above this level are identified at their apex.
find_peaks(absorb)
(array([ 2, 16, 31, 46, 62, 84, 102, 116, 130, 140, 161,
176, 199, 215, 220, 235, 249, 262, 270, 279, 291, 307,
382, 442, 455, 471, 486, 497, 524, 625, 749, 791, 812,
837, 864, 1020, 1114, 1285, 1573, 1698, 1731, 1858, 1879, 1908,
1919, 1934, 1948, 1988, 2009, 2020, 2064, 2091, 2108, 2148, 2172,
2184, 2302, 2381, 2468, 2539, 2565, 2579, 2607, 2628, 2661, 2673,
2693, 2716, 2733, 2745, 2756, 2772, 2789, 2811, 2829, 2846, 2853,
(continues on next page)
200
Scientific Computing for Chemists with Python
The function returns a tuple containing an array and a dictionary in this order. The array contains the indices of identified
peaks, while the dictionary may either be empty or include information about the identified peaks depending on what
keyword arguments are used in the function.
Below, we can plot the results of the function with the horizontal dotted line representing the chosen height, and the orange
dots represent identified peaks.
height = 0.01
i_peaks = find_peaks(absorb, height=height)[0]
fig = [Link](figsize=(10,6))
ax = fig.add_subplot(1,1,1)
[Link](height, 4000, 600, 'r', linestyles='dotted', label='Height')
[Link](ir[:,0], absorb)
[Link](ir[i_peaks,0], absorb[i_peaks], 'o', label='Identified peaks')
ax.set_xlim(4000, 600)
ax.set_xlabel('Wavenumbers, cm$^{-1}$')
ax.set_ylabel('Absorbance')
[Link]()
<[Link] at 0x10a2b9550>
0.175 Height
Identified peaks
0.150
0.125
0.100
Absorbance
0.075
0.050
0.025
0.000
4000 3500 3000 2500 2000 1500 1000
Wavenumbers, cm 1
When using keyword arguments, the find_peaks() function returns the values used by the keyword arguments in the
dictionary. For example, because we used the height= argument, the heights are returned.
find_peaks(absorb, height=height)
(array([ 2, 16, 31, 46, 62, 84, 102, 116, 130, 140, 161,
176, 199, 215, 220, 235, 249, 262, 270, 279, 291, 307,
382, 442, 455, 471, 486, 497, 524, 625, 749, 791, 812,
837, 864, 1020, 1114, 1285, 1573, 1698, 1731, 1858, 1879, 1908,
1919, 1934, 1948, 1988, 2009, 2020, 2108, 2148, 2172, 2184, 2302,
2381, 4984]),
{'peak_heights': array([0.01713256, 0.01951099, 0.01586994, 0.0166401 , 0.
↪01443504,
This approach struggles with identifying short peaks without mislabeling non-peaks, so we need another condition to limit
what is marked as a peak. The peak prominence (prominence=) is how far the apex of a peak is above the base of the
peak. The base of the peak may or may not be the baseline of the spectrum itself. By adding this condition, now only
peaks that satisfy both the height and prominence condition will be identified.
height = 0.01
i_peaks = find_peaks(absorb, height=height, prominence=0.002)[0]
(continues on next page)
202
Scientific Computing for Chemists with Python
fig = [Link](figsize=(10,6))
ax = fig.add_subplot(1,1,1)
[Link](height, 4000, 600, 'r', linestyles='dotted', label='Height')
[Link](ir[:,0], absorb)
[Link](ir[i_peaks,0], absorb[i_peaks], 'o', label='Identified peaks')
ax.set_xlim(4000, 600)
ax.set_xlabel('Wavenumbers, cm$^{-1}$')
ax.set_ylabel('Absorbance')
[Link]()
<[Link] at 0x107d70cb0>
0.175 Height
Identified peaks
0.150
0.125
0.100
Absorbance
0.075
0.050
0.025
0.000
4000 3500 3000 2500 2000 1500 1000
Wavenumbers, cm 1
The slope is a useful feature as it can be used to identify inflection points, edges, and make subtle features in a curve more
obvious. Unfortunately, noisy data can make it challenging to examine the slope as the noise causes the slope to fluctuate
so much that it sometimes dwarfs the overall signal. It is sometimes recommended that the noise be first removed by
signal smoothing, covered in section 6.2, before trying to identify signal features. To demonstrate the challenges of noisy
data, we will generate both noise-free and noisy synthetic data below and calculate the slopes for both.
rng = [Link].default_rng()
[Link](x, y_smooth);
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
0 1 2 3 4 5 6
We will use NumPy to calculate the slope using the [Link]() function, which calculates the differential of a user-
defined order (n). Because the slope is the dy/dx between every pair of adjacent points, the resulting slope data is one
data point shorter than the original data. This is important when plotting the data because the length of the x and y values
must be the same.
When examining the slope, it is important to use smooth data. In the example below, the slope from the noise in the noisy
data dwarfs that of the main signal. Therefore, we will use the slope of the smooth data to find the inflection point below.
dx = 2*[Link]/(1000 - 1)
dy_smooth = [Link](y_smooth, n=1)
dy_noisy = [Link](y_noisy, n=1)
x2 = (x[:-1] + x[1:]) / 2 # x values one shorter
204
Scientific Computing for Chemists with Python
Noisy Data
10 Smooth Data
5
Slope, dy/dx
10
0 1 2 3 4 5 6
Because the inflection point in the center of the data has a negative slope, we will need to find the minimum slope. This
may not always be the case with other data.
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
0 1 2 3 4 5 6
It is not uncommon to collect signal data that has a considerable amount of noise in it. Smoothing the data can help in
the processing and analysis of the data, such as making it easier to identify peaks or preventing the noise from hiding the
extremes in the derivative of the data. Smoothing alters the actual data, so it is important to be transparent to others that
the data were smoothed and how they were smoothed.
There are a variety of ways to smooth data, including moving averages, band filters, and the Savitzky-Golay filter. We
will focus on moving averages and Savitzky-Golay here. For this section, we will work with a noisy cyclic voltammogram
(CV) stored in the file CV_noisy.csv.
CV = [Link]('data/CV_noisy.csv', delimiter=',')
potent = CV[:,0]
curr = CV[:,1]
[Link](potent, curr)
[Link]('Potential, V')
[Link]('Current, A');
206
Scientific Computing for Chemists with Python
1e 5
0.6
0.4
0.2
0.0
Current, A
0.2
0.4
0.6
0.8
1.0
2.0 1.5 1.0 0.5 0.0
Potential, V
The first and simplest way to smooth data is to take the moving average of each data point with its immediate neighbors.
This is an unweighted sliding average smooth or a rectangular boxcar smooth. From noisy data point 𝐷𝑗 , we get smoothed
data point 𝑆𝑗 by the following equation where 𝐷𝑗−1 and 𝐷𝑗+1 are the points immediately preceding and following a data
point 𝐷𝑗 , respectively.
𝐷𝑗−1 + 𝐷𝑗 + 𝐷𝑗+1
𝑆𝑗 =
3
One thing to note about this smoothing method is that it is only valid for all points except the first and last because there
are no data points both before and after them to take the average with. As a result, the smoothed data is two data points
shorter. There are approximations that can be used to maintain the length of the data, but for simplicity, we will allow
the data to shorten.
[Link](potent[1:-1], rect_smooth)
[Link]('Potential, V')
[Link]('Current, A');
1e 6
6
4
2
0
Current, A
2
4
6
8
The above method treats each point equally and only takes the average with the immediately adjacent data points. The
triangular smooth approach averages extra data points with the points closer to the original point weighted more heavily
than those further away. For example, if we take the average using five data points, this is described by the following
equation.
𝐷𝑗−2 + 2𝐷𝑗−1 + 3𝐷𝑗 + 2𝐷𝑗+1 + 𝐷𝑗+2
𝑆𝑗 =
9
The resulting data is shortened by four points as the end points have insufficient neighbors to be averaged.
[Link](potent[2:-2], tri_smooth)
[Link]('Potential, V')
[Link]('Current, A');
208
Scientific Computing for Chemists with Python
1e 6
6
4
2
0
Current, A
2
4
6
8
While the above filters take some form of the mean of the surrounding data points, a median filter takes the median. This
filter is sometimes applied to images because it reduces noise while maintaining sharp edges.
[Link](potent[1:-1], median_smooth)
[Link]('Potential, V')
[Link]('Current, A');
1e 6
6
4
2
0
Current, A
2
4
6
8
6.2.4 Savitzky–Golay
Another approach is the Savitzky–Golay filter, which incrementally moves along the noisy data and fits sections (i.e.,
windows) of data points to a polynomial using least-square minimization. While this approach had been previously
described in the mathematical literature, Abraham Savitzky and M. J. E. Golay are known for applying it to spec-
troscopy ([Link] Conveniently, SciPy contains a built-in function for this called sav-
gol_filter() from the [Link] module shown below.
This function requires three arguments, which include the original data as a NumPy array, window, which is the width
of the moving window the savgol algorithm fits to a polynomial, and polyorder, which is the order of polynomial
used for the moving data fit. You are encouraged to experiment with the window and polyorder arguments to see
what works best for your application. However, polyorder must be less than the window size, and the window must
be an odd integer.
[Link](potent, sg_smooth)
[Link]('Potential, V')
[Link]('Current, A');
210
Scientific Computing for Chemists with Python
1e 6
6
0
Current, A
8
2.0 1.5 1.0 0.5 0.0
Potential, V
The Savitzky–Golay filter appears to have done a decent job removing the noise. Despite there being some remaining
noise and other artifacts in the CV, the denoised CV makes it significantly easier to locate the maxima and minima in this
example.
Another approach to filtering noise is to filter based on frequency. Many times, random noise in data occurs at a different
frequency than the data itself, and the noise can be reduced by filtering noise frequency ranges while maintaining signal
frequencies. If the noise is higher frequency than the signal, it can be filtered out with what is known as a low-pass filter.
Alternatively, filtering out low-frequency noise is known as a high-pass filter, and filtering out noise both above and below
the signal frequency is known as a band-pass filter. Frequency filtering is somewhat involved being that we need to use
window functions which are covered in the Think DSP book by Allen Downey listed at the end of this chapter. Instead,
we will just look at the distribution of signal and noise frequencies in synthetic data. This is useful for analyzing the noise
in data and also is used routinely in nuclear magnetic resonance (NMR) spectroscopy and Fourier Transform infrared
spectroscopy (FTIR).
To convert the data from the time domain to the frequency domain, we will use the fast Fourier transform (FFT) algorithm.
This algorithm is only for data that is periodic. Below, synthetic data is generated oscillating at 62.0 Hz with some random
noise to make it more interesting.
t = [Link](0,1,1000)
freq = 62.0 # Hz
signal = [Link](freq*2*[Link]*t)
noise = [Link](1000)
data = signal + 0.5 * noise
[Link](t, data)
[Link]('Time, s');
1.5
1.0
0.5
0.0
0.5
1.0
0.0 0.2 0.4 0.6 0.8 1.0
Time, s
SciPy contains an entire module called fft dedicated to Fourier transforms and reverse Fourier transforms. We will use
the basic fft() function for our synthetic data, which returns a mixture of real and imaginary values. For plotting, we
will simply look at the real component of the result using .real.
® Note
You may see code around that performs Fourier transform using the [Link] module. The fftpack
module is legacy code and should no longer be used.
212
Scientific Computing for Chemists with Python
250
200
150
100
50
Signal data or information taken from signal data often conforms to linear, polynomial, or other mathematical trends, and
fitting data is important because it allows scientists to determine the equation describing the physical or chemical behavior
of the data. In data fitting, the user provides the data and the general class of equation expected, and the software returns
the coefficients for the equation. Interpolation is the method of predicting values in regions among known data points.
The calculation of values where no data was collected can be accomplished by either using the coefficients derived from a
curve fit or using a special interpolation function that generates a callable function to calculate the new data points. Both
approaches are demonstrated below.
Before we can do our fitting, we need some new, noisy data to examine. A linear set of data with added noise is generated
below along with a second-order curve with the noise.
x = [Link](0,10,100)
noise = [Link](100)
y_noisy = 2.2 * x + 3 * noise
y2_noisy = 3.4 * x**2 + 4 * x + 7 + 3 * noise
[Link](x, y_noisy);
20
15
10
0
0 2 4 6 8 10
[Link](x, y2_noisy);
400
350
300
250
200
150
100
50
0
0 2 4 6 8 10
214
Scientific Computing for Chemists with Python
Now we can fit the noisy linear data with a line using the NumPy [Link](x, y, degree) function. The
function takes the x and y data along with the degree of the polynomial.
A line is a first-degree polynomial, and the function returns an array containing the coefficients for the fit with the highest
order coefficients first. This is effectively a linear regression.
a, b = [Link](x, y_noisy, 1)
print((a, b))
(np.float64(2.240683698467965), np.float64(1.2799581449656965))
For a linear equation of the form 𝑦 = 𝑎𝑥 + 𝑏, we get an array of the form array([a, b]), so the fitted equation
above is 𝑦 = 2.17𝑥 + 1.66. The positive shift of the 𝑦-intercept above zero is not surprising being that we added random
noise not centered around zero; the average of our [Link]() noise should be around 0.5, not zero. This could
be remedied either by subtracting 0.5 from the noise or using another random number generator such as the normal
distribution, such as randn(), which is centered around zero.
We can view our linear regression by plotting a line on top of our data points using the coefficients found above.
y_reg = a*x + b
linear data
linear regression
20
15
10
0
0 2 4 6 8 10
We can also obtain the statistics for our fit using the linregress() function from the SciPy stats module. Note
that this does not return the 𝑟2 value but instead the 𝑟-value which can be squared to generate the 𝑟2 value.
LinregressResult(slope=np.float64(2.2406836984679646), intercept=np.float64(1.
↪2799581449656952), rvalue=np.float64(0.9929311301807056), pvalue=np.float64(1.
↪stderr=np.float64(0.15660399723300497))
® Note
We are starting to see examples of functions that return multiple values which can be assigned to multiple variables
using tuple unpacking like below.
𝑥, 𝑦 = 𝑓𝑢𝑛𝑐(𝑧)
There may be times when you don’t need all of the returned values from a function. In these instances, it is common
to use __ (double underscore) as a junk variable which is broadly understood to store information that will never be
used in the code. You may also see a _ (single underscore) used for this purpose, but this is discouraged as a single
underscore is also used by the Python interpreter to store the last output.
Fitting to a polynomial of a higher order works the same way except that the order is above one. You will need to already
know the order of the polynomial, or you can make a guess and see how well a particular order fits the data. Below, the
[Link]() function determines the second-order data can be fit by the equation 𝑦 = 3.40𝑥2 + 3.95𝑥 + 8.70. We
can again plot this fit equation over our data points to see how well the data agree with our equation.
a, b, c = [Link](x, y2_noisy, 2)
print((a, b, c))
® Note
See section 14.2 for instructions on fitting data to equations other than linear or polynomial using scipy.
optimize functions.
216
Scientific Computing for Chemists with Python
Multivariable linear regression (aka, multiple linear regression) is similar to the linear regression seen in section 6.4.1
except that there are multiple independent variables. This takes the form below where 𝑦 is the dependent variable, 𝑥 are
the independent variables (plural) with coefficients 𝑎, and 𝑏 is the bias term. There are 𝑘 independent variables below.
The goal is to solve for the 𝑎 coefficients and the value for 𝑏 given a series of 𝑥-values with their corresponding 𝑦-values.
Essentially, this is taking regular linear regression to three dimensions or higher. There are multiple methods available in
Python to solve this type of problem including, but not limited to, the following.
® Note
Some of these options are essentially the same thing just implemented with different libraries or functions.
The first option will often involve the fewest lines of code but does require knowledge of using the scikit-learn library.
Being that the other options 1, 2, and 4 either require other libraries or specialized knowledge not yet addressed, we will
solve a multivariable linear regression problem using the [Link]() function.
In the example below, we have an array y which contains our dependent variable values and array X which contains our
independent variable values. For this approach, we want these two arrays to be related by the following equation where
a is an array containing the coefficients and bias term. The issue is that our array X has too few columns to take the dot
product of, so there is nothing that multiplies by 𝑏.
𝑦 =𝑎•𝑋
We need to add a column of ones to array X as is done below. Now when performing the above multiplication, 𝑏 is always
multiplied by 1 to return 𝑏.
X = np.column_stack((X, [Link](6)))
X
To solve for array a, we will use the [Link]() function which takes the arrays containing independent
and dependent variables in this order.
a = [Link](X, y)
a
The output includes the coefficients plus other information about the fit, such as the sum of the squared residuals from the
fit. If you just want the coefficients, use indexing like below.
a[0]
This result is interpreted as the equation 𝑦 = 5.22𝑥0 + 9.07𝑥1 + 1.22𝑥2 + 8.66𝑥3 + 9.78.
218
Scientific Computing for Chemists with Python
6.4.4 Interpolation
The practical difference between the [Link] function and the interpolation functions in SciPy is that the former
returns coefficients for the equation, while the interpolation functions return a Python function that can be used to calculate
values. There are times when one is more desirable than the other, depending upon your application. Below we will use
the interpolation function to interpolate a one-dimensional function.
Below is a dampening sine wave that we will interpolate from ten data points.
x = [Link](1,20, 10)
y = [Link](x)/x
[Link](x,y, 'o');
0.8
0.6
0.4
0.2
0.0
0.2
2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
To interpolate this one-dimensional function, we will use the interp1d() method from SciPy. Along with the x and
y values, interp1d() requires a mode of interpolation using the kind keyword, which can include the items listed in
Table 3.
Table 3 Modes for interp1d() Method
Kind Description
linear() Linear interpolation between data points
zero() Constant value until the next data point
nearest() Predicts values equaling the closest data point
quadratic() Interpolates with a second-order spline
cubic() Interpolates with a third-order spline
Below is a demonstration of both linear and cubic interpolation. The two functions f() and f2() are generated and can
be used like any other Python function to calculate values.
xnew = [Link](1,20,100)
Linear Interpolation
0.8 Cubic Interpolation
Sampled Data
0.6
0.4
0.2
0.0
0.2
2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
The baseline of chemical spectra may sometimes slope or undulate, requiring baseline correction. This is a two-step
process - first, the baseline needs to be identified, and then the baseline is subtracted from the original spectrum. Predicting
a baseline for a spectrum is not a trivial task. Fortunately, the pybaselines library provides Python implementations of
various algorithms for determining the baseline. Because pybaselines is not a standard library with Anaconda or Colab,
it needs to be installed using either pip or conda.
For our example below, we will correct the baseline of an IR spectrum of 2-pentanone. We can see that the baseline
curves upward at frequencies below 2000 cm−1 .
220
Scientific Computing for Chemists with Python
0.4
0.3
Absorbance
0.2
0.1
0.0
4000 3500 3000 2500 2000 1500 1000 500
Wavenumbers (cm 1)
To identify a baseline, we will use the Baseline module of pybaselines imported below. The first step is to create a
baseline fitter object, which accepts the x-axis data from your spectrum.
Next, the fitter object is used to predict the baseline using various baseline algorithms. There are numerous algorithms
available, and many algorithms have multiple parameters that fine-tune how the baseline is identified. Finding the ideal
algorithm and parameters to use can come down to trial-and-error, so feel free to try a few and see what works best.
We will try the modified polynomial, asymmetric least squares, and morphological-based algorithms below, but there are
many others to choose from. The output of each prediction is a background and the parameters from the baseline fit.
b Tip
If you are interested in learning more about the algorithms, see the pybaselines algorithms pages.
IR Spectrum
Modified Polynomial
0.4 Asymmetric Least Squares
Morphological
0.3
Absorbance
0.2
0.1
0.0
4000 3500 3000 2500 2000 1500 1000 500
Wavenumbers (cm 1)
Finally, we will subtract the baseline from the original data. The good news is that the baseline fitter generated the baseline
as a NumPy array with the same size as the original data, so subtraction is a matter of subtracting one array from another.
[Link](wavenums, absorb_corrected)
[Link]('Wavenumbers (cm$^{-1}$)')
[Link]('Absorbance')
[Link]().invert_xaxis()
222
Scientific Computing for Chemists with Python
0.4
0.3
Absorbance
0.2
0.1
0.0
4000 3500 3000 2500 2000 1500 1000 500
Wavenumbers (cm 1)
Further Reading
The ultimate authority on NumPy and SciPy are the Numpy & SciPy Documentation page listed below. As changes and
improvements occur in these libraries, this is one of the best places to find information. For information on digital signal
processing (DSP), there are numerous sources such as Allen Downey’s Think DSP book or articles such as those listed
below.
1. Numpy and Scipy Documentation. [Link] (free resource)
2. Downey, Allen B. Think DSP, Green Tea Press, 2016. [Link] (free resource)
3. O’Haver, T. C. An Introduction to Signal Processing in Chemical Measurement. J. Chem. Educ. 1991, 68 (6),
A147-A150. [Link]
4. Savitzky, A.; Golay, M.J.E. Smoothing and Differentiation of Data by Simplified Least Squares Procedures. Anal.
Chem. 1964, 36 (8), 1627–1639. [Link]
Exercises
Complete the following exercises in a Jupyter notebook. Any data file(s) referred to in the problems can be found in the
data folder in the same directory as this chapter’s Jupyter notebook. Alternatively, you can download a zip file of the data
for this chapter from here by selecting the appropriate chapter file and then clicking the Download button.
1. Import the file CV_K3Fe(CN)[Link] which contains a cyclic voltammogram for potassium cyanoferrate. Plot the
data with the green dots on the highest point(s) and red triangles on the lowest point(s).
2. Import the file titled CV_K3Fe(CN)[Link] and determine the inflection points. Plot the data with a marker on
both inflection points. Hint: There are two inflection points in these data with one running in the reverse direction
making it have a negative slope.
3. Generate noisy synthetic data from the following code.
a) Smooth the data using moving averages and plot the smoothed signal. Feel free to use the moving averages code
from this chapter.
b) Smooth the same data using a Savitzky–Golay filter. Plot the smoothed signal.
4. Import the 31 P NMR file titled fid_31P.csv and determine the number of major frequencies in this wave. Keep in
mind that there will be a second echo for each peak.
5. The wavelength of emitted light (𝜆) from hydrogen is related to the electron energy level transitions by the following
equation where R∞ is the Rydberg constant, n𝑖 is the initial principal quantum number of the electron, and n𝑓 is
the final principal quantum number of the electron.
1 1 1
= 𝑅∞ ( 2 − 2 )
𝜆 𝑛𝑓 𝑛𝑖
The following is experimental data of the wavelengths for five different transitions from the Balmer series (i.e., n𝑓
= 2).
n_i = [3, 4, 5, 6, 7]
wl = [656.1, 485.2, 433.2, 409.1, 396.4]
Calculate a value for the Rydberg constant (R∞ ) using a linear fit of the above data. The data will need to be first
linearized.
6. The following data is for the initial rate of a chemical reaction for different concentrations of starting material (A).
Calculate a rate constant (k) for this reaction using a nonlinear fit.
224
Scientific Computing for Chemists with Python
7. A colorimeter exhibits the following absorbances for known concentrations of Red 40 food dye. Generate a cali-
bration curve using the data below and then calculate the concentration of Red 40 dye in a pink soft drink with an
absorbance of 0.481.
8. The following are points on the 2s radial wave function (Ψ) for a hydrogen atom with respect to the radial distance
from the nucleus in Bohrs (𝑎0 ). Visualize the radial wave function as a smooth curve by interpolating the following
data points.
Radius (𝑎0 ) Ψ
1.0 0.21
5.0 -0.087
9.0 -0.027
13.0 -0.0058
17.0 -0.00108
9. The file Cp2Fe_Mossbauer.txt contains Mossbauer data for a ferrocene complex where the left data column is
velocity in millimeters per second and the right column is relative transmission. Using Python, determine the
velocities of the six negative peaks. Plot the spectrum with dots on the lowest point of each negative peak, and be
sure to label your axes complete with units.
10. Load the file XRF_Cu.csv, which contains X-ray fluorescence (XRF) data for elemental copper, and use Python
to determine the energy in eV of the two peaks. Notice that the x-axis is not in eV yet (see row 17 of data). You
are advised to load the data using a pandas function, and setting a threshold will likely be necessary.
Exercises 225
Scientific Computing for Chemists with Python
226
CHAPTER 7: IMAGE PROCESSING & ANALYSIS
Images are a major data format in chemistry and other sciences. They can be electron microscope images of a surface,
photos of a reaction, or images from fluorescence microscopy. Image processing and analysis can be performed using
software like Photoshop or GIMP, but this can be tedious and subjective when done manually. A better alternative is to
have software automate the entire process to provide consistent, precise, and objective processing of images and taking
measurements of their features.
Among the more popular Python libraries for performing scientific image analysis is scikit-image. This is a library
specifically designed for scientific image analysis and includes a wide variety of tools for the processing and extracting
information from images. Examples of tools in scikit-image include functions for boundary detection, object counting,
entropy quantification, color space conversion, image comparison, and many others. Even though there are other Python
libraries for working with images, such as pillow, scikit-image is designed for scientific image analysis while pillow is
intended for more fundamental operations such as image rotation and cropping.
Like SciPy, scikit-image stores most of its functions in modules, so it is common to import modules individually. For
example, if the user wants to import the color module, it is imported using the following code.
Multiple modules can also be imported in a single import such as below. A list of modules and their description are shown
in Table 1, and additional information can be found on the project website at [Link]
We can also import a single function from a module using the following code structure.
227
Scientific Computing for Chemists with Python
Module Description
color Converts images between color spaces
data Provides sample images
draw Generates coordinates of geometric shapes
exposure Examines and modifies image exposure levels
external. Handles reading, writing, and visualizing TIFF files
tifffile
feature Feature detection and calculation
filters Contains various image filters and functions for calculating threshold values
filters. Returns localized measurements in the image.
rank
graph Finds optimized paths across the image
io Supports reading and writing images
measure Performs a variety of measurements and calculations on or between two images
morphology Generates objects of a specified morphology
novice Provides simple image functions for beginners
restora- Includes image restoration tools
tion
segmenta- Identifies boundaries in an image
tion
transform Performs image transformations including scaling and image rotation
util Converts images into different encodings (e.g., floats to integers) and other modifications such as
inverting the image values and adding random noise to an image
viewer Image viewer tools
This chapter assumes the following imports. Because we will be doing some plotting, this includes the following matplotlib
import and that inline plotting is enabled. In addition, there are functions inside scikit-image that are not in a module, so
we also need to import skimage as well.
import skimage
from skimage import data, io, color
Despite the power and utility of the scikit-image library, there is a significant amount of image processing and analysis
that can be performed using NumPy functionality. This is especially true being that scikit-image imports/stores images
as NumPy arrays.
Most images are raster images, which are essentially a grid of pixels where each location on the grid is a number describing
that pixel. If the image is a grayscale image, these values represent how light or dark each pixel is; and if it is a color
image, the value(s) at each location describe the color. Figure 1 shows a grayscale photo of a flask containing crystals,
with a 10 × 10 pixel excerpt showing the brightness values from the photo. While there is another major class of images
known as vector images, we will restrict ourselves to dealing with raster images in this chapter as primary scientific data
tend to be raster images.
228
Scientific Computing for Chemists with Python
Figure 1 An excerpts of values from a grayscale image showing values representing the brightness of each pixel.
The scikit-image library includes a data module containing a series of images for the user to experiment with. To
display images in the notebook, use the matplotlib [Link]() function. Each image in the data module has
a function for fetching the image, and you can find a complete list of images/functions in the data module by typing
help(data). We will open and view the image of a grayscale lunar surface using the [Link]() function.
Á Warning
The scikit-image [Link]() function is being deprecated and will be removed in version 0.27. If you have used
this function in the past, consider using matplotlib’s [Link]() instead.
moon = [Link]()
[Link](moon);
100
200
300
400
500
0 100 200 300 400 500
The image does not look like a grayscale image because matplotlib is treating the image as data (see section 7.1.4). Use
cmap='gray', vmin=0, vmax=255 to make the image look like a grayscale image.
b Tip
If the image turns out black, you probably need to change vmax=1. See section 7.2.2 for a discussion on encoding.
moon = [Link]()
[Link](moon, cmap='gray', vmin=0, vmax=255);
230
Scientific Computing for Chemists with Python
100
200
300
400
500
0 100 200 300 400 500
If we take a closer look at the data contained inside the lunar surface image, we find a two-dimensional NumPy array
filled with integers ranging from 0 → 255.
moon
Each of these values represents a lightness value where 0 is black, 255 is white, and all other values are various shades
of gray. To manipulate the image, we can use NumPy methods, being that scikit-image stores images as ndarrays. For
example, the image can be darkened by dividing all the values by two. Because this array is designated to contain integers
(dtype = uint8), integer division (//) is used to avoid floats.
moon_dark = moon // 2
[Link](moon_dark, cmap='gray', vmin=0, vmax=255);
100
200
300
400
500
0 100 200 300 400 500
Color images are slightly more complicated to represent because all necessary colors cannot be represented by single
integers from 0 → 255. Probably the most popular way to digitally encode colors is RGB, which describes every color as a
combination of red, green, and blue (Figure 2). These are also known as color channels, and this is typically how computer
monitors display colors. If you look close enough at the screen, which may require a magnifying glass for high-resolution
displays, you can see that every pixel is really made up of three lights: a red, a green, and a blue. Their perceived color is
a mixture or blend of the red, green, and blue values. Being that every pixel now has three values to describe it, a NumPy
array that defines a color image is three-dimensional. The first two dimensions are the height and width of the image, and
the third dimension contains values from each of the three color channels.
[row, column, channel]
232
Scientific Computing for Chemists with Python
Figure 2 An excerpt of the red, green, and blue color channels for a small portion of a color image. The values in each
channel represent the brightness of that color in each pixel.
By scikit-image convention, the encoding of colors is in the order red, green, and then blue order, so the 0 channel is red,
for example.
We can look at an example of a color photo by loading an image from the Hubble Space Telescope. This image is included
with the scikit-image library for users to experiment with.
hubble = data.hubble_deep_field()
[Link](hubble);
0
100
200
300
400
500
600
700
800
array([[[15, 7, 4],
[15, 9, 9],
[ 9, 4, 8],
...,
[18, 11, 5],
[16, 19, 10],
[15, 10, 6]],
[[ 2, 7, 0],
[ 5, 11, 7],
[13, 19, 17],
...,
[11, 10, 5],
[13, 18, 11],
[ 9, 11, 6]],
...,
234
Scientific Computing for Chemists with Python
Looking at the array, you will notice that it is indeed three-dimensional with values residing in triplets. You may also
notice that the numbers are rather small because most pixels in this particular image are near black. If we want to look
at just the red values of the image, this can be accomplish by slicing the array. The red is the first layer in the third
dimension, so we should slice it hubble[:, :, 0]. The brighter a group of pixels in the red channel image, the
more red color that is present in that region.
0
100
200
300
400
500
600
700
800
Alternatively, images can be loaded from an external source using the [Link]() function provided by scikit-image.
This function requires one argument to tell scikit-image which image the user wants to load. If your Jupyter notebook is
in the same directory as the image you want to load, you can simply input the full file name, including the extension, as a
string. Otherwise, you will need to include the full path to the file in addition to the name. Below is an image showing a
flask full of [Ni(CH3 CN)6 ][BF4 ]2 crystals is read into Python.
flask = [Link]('data/[Link]')
[Link](flask);
236
Scientific Computing for Chemists with Python
100
200
300
400
500
flask
...,
7.1.4 Colormaps
When matplotlib deals with a NumPy array, it treats it as generic data, not an image. The human mind does not effectively
handle data on this scale, so to make it easier for humans to interpret, matplotlib maps the values to colors according the
colormap on the right. This is known as false color because the colors in the image are not the real image colors. By
default, the colormap viridis is used, but there are many other colormaps available to choose from in matplotlib. Below is
the red color channel from the Hubble image displayed, so using the [Link]() function.
238
Scientific Computing for Chemists with Python
0 250
100
200
200
300
150
400
500
100
600
700 50
800
0 200 400 600 800 0
To change colormaps, input the name of a different colormap as a string in the optional cmap argument (e.g., plt.
imshow(hubble[:,:,0], cmap='magma')). See [Link]
html for a list of available colormaps. It is strongly encouraged to use one of the perceptually uniform colormaps because
they are more accurately interpreted by humans and also show up as a smooth, interpretable gradient when printed on a
grayscale printer. Below is the display of the Hubble image red channel using the Reds colormap.
0
100
200
300
400
500
600
700
800
b Tip
To reverse the direction of any matplotlib colormap, include an _r after the colormap name. For example, in
the above example with the Reds colormap, the larger the value, the more red the pixel representing the value
becomes. If you use Reds_r, the larger the value, the less red the pixel representing the value becomes.
After processing an image, it is sometimes helpful to save the image to disk for records, reports, and presentations.
The [Link]() function works just fine if executed in the same Jupyter cell as the [Link]() function.
Alternatively, scikit-image provides an image saving function [Link](file_name, array) that operates
similarly to [Link]() except with a couple of image-specific arguments. One key difference is that plt.
savefig() does not take an array argument but instead assumes you want the recently displayed image saved while io.
imsave(file_name, array) takes an array and can save an image even if it has not been displayed in the Jupyter
notebook. Check the directory containing the Jupyter notebook, and there should be a new file titled new_image.png.
[Link]('new_img.png', hubble)
240
Scientific Computing for Chemists with Python
The scikit-image library along with NumPy also provide a variety of basic image manipulation functions such as adjusting
the color, managing how the data is numerically represented, and establishing threshold cutoff values.
7.2.1 Colors
There are numerous ways to represent colors in digital data. The RGB color space is undoubtedly one of the most popular
color spaces, but there are others that you may encounter, such as HSV (hue, saturation, value) or XYZ. Scikit-image
provides functions in the color module for easily converting between these color spaces, and Table 2 lists some common
functions. See the scikit-image website for a more complete list.
Table 2 Common Functions from the color Module
Function Description
color.rgb2gray() Coverts from RGB to grayscale
color.gray2rgb() Coverts grayscale to RGB; by just replicating the gray values into three color channels
color.hsv2rbg() HSV to RGB conversion
color.xyz2rgb() XYZ to RGB conversion
hubble_gray = color.rgb2gray(hubble)
hubble_gray
You will notice that scikit-image takes a three-dimensional data structure, the third dimension being the color channels,
and converts it to a two-dimensional, grayscale structure as expected. One detail that may strike you as different is that
the values are decimals. Up to this point, grayscale images were represented as two-dimensional arrays of integers from
0 → 255. There is no rule that says lightness and darkness values need to be represented as integers. Above, they are
presented as floats from 0 → 1. This brings us to the next topic of encoding values.
7.2.2 Encoding
Encoding is how the values are presented in the image array. The two most common are integers from 0 → 255 or floats
from 0 → 1. However, there are other ranges outlined in Table 3. The difference between signed integers (int) and
unsigned integers (uint) is that unsigned integers are only positive integers starting with zero, while signed integers are
both positive and negative centered approximately around zero. The approximate part is because there are equal numbers
of positive and negative integers, and being that zero is a positive integer, zero is not the exact center. To determine what
the range of values is for an image, scikit-image provides the function [Link]().
Scikit-image also provides some convenient functions for converting to various value ranges described in Table 3. These
functions are not contained in a module, so you will need to just do an import skimage to get access, which was done
at the start of this chapter. The one format that probably needs commenting on is the Boolean format. In this encoding,
every pixel is a True or False value, which is equivalent to saying 1 or 0. This is for black-and-white images where
each pixel is one of two possible values.
Table 3 Scikit-Image Functions for Converting Data Types
Functions Description
skimage.img_as_ubyte() Converts to integers from 0 → 255
skimage.img_as_uint() Converts to integers from 0 → 65535
skimage.img_as_int() Converts to integers from -32768 → 32767
akimage.img_as_bool() Converts to Boolean (i.e., True or False) format
skimage.img_as_float32() Converts to floats from 0 → 1 with 32-bit precision
skimage.img_as_float64() or img_as_float Converts to floats from 0 → 1 with 64-bit precision
skimage.dtype_limits(hubble_gray)
(-1, 1)
hubble_gray_unint8 = skimage.img_as_ubyte(hubble_gray)
skimage.dtype_limits(hubble_gray_unint8)
(0, 255)
If a grayscale image is encoded with floats from 0 → 1, then it is necessary to set vmin=0 and vmax=255. These are the
min and max possible values from the image and are used to ensure that the range of possible values extends completely
across the colormap. If these two parameters are excluded, matplotlib will automatically adjust how the values map to
colors to use the full range of the colormap in the displayed image.
Before trying to extract certain types of information or identify features in an image, it is sometimes helpful to first
increase the contrast of an image. There are a number of ways of doing this, including thresholding and modification
of the image histogram. Some approaches can be performed using NumPy array manipulation, but scikit-image also
provides convenient functions designed for these tasks.
Thresholding can be used to generate a black-and-white image (i.e., not grayscale) by converting gray values at or below a
brightness threshold to black and above the threshold to white. The threshold can be set manually or by an algorithm that
chooses an optimal value customized to each image. We will start with manually setting a threshold. The grayscale image
generated from rgb2gray() is encoded with floats from 0 → 1, so a threshold of 0.65 is chosen by experimentation.
A black-and-white image is then generated as a Boolean. The resulting black-and-white image is shown below.
242
Scientific Computing for Chemists with Python
chem = [Link]()
chem_gray = color.rgb2gray(chem)
[Link](chem_gray, cmap='gray', vmin=0, vmax=1);
100
200
300
400
500
0 100 200 300 400 500
chem_bw = skimage.img_as_ubyte(chem_gray > 0.65)
# above generates a Boolean encoding
[Link](chem_bw, cmap='gray', vmin=0, vmax=1);
100
200
300
400
500
0 100 200 300 400 500
The appropriate threshold may vary from image to image, so manually setting a value is not always practical. Scikit-image
provides a number of functions, shown below in Table 4, from the filters module for automatically choosing a threshold.
If you are not sure which of the functions below to use, there is a try_all_filters() function in the filters module
that will try seven of them and plot the results for easy comparison.
Table 4 Threshold Functions from the filters Module
Functions Description
filters.threshold_isodata() Threshold value from ISODATA method
filters.threshold_li() Threshold value from Li’s minimum cross entropy method
filters.threshold_local() Threshold mask (array) from local neighborhoods
filters.threshold_mean() Threshold value from mean grayscale value
filters.threshold_minimum() Threshold value from minimum method
filters.threshold_niblack() Threshold mask (array) from the Niblack method
filters.threshold_otsu() Threshold value from Otsu’s method
filters.threshold_sauvola() Threshold mask (array) from Sauvola method
filters.threshold_triangle() Threshold value from triangle method
filters.threshold_yen() Threshold value from Yen method
® Note
Threshold value functions provide a single threshold value while threshold masks provide arrays of values the size of
the image. They are used in the same fashion except that the latter provides a per-pixel threshold.
244
Scientific Computing for Chemists with Python
Another method for increasing contrast is by modifying the image histogram. If the values from an image are plotted in
a histogram, you will see something that looks like the following.
4000
3000
Counts
2000
1000
0
0 50 100 150 200 250
Values
This is a plot of how many of each type of brightness value is present in the image. There are practically no pixels in
the image that are black (value 0) or completely white (value 255), but there are two main collections of gray values.
The contrast of this image can be increased by performing histogram equalization, which spreads these values out more
evenly. The exposure module provides an equalize_hist() function for this task.
chem_eq = exposure.equalize_hist(chem_gray)
[Link](chem_eq, cmap='gray', vmin=0, vmax=1);
100
200
300
400
500
0 100 200 300 400 500
Histogram equalization does not produce a black-and-white image, but it does make the dark values darker and the light
values lighter. If we look at the histogram for this image, it will be more even as shown below.
hist = [Link](chem_eq)
[Link](hist[0])
[Link]('Values')
[Link]('Counts');
246
Scientific Computing for Chemists with Python
3500
3000
2500
2000
Counts
1500
1000
500
0
0 50 100 150 200 250
Values
The scikit-image library contains numerous functions for performing various scientific analyses - so many that they cannot
be comprehensively covered here. Below is a selection of some interesting examples that are relevant to science, including
counting objects in images, entropy analysis, and measuring eccentricity of objects. The examples below use mostly
synthetic data to represent various data you might encounter in the lab. Real data can be easily extracted from publications
but are not used here for copyright reasons.
A classic problem that translates across many scientific fields is to count spots in a photograph. A biologist may need to
quantifying the number of bacteria colonies in a petri dish over the course of an experiment, while an astronomer may
want to count the number of stars in a large cluster. In chemistry, this problem may occur as a need to quantify the number
of nanoparticles in a photograph or using the locations to calculate the average distances between the particles.
The good news is that the scikit-image library provides three functions that will take a photograph and return an array
of xyz coordinates indicating where the blobs are located in the image. If all you care about is the number of blobs,
simply find the length of the returned array. There are three functions listed below which include Laplacian of Gaussian
(LoG), Difference of Gaussian (DoG), and Determinant of Hessian (DoH). The LoG algorithm is the most accurate but
the slowest, while the DoH algorithm is the fastest. These functions only accept two-dimensional images, so if it is a color
image, you will need to either convert it to grayscale or select a single color channel to work with.
[Link].blob_log(image, threshold=)
[Link].blob_doh(image, threshold=)
dots = [Link]('data/[Link]')
[Link](dots);
200
400
600
800
1000
1200
0 250 500 750 1000 1250 1500 1750
An image of black dots on a white background is imported above, but the blob detection algorithms work best with light
colors on a dark background. We will invert the image below by subtracting the values from the maximum value or using
the color.rgb2gray().
or
dots_inverted = [Link](dots)
248
Scientific Computing for Chemists with Python
200
400
600
800
1000
1200
0 250 500 750 1000 1250 1500 1750
To detect the blobs, we will use the blob_dog() function as demonstrated below. The function allows for a thresh-
old argument to be set to adjust the sensitivity of the algorithm in finding blobs. A lower threshold results in smaller or
less intense blobs to be included in the returned array.
The returned array includes three columns corresponding to the y position, x position, and intensity of each spot, respec-
tively. The x and y coordinates for an image starts at the top left corner while typical plots start at the bottom left. Keep
this in mind when comparing the coordinates to the image. To confirm that scikit-image found all the blobs, we can plot
the coordinates on top of the image to see that they all line up. This is demonstrated below.
200
400
600
800
1000
1200
0 250 500 750 1000 1250 1500 1750
To find the number of spots, determine length of the array using the len() Python function or looking at the shape of
the array.
len(blobs)
12
The term entropy outside of the physical sciences is used to represent a quantification of disorder or irregularity. In image
analysis, this disorder is the amount of pixel (brightness or color) variation within a region of the image. As you will see
below, entropy is the highest near the boundaries and in noisy areas of a photograph. This makes an entropy analysis
useful for edge detection, checking for image quality, and detecting alterations to an image.
The [Link] modules contains the entropy function shown below. It works by going through the image pixel-by-
pixel and calculating the entropy in the neighborhood, which is the area around each pixel. An entropy value is recorded
in the new array at each location and can be plotted to generate an entropy map. The entropy function takes two required
arguments: the image (img) and a description of the neighborhood called a structured element (selem).
[Link](img, selem)
250
Scientific Computing for Chemists with Python
array([[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
[0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]], dtype=uint8)
The neighborhood is defined as an array of ones and zeros. In this case, it is a disk of radius 5. The user can adjust this
value to the needs of the analysis.
0 6
100 5
4
200
3
300
2
400
1
500 0
0 100 200 300 400 500
® Note
The step above for converting the image repesentation from floats to integers is not strictly required, but the
entropy() function will generate a loss of precision warning if you do not.
Examination of the image shows that there is an increase in entropy near the edges of the features in the image as expected.
There are two regions (blue) that contain unusually low entropy. If you look back at the original image, these regions are
comparatively homogeneous in color.
7.3.3 Eccentricity
Eccentricity is the measurement of how non-circular an object is. It runs from 0 → 1 with zero being a perfect circle
and larger values representing more eccentric objects. This can be useful for quantifying the shape of nanoparticles or
droplets of liquid. The measure module from scikit-image provides an easy method of measuring eccentricity. First,
let us first import an image of ovals for an example. Alternatively, you are welcome to use the coins image from the
data module, but this will require some preprocessing such as increasing the contrast.
ovals = [Link]('data/[Link]')
[Link](ovals);
200
400
600
800
1000
252
Scientific Computing for Chemists with Python
200
400
600
800
1000
props[0].eccentricity
0.9469273936534165
props[1].eccentricity
0.39666071911272044
Further Reading
The scikit-image library with NumPy are likely all you will need for a vast majority of your scientific image processing,
and the scikit-image project webpage is an excellent course of information and examples. The gallery page is particularly
worth checking out as it provides a large number of examples highlighting the library’s capabilities. In the event there is
an edge case the scikit-image cannot do, the pillow library may be of some use. Pillow provides more fundamental image
processing functionality such as extracting metadata from the original file.
1. Scikit-image Website. [Link] (free resource)
Exercises
Complete the following exercises in a Jupyter notebook. Any data file(s) refered to in the problems can be found in the
data folder in the same directory as this chapter’s Jupyter notebook. Alternatively, you can download a zip file of the data
for this chapter from here by selecting the appropriate chapter file and then clicking the Download button.
1. Import the image titled NaK_THF.jpg using scikit-image.
a) Convert the image to grayscale using a scikit-image function.
b) Save the grayscale image using the io module.
2. Load the chelsea image from the scikit-image data module and convert it to grayscale. Display the image using
the scikit-image plotting function and display it a second time using a matplotlib plotting function. Why do they
look different?
3. Generate a 100 × 100 pixel image containing random noise generated by a method from the [Link] module
such as random() or integers() (see section 4.7). Display the image in a Jupyter notebook along with a
histogram of the pixel values. Hint: you will need to flatten the array before generating the histogram plot.
4. Write your own Python function for converting a color image to grayscale. Then find the source code for the scikit-
image rgb2gray() function available on the scikit- image website and compare it to your own function. Are
there any major differences between your function and the scikit-image function?
5. Import an image of your choice either from the data module or of your own and convert it to a grayscale image.
a) Invert the grayscale image using NumPy by subtracting all values from the maximum possible value
b) Invert the original grayscale image using the invert() function in the scikit- image util module
6. Import a color image of your choice either from the data module or of your own and calculate the sum of all
pixels from each of the three color channels (RGB). Which color (red, green, or blue) is most prevalent in your
image?
7. The folder titled glow_stick contains a series of images taken of a glow stick over the course of approximately
thirteen hours along with a CSV file containing the times at which each image was taken in numerical order.
Quantify the brightness of each image and generate a plot of brightness versus time.
8. The JPG image file format commonly used for photographs degrades images during the saving process due to the
lossy compression algorithm while the PNG image file format does not degrade images with its lossless compression
algorithm.
a) To view how JPG distorts images, import the [Link] and [Link] images of the same NMR spectrum.
Subtract the two images from each other and visualize this difference to see the image distortions caused by JPG
compression.
b) Which of the above file formats is better for image-based data in terms of data integrity?
9. Import the image [Link] and determine the number of spots in the image using scikit-image. Plot the coor-
dinates of the spots you find with red x’s over the image to confirm your results. If your script missed any spots,
speculate as to why those spots were missed.
10. The image test_tube_altered.png has been altered using photo editing software. Generate and plot an entropy
map of the image to identify the altered regions.
254
Scientific Computing for Chemists with Python
11. Steganography is the practice of hiding information in an image or digital file to avoid detection. The file hid-
den_img.png was created by combining an image with pseudorandom noise to mask the original image. Perform
an entropy analysis on the image to reveal the original image. You may need to adjust the size of the selection
element (selem) to detect the hidden image.
Exercises 255
Scientific Computing for Chemists with Python
256
CHAPTER 8: MATHEMATICS
We have already been doing math throughout this book as Python is fundamentally performing mathematical operations
through arithmetic, calculus, algebra, and Boolean logic among others, but this chapter will dive deeper into symbolic
mathematics, matrix operations, and integration. Some of this chapter will rely on SciPy and NumPy, but for the symbolic
mathematics, we will use the popular SymPy library.
SymPy is the main library in the SciPy ecosystem for performing symbolic mathematics, and it is suitable for a wide
audience from high school students to scientific researchers. It is something like a free, open-source Mathematica substi-
tute that is built on Python and is arguably more accessible in terms of cost and ease of acquisition. All of the following
SymPy code relies on the following import which makes all of the SymPy modules available.
import sympy
SymPy differentiates itself from the rest of Python and SciPy stack in that it returns exact or symbolic results whereas
Python, SciPy, and NumPy will generate numerical answers which may not be exact. That is to say, not only does SymPy
perform symbolic mathematical operations, but even if the result of an operation has a numerical answer, SymPy will
return the value in exact form. For example, if we take the square root of 2 using the math module, we get a numerical
value.
import math
[Link](2)
1.4142135623730951
The value returned is a rounded approximation of the true answer. In contrast, if the same operation is performed using
SymPy, we get a different result.
[Link](2)
√
2
Because the square root√of two is an irrational number, it cannot be represented exactly by a decimal, so SymPy leaves
it in the exact form of 2. If we absolutely need a numerical value, SymPy can be instructed to evaluate an imprecise,
numerical value using the evalf() method.
257
Scientific Computing for Chemists with Python
[Link](2).evalf()
1.4142135623731
One of the advantages of evalf() is that it also accepts a significant figures argument.
[Link](2).evalf(30)
1.41421356237309504880168872421
[Link](40)
3.141592653589793238462643383279502884197
8.1.1 Symbols
Before SymPy will accept a variable as a symbol, the variable must first be defined as a SymPy symbol using the sym-
bols() function. It takes one or more symbols at a time and attaches them to variables.
x, c, m = [Link]('x c m')
There is no value attached to x as it is a symbol, so now it can be used to generate symbolic mathematical expressions.
E = m * c**2
𝑐2 𝑚
E**2
𝑐 4 𝑚2
SymPy can also be used to solve expressions for numerical values, and there are times when only certain ranges or types
of numerical values make physical sense. For example, concentrations should only be nonnegative and real values. To
constrain solutions of an expression to positive and nonnegative values, additional arguments known as predicates can be
added to the [Link]() function such as nonnegative=True or real=True.
258
Scientific Computing for Chemists with Python
To see what constraints were placed on a variable, use the assumptions() function demonstrated below.
{'commutative': True,
'complex': True,
'extended_negative': False,
'extended_nonnegative': True,
'extended_real': True,
'finite': True,
'hermitian': True,
'imaginary': False,
'infinite': False,
'negative': False,
'nonnegative': True,
'real': True}
A selection of predicate arguments for the [Link]() function are listed below in Table 1. This is not an
exhaustive list, but a complete list can be found on the SymPy website under “Predicates.”
Table 1 Predicates for [Link]()
Depending upon settings and version of SymPy, the output may look like Python equations which are not always the
easiest to read. If so, you can turn on pretty printing, shown below, which will instruct SymPy to render the expressions
in more traditional mathematical representations that you might see in a math textbook. More recent versions of SymPy
make this unnecessary, however, as it generates more traditional mathematical representations by default.
Similar to the math Python module, SymPy contains an assortment of standard mathematical operators such as square
root and trigonometric functions. A table of common functions is below. Some of the functions start with a capital letter
such as Abs(). This is important so that they do not collide with native Python functions if SymPy is imported into the
global namespace.
Table 2 Common SymPy Functions
It is important to note that any mathematical function operating on a symbol needs to be from the SymPy library. For
example, using a [Link]() function from the math Python module will result in an error.
SymPy is quite capable of algebraic operations and is knowledgeable of common identities such as 𝑠𝑖𝑛(𝑥)2 +𝑐𝑜𝑠(𝑥)2 = 1,
but before we proceed with doing algebra in SymPy, we need to cover some basic algebraic methods. These are provided
in Table 3 which includes polynomial expansion and factoring, expression simplification, and solving equations. The
subsequent sections demonstrate each of these.
Table 3 Common Algebraic Methods
Method Description
[Link]() Expand polynomials
[Link]() Factors polynomials
[Link]() Simplifies the expression
[Link]() Equates the expression to zero and solves for the requested variable
[Link]() Substitutes a variable for a value, expression, or another variable
When dealing with polynomials, expansion and factoring are common operations that can be tedious and time-consuming
by hand. SymPy makes these quick and easy. For example, we can expand the expression (𝑥−1)(3𝑥+2) as demonstrated
below.
expr = (x - 1) * (3 * x + 2)
[Link](expr)
3𝑥2 − 𝑥 − 2
[Link](3 * x**2 - x - 2)
(𝑥 − 1) (3𝑥 + 2)
260
Scientific Computing for Chemists with Python
8.2.2 Simplification
SymPy may not always return a mathematical expression in the simplest form. Below is an expression with a simpler
form, and if we feed this into SymPy, it is not automatically simplified.
3 * x**2 - 4 * x - 15 / (x - 3)
15
3𝑥2 − 4𝑥 −
𝑥−3
However, if we instruct SymPy to simplify the expression using the simplify() method, it will make a best attempt
at finding a simpler form.
3𝑥 + 5
SymPy can also solve equations for an unknown variable using the solve() function. The function requires a single
expression that is equal to zero. For example, the following solves for 𝑥 in 𝑥2 + 1.4𝑥–5.76 = 0.
[-3.20000000000000, 1.80000000000000]
A common chemical application of the above algebraic operations is solving equilibrium problems using the ICE (Initial,
Change, and Equilibrium) method. As a penultimate step, the mathematical expressions are inserted into the equilib-
rium expression and often result in a polynomial equation. Below is an example problem with completed ICE table and
equilibrium expression.
To expand the right portion of the equation, we can use the expand() method. Notice that the variable x has been
constrained below to real (real=True) and nonnegative (nonnegative=True) values here. This is because in this
example, x is one of the equilibrium concentrations, so imaginary and negative values would make no physical sense.
These constraints may not be appropriate for other examples.
[Link](expr)
This is probably not what you were expecting or hoping for. The polynomial has been expanded, but the result is still a
fraction. We can instruct SymPy to simplify the results.
[Link]([Link](expr))
This is much better. Ultimately, we want to solve for 𝑥, but the solve() function requires an expression that equals
zero. We can achieve this by subtracting 3.44.
[Link](expr - 3.44)
[0.170006841512893]
If the variable x was not constrained to real and nonnegative values, a fourth-order polynomial would return four solutions,
with only one making physical sense. Because we did constrain x, the solve() function conveniently only returns 0.17.
8.2.5 Substitutions
Another common algebraic operation is the substitution of one variable in an expression for another variable, expression,
or value. This is accomplished in SymPy using the [Link]() function which requires two pieces of information
- the variable being replaced (x_old) and the new variable, expression, or value (x_new).
[Link](x_old, x_new)
As an example, let’s determine the composition of a mixture of two enantiomers based on the net optical rotation of this
mixture. The net rotation of a mixture, [𝛼]𝑚𝑖𝑥 , of two enantiomers 𝑑 and 𝑙 is described below as the linear combination
of rotations of each enantiomer where 𝑑 and 𝑙 are the mole fractions and [𝛼]𝑑 and [𝛼]𝑙 are the specific rotations of each
enantiomer.
If we have a mixture where the net rotation is +8.3∘ and the 𝑑 and 𝑙 enantiomers have specific rotations of +32.4∘ and
-32.4∘ , respectively, we can insert these values into the above equation to get the below result.
262
Scientific Computing for Chemists with Python
We now have one equation with two unknowns, 𝑑 and 𝑙. To solve this, we need a second equation which we can generate
by recognizing that the sum of the fractions equals 1 just as the sum of percentages total to 100%.
𝑑+𝑙=1
We rearrange the above equation to 𝑑 = 1 − 𝑙 and now need to substitute this expression for d in the first equation. We
can let SymPy perform this substitution.
d, l =[Link]('d, l')
24.1 − 64.8𝑙
We can then solve this expression for 𝑙 using the [Link]() function being that it equals zero in the current form.
[Link](net_new)
[0.371913580246914]
The [Link]() function can also substitute variables for numerical values. If we want to see the net rotation is
𝑙 = 0.6 and 𝑑 = 0.4, we can run the following.
−14.78
8.3 Matrices
Matrices are an efficient method of working with larger amounts of data. When done by hand, as is the case in many
classroom environments, it is likely slow and painful. The beauty and power of matrices is when they are used with
computers because they simplify bulk calculations. SymPy, SciPy, and NumPy all support matrix operations. If you
need to do symbolic math, SymPy should be your go-to, but for the numerical calculations that we will do here, we will
use NumPy’s linalg module.
SciPy and NumPy both offer a matrix object, but the SciPy official documentation discourages their use as they offer
little advantage over a standard NumPy array. We will stick with NumPy arrays here, but below demonstrates creating
a matrix object if you feel that you absolutely must use them. See the NumPy documentation page for further details on
attributes and methods for this class of object.
import numpy as np
mat = [Link]([[1, 8], [3, 2]])
mat
matrix([[1, 8],
[3, 2]])
Being that we are using NumPy arrays, the standard mathematical operations use the +, ‒, *, /, and ** operators as
demonstrated in chapter 4. There are a few other operations and methods, however, that are important for matrices such
as calculating the inverse, determinant, transpose, and dot product. For these operations, we have the following methods
provided by NumPy’s linalg module, Table 4, which are demonstrated in the following sections.
Table 4 Common NumPy Methods for Linear Algebra
Method Description
[Link]() Calculates the dot product
[Link]() Returns the inverse of an array (if it exists)
[Link]() Returns the Moore-Penrose pseudoinverse of an array
[Link]() Returns the determinant of an array
[Link]() Solves a system of linear equations
[Link]() Returns approximate solution to a system of linear equations
In addition, it is worth reiterating that there is a general NumPy array method transpose() that will transpose or
rotate the array around the diagonal. There is a convenient array.T shortcut that is often used. See section 4.2.3 for
details.
Solving systems of equations can be a tedious process by hand, but solving them using matrices can save time and effort.
Let us say we want to solve the following system of equations for 𝑥, 𝑦, and 𝑧.
6𝑥 + 10𝑦 + −5𝑧 = 21
2𝑥 + 7𝑦 + 𝑧 = 13
−10𝑥 + −11𝑦 + 11𝑧 = −21
These equations can be rewritten in matrix or array form as follows with the left matrix holding the coefficients.
6 10 −5 𝑥 21
⎡ 2 7 1⎤ ⎡ ⎤ ⎡ ⎤
⎢ ⎥ ⋅ ⎢𝑦 ⎥ = ⎢ 13 ⎥
⎣−10 −11 11 ⎦ ⎣ 𝑧 ⎦ ⎣−21⎦
We will call the first array M, the second X, and the third y, so we get
𝑀 ⋅𝑋 =𝑦
We can solve for X by multiplying (dot product) both sides by the inverse of M, 𝑀 −1 . Anything multiplied by its inverse
is the identity, so 𝑀 −1 ⋅ 𝑀 is the identity matrix and can be ignored.
𝑀 −1 ⋅ 𝑀 ⋅ 𝑋 = 𝑀 −1 ⋅ 𝑦
𝑋 = 𝑀 −1 ⋅ 𝑦
To get the inverse of a matrix or array, we can use the [Link]() function provided by NumPy’s linear algebra
module and use the dot() method to take the dot product.
264
Scientific Computing for Chemists with Python
[Link](M).dot(y)
𝐴 = 𝜖𝑏𝐶
For a path length of 1.0 cm, which is quite common, the equation simplifies down to:
𝐴 = 𝜖𝐶
When there are three analytes, 𝑥, 𝑦, and 𝑧, the absorption of light at a given wavelength equals the sum of the individual
absorptions.
𝐴 = 𝜖𝑥 𝐶𝑥 + 𝜖𝑦 𝐶𝑦 + 𝜖𝑧 𝐶𝑧
If we measure the absorbance of a three-analyte solution at three different wavelengths (𝜆), we get the following three
equations.
If you perform either the above calculations on other data and receive a LinAlgError: Singular matrix error,
this means that the coefficient matrix does not have an inverse and cannot be solved by these methods. One possible
reason is that the coefficient matrix is not square - a requirement for obtaining an inverse. Here are two possible solutions
to working around this issue.
1. Substitute the [Link]() function with the Moore-Penrose pseudoinverse function [Link].
pinv(). This versatile function can work with non-square matrices.
2. Substitute the [Link]() for the [Link]() function. The former can find approxi-
mate solutions when exact solutions do not exist or when the coefficient matrix is not square. This is not uncommon
when dealing with linear fitting because not all data points may fall perfectly on the line of best fit or the number
of data points does not equal the number of independent variables.
As an example inspired by J. Chem. Educ. 2000, 77, 185-187, let’s calculate the enthalpy of the following reaction
𝑆8 (𝑠) + 8 𝑂2 (𝑔) → 8 𝑆𝑂2 (𝑔) Δ𝐻𝑛𝑒𝑡 = ?
knowing the enthalpy of the following two subreactions.
𝑆8 (𝑠) + 12 𝑂2 (𝑔) → 8 𝑆𝑂3 (𝑔) Δ𝐻1 = −3160𝑘𝐽
𝑅 = 𝐴−1 ⋅ 𝑌
The problem we face is that matrix A is not square and thus the matrix inverse cannot be calculated. Instead, we can use
the Moore-Penrose pseudoinverse in place of the regular inverse as demonstrated below.
266
Scientific Computing for Chemists with Python
A = [Link]([[0, -2],
[8, 2],
[-12, -1],
[-1, 0]])
Y = [Link]([8, 0, -8, -1])
R = [Link](A).dot(Y)
R
This means we need to multiply the first subreaction by 1 and the second subreaction by -4 (i.e., reverse it and quadruple
everything).
Alternatively, we can use the [Link]() similarly to how we use the [Link]() function.
Set the keyword argument rcond=None to avoid an error.
[Link](A, Y, rcond=None)
For the final step of our calculation, we need to multiply the values r1 and r1 by the enthalpy values of the subreactions
and add them together.
dH = [Link](dH_sub)
dH
np.float64(-2375.9999999999995)
This means that the enthalpy of the overall net reaction is -2376 kJ.
® Note
The normal equation used in this section is really just the Moore-Penrose pseudoinverse and is used here as a
demonstration of performing matrix calculations.
Finding the line of best fit through data points can be accomplished by least-square minimization. What we are essentially
looking for is an equation of the form 𝑦 = 𝑚𝑥 + 𝑏 that is as close as possible to the data points, and the mean square error
determines what qualifies as “close.” If we rewrite this problem in matrix or array form, it will look like the following for
a series of four points (𝑥𝑛 , 𝑦𝑛 ) on a two-dimensional plane. The first array contains a column of ones to multiply with b,
𝑥0 1 𝑦0
⎡𝑥 1⎤ 𝑚 ⎡𝑦 ⎤
⎢ 1 ⎥ ⋅ [ ] = ⎢ 1⎥
⎢𝑥2 1⎥ 𝑏 ⎢𝑦2 ⎥
⎣𝑥3 1⎦ ⎣𝑦3 ⎦
We will call the leftmost matrix 𝑋, the center matrix 𝜃, and the rightmost matrix 𝑦.
𝑋⋅𝜃 =𝑦
Ultimately, we are looking for the values of 𝑚 and 𝑏, so we need to solve for matrix 𝜃. This can be accomplished through
optimization algorithms (section 14.2), or in the case of linear regression, there is a direct solution known as the normal
equation shown below where 𝑋 𝑇 is the transpose of 𝑋.
(𝑋 𝑇 ⋅ 𝑋)−1 ⋅ 𝑋 𝑇 ⋅ 𝑦 = 𝜃
As an example, below is a table of synthetic data for copper cuprizone absorbances at various concentrations at 591 nm.
We can use a linear fit to create a calibration curve from this data.
Table 5 Beer-Lambert Law Data for Copper Cuprizone
y = A
X = [Link]((C, [Link](6))).T
X
array([[1.0e-06, 1.0e+00],
[3.0e-06, 1.0e+00],
[6.0e-06, 1.0e+00],
[1.5e-05, 1.0e+00],
[2.5e-05, 1.0e+00],
[3.5e-05, 1.0e+00]])
For the sake of readability, the calculation using the normal equation has been split in half as shown below.
𝑢 = (𝑋 𝑇 ⋅ 𝑋)−1
𝑣 = 𝑋𝑇 ⋅ 𝑦
𝑢⋅𝑣 =𝜃
u = [Link]([Link](X))
v = [Link](y)
theta = [Link](v)
268
Scientific Computing for Chemists with Python
theta
A plot of the linear regression and the data points is shown below, and the linear regression returned a molar absorptivity
of 1.55 × 104 cm−1 M−1 .The regression also returned a 𝑦-intercept value of -5.45 × 10−6 , which is below the detection
limits making it practically zero. This makes sense because the 𝑦-intercept should always be approximately zero if the
background is subtracted.
[Link]()
plt.ticklabel_format(style='sci', axis='x', scilimits=(0, 0))
[Link]('Concentration, M')
[Link]('Absorbance, au');
Linear Regression
0.6 Data Points
0.5
0.4
Absorbance, au
0.3
0.2
0.1
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Concentration, M 1e 5
Matrices can also be used to balance chemical equations as shown below, where 𝑥1 through 𝑥4 are the coefficients for the
balanced chemical equation.
𝑥1 𝐶3 𝐻8 + 𝑥2 𝑂2 → 𝑥3 𝐶𝑂2 + 𝑥4 𝐻2 𝑂
𝐶
We can then describe the number of carbon, hydrogen, and oxygen atoms in each compound using 3 × 1 matrices ⎡
⎢𝐻 ⎥
⎤
⎣𝑂⎦
as shown below.
3 0 1 0
𝑥1 ⎡8⎤ + 𝑥 ⎡0⎤ → 𝑥 ⎡0⎤ + 𝑥 ⎡2⎤
⎢ ⎥ 2⎢ ⎥ 3⎢ ⎥ 4⎢ ⎥
⎣0⎦ ⎣2⎦ ⎣2⎦ ⎣1⎦
Because the number of carbons, hydrogens, and oxygens should be the same on both sides of the balanced chemical
equation, if we subtract the products from the reactants, we should get zero.
3 0 1 0 0
𝑥1 ⎡8⎤ + 𝑥 ⎡0⎤ − 𝑥 ⎡0⎤ − 𝑥 ⎡2⎤ = ⎡0⎤
⎢ ⎥ 2⎢ ⎥ 3⎢ ⎥ 4⎢ ⎥ ⎢ ⎥
⎣0⎦ ⎣2⎦ ⎣2⎦ ⎣1⎦ ⎣0⎦
One potential issue with this set of linear equations is that making all the 𝑥 variables zero is a valid solution, so to avoid this
solution, we will set one of the 𝑥 variables to one. Remember that a balanced chemical equation is about the appropriate
ratio between the reactants and products, so setting a single coefficient to one can still generate a balanced equation. The
one issue is that the coefficients generated by the software may not be integers, but this can be fixed by multiplying the
fractions to get whole numbers as a final step demonstrated below.
Here we have set 𝑥4 = 1.
3 0 1 0 0
𝑥1 ⎡8
⎢ ⎥
⎤ + 𝑥 ⎡0⎤ − 𝑥 ⎡0⎤ − (1) ⎡2⎤ = ⎡0⎤
2⎢ ⎥ 3⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣0⎦ ⎣2⎦ ⎣2⎦ ⎣1⎦ ⎣0⎦
Now we move the last term to the right side.
3 0 1 0
𝑥1 ⎡8⎤ + 𝑥 ⎡0⎤ − 𝑥 ⎡0⎤ = ⎡2⎤
⎢ ⎥ 2⎢ ⎥ 3⎢ ⎥ ⎢ ⎥
⎣0⎦ ⎣2⎦ ⎣2⎦ ⎣1⎦
These matrices can now be merged into one larger matrix. The left matrix below will be called M, and the right matrix
below is called b.
3 0 −1 𝑥1 0
⎡8 0 0⎤ ⋅ ⎡𝑥 ⎤ = ⎡2⎤
⎢ ⎥ ⎢ 2⎥ ⎢ ⎥
⎣0 2 −2⎦ ⎣𝑥3 ⎦ ⎣1⎦
We can then solve for the 𝑥 values to get our coefficients using the [Link]() function as demonstrated
below.
sol = [Link](M, b)
sol
270
Scientific Computing for Chemists with Python
This means that 𝑥1 =0.25, 𝑥2 =1.25, and 𝑥3 =0.75. We can append 𝑥4 below and then multiply all the values by the same
number to generate all integers.
sol = [Link](sol, 1)
sol * 4
This means that the integer coefficients for the balanced chemical equation are 𝑥1 =1, 𝑥2 =5, 𝑥3 =3, and 𝑥4 =4.
𝐶3 𝐻8 + 5 𝑂2 → 3 𝐶𝑂2 + 4 𝐻2 𝑂
This section covers using [Link] to calculate eigenvalues and eigenvectors, which is useful in quantum mechanics
among other applications. This topic will not be utilized later in this book, so feel free to skip over this section if you
have no interest in this topic.
For a square matrix 𝐴, there can exist a scalar 𝜆 and vector 𝑉 that satisfy the following equation.
𝐴𝑉 = 𝜆𝑉
The vector and scalar are known as the eigenvector and eigenvalue, respectively, and there may be more than one solution
for any given matrix 𝐴.
The [Link] module includes a function [Link]() that returns the eigenvalue(s) and eigenvector(s) for
a given square matrix in this order
[Link](matrix)
As an example, we can determine the eigenvalues and eigenvector for the following matrix.
3 1
𝐴=[ ]
4 3
[ 0.89442719, 0.89442719]]))
The first array contains the two eigenvalues, while the second matrix contains the two eigenvector solutions.
Not every matrix has eigenvalues or eigenvectors. In the case of the following 90∘ rotation matrix, the solution generated
includes 𝑗 values which is Python’s notation used for imaginary and complex numbers.
0 −1
𝐴=[ ]
1 0
8.4 Calculus
SymPy and SciPy both contain functionality for performing calculus operations. We will start with SymPy for the sym-
bolic math and switch over to SciPy for the strictly numerical work in section 8.4.3. In this section, we will be working
with the radial density functions (𝜓) for hydrogen atomic orbitals. The squares of these functions (𝜓2 ) provide the prob-
ability of finding an electron with respect to distance from the nucleus. While these equations are available in various
textbooks, SymPy provides a physics module with a R_nl() function for generating these equations based on the
principal (n) quantum number, angular (l) quantum number, and the atomic number (Z). For example, to generate the
function for the 2p orbital of hydrogen, n = 2, l = 1, and Z = 1.
r = [Link]('r')
R_21 = R_nl(2, 1, r, Z=1)
R_21
√ 𝑟
6𝑟𝑒− 2
12
This provides the wavefunction equation with respect to the radius, r. We can also convert it to a Python function using
the [Link]() method.
f(0.5)
np.float64(0.07948602207520471)
8.4.1 Differentiation
SymPy can take the derivative of mathematical expressions using the [Link]() function. This function requires
a mathematical expression, the variable with respect to the derivative is taken from, and the degree. The default behavior
is to take the first derivative if a degree is not specified.
[Link](expr, r, deg)
As an example problem, the radius of maximum density can be found by taking the first derivative of the radial equation
and solving for zero slope.
272
Scientific Computing for Chemists with Python
dR_21 = [Link](R_21, r, 1)
dR_21
√ 𝑟 √ −𝑟
6𝑟𝑒− 2 6𝑒 2
− +
24 12
mx = float([Link](dR_21)[0])
The solve() function returns an array, so we need to index it to get the single value out. We can plot the radial density
and the maximum density point to see if it worked.
0.14
0.12
0.10
Probability Density
0.08
0.06
0.04
0.02
0.00
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
Radius, a0
The radius is in Bohrs (𝑎0 ) which is equal to approximately 0.53 angstroms.
SymPy can also integrate expressions using the [Link]() function which takes the mathematical expres-
sion and the variable plus integration range in the form of a tuple. If the integration range is omitted, then SymPy will
return a symbolic expression.
The normalized (i.e., totals to one) density function is the squared wave function times 𝑟2 (i.e., 𝜓2 𝑟2 ). We can use this
to determine the probability of finding an electron in a particular range of distances from the radius. Below, we integrate
from the nucleus to the radius of maximum density.
0.0526530173437111
There is a 5.27% probability of finding an electron between the nucleus and the radius of maximum probability. This
is probably a bit surprising, but examination of the radial density plot reveals that the radius of maximum probability is
quite close to the nucleus with a significant amount of density beyond the maximum radius. Let’s see the probability of
finding an electron between 0 and 10 Bohrs from the nucleus.
0.970747311923039
The above integration assumes a mathematical function is known. There are times when there is no known function to
describe the data, such as spectra. This is common in NMR spectroscopy and gas chromatography (GC) among many
other applications where integration of peak areas is used to quantify different components of a spectrum.
In the following example, we will use a section of a 1 H NMR spectrum where we want to determine the ratio of the three
triplet peaks via integration. NMR spectra are typically stored in binary files that require a special library to read, which
is covered in chapter 11. For simplicity in this example, the data for a section of the NMR spectrum has been converted
to a CSV file titled Ar_NMR.csv.
array([[0.00000000e+00, 3.42490660e-03],
[1.00000000e+00, 4.52560300e-03],
[2.00000000e+00, 6.67372160e-03],
[3.00000000e+00, 8.58410100e-03],
[4.00000000e+00, 1.23892580e-02],
[5.00000000e+00, 2.12517060e-02],
[6.00000000e+00, 5.18062560e-02],
[7.00000000e+00, 1.23403220e-01],
(continues on next page)
274
Scientific Computing for Chemists with Python
The imported data are stored in an array where the first column contains the index values and the second column contains
the amplitudes.
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0 10 20 30 40 50
Above is a plot of the peaks with respect to the index values (not ppm). To integrate under each of the triplet peaks, first
we need the index values for the edges of each peak. Below is a list, i, that provides reasonable boundaries, and a plot is
below with these edges marked in orange squares.
276
Scientific Computing for Chemists with Python
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0 10 20 30 40 50
Integration under sampled data does not include the values between data points, so these regions are estimated based on
assumptions. The trapezoid() function assumes that any data point between known points lies directly between the
known data points (i.e., linear interpolation) as shown below by the blue lines.
Trapezoidal Integration
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0 10 20 30 40 50
Alternatively, the simpson() function uses the Simpson’s rule which estimates the data between known points using
quadratic interpolation shown below.
278
Scientific Computing for Chemists with Python
Simpson's Integration
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0 10 20 30 40 50
® Note
As of Scipy 1.14, the former trapz() and simps() functions have been replaced by trapezoid() and
simpson(), respectively.
Below, both the trapezoidal and Simpson’s methods are demonstrated. Note that the trapezoid(x, y) function
takes both the 𝑥 and 𝑦 values as required, positional arguments while simpson(y, x=) only requires the 𝑦 data but
will optionally accept the 𝑥 data as a keyword argument.
# trapezoid method
for peak in i:
x = nmr[peak[0]:peak[1], 0]
y = nmr[peak[0]:peak[1], 1]
print(trapezoid(y, x))
1.0401881535
1.529880057
0.5834871775
1.0405229256666666
1.5661107306666666
0.5839565783333334
The three peaks have areas of approximately a 2:3:1 ratio. Using Simpson’s rule here gives approximately the same result.
Ordinary differential equations (ODEs) mathematically describe the change of one or more dependent variables with
respect to an independent variable. Common chemical applications include chemical kinetics, diffusion, electric current,
among others. The SciPy integrate module provides an ODE integrator called odeint() which can integrate
ordinary differential equations. This is useful for, among other things, integrating under kinetic differential equations to
determine the concentration of reactants and products over the course of a chemical reaction.
For example, the following is a first-order chemical reaction with starting material, A, and product, P.
𝐴→𝑃
The decay of a radioactive isotope is an example of a first-order reaction because the rate of decay is proportional to the
amount of A. First-order reaction rates are described by
𝑑[𝐴]
𝑅𝑎𝑡𝑒 = = −𝑘[𝐴]
𝑑𝑡
where [A] is the concentration (M) of A, 𝑘 is the rate constant (1/s), and rate is the change in [A] versus time (M/s). The
odeint() function below takes a differential equation in the form of a Python function, func, the initial values for A,
A0, and a list or array of the times,t, to calculate the [A] .
[Link](func, A0, t)
The Python function can be defined by a def statement or a lambda expression. The former is used below.
The function should take the dependent variable(s) as the first positional argument and the independent variable as the
second positional argument. In this example, A is the dependent variable and time, t, is the independent variable. If
there are multiple dependent variables, they need to be provided inside a composite object like a list or tuple which can be
unpacked through indexing or tuple unpacking once inside the function. You may also notice that t is an unused argument
in our Python function. It is included and required to signal to odeint() that the independent variable is t. The function
is integrated below at times defined by t, and the initial concentration of A and rate constant are A0 and k, respectively.
280
Scientific Computing for Chemists with Python
The concentration of product (P_t) is calculated through the difference between the initial concentration of starting
material and the current concentration. That is, we assume that whatever starting material was consumed has become
product. The results of the simulation have been visualized below.
1.0
0.8
0.6
A
[X], M
P
0.4
0.2
0.0
0 10 20 30 40 50
Time, s
This approach to kinetic simulations can be adapted to even more complex reactions which are demonstrated in section
9.1.4.
Between SymPy, NumPy, SciPy, and Python’s built-in functionality, there is often more than one way to carry out
calculations in Python. For example, finding roots and derivatives of polynomials can be, along with the approaches
demonstrated in this chapter, calculated by creating a NumPy Polynomial object and using NumPy’s roots() and
deriv() methods, respectively. How you carry out a calculation can often come down to a matter of personal prefer-
ence, though there are differences in terms of speed and the output format. Find what works for you and do not necessarily
worry if others are doing the same calculations through a different library or set of functions.
Further Reading
Exercises
Complete the following exercises in a Jupyter notebook using the SymPy and SciPy libraries. Any data file(s) referred to
in the problems can be found in the data folder in the same directory as this chapter’s Jupyter notebook. Alternatively, you
can download a zip file of the data for this chapter from here by selecting the appropriate chapter file and then clicking
the Download button.
1. Factor the following polynomial using SymPy: 𝑥2 + 𝑥–6
2. Simplify the following mathematical expression using SymPy: 𝑧 = 3𝑥 + 𝑥2 + 2𝑥𝑦
3. Expand the following expression using SymPy: (𝑥–2)(𝑥 + 5)(𝑥)
4. A 53.2 g block of lead (Cp = 0.128 J/g·°C) at 128 °C is dropped into a 238.1 g water (Cp = 4.18 J/g·°C) at 25.0
°C. What is the final temperature of both the lead and water? Hint: Assume this is an isolated system, so q𝑙𝑒𝑎𝑑 +
q𝑤𝑎𝑡𝑒𝑟 = 0. We also know that 𝑞 = 𝑚𝐶𝑝Δ𝑇 .
5. The following equation relates the ΔG with respect to the equilibrium constant K.
Δ𝐺 = Δ𝐺𝑜 − 𝑅𝑇 𝑙𝑛(𝐾)
If Δ𝐺𝑜 = -1.22 kJ/mol for a chemical reaction, what is the value for K for this reaction at 298 K? Use the sympy.
solve() function to solve this problem. Remember that equilibrium is when ΔG = 0 kJ/mol, and watch your
energy units. (R = 8.314 J/mol·K)
6. A matrix or array of x,y coordinates can be rotated on a two-dimensional plane around the origin by multiplying by
the following rotation matrix (M𝑅 ). The angle (𝜃) is in radians, and the coordinates are rotated clockwise around
the origin.
𝑐𝑜𝑠(𝜃) −𝑠𝑖𝑛(𝜃)
𝑀𝑅 = [ ]
𝑠𝑖𝑛(𝜃) 𝑐𝑜𝑠(𝜃)
Below is an example using three generic points on the x,y plane.
𝑥0 𝑦0 𝑥′0 𝑦0′
⎡𝑥 𝑐𝑜𝑠(𝜃) −𝑠𝑖𝑛(𝜃)
⎢ 1 𝑦1 ⎤
⎥ ⋅ [𝑠𝑖𝑛(𝜃)
⎡
] = ⎢𝑥′1 𝑦1′ ⎤
⎥
𝑐𝑜𝑠(𝜃) ′
⎣𝑥2 𝑦2 ⎦ ⎣𝑥2 𝑦2′ ⎦
a) Given the following coordinates for the four atoms in carbonate (CO2− 3 ) measured in angstroms, rotate them 90
𝑜
clockwise. Plot the initial and rotated points in different colors to show that it worked.
b) Package the above code into a function that takes an array of points and an angle and performs the above rotation.
7. Using the rotation matrix described in the above problem, write a function that rotates the carbonate anion around
its own center of mass. The suggested steps to complete this task are listed below.
a) Calculate the center of mass
b) Subtract the center of mass from all points to shift the cluster to the origin.
c) Rotate the cluster of points.
d) Add the center of mass back to the cluster to the shift the points back to the starting location.
282
Scientific Computing for Chemists with Python
8. The following is the equation for the work performed by a reversible, isothermal (i.e., constant T) expansion of a
piston by a fixed quantity of gas.
𝑣𝑓
1
𝑤=∫ −𝑛𝑅𝑇 𝑑𝑉
𝑣𝑖 𝑉
a) Using SymPy, integrate this expression symbolically for V𝑖 → V𝑓 . Try having SymPy simplify the answer to
see if there is a simpler form.
b) Integrate the same expression above for the expansion of 2.44 mol of He gas from 0.552 L to 1.32 L at 298 K.
Feel free to use either SymPy or SciPy.
9. Using odeint(), simulate the concentration of starting material for the second-order reaction below and overlay
it with the second-order integrated rate law to show that they agree.
2𝐴 → 𝑃
10. Below are the transformation matrices for an S4 and C2 operation used in group theory. Show that two S4 operations
equal one C2 operation by multiplying two S4 operations together. That is, show that S4 S4 = C2 .
0 −1 0 −1 0 0
𝑆4 = ⎡
⎢1 0 0⎤⎥ 𝐶2 = ⎡ ⎤
⎢ 0 −1 0⎥
⎣0 0 −1⎦ ⎣0 0 1⎦
11. Using dot product math, write your own linear regression function that accepts the x and y coordinates of data
points as separate arrays and returns the slope and intercept of a line of best fit.
Exercises 283
Scientific Computing for Chemists with Python
284
CHAPTER 9: SIMULATIONS
Simulations are a major component of modern chemical research, either in conjunction with experimental work or by
itself. A digital chemical simulation is a representation or mimic of a physical or chemical process using a computer with
enough detail that the results provide meaningful and useful insights into the real process. Simulations do not need to
represent every aspect of the real world as long as the omitted details do not reduce the accuracy or precision to a level
that the simulation is no longer useful.
Modern chemical simulations are often quite complex and are performed with a range of free or commercial software that
regrettably can obfuscate the underlying methods. This chapter aims to introduce simulations with simple methodologies
that can be easily coded in Python, NumPy, and SciPy. These simulations are not designed for use in a research setting
due to the low level of sophistication and do not represent the current state-of-the-art in the field of chemical simulations.
Some of these simulations are also not as computationally efficient as they could be because efficiency is sometimes
sacrificed here for simplicity and accessibility.
The simulations in this chapter assume the following imports from NumPy, SciPy, and matplotlib.
import numpy as np
import [Link]
import [Link] as plt
Simulations with no random variables have fixed outcomes dictated by the code and input parameters. If these simula-
tions are run multiple times using the same parameters, the outcomes of the simulations will be exactly identical. This
is a category of simulations known as deterministic simulations. Even though many physical and chemical processes are
driven by randomness, such as the random movements and collisions of molecules, they can often still be simulated deter-
ministically because a large number of molecules can make the randomness conform to predictable statistical behavior.
This is the case with Nuclear Magnetic Resonance (NMR) splitting patters and chemical kinetics among many others.
The splitting patters observed in 1 H NMR spectra are typically generated by neighboring protons possessing spins of +1/2
or –1/2 which alter the magnetic field around the observed proton. Even though the signs of the neighboring protons are
random, the sample contains such a large number of molecules that the ratio should be quite close to the theoretical value
of approximately 1:1. As a result, we can simulate the splitting patterns generated in 1 H NMR spectra deterministically
by splitting all peaks into 1:1 doublets for every neighboring proton.
A recursive function is defined below that generates the splitting pattern generated by equivalent protons. The function
takes in the chemical shift of the peak(s) (peaks), the number of equivalent neighboring protons (n), the coupling
285
Scientific Computing for Chemists with Python
constant (J) in Hz, and the frequency of observation (freq) in MHz; and it returns a list of the split peaks in ppm. Each
time the function is called, it splits the existing peak(s) into doublets, and the function is then called again if more splits
are necessary due to multiple equivalent neighboring protons. The function below also includes validity checks to ensure
the user-provided parameters are what the function expects.
n =n - 1
In the above example, a peak at 1.00 ppm has two neighboring protons that couple with it at 3.4 Hz, and the sample is
observed at 400 MHz. There are four resulting peaks in the output list, but two peaks are at the same chemical shift of
1.00 ppm. This results in three peaks with the peak at 1.00 ppm being twice the magnitude as the other two. We can
visualize this by binning the peaks and generating a line plot.
® Note
The simulated NMR spectrum can also be plotted using the [Link]() function.
286
Scientific Computing for Chemists with Python
2.00
1.75
1.50
1.25
1.00
0.75
0.50
0.25
0.00
0.985 0.990 0.995 1.000 1.005 1.010 1.015
If there are multiple nonequivalent groups of neighboring protons, this often results in more complex splitting patters due
to additional protons and additional coupling constants. This can be simulated by nesting the split() function and
providing the different coupling constants. Below, we simulate a splitting pattern for a proton coupled with two protons
with J = 9.8 Hz and another proton with J = 10.8. This generates a doublet of triplets.
® Note
2.00
1.75
1.50
1.25
1.00
0.75
0.50
0.25
0.00
0.96 0.97 0.98 0.99 1.00 1.01 1.02 1.03 1.04
Chemical Shift, ppm
Another phenomenon that can be simulated deterministically is the progress of a chemical reaction with respect to time.
Many chemical reactions slow over the course of the reaction as a result of diminishing reactant concentrations. This
occurs when reaction rates are dependent on the concentration of at least one reactant, and as the reaction progresses,
starting material is consumed, slowing the reaction.
One method for simulating this phenomenon is to incrementally calculate the rate of the chemical reaction at various points
in the reaction based on the current concentrations. That is, at each small time step of the reaction, use the concentration(s)
to calculate the current reaction rate and then increase/decrease the reaction concentrations by the amount calculated.
For example, we can simulate the following single-step chemical reaction of A → P. Because this is an elementary step,
the rate law is derivable from the stoichiometry where rate is M/s, 𝑘𝑟𝑥𝑛 is the rate constant, and [A] is the concentration
of A in molarity (M).
To keep the math simple, we will make each step in the reaction one second. That way, if the rate is 0.1 M/s, we can
simply subtract 0.1 M for one second of reaction. Let us choose a k = 0.05 s−1 and an initial [A] = 1.00 M. Therefore,
the rate = (0.05 s−1 )(1.00 M) = 0.05 M/s, so the concentration of A should decrease by 0.05 M in the first second, giving
us 0.95 M. Now the rate of reaction is (0.05 s−1 )(0.95 M) = 0.0475 M/s, so we now subtract 0.0475 M from [A] for
the next second of reaction to get 0.903 M. This continues for the entire duration of the simulation. Code for executing
this process is shown below. A for loop runs the above process for each second of the simulation and records the new
concentrations of A and P in NumPy arrays via assignment.
A, P = 1.00, 0.00 # molarity, M
k = 0.05 # 1/s for a first-order reaction
(continues on next page)
288
Scientific Computing for Chemists with Python
# simulation
for sec in time:
# record concentration
A_conc[sec] = A
P_conc[sec] = P
# recalculate rate
rate = k * A
# recalculate new concentration
A -= rate
P += rate
You may be wondering why the first lines of code in the for loop records the concentrations instead of first decreasing
them. This is because we need to record the initial concentration first before recalculating them. The next iteration will
record the new concentrations before again recalculating rates and concentrations. Below is a plot of the simulation results.
s = 5 # step size
[Link](time, A_conc, label='A Simulated')
[Link](time, P_conc, label='P Simulated')
[Link]('Time, s')
[Link]('Concentration, M')
[Link]();
1.0
0.8
Concentration, M
0.6
A Simulated
P Simulated
0.4
0.2
0.0
0 20 40 60 80 100
Time, s
We can overlay this plot with the theoretical values using the integrated first-order rate law below.
t = [Link](0,100, 10)
A_theor = 1.0 * [Link](-k * t)
P_theor = [Link](10) - A_theor
[Link]();
1.0
0.8
Concentration, M
0.6 A Simulated
P Simulated
A Theoretical
0.4 P Theoretical
0.2
0.0
0 20 40 60 80 100
Time, s
The theoretical equation and simulation results are in good agreement. A closer inspection of the two shows a slight
discrepancy between the two, which is most noticeable earlier in the simulation. This is because the simulation only
adjusts the rate every second, while the theoretical equation can be thought of as recalculating the rate for infinitely small
increments. A more accurate method of performing kinetic simulations is presented in section 9.1.4.
290
Scientific Computing for Chemists with Python
If we have a well-established theoretical equation for the above reaction of A → P, why do we need the simulation? With
this methodology, we can simulate more complicated reaction mechanisms, such as the multistep reaction below, even if
we do not have the theoretical rate law in hand.
𝑘1
𝐴⇌𝐼
𝑘𝑟1
𝑘2
𝐼 +𝐵 ⇌ 𝑃
𝑘𝑟2
In this reaction, starting material A converts to intermediate I in the first step, followed by starting material B combining
with I to form the product P. Both of these steps are reversible, so there are four rate constants. The code and output of
the simulation are below. Unlike the previous simulation, the simulation below appends values to lists (e.g., A_conc).
# the simulation
for sec in range(length):
A_conc.append(A)
I_conc.append(I)
B_conc.append(B)
P_conc.append(P)
# recalculate rates
rate_1 = k1 * A
rate_r1 = kr1 * I
rate_2 = k2 * B * I
rate_r2 = kr2 * P
#recalculate concentrations after next time increment
A = A - rate_1 + rate_r1
I = I + rate_1 - rate_2 - rate_r1 + rate_r2
B = B - rate_2 + rate_r2
P = P + rate_2 - rate_r2
1.0 A
I
B
0.8 P
Concentration, M
0.6
0.4
0.2
0.0
0 25 50 75 100 125 150 175 200
Time, s
A word of caution regarding the above simulations - if the rate constants are increased enough, oscillating behavior and
negative concentrations will be observed… the latter of which is clearly wrong. This is because the simulation fails to
recalculate the rates quickly enough for the simulation, but this can be remedied by decreasing the step size.
Another approach to performing the above kinetic simulations is to integrate the differential equations. For an introduc-
tion to integrating differential equations, see section 8.4.4. Below we will simulate a two-step reaction where the first
step is reversible. Because the following are the elementary steps, the rate equations can be inferred from the reaction
stoichiometry.
𝑘1 𝑘2
𝐴⇌𝐵→𝑃
𝑘𝑟1
The three differential equations tracking the concentrations of A, B, and P are shown below where 𝑘1 and 𝑘𝑟1 are the
forward and reverse rate constants, respectively, for the first step and 𝑘2 is the rate constant for the second step.
𝑑[𝐴]
= −𝑘1 [𝐴] + 𝑘𝑟1 [𝐵]
𝑑𝑡
𝑑[𝐵]
= 𝑘1 [𝐴] + −𝑘2 [𝐵] − 𝑘𝑟1 [𝐵]
𝑑𝑡
𝑑[𝑃 ]
= 𝑘2 [𝐵]
𝑑𝑡
As is done in section 8.4.4, a Python function is created containing the differential equations, but in contrast to chapter 8,
the differential equation for d[P]/dt is also included in the Python function instead of calculating [P] after the integration.
292
Scientific Computing for Chemists with Python
Because the odeint() function only takes the initial concentration (A0, B0, and P0) as a single argument, they need
to be placed in a tuple.
1.0
0.8
0.6
A
[X], M
B
P
0.4
0.2
0.0
0 10 20 30 40 50
Time, s
Unlike the deterministic simulations above, if the same code for a stochastic simulation is run multiple times, the results
will vary at least slightly, though the overall patterns should be similar. This is because the outcome of stochastic simu-
lations is determined by (pseudo)random number generators. It is as if the results of the simulation are dictated by the
flip of a coin or roll of a die. This analogy is so good that rolling dice repeatedly can simulate radioactive decay kinetics
among other things. Rolling a die thousands of times is tedious, so we will use NumPy’s random module to generate
random values for the simulations.
® Note
There is a random component to some of the following code, so exact results may vary.
Radioactive decay is a random process, so logically it can be simulated as such. Every radioactive atom has a fixed
probability of decaying each second, just like a die has a fixed probability of rolling a one. In the simulation below, a for
loop is used for each second or step of the simulation, and a random number generator is used in each step to decide how
many atoms decay. The binomial() method is used here to generate a series of zeros and ones with a set probability
of generating a one. In this simulation, a one signifies a decaying atom. These decayed atoms are tallied and subtracted
from the current number of remaining atoms, and this value is recorded in the atoms_remaining variable.
rng = [Link].default_rng()
starting_atoms = 1000
length = 10000 # length of simulation
num_atoms = starting_atoms
atoms_remaining = []
for x in range(length):
atoms_remaining.append(num_atoms)
# "rolls" dice and tallies up number of zeros
decays = [Link](1, p=0.001, size=num_atoms)
decayed_count = [Link](decays)
# deduct decayed nuclei from the total
num_atoms -= decayed_count
The simulation results stored in the atoms_remaining array can be plotted along with the first-order integrated rate
law to see how the two compare. Being that there is a 1/1000 probability in the above simulation of each atom generating
a one (decay), the rate constant (𝑘) is 0.001 s−1 . For ease of viewing, only twenty data points from the simulation are
plotted below.
# plot of simulation
step = [Link](0, length, 20)
[Link](step, atoms_remaining[::500], 'o', label='Simulation Results')
# plot of theoretical rate law
(continues on next page)
294
Scientific Computing for Chemists with Python
800
Atoms Remaining
600
400
200
0
0 2000 4000 6000 8000 10000
Time, s
The simulation and theoretical model are in good but not perfect agreement. The deviation is a result of the simulation
using random numbers and only simulating a relatively small number of molecules. If this simulation were run with
increasingly larger numbers of molecules, the results are expected to converge on the theoretical prediction.
Uncertainty is a part of all data, and uncertainty around a repeatedly measured and calculated value is sometimes rep-
resented in the form of a 95% confidence interval (CI). This is the interval around the mean that has a 95% chance of
containing the true value. Another way of describing a 95% CI is that if we were to repeatedly collect a dataset and
calculate the 95% CI, the true value should be, statistically speaking, inside the confidence interval 95% of the time.
Performing these experiments would be tedious, but this can be simulated in Python relatively easily.
The equation for calculating the 95% CI is shown below where 𝑥̄ is the average value in a set of repeated measurements,
𝑠 is the standard deviation (corrected), 𝑡 is the statistical 𝑡 value from a table, and 𝑁 is the degrees of freedom. For 20
samples per set, 𝑡 = 2.09 and 𝑁 = 19.
𝑡𝑠
95%𝐶𝐼 = 𝑥̄ ± √
𝑁
We can simulate the data collection by picking a true value and generating twenty samples by adding random error to
twenty copies of the true value. Using the simulated dataset, the 95% CI can be calculated, and we can test whether or
not the true value is inside the CI. If we repeated this procedure numerous times, recording the success or failure of the
true value being inside the CI, we can calculate the success rate as demonstrated below.
trials = 100000
N = 20
t = 2.09
true = 6.2 # true value
# number of times mean inside 95% CI
in_interval = 0
94.85
The above simulation finds that almost 95% of the time the true value is inside the 95% CI, which is pretty close to
what we expected. If this simulation is repeated, you will likely observe that the values are very often slightly below the
expected 95%. This is the result of smaller datasets and should be closer to the theoretical value with increasing dataset
size.
Polymers are long chains of repeating units called monomers. These chains can easily extend for thousands of monomers
and wind around in 3D space in seemingly random fashions. A single polymer chain can be made of a single type of
monomer or multiple types and can be of varying lengths, but for the following polymer simulation, we will work with
polymers of a fixed number of monomers and ignore the monomer types.
One model for polymer conformation is a random flight polymer which assumes that the conformation of the polymer is
entirely random. We can simulate a random flight polymer through a random walk by making each subsequent segment of
polymer extend in a random direction and distance. For simplicity, we will simulate the polymer in only two dimensions,
but this simulation can be expanded to a third dimension. The random element of the simulation is provided by a NumPy
random number generator which generates a random length and direction for each new segment.
The general procedure for the following simulation is to start the polymer chain at coordinate (0, 0), and for each new seg-
ment, add a random value to the x-coordinate of the previous polymer end and another random value to the y-coordinate.
Each new coordinate is then appended to a list of coordinates (coords) for analysis and visualization. This simulation
is coded below. The random values are floats from [-1, 1). NumPy does not provide a function for generating this range,
so we can modify the [0,1) range from the random() method by subtracting 0.5 and multiplying by 2.
segments = 3000
coords = [[0, 0]]
(continues on next page)
296
Scientific Computing for Chemists with Python
coords = [Link](coords)
10
Position(y), au
20
30
40
50 40 30 20 10 0
Position(x), au
The results of the simulation show a polymer strand winding around in a seemingly random fashion. If we rerun the above
simulation, a different-looking polymer conformation will be generated.
Further Reading
4. Kneusel, R. T. The Art of Randomness: Randomized Algorithms in the Real World; No Starch Press: San Francisco,
CA, 2024.
Exercises
Complete the following exercises in a Jupyter notebook. Any data file(s) referred to in the problems can be found in the
data folder in the same directory as this chapter’s Jupyter notebook. Alternatively, you can download a zip file of the data
for this chapter from here by selecting the appropriate chapter file and then clicking the Download button.
1. Using [Link]() and a differential equation, plot the concentration of starting material
A with respect to time for a third-order reaction.
2. Create a simulation of the following single-step reaction and overlay it with the appropriate integrated rate law. The
rate constant is 0.28 M−1 s−1 . Feel free to start with code from this chapter and modify it as needed.
2𝐴 → 𝑃
3. Plot the concentrations of A, B, C, and P with respect to time for the following three-step, non-reversible mecha-
nism. The initial concentrations and rate constants are in the table below.
𝐴→𝐵→𝐶→𝑃
4. Simulate the following chemical equilibrium where the forward rate is described by Rate𝑓 = (1.3 × 10−2
M−1 s−1 )[A]2 and the reverse rate is described by Rate𝑟 = (6.2 × 10−3 s−1 )[B].
𝑘𝑓
2𝐴 ⇌ 𝐵
𝑘𝑟
Use a for loop to simulate each second of reaction by calculating the rates and increasing/decreasing each con-
centration appropriately. Record the concentrations in lists and plot the results. Start with 2.20 M of A and 1.72
M of B and run the simulation for at least 200 seconds. Notice that the rates are in M/s.
5. In section 9.1.3, a two-step, reversible reaction is simulated. If the rate constant k𝑟1 is decreased to 0.001 s−1 , what
effect on the reaction do you anticipate? Simulate this to see if your prediction is correct.
6. Simulate two competing, first-order reactions of starting material A forming product P1 and P2 and plot the resulting
concentrations of both products versus time.
𝑘1 𝑘2
𝑃1 ← 𝐴 → 𝑃 2
Use k1 = 0.02 M/s and k2 = 0.04 M/s and start with 2.00 M of A. What do you predict the plot of concentration
versus time to look like and the ratio of products to be? Does your simulation agree?
7. Polymers that consist of two or more different monomers are known as copolymers. Simulate an addition copolymer
consisting of two monomers: ethylene (28.06 g/mol) and styrene (104.16 g/mol) with a fixed length of a thousand
units. Given the molecular weights of the two monomers above, calculate the weights for a thousand simulated
polymer strands and generate a histogram of the frequency versus weight. Hint: try using the binomial()
method with p=0.5 and treat a zero as one monomer and a one is the other.
298
Scientific Computing for Chemists with Python
8. Block copolymers are polymers where multiple monomer types are clustered along the polymer chain instead of
being randomly dispersed. These clusters are called blocks, which may be of random lengths as the polymer
switches between monomer types. An example is shown below.
-A-A-A-A-A-A-A-B-B-B-B-B-B-A-A-A-A-B-B-B-A-A-A-A-A-
Simulate a block copolymer consisting of two monomers with a total length of a hundred monomer units.
Hint: Append monomers (0 or 1) to a list inside a for loop, and use a method such as binomial() to decide
when to toggle between monomer types. Use mono = 1 - mono to make the switch.
9. The random flight polymer simulation presented in section 9.2.3 uses a for loop. As discussed in chapter 4, one of
the virtues of NumPy is that it often avoids the computationally inefficient for loops. Below is the same simulation
written in a single line of code leveraging the power of NumPy arrays. Briefly explain what it is doing and why it
works.
rng = [Link].default_rng()
loc = [Link]([Link](-1, high=2, size=(3000,2)), axis=0)
10. Proteins are nature polymers consisting of twenty common monomers called amino acids. Simulate a random
protein strand of a thousand units long using the integers() method and a Python dictionary or list containing
the single-letter amino acid codes.
11. Confidence intervals
a) Convert the code for calculating a 95% confidence interval in section 9.2.2 to a Python function that accepts
the number of samples as the one argument and returns the percentage of the time the true value is inside the
confidence interval. You will need to look up t values and generate a dictionary that converts degrees of freedom
(N) to t values.
b) Using a for loop, calculate the percentage of the time the true value is in the 95% confidence interval for each
of the sample sizes in the above dictionary and plot the results. Describe the trend.
12. Simulate the diffusion of molecules along a single axis. Start all molecules at zero, and for each step of the simu-
lation, add a random number, positive or negative, to each value in the array. Plot the results in a histogram.
13. Using the function from section 9.1.1, simulate the splitting pattern for the tertiary proton in isopropyl alcohol
((CH3 )2 CHOH). In CDCl3 , this proton is observed at 3.82 ppm with a coupling constant of 6 Hz. Assume no
coupling with the hydroxyl proton is observed.
14. The law of large numbers indicates that as the number of trials increases, the observed average should overall
converge on the statistical average. For example, when rolling a six-sided die, all numbers are equally probable to
land up, so if we roll a number of dice, the average of all the numbers is expected to be around 3.5 (i.e., (1 + 2 +
3 + 4 + 5 + 6)/6 = 3.5). Using the integers() method, simulate the rolling of between two and five thousand,
six-sided dice and plot the resulting average number versus the number of dice rolled. Include at least a hundred
data points in your plot and label your axes.
Exercises 299
Scientific Computing for Chemists with Python
300
CHAPTER 10: PLOTTING WITH SEABORN
There are a number of plotting libraries available for Python, including Bokeh, Plotly, and MayaVi; but the most prevalent
library is still probably matplotlib. It is often the first plotting library a Python user will learn, and for good reason. It
is stable, well supported, and there are few plots that matplotlib cannot generate. Despite its popularity, there are some
drawbacks… namely, it can be quite verbose. That is, you may be able to generate nearly any plot, but it will take at least
a few lines of code, if not dozens, to create and customize your figure.
One attractive alternative is the seaborn plotting library. While seaborn cannot generate the same variety of plots as
matplotlib, it is good at generating a few common plots that people use regularly, and here is the key detail… it often does
what would take matplotlib 10+ lines of code in only one or two lines. To make things even better, seaborn is built on top
of matplotlib. This means that if you are not completely happy with what seaborn creates, you can fine-tune it with the
same matplotlib commands you already know! In addition, seaborn is designed to work closely with the pandas library.
For example, think of all the lines of code you have typed to simply add labels to your x- and y-axes. Instead, seaborn
often pulls the labels from the DataFrame column headers. Again, if you do not like this default behavior, you can still
override it with [Link]() and other commands that you already know.
By convention, seaborn is imported with the sns alias, but being that this is a relatively young library, it is unclear how
strong this convention is. The official seaborn website uses it, so we will as well. All code in this chapter assumes the
following import.
A map of the seaborn plotting library is mainly a series of the different types of plots that it can generate. Below is a table
of the main categories. The rest of this chapter is a more in-depth survey of select plotting functions, and it is certainly
not a complete list.
Table 1 Seaborn Plotting Type Categories Covered Herein
Category Description
Regression Draws a regression line through the data
Categorical Plots frequency versus a category
Distribution Plots frequency versus a continuous value
Matrix Displays the data as a colored grid
Relational Visualizes the relationship between two continuous variables
301
Scientific Computing for Chemists with Python
One distinction between some of the plotting categories above is whether they display continuous versus dis-
crete/categorical information. When data are continuous, they can be nearly any value in a range like the density of
a metal. This is in contrast to discrete or categorical data that places data in a limited number of groups or bins such as
the element(s) present in a metal sample.
Generating a regression line through data is a common task in science, and seaborn includes multiple plotting types that
perform this task. All of the plots discussed below use a least square best fit and include a confidence interval for the
regression line as a shaded region. Remember that there is uncertainty in both the slope and y-intercept for a regression
line. If we were to plot all the possible variations of the regression line within the slope and intercept uncertainties, we
get the regression confidence interval. By default, seaborn displays the 95% confidence interval, but this can be changed.
10.2.1 regplot
The regplot generates a single scatter plot of data with a linear regression through the data points complete with a 95%
confidence interval. The [Link]() function can take x and y positional arguments just like [Link](),
but it also can take the x and y column names from a pandas DataFrame. Both approaches are demonstrated below.
rng = [Link].default_rng()
x = [Link](10)
y = 2 * x + [Link](size=10)
[Link](x=x, y=y);
20
15
10
0 2 4 6 8
302
Scientific Computing for Chemists with Python
If the data is in a DataFrame, the x and y values can be provides as the column names, and seaborn will automatically
add the column names as x and y labels. Below is a series of boiling point and molecular weights for various organic
compounds.
bp = pd.read_csv('data/org_bp.csv')
bp
bp MW type
0 65 32.04 alcohol
1 78 46.07 alcohol
2 98 60.10 alcohol
3 118 74.12 alcohol
4 139 88.15 alcohol
5 157 102.18 alcohol
6 176 116.20 alcohol
7 195 130.23 alcohol
8 212 144.25 alcohol
9 232 158.28 alcohol
10 36 72.15 alkane
11 69 86.18 alkane
12 98 100.21 alkane
13 126 114.23 alkane
14 151 128.26 alkane
15 174 142.29 alkane
16 196 156.31 alkane
17 216 170.34 alkane
18 63 86.18 alkane
19 117 114.23 alkane
20 28 72.15 alkane
21 80 100.21 alkane
22 108 74.12 alcohol
23 83 74.12 alcohol
24 131 88.15 alcohol
25 135 102.18 alcohol
26 140 116.20 alcohol
27 182 94.11 alcohol
28 202 108.14 alcohol
29 220 136.19 alcohol
If you choose to provide column names from a pandas DataFrame, you must also provide the name of the DataFrame
using the data keyword argument.
250
200
150
bp
100
50
0
40 60 80 100 120 140 160
MW
While the DataFrame column names provide accurate axis labels, the units are missing. We can use matplotlib commands
from chapter 3 to modify the axis labels.
304
Scientific Computing for Chemists with Python
250
200
150
bp, oC
100
50
0
40 60 80 100 120 140 160
MW, g/mol
10.2.2 lmplot
An lmplot() is very similar to the regplot() function except that an lmplot() also allows for multiple regressions
based on additional pieces of information about each data point. For example, the org_bp.csv file above contains the boiling
points of various alcohols and alkanes along with their molecular weights. Chemical intuition might bring one to expect
two independent boiling point trends between the alcohol and alkanes, so we need two independent regression lines for
the two classes of organic molecules. The lmplot() function can do exactly this.
The lmplot() function takes the x and y variables and the DataFrame name as either positional or keyword arguments,
so the function call could also be as shown below where the first three arguments are positional arguments providing the
x-values, y-values, and the DataFrame name in this order.
The hue= argument is the column name that dictates the color of the markers, so in this example, it will be the type of
organic molecule.
250
200
150
bp, oC
type
alcohol
alkane
100
50
306
Scientific Computing for Chemists with Python
250 type
alcohol
alkane
200
150
bp, oC
100
50
Categorical plots contain one axis of continuous values and one axis of discrete or categorical values. For example, if the
density of three metals were measured repeatedly in the lab, we would want to plot measured density (continuous) with
respect to metal identity (categorical). Below are a few fictitious laboratory measurements for the densities of copper,
iron, and zinc.
Table 2 Density (g/mL) Measurements for Different Metals
Cu Fe Zn
8.51 7.95 6.79
9.49 7.53 7.06
8.48 8.09 7.96
9.40 7.44 7.06
8.83 8.38 6.69
9.45 7.83 7.21
8.73 6.88 7.35
9.00 7.90 6.65
8.84 8.51 7.41
9.32 7.89 7.89
If we want to compare these values, the density can be plotted on the y-axis and metal on the x-axis. First, we need to
load the values into a DataFrame.
df = [Link](densities, columns=labels)
[Link]()
Cu Fe Zn
0 8.51 7.95 6.79
1 9.49 7.53 7.06
2 8.48 8.09 7.96
3 9.40 7.44 7.06
4 8.83 8.38 6.69
The simplest categorical plot function is stripplot() which generates a scatter plot with the x-axis as the categorical
dimension and the y-axis as the continuous value dimension. By providing the function with the DataFrame, it will assume
the columns are the categories.
[Link](data=df);
308
Scientific Computing for Chemists with Python
9.5
9.0
8.5
8.0
7.5
7.0
Cu Fe Zn
By default, the x-axis contains the column labels from the DataFrame, but the y-axis is without any label. Again, one of
the conveniences of the seaborn library is that it is built on top of matplotlib, so any plot created by seaborn can be further
modified by matplotlib commands as shown below.
[Link](data=df)
[Link]('Density, g/mL')
[Link]('Metals');
9.5
9.0
8.5
Density, g/mL
8.0
7.5
7.0
Cu Fe Zn
Metals
While the plots above are elegantly simple, they can make it difficult to accurately interpret the data when multiple data
points are overlapping as can happen with larger numbers of data points. This obscures the quantity of points in various
regions. One plot that alleviates this issue is the swarm plot which is almost identical to the strip plot except that points
are not permitted to overlap to make the quantity more apparent.
[Link](data=df)
[Link]('Density, g/mL')
[Link]('Metals');
310
Scientific Computing for Chemists with Python
9.5
9.0
8.5
Density, g/mL
8.0
7.5
7.0
Cu Fe Zn
Metals
An additional option for understanding the density of points is the violin plot. By default, this plot renders a blob with the
width representing the density of points at various regions. Inside the blob are miniature box plots (discussed in the next
section) that provide more information about the distribution of data points.
[Link](data=df)
[Link]('Density, g/mL')
[Link]('Metals');
10.0
9.5
9.0
8.5
Density, g/mL
8.0
7.5
7.0
6.5
6.0
Cu Fe Zn
Metals
The box plot is a classic plot in statistics for representing the distribution of data and can be easily generated in seaborn
using the boxplot() function, which works much the same way as the above categorical plots. There are three main
components to a box plot. The center box contains lines marking the 25𝑡ℎ , 50𝑡ℎ , and 75𝑡ℎ percentile regions. For example,
the 75𝑡ℎ percentile line is where 75% of the data points are below. The 50𝑡ℎ percentile is also known as the median. The
length of the box (i.e., from the 25𝑡ℎ percentile to 75𝑡ℎ percentile) is known as the inner quartile range (IQR). Beyond
the box are the bars known as whiskers, which mark the range of the rest of the data points up to 1.5x the IQR. If a data
point is beyond 1.5x the IQR, it is an outlier and is explicitly represented with a spot (Figure 1).
Figure 1 A box plot is composed of a box with lines at the 25𝑡ℎ , 50𝑡ℎ , and 75𝑡ℎ percentiles and whiskers that extend out
312
Scientific Computing for Chemists with Python
to the rest of the non-outlier data points. If a data point is greater than 1.5 × the inner quartile range from the 25𝑡ℎ or
75𝑡ℎ percentiles, it is an outlier represented by a dot.
[Link](data=df)
[Link]('Density, g/mL')
[Link]('Metals');
9.5
9.0
8.5
Density, g/mL
8.0
7.5
7.0
Cu Fe Zn
Metals
The count plot represents the frequency of values for different categories. This is similar to a histogram plot except
that a histogram’s x-axis is a continuous set of values while a count plot’s x-axis is made up of discrete categories. The
countplot() function accepts a raw collection of responses, tallies them up, and plots them as a labeled bar plot. For
example, if we have a dataset of all the chemical elements up to rutherfordium (Rf) and their physical state under standard
conditions, the function accepts the list of their physical states, counts them, and generates the plot.
elem = pd.read_csv('data/elements_data.csv')
[Link]()
[Link](x='state', data=elem);
80
60
count
40
20
0
gas solid liquid
state
Like many plotting types in seaborn, the count plot can be further customized through keyword arguments and using other
available data. One shortcoming of the above plot is that the states are listed in the order they first appear in the dataset
instead of based on disorder. We can assert a different order by providing the order argument as a list of how the states
should appear.
314
Scientific Computing for Chemists with Python
80
60
count
40
20
0
gas liquid solid
state
We can also set the color of each bar based on the valence orbital block by providing the hue argument with the name
of the column.
30 block
s
p
25 d
f
20
count
15
10
0
gas solid liquid
state
Seaborn provides a set of plotting types that represent the distribution of data. These are essentially extensions of the his-
togram plot but with extra features like additional dimensions, kernel density estimates, and generating grids of histogram
plots.
10.4.1 histplot
The histplot() function is one of the most basic distribution plotting functions in seaborn. This function is similar to
the matplotlib [Link]() function except that seaborn brings a few extra options like setting the color (hue=) based
on a particular column of data.
To demonstrate this, we will use the results of a one-dimensional stochastic diffusion simulation. During the individual
steps of this simulation, each of a thousand simulated molecules is either moved to the right one unit, to the left one unit,
or not moved at all. A random number generator dictates this movement as demonstrated below.
[Link](loc)
[Link]('Location')
[Link]('Number of Molecules');
316
Scientific Computing for Chemists with Python
120
100
Number of Molecules
80
60
40
20
0
75 50 25 0 25 50 75
Location
The kdeplot() function is very similar to the histplot() function except that it fits the histogram with a kernel
density estimate (kde) curve. This curve is basically just a smoothed curve over the data to help visualize the overall trend.
[Link](loc)
[Link]('Location')
[Link]('Fraction of Molecules')
plt.tight_layout()
0.014
0.012
0.010
Fraction of Molecules
0.008
0.006
0.004
0.002
0.000
100 50 0 50 100
Location
A joint plot can be described as a scatter plot with histograms on the sides providing additional information or clarification
on the density of the data points. To demonstrate this, below is a two-dimensional stochastic diffusion simulation and the
results. The principles are the same as above except applied to two dimensions.
[Link](x, y, '.')
[Link]('equal');
318
Scientific Computing for Chemists with Python
200
100
100
200
<[Link] at 0x11962e480>
200
100
Y Distance, au
100
200
320
Scientific Computing for Chemists with Python
200
100
Y Distance, au
100
200
200
100
Y Distance, au
100
200
The pair plot belongs to the category of distribution plots, but it is different enough to be worth addressing separately. A
pair plot is designed to show the relationship among multiple variables by generating a grid of plots in a single figure. Each
plot in the grid is a scatter plot showing the relationship between two of the variables on either axis with the exception of the
plots in the diagonals. Because the diagonal plots are the intersection between a variable and itself, these are histograms
showing the distributions of values for that variable. Pair plots are particularly useful for looking at new data to see if
there are any trends worth investigating because this entire grid can be easily generated with a single [Link]()
function.
To demonstrate a pair plot, the file periodic_trends.csv contains physical data on non-noble gas elements in the first three
rows of the periodic table. To quickly see how each of the columns of data relates to each other, we will generate a pair
plot.
per = pd.read_csv('data/periodic_trends.csv')
[Link]()
322
Scientific Computing for Chemists with Python
[Link](per);
4.0
3.5
3.0
2.5
EN
2.0
1.5
1.0
3.0
2.5
2.0
row
1.5
1.0
1600
1400
1200
IE_kJ
1000
800
600
140
120
radius_pm
100
80
60
40
1 2 3 4 1.0 1.5 2.0 2.5 3.0 500 1000 1500 50 100 150
EN row IE_kJ radius_pm
The color can also be set based on any piece of information. Below, the row is used to dictate the color of each data point.
324
Scientific Computing for Chemists with Python
4.0
3.5
3.0
2.5
EN
2.0
1.5
1.0
1600
1400
1200 row
IE_kJ
1000 1
2
800 3
600
140
120
radius_pm
100
80
60
40
0.0 2.5 5.0 0 1000 2000 50 100 150 200
EN IE_kJ radius_pm
Heat maps are color representations of 2D grids of numerical data and are ideal for making large tables of values easily
interpretable. As an example, we can import a table of bond dissociation energies (in kJ/mol) and visualize these data as
a heat map. In the following pandas function call, the index_col=0 tells pandas to apply the first column as column
headers as well.
H C N O F
H 436 415 390 464 569
C 415 345 290 350 439
N 390 290 160 200 270
O 464 350 200 140 160
F 569 439 270 160 160
This grid of numerical values is difficult to quickly interpret, and if it were a larger table of data, it could become al-
most impossible to interpret in this form. We can plot the heat map using the heatmap() function and feeding it the
DataFrame. The function also accepts NumPy arrays, but without the index and column labels of a DataFrame, the axes
will not be automatically labeled.
[Link](bde);
550
H
500
450
C
400
350
N
300
O
250
200
F
150
H C N O F
Now we have a color grid where the colors represent numerical values defined in a colorbar automatically displayed on
the right side. This default color map can easily be customized through various arguments in the heatmap() function.
One nice addition is to display the numerical values on the heat map by setting annot=True. If you choose to annotate
the rectangles, you may need to use the fmt= parameter to dictate the format of the annotation labels. Some common
formats are d for decimal, f for floating point, and .2f gives two places after the decimal point in a floating point number.
If you want a different color map, this can be set using the cmap argument and any matplotlib colormap you want. Below,
the annotation is turned on with the perceptually uniform viridis colormap. To further customize the colorbar, use the
cbar_kws= argument that takes a dictionary of parameters found on the matplotlib website. For example, to add a
label, use the label key and the label text is the dictionary value as shown below.
326
Scientific Computing for Chemists with Python
550
H 436 415 390 464 569
500
300
464 350 200 140 160
O
250
200
569 439 270 160 160
F
150
H C N O F
Relational plots are a new addition to the seaborn library as of version 0.9 and include seaborn’s functions for scatter and
line plots. Of course, matplotlib does a nice job making scatter and line plots reasonably easy, but seaborn offers a few
extra ease-of-use improvements upon matplotlib that may be worth something to you depending upon your needs.
One difference between seaborn and matplotlib in generating scatter and line plots is that seaborn allows the user to
change the color, size, and marker styles of individual markers based on numerical values or text data. Matplotlib can
also change the color and size of the markers but only based on numerical values, and to change the marker style, the
[Link]() function needs to be called a second time. Seaborn allows this whole process in a single function call.
Below, we are using the periodic trends data (per) imported in section 10.5. We can start with plotting the electroneg-
ativity (EN) versus the atomic radius (radius_pm) using the [Link]() function, which takes many of
the same basic arguments as plots we have seen so far with seaborn.
4.0
3.5
3.0
2.5
EN
2.0
1.5
1.0
40 60 80 100 120 140
radius_pm
To modify the color, size, and marker style of the data points, use the hue, size, and marker arguments. This
allows additional information to be infused into a single plot. Note that the legend automatically appears on the plot. In
addition, the colormap for the plot can be modified using the palette keyword argument and the name of any matplotlib
colormap.
328
Scientific Computing for Chemists with Python
4.0 IE_kJ
600
800
3.5 1000
1200
3.0 1400
1600
row
2.5 1
EN
2
3
2.0
1.5
1.0
40 60 80 100 120 140
radius_pm
The lineplot() function in seaborn is somewhat similar to the [Link]() function in matplotlib except it also
includes a number of extra features similar to those seen in other seaborn plotting functions. This includes the ability to
change the plotting color and style based on additional information, easy visualization of confidence intervals, automatic
generation of a legend, and others. To demonstrate the lineplot() function, we will import simulated kinetic data
for a first-order chemical reaction run seven times (i.e., runs 0 → 6).
kinetics = pd.read_csv('data/kinetic_runs.csv')
[Link]()
run
1.0 0
1
0.8 2
3
4
0.6 5
6
[A], M
0.4
0.2
0.0
0.2
0 25 50 75 100 125 150 175 200
Time, s
The [A] was plotted versus Time, and the hue of each line was set to the Run number. The result is that each kinetic
run is shown in a separate color. If the user is not concerned so much with seeing the individual runs but instead wants to
see an average of each of the runs with some indication of the variation, the lineplot() function provides a default
95% confidence interval as is shown below.
330
Scientific Computing for Chemists with Python
1.0
0.8
0.6
[A], M
0.4
0.2
0.0
1.0
0.8
0.6
[A], M
0.4
0.2
0.0
Similar to a number of other Python libraries, seaborn brings with it datasets for users to experiment with. These
are callable using the sns.load_dataset() function with the name of the dataset as the argument. Below
is a table describing a few of the available Seaborn datasets. This list may change, so you can use the sns.
get_dataset_names() to see the most current list.
Table 3 A Few Datasets Available in Seaborn
Name Description
anscombe Anscombe’s quartet data with four artificial datasets that exhibit the same mean, standard deviation, and
linear regression among other statistical descriptors
car_crashes
Data on car crashes including mph above the speed limit among other information
exer- Diet and exercise data
cise
flights Aircraft flight information including year, month, and number of passengers
iris Ronald Fisher’s famous iris dataset used frequently in machine learning classification examples
planets Information on discovered planets
tips Restaurant information including bill total, tip, and information about the client
titanic Titanic survivor dataset
332
Scientific Computing for Chemists with Python
Further Reading
Exercises
Complete the following exercises in a Jupyter notebook and seaborn library. Any data file(s) referred to in the problems
can be found in the data folder in the same directory as this chapter’s Jupyter notebook. Alternatively, you can download
a zip file of the data for this chapter from here by selecting the appropriate chapter file and then clicking the Download
button.
1. Import the file linear_data.csv and visualize it using a regression plot.
2. Import the file titled ir_carbonyl.csv and visualize the carbonyl stretching frequencies using a seaborn categorical
plot. Represent the different molecules with different colors.
3. Import the file titled ir_carbonyl.csv containing carbonyl stretches of ketones and aldehydes.
a) Separate the ketones and aldehydes values into individual Series.
b) Visualize the distribution of both ketone and aldehyde carbonyl stretches using a kde plot.
4. Import the elements_data.csv file and generate a count plot showing the number of elements in each block of the
periodic table (i.e., s, p, d, f).
5. The following equation is Planck’s law which describes the relationship between the radiation intensity(𝑀 ) with
respect to wavelength (𝜆) and temperature (𝑇 ).
2𝜋ℎ𝑐2
𝑀=
𝜆5 (𝑒ℎ𝑐/𝜆𝑘𝑇 − 1)
Import the data called [Link] containing intensities at various temperatures and wavelengths based on
Planck’s law. Generate a plot of intensity versus wavelength using the lineplot() function, and display the
different temperatures as different colors.
6. Import the file ionization_energies.csv showing the first four ionization energies for a number of elements. Plot
this grid of data as a heat map. Include labels in each cell using the annot= argument.
7. Import the file ROH_data_small.csv and plot visualize how boiling point (bp), molecular weight (MW), degree,
and whether a compound is aliphatic are correlated using a pairplot.
8. The following code generates the radial probability plot for hydrogen atomic orbitals for n = 1-4 (see section 3.1)
and determines the radius of maximum probability (see section 6.1.1). These values are combined into a pandas
DataFrame called max_prob where the rows are the principal quantum numbers and columns are the angular
quantum numbers. Display the DataFrame using a heatmap. Your heatmap should include numerical labels on
each colored block on the heatmap, and you should select a non-default, perceptually uniform colormap for your
colormap.
import numpy as np
import pandas as pd
import sympy
from [Link] import R_nl
R = [Link]('R')
r = [Link](0,60,0.01)
max_radii = []
(continues on next page)
for n in range(1,5):
shell_max_radii = []
for l in range(0, n):
psi = R_nl(n, l, R)
f = [Link](R, psi, 'numpy')
max = [Link](f(r)**2 * r**2)
shell_max_radii.append(max/100)
max_radii.append(shell_max_radii)
334
CHAPTER 11: PLOTTING WITH ALTAIR
Matplotlib can create nearly any plot you may need, but it often requires numerous lines of code to generate the desired
result. Seaborn strives to remedy this by offering functions to create a series of common statistical plots in only a few lines
of code with excellent default colors and styles. Altair strives to be a middle ground by having the power of matplotlib
while requiring shorter code than matplotlib. In addition, Altair includes the ability to interact with the plots such as
panning, getting stats on highlighted data points, and informative dialogue boxes when hovering the cursor over a data
point. While Altair has other virtues, it is the interactive capabilities that will be given special attention in this chapter
along with teaching the basics of Altair plotting.
If you have Python installed on your machine, you can install Altair using pip, and if you are using Colab, Altair is already
installed. Altair is imported using the below command with the alt alias. Altair is designed to work with pandas, so
pandas needs to also be imported.
Altair has a number of renderers for displaying your plots with the default behavior using a JavaScript front end that
requires an internet connection. If you are working offline or do not want Altair to reach out to the internet to assist in
your plotting, the below command will make it work offline. There are other rendering options, but I find this works well
while still maintaining the interactivity of Altair plots.
[Link]('jupyter', offline=True)
® Note
Some graphs in this chapter are interactive in the web version of this book but are static in the PDF version.
In the following example, we will visualize ligand cone angle data from J. Am. Chem. Soc. 1975, 97, 7, 1955–1956 and
Chem. Rev. 1977, 77, 3, 313–348, so the data need to be loaded into a pandas DataFrame.
335
Scientific Computing for Chemists with Python
To generate a plot, we first need to create a Chart object using the Chart() function like below which accepts a pandas
DataFrame. Most other customizing beyond this is done by concatenating a series of methods to the Chart object. The
Chart object then needs to be instructed how to represent data points using one of the mark methods. The table below
provides common options, but there are additional options on the Altair website.
Table 1 Common Altair Marker Methods
The marks are customizable by providing the mark method extra keyword parameters like those listed in Table 2.
Table 2 Select Mark Method Arguments
Below is a function call to make a scatter plot using the mark_point() method.
[Link](ligands).mark_point()
Altair only returns a dot because no instructions were provided on how to represent the information. This final piece of
information is known as the encoding or encoding channel and is assigned using the encode() method. In the example
below, the cone angle is encoded or represented by the location on the x-axis using the x= parameter and carbonyl (i.e.,
M-C≡O) stretching frequency is encoded by the position on the y-axis using the y= parameter. Because the Chart
object already has the DataFrame, the x= and y= arguments only need the DataFrame column names.
[Link](ligands).mark_point().encode(
x='cone_angle',
y='CO_freq')
336
Scientific Computing for Chemists with Python
By default, Altair includes zero on the axes, so it is necessary in this example to adjust the ranges for both axes. To adjust
the ranges, first replace the x= and y= shorthand notation with alt.X() and alt.Y() which gives more control.
[Link](ligands).mark_point().encode(
alt.X('cone_angle'),
alt.Y('CO_freq')
)
Then add the scale() method with the domain= parameter to restrict the plotting domains.
[Link](ligands).mark_point().encode(
alt.X('cone_angle').scale(domain=[100, 200]),
alt.Y('CO_freq').scale(domain=[2050, 2100])
)
Along with the x- and y-axis positions, information can be encoded using other visual indicators such as color, size, shape,
etc. Below is a table of some key encodings with others listed on the Altair website.
Table 3 Common Encoding Channels in Altair
Encoding Description
x or alt.X() Position on x-axis
y or alt.Y() Position on y-axis
color or [Link]() Marker color
shape or [Link]() Marker shape
size or [Link]() Marker size
opacity or [Link]() Opacity of the marker
column or [Link]() Separates plots along x-axis
row or [Link]() Separates plots along y-axis
tooltip Dialogue box with information
For example, the chart below represents the ΔH values using the color and the type of ligand with the marker shape.
[Link](ligands).mark_point().encode(
alt.X('cone_angle').scale(domain=[100, 200]),
alt.Y('CO_freq').scale(domain=[2050,2100]),
[Link]('dH'),
[Link]('type')
)
338
Scientific Computing for Chemists with Python
Another way to provide access to information is through a dialogue box using the tooltip= encoding parameter. Just
include a list or tuple of DataFrame column names to be included in the tooltip box. Below, the user will see a small
popup box with the ligand name, enthalpy, and carbonyl frequencies whenever they hover their cursor over the marker on
the plot.
[Link](ligands).mark_point().encode(
alt.X('cone_angle').scale(domain=[100, 200]),
alt.Y('CO_freq').scale(domain=[2050,2100]),
[Link]('dH'),
[Link]('type'),
tooltip=['ligand', 'dH', 'CO_freq']
)
We now have a fairly reasonable plot, but further customization is often necessary. For example, better axis labels with
units would be ideal and can be added using the title() method on each encoding channel. If you don’t like the
colormap, this can be set with the scheme= argument in the color encoding channel.
[Link](ligands).mark_point().encode(
alt.X('cone_angle').scale(domain=[100, 200]).title('Cone Angle (Degrees)'),
alt.Y('CO_freq').scale(domain=[2050,2100]).title('Carbonyl Frequency (1/cm)'),
[Link]('dH').scale(scheme='viridis').title('dH (kcal/mol'),
[Link]('type'),
tooltip=['ligand', 'dH', 'CO_freq']
)
340
Scientific Computing for Chemists with Python
b Tip
If you get an error while trying to save your plot, you may be missing an optional dependency. See Altair website for
installation instructions.
As a final step for our first Altair plot, we can save it using either the (…) menu on the top right or by using the save()
method. Like matplotlib, if no format is specified, Altair grabs this information from the extension (e.g., png, pdf, or svg)
in the file name.
c = [Link](ligands).mark_point().encode(
alt.X('cone_angle').scale(domain=[100, 200]).title('Cone Angle (Degrees)'),
alt.Y('CO_freq').scale(domain=[2050,2100]).title('Carbonyl Frequency␣
↪(Wavenumbers)'),
[Link]('dH').scale(scheme='viridis').title('dH (kcal/mol)'),
[Link]('type')
)
[Link]('first_altair_plot.pdf', format='pdf')
One of the major advantages of Altair over seaborn and matplotlib is the ability to interact with plots in Altair. This can
take many forms, the most basic of which is the ability to pan and zoom. Enabled panning and zooming by adding the
interactive() method to a Chart object. Now by dragging and scrolling, the user can pan and zoom the plot,
respectively. Double-click on the plot to reset it.
[Link](ligands).mark_point().encode(
alt.X('cone_angle').scale(domain=[100, 200]).title('Cone Angle (Degrees)'),
alt.Y('CO_freq').scale(domain=[2050,2100]).title('Carbonyl Frequency (1/cm)'),
[Link]('dH').scale(scheme='viridis').title('dH (kcal/mol'),
(continues on next page)
As an additional example, we will plot the IR spectrum of trans-cinnamaldehyde. Because this is a spectrum, a line plot
is the most appropriate.
b Tip
There are often larger numbers of data points in spectral data. If you get an error due to too many data points,
you can add alt.data_transformers.disable_max_rows() to override this if need be. You may
also run alt.data_transformers.enable("vegafusion") which does some pre-calculations.
alt.data_transformers.enable("vegafusion")
[Link](tcinn).mark_line().encode(
x='Wavenumbers',
y='Absorbance'
)
342
Scientific Computing for Chemists with Python
We now see our plot, but it’s a bit tiny for an IR spectrum. The plot size can be adjusted using the properties()
method and with the width= and height= arguments.
[Link](tcinn).mark_line().encode(
alt.X('Wavenumbers').scale(domain=(4000, 400)),
y='Absorbance'
).properties(width=800, height=400)
The x-axis can be reversed by either setting the domain=(14, -1) or by setting reverse=True in the scale()
method. The plot below is also made interactive by again appending .interactive() and adding a tooltip.
The nice thing about making a spectrum interactive is the ability to pan, zoom, and identify the frequencies of various
absorbances.
® Note
One disadvantage of Altair is that the axis labels do not currently support LaTex formatting. For now, paste in
Unicode symbols whenever you need them.
[Link](tcinn).mark_line().encode(
alt.X('Wavenumbers').scale(domain=(4000, 400)).title('Wavenumbers (1/cm)'),
y='Absorbance',
tooltip=['Wavenumbers', 'Absorbance']
).properties(width=800, height=400).interactive()
Altair will make a best effort to guess the data type (Table 4) and plot the data appropriately. For example, if the data are
numerical values, Altair treats the values as continuous quantitative features, so it plots the data along a continuous axis
with markings anywhere along the axis. Alternatively, if the data are strings, Altair treats them as nominal data which is
categorical in no particular order.
Table 4 Altair Data Types
344
Scientific Computing for Chemists with Python
For example, when plotting the molecular weight (MW) versus hydrocarbon type (hydrocarbon), Altair automatically
treats the molecular weight as quantitative data and the hydrocarbon type as nominal data like below.
HC = pd.read_csv('data/[Link]')
[Link]()
bp MW EOU hydrocarbon
0 574.0 238.46 1 alkene
1 356.0 82.15 2 alkene
2 565.0 226.45 0 alkane
3 330.0 82.15 2 alkyne
4 457.0 156.31 0 alkane
[Link](HC).mark_point().encode(
alt.X('hydrocarbon'),
alt.Y('MW').title('MW (g/mol)')
).properties(width=300)
The default data types can be overridden by either appending the data type abbreviation (Table 4) to the DataFrame header
string or by setting type= to one of the types. One common situation where this is necessary is when categories are
designated by numbers like in machine learning datasets. For example, the hydrocarbon data includes the elements of
unsaturation (EOU) for various hydrocarbons. Altair’s default behavior is to treat the degree as continuous and quantitative,
which leads to the following result.
[Link](low_EOU).mark_point().encode(
alt.X('EOU'),
alt.Y('bp'))
This is not really what we want because there are non-integer markings and the 4 is up against the edge of the plot. If
we append :O to the EOU, this tells Altair to treat elements of unsaturation as nominal values, which are ordered but not
continuous.
[Link](low_EOU).mark_point().encode(
alt.X('EOU:O').title('Elements of Unsaturation'),
alt.Y('bp').title('bp (°C)')
).properties(width=300)
346
Scientific Computing for Chemists with Python
[Link](low_EOU).mark_point().encode(
alt.X('EOU', type='ordinal').title('Elements of Unsaturation'),
alt.Y('bp').title('bp (°C)')
).properties(width=300)
Now we only get integer markings while the values are still in order.
The chart can be further customized, like changing the angle of the axis labels, colors, shape of markers, and making the
chart interactive.
[Link](low_EOU).mark_point().encode(
alt.X('EOU:O', axis=[Link](labelAngle=0)).title('Elements of Unsaturation'),
alt.Y('bp').title('bp (°C)'),
[Link]('MW').scale(scheme='viridis'),
[Link]('hydrocarbon')
).properties(width=300).interactive()
Altair supports the display of faceted figures. While the figures could be created using separate code cells, there are
advantages to displaying them together (see section 11.5). In this example, we will display the density of degassed Coke
and Diet Coke using different types of glassware for measuring the volume.
soda = pd.read_csv('data/[Link]')
[Link]()
In the first example below, we use the Column encoding to represent the data for different glassware types. This results
in what looks like three different figures that share the same y-axis label. If we were to instead use Row encoding, the
three sections would instead be rows and share the same x-axis label.
[Link](soda).mark_point().encode(
alt.Y('Density').scale(domain=(0.9, 1.1)),
(continues on next page)
348
Scientific Computing for Chemists with Python
This figure is a bit narrow, so we can adjust the dimensions again using .properties(width=100), which sets the
width of each section of the graph.
[Link](soda).mark_point().encode(
alt.Y('Density').title('Density (g/mL)').scale(domain=(0.9, 1.1)),
x='Soda',
column='Glassware',
color='Glassware').properties(width=100)
Another way to generate two or more figures or plots together is to concatenate or overlay them. This is accomplished
by assigning two different charts to variables and using either &, |, or + (Table 5). Alternatively, the functions in Table 5
can be used by providing them with the Chart objects.
Table 5 Layered and Multifigured Plots
Below, two scatter plots are created with density on the y-axis and different categories on the x-axes. The figures are then
horizontally concatenated.
chart1 = [Link](soda).mark_point().encode(
alt.Y('Density').title('Density (g/mL)').scale(domain=(0.9, 1.1)),
x='Soda',
color='Glassware').properties(width=250)
chart2 = [Link](soda).mark_point().encode(
alt.Y('Density').title('Density (g/mL)').scale(domain=(0.9, 1.1)),
x='Glassware',
color='Glassware').properties(width=250)
chart1 | chart2
350
Scientific Computing for Chemists with Python
We could instead perform vertical concatenation like below. This is more useful when one plot is narrow like a small bar
graph.
chart1 = [Link](soda).mark_point().encode(
alt.Y('Density').title('Density (g/mL)').scale(domain=(0.9, 1.1)),
x='Soda',
color='Glassware').properties(width=250)
chart2 = [Link](soda).mark_point().encode(
alt.Y('Density').title('Density (g/mL)').scale(domain=(0.9, 1.1)),
x='Glassware',
color='Glassware').properties(width=250)
The overlay option (+) is useful for plotting more than one type of plot on the same axes, like a line and scatter plot, as
we have often done in Chapter 3. An example of this is in the following section.
352
Scientific Computing for Chemists with Python
Another form of interactivity supported by Altair is to allow the user to select portions of a graph and see information
about the selection, such as averages, sums, and distributions. For this section, we will start by looking at a dataset with
alcohol machine learning features.
Below, the boiling point, molecular weight, degree, and whether the alcohol is cyclic (1) or non-cyclic (0) are visualized.
[Link](ROH).mark_point().encode(
alt.Y('bp').scale(domain=[300, 600]).title('bp (°C)'),
alt.X('MW').scale(domain=[0, 200]).title('MW (g/mol)'),
[Link]('degree:O').scale(scheme='viridis'),
[Link]('cyclic:N'))
Altair allows the users to box select data points by adding an interval selection parameter using the alt.
selection_interval() function. This selection parameter is added to the Chart through the .add_params()
method. By default, this is a box selection, which allows the user to select a rectangle anywhere on the plot. If en-
codings=['x'] or encodings=['y'] parameters are added to the selection_interval() function, the
selection is restricted along the x- or y-axes, respectively.
# box selection
selection = alt.selection_interval()
points = [Link](ROH).mark_point().encode(
alt.Y('bp').scale(domain=[300, 600]).title('bp (°C)'),
(continues on next page)
points
# X selection object
selection = alt.selection_interval(encodings=['x'])
points = [Link](ROH).mark_point().encode(
alt.Y('bp').scale(domain=[300, 600]).title('bp (°C)'),
alt.X('MW').scale(domain=[0,200]).title('MW (g/mol)'),
[Link]('degree:O').scale(scheme='viridis'),
[Link]('cyclic:N')
).add_params(selection)
points
354
Scientific Computing for Chemists with Python
# Y selection object
selection = alt.selection_interval(encodings=['y'])
points = [Link](ROH).mark_point().encode(
alt.Y('bp').scale(domain=[300, 600]).title('bp (°C)'),
alt.X('MW').scale(domain=[0,200]).title('MW (g/mol)'),
[Link]('degree:O').scale(scheme='viridis'),
[Link]('cyclic:N')
).add_params(selection)
points
The user is now able to select regions of the Chart, which is stored in the selection variable. This does not really do
anything except make a gray box until this information is passed to another function. In the plot below, two Chart objects
are created - one scatter plot and one bar plot. These Charts are vertically concatenated using the & operator (last line).
The selection object is added to the scatter plot using add_params() while the selection object is provided to the bar
plot through the transform_filter() function. This setup makes it so the scatter plot is where the user selects
regions and the bar plot is the recipient of this selection information. Finally, notice that the bar plot x-variable contains
a count() function instead of a DataFrame column header. This processes the selection information and uses it for the
bar graph. Specifically, the bar graph here shows the total number of primary, secondary, and tertiary alcohols selected
in the scatter plot.
# scatter plot
scatter = [Link](ROH).mark_point().encode(
alt.Y('bp').scale(domain=[300, 600]).title('bp (°C)'),
alt.X('MW').scale(domain=[0,200]).title('MW (g/mol)'),
[Link]('degree:O').scale(scheme='viridis'),
[Link]('cyclic:N')
).add_params(selection)
# bar plot
bar = [Link](ROH).mark_bar().encode(
x='count()',
y='degree:O',
color='degree:O'
).transform_filter(selection)
356
Scientific Computing for Chemists with Python
Another example below is a bar graph of the radial probability of the hydrogen 3p atomic orbital. Like above, there are
two Chart objects - one bar graph and a rule or line that spans the entire Chart. Instead of stacking these Charts, they
are overlayed using the + operator. The bar graph is provided with the selection object through the add_params()
method, allowing the user to select regions in this Chart. The rule Chart accepts the selection through the trans-
form_filter() method, making it the recipient of the selection information. Similar to the above example, the
y-axis is given a function, mean(), which takes the average of the selected probabilities and sets the horizontal bar to
this value. The end result is a bar plot where the user can select a region and see a horizontal line marking the average
probability of the selected region.
prob = pd.read_csv('data/prob_3p_normalized.csv')
selection = alt.selection_interval(encodings=['x'])
bar = [Link](prob).mark_bar().encode(
x=alt.X('Radius').title('Radius (Bohrs)'),
y=alt.Y('Probability').title('Probability'),
).add_params(selection)
rule = [Link](prob).mark_rule(color='firebrick').encode(
y='mean(Probability)',
size=[Link](3)
).transform_filter(selection)
bar + rule
Below is a modified version of the previous graphic where instead of taking the mean of the selected region, the sum
is calculated. This effectively allows the user to graphically integrate different regions of the graph. For example, by
selecting the region just below the node, it can be seen that this region constitutes a little over 10% of the probability.
The two Charts are overlayed using the [Link]() function instead of the + operator to allow more control. This
allows a second y-axis to be added, which shows the sum of the selected probabilities. The colors of the two y-axis labels
are also set to match the two elements in the plot.
Finally, the above plot can be converted from a bar graph to a line plot by changing mark_bar() to mark_line().
358
Scientific Computing for Chemists with Python
selection = alt.selection_interval(encodings=['x'])
bar = [Link](prob).mark_line().encode(
x=alt.X('Radius').title('Radius (Bohrs)'),
y=alt.Y('Probability').title('Probability'),
).add_params(selection)
rule = [Link](prob).mark_rule(color='firebrick').encode(
y=alt.Y('sum(Probability)').scale(domain=(0, 1)),
size=[Link](3)
).transform_filter(
selection
)
Further Reading
The best source of up-to-date information on Altair is the Altair website. Because Altair is newer than matplotlib and
seaborn, there are fewer resources currently available.
1. Altair website. [Link] (free resource)
Official Altair website and documentation page.
360
CHAPTER 12: NUCLEAR MAGNETIC RESONANCE WITH NMRGLUE
& NMRSIM
Nuclear magnetic resonance (NMR) spectroscopy is one of the most common and powerful analytical methods used in
modern chemistry. Up to this point, we have been primarily dealing with text-based data files - that is, files that can be
opened with a text editor and still contain human-comprehensible information. If you open most files that come out of
an NMR instrument in a text editor, it will look more like gibberish than anything a human should be able to read. This
is because they are binary files - they are written in computer language rather than human language.
We need a specialized module to be able to import and read these data, and luckily, a Python library called nmrglue does
exactly this. The library contains modules for dealing with data from each of the major NMR spectroscopy file types,
which includes Bruker, Pipe, Sparky, and Varian. It does not read JEOL files, but as of this writing, JEOL spectrometers
support exporting data into at least one of the above file types supported by nmrglue, and direct support for JEOL files is
under development.
In addition, it is also sometimes helpful to be able to simulate NMR spectra to confirm spectral parameters (e.g., coupling
constants), visualize hypothetical spectra of splitting patterns, or fit the line shapes or splitting patterns of experimental
data. The library nmrsim provides the ability to simulate NMR spectra, including dynamic NMR, and is introduced in
section 12.2.
Currently, nmrglue is not included with the default installation of Anaconda or Miniconda, so you will need to install it
separately. Instructions are included on the nmrglue documentation page, or you can use pip to install it. If Jupyterlab is
installed on your computer, you should be able to install it through the terminal using pip install nmrglue, and
if you are using Google Colab, you should include !pip install nmrglue in the first code cell of the notebook
(see section 0.2). nmrglue requires you to have NumPy and SciPy installed, and matplotlib should also be installed for
visualization.
All use of code below assumes the following imports with aliases. nmrglue is not a major library in the SciPy ecosystem, so
the ng alias is not a strong convention but is used here for convenience and to be consistent with the online documentation.
import nmrglue as ng
import numpy as np
import [Link] as plt
The general procedure for collecting NMR data is to excite a given type of NMR-active nuclei with a radio-frequency pulse
and allow them to relax. As they precess, their rotation leads to a voltage oscillation in the instrument at characteristic
frequencies, and the spectrometer records these oscillations as a free induction decay (FID) depicted below (Figure 1, left).
It is the frequency of these oscillations that we are interested in because they are informative to a trained chemist as to the
chemical environment of the nuclei. One challenge is that all the different signals from each of the nuclei are stacked on
361
Scientific Computing for Chemists with Python
top of each other, making it difficult to distinguish one from the other or to determine the wave frequency. This is similar
to the problem of a computer discerning a single instrument in an entire orchestra playing at once. Fortunately, there is a
mathematical equation called the Fourier transform that converts the above FID into a graph showing all of the different
frequencies (Figure 1, right). This is what is known as converting the time domain to the frequency domain.
Figure 1 Raw NMR spectroscopy data is converted from the time domain (left) to the frequency domain (right) using a
Fourier transform.
The general steps for dealing with NMR spectroscopic data in Python are outlined below.
1. Load the FID data into a NumPy array using nmrglue.
2. Fourier transform the data to the frequency domain.
3. Phase the spectrum.
4. Reference the spectrum.
5. Measure the chemical shifts and integrals of the peaks.
The importing of data using nmrglue is performed by the read function from one of the submodules shown in Table 1.
Additional modules can be found in the nmrglue documentation. The choice of module is dictated by the data file type.
Table 1 Examples of nmrglue Modules
Module Description
bruker Bruker data as a single file
pipe Pipe data as a single file with an .fid extension
sparky Sparky NMR file format with .ucsf extension
varian Varian/Agilent data as a folder of data with an .fid extension
jcampdx JCAMP-DX files with .dx or .jdx extensions
The read() function loads the NMR file and returns a tuple containing a dictionary of metadata and data in a NumPy ar-
ray. The dictionary includes information required to complete the processing of the NMR data. Looking at the NMR data
shown below, you may have noticed each point includes both both real and imaginary components (i.e., the mathematical
terms with j). Both are necessary for phasing the spectrum later on, so don’t discard any of the data.
362
Scientific Computing for Chemists with Python
array([-0.00194889-0.00471539j, -0.00192186-0.00472489j,
-0.00191337-0.00473085j, ..., -0.00189737+0.00591656j,
-0.00191882+0.005872j , -0.00191135+0.00587132j],
shape=(13107,), dtype=complex64)
® Note
The data used in this demo was already Fourier transformed on the spectrometer, so the following cell reverses
this process for demo purposes. Some spectrometers automatically Fourier transform the data while others do
not.
# Reversed the Fourier transform for demo purposes being as this data
# was collected on a spectrometer that already Fourier transformed the data.
The dictionary, dic, above contains a very long list of values, and the dictionary keys can be different among different
file formats. To maintain a shorter, more useful, and more consistent dictionary of metadata, nmrglue provides the
guess_udic() function for generating a universal dictionary among all file formats.
{'ndim': 1,
0: {'sw': 5994.65478515625,
'complex': True,
'obs': 399.7821960449219,
'car': 1998.9109802246094,
'size': 13107,
'label': 'Proton',
'encoding': 'direct',
'time': False,
'freq': True}}
® Note
In NMR spectroscopy, “1D NMR” is actually two-dimensional while “2D NMR” is actually three-dimensional.
The universal dictionary is a nested dictionary. The first key is ndim which provides the number of dimensions in the
NMR spectrum. Most NMR spectra are one-dimensional, but two-dimensional is also fairly common. Subsequent key(s)
are for each dimension in the NMR spectrum with the value as a nested dictionary of metadata. Because the data for
the above spectrum is one-dimensional, there is only one nested dictionary. Table 2 below provides a description of each
piece of metadata contained in the universal dictionary.
* That is, it is assumed that we are looking at single dimensions from the NMR data, so for example, we are looking at
udic[0].
** Being that the data must be in either the frequency or time domain, the freq and time keywords effectively provide
the same information.
When the data is first imported, it is often in the time domain. You can confirm this by checking that the time value in
the udic is set to True like below.
udic[0]['time']
0.010
0.005
0.000
0.005
0.010
0.015
0 2000 4000 6000 8000 10000 12000
To convert the data to the frequency domain, we will use the fast Fourier transform function (fft) from the fft SciPy
module. nmrglue also contains Fourier transform functions, but we will use SciPy here. The plot below inverts the x-axis
with [Link]().invert_xaxis() to conform to NMR plotting conventions.
364
Scientific Computing for Chemists with Python
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
12000 10000 8000 6000 4000 2000 0
When you plot the Fourier transformed data, you may get a ComplexWarning error message because the Fourier
transform will return complex values (i.e., values with real and imaginary components). To only work with the real
components, use the .real method as is done above. The plot now looks more like an NMR spectrum, but most of the
resonances are out of phase. The next step is to phase the spectrum.
Phasing is the post-processing procedure for making all peaks point upward as shown in Figure 2. There is more to it
than taking the absolute value as that would not always generate a single peak, so nmrglue contains a series of functions
for phasing spectra.
Figure 2 Phasing an NMR spectrum results in all the signals pointing in the positive direction.
[Link] Autophasing
The simplest method to phase your NMR spectrum is to allow the autophasing function to handle it for you. Below is the
function which takes the data and the phasing algorithm as the arguments.
[Link].proc_autophase.autops(data, algorithm)
The permitted phasing algorithms can be either acme or peak_minima. It is important to feed the autops() function
the data array with both the real and imaginary components.
1.75
1.50
1.25
1.00
0.75
0.50
0.25
0.00
12000 10000 8000 6000 4000 2000 0
You should try both algorithms to see which works best for you. The above spectrum is the result of the acme autophasing
algorithm, which is close but still slightly off. If neither of the provided autophasing algorithms work for you, you will
need to instead manually phase the NMR spectrum as discussed below.
Manually phasing the NMR spectrum is a two-step process. First, you need to call the manual_ps() phasing function
and adjust the p0 and p1 sliders until the spectrum appears phased.
366
Scientific Computing for Chemists with Python
After closing the window, the function will return values for p0 and p1 that you found to properly phase the spectrum.
Second, input those p0 and p1 values into the ps() phasing function to actually phase the spectrum.
fig3 = [Link](figsize=(16,6))
ax3 = fig3.add_subplot(1,1,1)
[Link](phased_data.real)
[Link]().invert_xaxis()
You can then plot the phased_data to get your NMR spectrum with all the peaks pointing upward.
Even though the NMR spectrum is now phased, it is unlikely to be properly referenced. That is, the peaks are not
currently located at the correct chemical shift. Referencing is often performed by knowing the accepted chemical shifts of
the solvent resonances or an internal standard (e.g., tetramethylsilane, TMS) and adjusting the spectrum by a correction
factor. Currently, we are plotting our data against the index of each data point, so first we need to create a frequency
scaled x-axis as an array followed by adjusting the location of the spectrum so that it is properly referenced.
The x-axis is the frequency scale, so this axis is sometimes presented in hertz (Hz). However, because the frequency
of NMR resonances depends upon the instrument field strength, the same sample will exhibit different frequencies in
different instruments. To make the frequency axis independent of the spectrometer field strength, NMR spectra are often
presented on a ppm scale which is the ratio of the observed chemical shift (Hz) versus a standard over the spectrometer
frequency (MHz) at which that particular nucleus is observed.
This makes the locations of the peaks consistent from spectrometer to spectrometer no matter the strength of the magnet.
This is where the udic from section 12.1.1 is important because we can obtain the observed frequency width (Hz) of
the spectrum, and the resolution of the data. The latter is how many data points are in the spectrum which is important
so that we avoid a plotting error (we all know the one: ValueError: x and y must have same first
dimension,...). If any of the values from the udic are 999.99, this means the spectrometer did not record this
piece of information and you will need to find it elsewhere.
fig4 = [Link](figsize=(16,6))
ax4 = fig4.add_subplot(1,1,1)
[Link](ppm, phased_data.real)
ax4.set_xlabel('Chemical Shift, ppm')
[Link]().invert_xaxis()
1.75
1.50
1.25
1.00
0.75
0.50
0.25
0.00
14 12 10 8 6 4 2 0
Chemical Shift, ppm
Alternatively, nmrglue contains an object called a unit conversion object that can be created and used to convert between
ppm, Hz, and point index values for any position in an NMR spectrum. To create a unit conversion object, use the
make_uc() function which takes two arguments – the dictionary, dic, and the original data array, data, generated
from reading the NMR file in section 12.2.
® Note
If you are using a different NMR file format than pipe, change pipe to the appropriate format from Table 1.
ppm = unit_conv.ppm_scale()
The last line of the above code generates an array of ppm values required for the x-axis to plot the NMR data.
368
Scientific Computing for Chemists with Python
uc = [Link].make_uc(dic, data)
ppm_scale = uc.ppm_scale()
phased_data_rev = phased_data.real[::-1]
1.75
1.50
1.25
1.00
0.75
0.50
0.25
0.00
12 10 8 6 4 2 0 2
Chemical Shift, ppm
The following example uses the ppm scale generated by the unit conversion object.
In the above spectrum, the small resonance at 0.08 ppm is internal TMS (tetramethylsilane) standard which should be
located at 0.00 ppm. The temptation is to subtract 0.08 ppm from the x-axis, but the spectrum is not simply moved over
but instead is rolled. That is, as the spectrum is moved, some of it disappears off one end and reappears on the other
(Figure 3).
Figure 3 Referencing an NMR spectrum is performing by rolling it until the peaks reside at the correct shifts. As a signal
falls of one end of the spectrum, it reappears at the other end.
Conveniently for us, NumPy has a function [Link]() that does exactly this to array data, and nmrglue contains its
own [Link].proc_base.roll() function for this task which calls the NumPy function. Feel free to use
either one.
[Link](array, shift)
The [Link]() function takes two required arguments. The first is the array containing the data and the second is the
amount to shift or roll the data. The shift is not in ppm but rather positions in the data array. If you know your referencing
correction in ppm (Δppm), use the following equation which describes the relationship between the correction in ppm
(Δppm) and the correction in number of data points (Δpoints). The size is the number of point in a spectrum, obs
is the observed carrier frequency, and sw is the sweep width in Hz. These values are all available from the universal
dictionary.
1.75
1.50
1.25
1.00
0.75
0.50
0.25
0.00
7.8 7.6 7.4 7.2 7.0 6.8 6.6 6.4 6.2 6.0 5.8 5.6 5.4 5.2 5.0 4.8 4.6 4.4 4.2 4.0 3.8 3.6 3.4 3.2 3.0 2.8 2.6 2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0
Chemical Shift, ppm
If you want to narrow the plot to where the resonances are located, you can use the [Link](8,0) function. Notice
that 8 is first to indicate that the plot is from 8 ppm → 0 ppm. The use of [Link](8,0) removes the need to use
[Link]().invert_xaxis() to flip the x-axis.
370
Scientific Computing for Chemists with Python
12.1.5 Integration
Integration of the area under the peaks can be performed using either integration functions from the scipy.
integrate module or through nmrglue’s integration function(s). Because the integration function in nmrglue supports
limit values in the ppm scale, it is probably the most convenient and is demonstrated below.
The integration is performed using the integrate() function below where data is your NMR data as a NumPy
array, the conv_obj is an nmrglue unit conversion object (see section [Link]), and limits is a list or array of limits for
integration.
[Link](data, conv_obj, limits)
uc = [Link].make_uc(dic, phased_data_rev)
1.75
1.50
1.25
1.00
0.75
0.50
0.25
0.00
7.8 7.6 7.4 7.2 7.0 6.8 6.6 6.4 6.2 6.0 5.8 5.6 5.4 5.2 5.0 4.8 4.6 4.4 4.2 4.0 3.8 3.6 3.4 3.2 3.0 2.8 2.6 2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0
Chemical Shift, ppm
The limits are in ppm, so take a look at the spectrum above and decide where you want to put the integration limits. An
NMR spectrum is shown above with the chosen integration limits represented as vertical red lines.
Now to integrate our NMR spectrum.
area = [Link](data_ref.real, uc, limits)
area
These values are probably not what you expected, but if we divide all of them by the smallest value, it is easier to see the
relative ratio of areas.
ratio = area / [Link](area)
ratio
array([2.47065636, 1.55163197, 1. ])
The spectrum above is the 1 H NMR of ethylbenzene in CDCl3 which has five aromatic protons, and the other two
resonances should have three and two protons. If we do some math to make the integrations total to ten protons and round
to the nearest integer, we get 5:3:2. There is a small amount of error likely due to the solvent resonance (CHCl3 , 7.27
ppm) being included in the integration of the aromatic protons among other things.
10 / [Link](ratio) * ratio
Another piece of information that is commonly extracted from NMR spectra is the chemical shift of the resonances.
Similar to integration, SciPy contains functions such as [Link]() or [Link].
find_peaks() that can find peaks in spectra, but again, nmrglue contains a function, below, designed for the task
of locating peaks in NMR spectra.
[Link](data, pthres=)
There are numerous optional arguments for the peak picking function, but the two mandatory pieces of information
required are the data array and a positive threshold (pthres=) above which any peak will be identified. Glancing
at the spectrum below, all peaks are above 0.1 (green dotted line) and the baseline is below 0.1, so this seems like a
reasonable threshold.
1.75
1.50
1.25
1.00
0.75
0.50
0.25
0.00
7.8 7.6 7.4 7.2 7.0 6.8 6.6 6.4 6.2 6.0 5.8 5.6 5.4 5.2 5.0 4.8 4.6 4.4 4.2 4.0 3.8 3.6 3.4 3.2 3.0 2.8 2.6 2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0
Chemical Shift, ppm
® Note
The [Link]() does not work with NumPy versions 1.24 and later if you are using
372
Scientific Computing for Chemists with Python
a version of nmrglue before 0.10. Consider upgrading your version of nmrglue if [Link].
pick() raises an error.
The output of this function is an array of tuples with each tuple containing information about an identified peak. From
this, we can already tell there are four peaks identified. Each tuple contains an index for the peak, a peak number, a line
width of the peak, and an estimate of the areas of each peak. We can use the index values to index the ppm array for the
chemical shifts.
peak_loc = []
for x in peaks:
peak_loc.append(ppm_scale[int(x[0])])
print(peak_loc)
1.75
1.50
1.25
1.00
0.75
0.50
0.25
0.00
7.8 7.6 7.4 7.2 7.0 6.8 6.6 6.4 6.2 6.0 5.8 5.6 5.4 5.2 5.0 4.8 4.6 4.4 4.2 4.0 3.8 3.6 3.4 3.2 3.0 2.8 2.6 2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0
Chemical Shift, ppm
We can plot the NMR spectrum with these chemical shifts marked with vertical dotted lines shown above. Looks like
it did a pretty good job locating the resonances! If nmrglue fails to properly identify the peaks, there are a number of
parameters described in the nmrglue documentation that can be adjusted.
nmrsim is a Python package for simulating NMR spectra based on information such as the chemical shifts, coupling
constants, and number of coupling nuclei. The package is capable of simulating individual first-order and second-order
splitting patterns or entire NMR spectra. It can also simulate dynamic NMR caused by nuclei rapidly exchanging. nmrsim
is installable using pip. The package has a few key functions listed below (Table 3) for simulating first-order multiples,
spin systems, and spectra. The Multiplet() function is used to simulate a single, first-order resonance such as a 1:2:1
triplet or a doublet-of-doublets while the SpinSystem() function simulates two resonance signals belonging to pairs
of coupled nuclei. The Spectrum() function can generate entire spectra by merging the resonances generated by other
functions.
® Note
nmrsim is still in beta, so significant changes to the library may occur in future updates.
Function Description
Multiplet() Simulates a single, first-order multiple
SpinSystem() Simulates sets of first- or second-order multiplets generated by coupled nuclei
Spectrum() Simulates first-order spectra
As an example, we can simulate the signal of methylene (i.e., -CH2 -) protons in CH3 -CH2 -CH-. Let us assume that the
methyl/methylene protons have coupling constants of J = 7.8 Hz, and the methine/methylene protons have a coupling
constant of J = 6.1 Hz. First, we need to import the Multiplet() function along with the mplplot() plotting
function. The Multiplet() function takes the resonance frequency in Hz (v) as the first positional argument followed
by the intensity (I) of the resonance signal. This can simply be the number of nuclei the signal represents and is only
really important when generating entire spectra with multiple signals so that signals that represent more nuclei have a
larger area. Finally, coupling constants(J)/number of nuclei (n_nuc) pairs is provided as a list of tuples, list of lists, or
2D array.
The Multiplet() function generates a Multiplet object which can produce a peak list using the peaklist()
method. The peak list is simply a list of tuples with (v, I) pairs for each peak in the multiplet.
<nmrsim._classes.Multiplet at 0x10d581730>
374
Scientific Computing for Chemists with Python
mult_peaks = [Link]()
mult_peaks
[(485.25000000000006, 0.125),
(491.3500000000001, 0.125),
(493.05, 0.375),
(499.15000000000003, 0.375),
(500.84999999999997, 0.375),
(506.95, 0.375),
(508.6499999999999, 0.125),
(514.7499999999999, 0.125)]
Next, we need to visualize this data. For this, nmrsim provides multiple plotting functions built off of matplotlib. We will
focus on the mplplot() function, which accepts the peaklist and generates the line shapes for the actual peaks.
There are a number of optional, keyword arguments such as line width (w), y-axis limits (y_min and y_max), x-axis
limits (limits), and the number of points in the multiplet (points). The mplplot() function will return the x- and
y-coordinates for the plot. To suppress this, either end the line with a ; or give it a pair of variables to store these data.
b Tip
If the splitting pattern does not look quite right, consider increasing the number of points because undersampling
can lead to anomalous-looking signals.
0.30
0.25
0.20
0.15
0.10
0.05
0.00
560 540 520 500 480 460 440
Below is the same splitting pattern with the line width tripled.
376
Scientific Computing for Chemists with Python
0.200
0.175
0.150
0.125
0.100
0.075
0.050
0.025
0.000
560 540 520 500 480 460 440
As another option, we can overlay the multiplet with lines showing the exact chemical shift and intensity ratio of each
peak. This can be done either using your plotting library of choice or using the mplplot_stick() function in nmrsim.
Below, the intensity of the stem plot is reduced by a fifth to keep the lines inside the blue splitting pattern.
peaks = [Link](mult_peaks)
[Link](freq, intens)
[Link](peaks[:,0], peaks[:,1]/5, linefmt='C1', basefmt=' ', markerfmt=' ')
0.200
0.175
0.150
0.125
0.100
0.075
0.050
0.025
0.000
440 460 480 500 520 540 560
Entire NMR spectra can be simulated from the component resonance signals - either Multiplet or SpinSystem objects.
Down below, we simulate the signals for the methyl, ethyl, and -OH from ethanol with a J=7.3 Hz. Because the -OH
peak is broader due to exchange, the width of the resonance is increased by setting w=3. The three resonances are
then combined into a single spectrum using the Spectrum() function which accepts the resonances in a list and also
optionally accepts minimum (vmin=) and maximum (vmax) frequency ranges for the spectrum in Hz.
b Tip
A spectrum can also be created by adding the resonance signals together with the + operator like below.
spec = methyl + ethyl + OH
# create resonances
methyl = Multiplet(492, 3, [(7.3, 2)])
ethyl = Multiplet(1480, 2, [(7.3, 3)])
OH = Multiplet(1020, 1, [], w=3)
# build spectrum
spec = Spectrum([methyl, ethyl, OH], vmin=0, vmax=1600)
v_spec, I_spec = [Link](points=4000)
378
Scientific Computing for Chemists with Python
[Link](figsize=(12, 5))
[Link](v_spec_ppm, I_spec, linewidth=0.8)
[Link]('Chemical Shift, ppm')
[Link]().invert_xaxis()
1.2
1.0
0.8
0.6
0.4
0.2
0.0
4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0
Chemical Shift, ppm
The simulation even exhibits the second-order roofing effect where coupled resonances ‘lean’ towards each other.
b Tip
The SpinSystem() function can also simulate second-order signals with a default setting of sec-
ond_order=False.
nmrsim is capable of simulating second-order splitting patterns using the following functions (Table 4). The name of
each function is based on the Pople notation where letters adjacent to each other in the alphabet represent resonances that
are near each other in a spectrum (e.g., A and B), letters far apart in the alphabet represent resonances further apart in
the spectrum (e.g., A and X), the same letter is used to represent chemically equivalent nuclei, and primes are used to
differentiate chemically equivalent nuclei that are magnetically nonequivalent (e.g., A and A’).
Table 4 Second-Order Simulation Functions
Function Description
AB() Simulates an AB system
AB2() Simulates an AB2 system
ABX() Simulates an ABX system
ABX3() Simulates an ABX3 system
AAXX() Simulates an AA’XX’ system
AABB() Simulates an AA’BB’ system
These functions typically accept the coupling constants (e.g., Jab=), the distance between the two nuclei (e.g., Vab=),
and the chemical shift of the signal in Hz (Vcentr=). As a demonstration, below we will simulate an AB spin system
where the two nuclei are coupled with J=10.0 Hz and separated by 9.0 Hz.
from [Link] import AB
1.0
0.8
0.6
0.4
0.2
0.0
1980 1960 1940 1920 1900 1880 1860
If we increase the distance between the two nuclei to 30.0 Hz, not only do the two signals become further apart, but
the second-order character, unevenness in this case, decreases. It is important to note that when measuring the distance
between the two second-order signals like this, the center of a doublet with uneven heights is not the center of the doublet
but rather a weighted frequency average of the two peaks based on intensities. This means the chemical shift of a doublet
is closer to the larger of the two peaks in the doublet.
res = AB(10, 30, 1918)
mplplot(res);
380
Scientific Computing for Chemists with Python
1.0
0.8
0.6
0.4
0.2
0.0
1980 1960 1940 1920 1900 1880 1860
Nuclei in some molecules can exchange with each other at observable rates. At lower temperatures, the exchange is
relatively slow, leading to two distinct and reasonably sharp signals representing the two environments of the exchanging
nuclei. As the temperature is increased, the exchange becomes more rapid, causing the two signals to broaden and
become closer until they merge into a single peak and ultimately sharpen. There are two dynamic NMR functions in
the [Link] module: the dnmr_two_singlets() function, which simulates two exchanging nuclei (or groups
of chemically equivalent nuclei) that are not coupling with each other, while the dnmr_AB() function simulates two
exchanging nuclei that couple with each other. Below, we will simulate two non-coupled, singlet signals exchanging with
each other. The required arguments are the chemical shift frequencies of the two nuclei during slow exchange (va and
vb), the exchange rate constant in Hz (k), the half-height width of the peaks at slow exchange (wa and wb), and the
fraction of the nuclei in position a (pa). Optionally, you can specify the frequency limits for the generated line shape
(limits=) and number of data points (points=).
[<[Link].Line2D at 0x10dc9f290>]
0.008
0.006
0.004
0.002
0.000
360 380 400 420 440 460 480 500
Further Reading
Exercises
Complete the following exercises in a Jupyter notebook and NMRglue library. Any data file(s) referred to in the problems
can be found in the data folder in the same directory as this chapter’s Jupyter notebook. Alternatively, you can download
a zip file of the data for this chapter from here by selecting the appropriate chapter file and then clicking the Download
button.
1. Open the 1 H NMR spectrum of ethanol, EtOH_1H_NMR.fid, taken in CDCl3 with TMS using NMRglue. Use
the pipe module.
a) Plot the resulting spectrum and be sure to properly reference it if not done already.
382
Scientific Computing for Chemists with Python
b) Integrate the methyl (-CH3 ) versus the methylene (-CH2 -) resonances and calculate the ratio.
2. Open the 1 H and 13 C NMR spectra of 2-ethyl-1-hexanol, 2-ethyl-1-hexanol_1H_NMR_CDCl3.fid and 2-ethyl-
1-hexanol_13C_NMR_CDCl3.fid, in CDCl3 with TMS and plot them on a ppm scale. Be sure to properly phase
and reference the spectra if not done already. Use the pipe module.
3. Simulate a first-order doublet of triplets with J=5.6 Hz and J=9.2 Hz, respectively.
4. Select an article from the Journal of Organic Chemistry or some other journal and simulate an NMR spectrum with
coupling (e.g., not 13 C{1 H}) based on data listed in the experimental section. Note: some articles are free to access
even if you do not have a subscription. Just access the most recent issue, and the free articles are marked “Open
Access” in ACS journals.
5. Simulate a second-order AA’BB’ simulation with J 𝐴𝐴′ = 15.0 Hz, J 𝐵𝐵′ = 15.0 Hz, J 𝐴𝐵 = 7.0 Hz, J 𝐴𝐵′ = 7.0 Hz,
and a separation of 27.0 Hz. Compare your simulate to what is shown on Hans Reich’s figure (first set of NMR
spectra on the page).
Exercises 383
Scientific Computing for Chemists with Python
384
CHAPTER 13: MACHINE LEARNING USING SCIKIT-LEARN
Machine learning is a hot topic with popular applications in driverless cars, internet search engines, and data analysis
among many others. Numerous fields are utilizing machine learning, and chemistry is certainly no exception, with papers
using machine learning methods being published regularly. There is a considerable amount of hype around the topic along
with debate about whether the field will live up to this hype. However, there is little doubt that machine learning is making
a significant impact and is a powerful tool when used properly.
Machine learning occurs when a program exhibits behavior that is not explicitly programmed but rather is “learned” from
data. This definition may seem somewhat unsatisfying because it is so broad that it is vague and only mildly informative.
Perhaps a better way of explaining machine learning is through an example. In section 13.1, we are faced with the challenge
of writing a program that can accurately predict the boiling point of simple alcohols when provided with information about
the alcohols, such as the molecular weight, number of carbon atoms, degree, etc. These pieces of information about each
alcohol are known as features, while the answer we aim to predict (i.e., boiling point) is the target. How can each feature
be used to predict the target? To generate a program for predicting boiling points, we would need to pour over the data
to see how each feature affects the boiling point. Next, we would need to write a script that somehow uses these trends to
calculate the boiling points of alcohols we have never seen. This probably appears like a daunting task. Instead, we can use
machine learning to solve this task by allowing the machine learning algorithms to figure out how to use the data and make
predictions. Simply provide the machine learning algorithm with the features and targets on a number of alcohols and
allow the machine learning algorithm to quantify the trends and develop a function to predict the boiling point of alcohols.
In simple situations, this entire task can be completed in just a few minutes! The sections in this chapter are broken
down by types of machine learning. There are three major branches of machine learning: supervised, unsupervised, and
reinforcement learning. This chapter will focus on the first two, which are the most applicable to chemistry and data
science, while the latter relates more to robotics and is not as commonly employed in chemistry.
There are multiple machine learning libraries for Python, but one of the most common, general-purpose machine learning
libraries is scikit-learn. This library is simple to use, offers a wide array of common machine learning algorithms, and
is installed by default with Anaconda. As you advance in machine learning, you may find it necessary to branch out to
other libraries, but you will probably find that scikit-learn does almost everything you need it to do during your first year
or two of using machine learning. In addition, scikit-learn includes functions for preprocessing data and evaluating the
effectiveness of models.
The scikit-learn library is abbreviated sklearn during imports. Each module needs to be imported individually, so you
will see them imported throughout this chapter. We will be working with data and visualizing our results, so we will also
be utilizing pandas, NumPy, and matplotlib. This chapter assumes the following imports.
import pandas as pd
import numpy as np
import [Link] as plt
385
Scientific Computing for Chemists with Python
Supervised learning is where the machine learning algorithms are provided with both feature and target information with
the goal of developing a model to predict targets based on the features. When the supervised machine learning predictions
are looking to categorize an item like a photo or type of metal complex, it is known as classification; and when the
predictions are seeking a numerical value from a continuous range, it is a regression problem. Some machine learning
algorithms are designed for only classification or only regression while others can do either.
There are numerous algorithms for supervised learning; below are simple examples employing some well-known and
common algorithms. For a more in-depth coverage of the different machine learning algorithms and scikit-learn, see the
Further Reading section at the end of this chapter.
The file titled ROH_data.csv contains information on over seventy simple alcohols (i.e., a single -OH with no other non-
hydrocarbon function groups) including their boiling points. Our goal is to generate a function or algorithm to predict the
boiling points of the alcohols based on the information on the alcohols, so here the target is the boiling point and features
are the other information about the alcohols.
The dataset includes the boiling point (K), molecular weight (g/mol), number of carbon atoms, whether or not it is
aliphatic, degree, whether it is cyclic, and the average position of any aryl substituents. Scikit-learn requires that all
features be represented numerically, so for the last three features 1 represents True and 0 represents False.
Not every feature will be equally helpful in predicting the boiling points. Chemical intuition may lead someone to propose
that the molecular weight will have a relatively large impact on the boiling points, and the scatter plot below supports this
prediction with boiling points increasing with molecular weight. However, the molecular weight alone is not enough to
obtain a good boiling point prediction as there is as much as a one-hundred-degree variation in boiling points at around
the same molecular weight. The color of the markers indicates the degree of the alcohol, and it is pretty clear that tertiary
alcohols tend to have lower boiling points than primary and secondary alcohols, which means there is a small amount of
information in the degree that can be used to improve a boiling point prediction. If all the small amounts of information
from each feature are combined, there is potential to produce a better boiling point prediction, and machine learning
algorithms do exactly this.
386
Scientific Computing for Chemists with Python
3.00
525
2.75
500
2.50
475
2.25
450
Degree
bp, K
2.00
425
1.75
400
1.50
375
350 1.25
1.00
40 60 80 100 120 140 160 180
MW, g/mol
Whenever training a machine learning model to make predictions, it is important to evaluate the accuracy of the predic-
tions. It is unfair to test an algorithm on data it has already seen, so before training a model, first split the dataset into a
training subset and a testing subset. It is also important to shuffle the dataset before splitting it as many datasets are at
least partially ordered. The alcohol dataset is roughly in order of molecular weight, so if an algorithm is trained on the
first three-quarters of the dataset and then tested on the last quarter, training occurs on smaller alcohols and testing on
larger alcohols. This could result in poorer predictions as the machine learning algorithm is not familiar with the trends
of larger alcohols. The good news is that scikit-learn provides a built-in function for shuffling and splitting the dataset
known as train_test_split(). The arguments are the features, target, and the fraction of the dataset to be used
for testing. Below, a quarter of the dataset is allotted for testing (test_size=0.25).
b Tip
The train_test_split() function randomly shuffles the dataset before splitting it resulting in different
results each time the function is called. The random_state= argument can be used to produce fixed results
for example or demo purposes.
target = ROH['bp']
features = ROH[[ 'MW', 'carbons', 'degree', 'aliphatic',
'avg_aryl_position','cyclic']]
The output includes four values containing the training/testing features and targets. By convention, X contains the features
and y are the target values because they are the independent and dependent variables, respectively; and the features variable
is capitalized because it contains multiple values per alcohol.
b Tip
Another variable name convention is to capitalize variables that contain a collection and use lowercase letters for
single values. For example, a single 𝑥-value in a plot would be x while a list containing multiple 𝑥-values would
be X.
Now for some machine learning using a very simple linear regression model. This model treats the target value as a linear
combination or weighted sum of the features where 𝑥 are the features and 𝑤 are the weights.
𝑡𝑎𝑟𝑔𝑒𝑡 = 𝑤0 𝑥0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3 + 𝑤4 𝑥4 + 𝑤5 𝑥5 + ...
The general procedure for supervised machine learning, regardless of model, usually includes three steps.
1. Create a model and attach it to a variable.
2. Train the model with the training data.
3. Evaluate the model using the testing data or use it to make predictions.
To implement these steps, the linear model from the linear_model module is first created with the Linear-
Regression() function and assigned the variable reg. Next, it is trained using the fit() method and the training
data from above.
reg = linear_model.LinearRegression()
[Link](X_train, y_train)
LinearRegression()
Finally, the trained model can make predictions using the predict() method.
prediction = [Link](X_test)
prediction
388
Scientific Computing for Chemists with Python
Remember that the algorithm has been only provided the features for the testing subset; it has never seen the y_test
target data. The performance can be assessed by plotting the predictions against the true values.
500
480
460
True bp, K
440
420
400
380
360
350 375 400 425 450 475 500 525
Predicted bp, K
This is a substantial improvement from using only the molecular weight to make predictions! If the above code is run
again, the results will likely vary because the train_test_split() function randomly splits the dataset, so each
time the above code is run, the algorithm is trained and tested on different portions of the original dataset.
It is important to evaluate the effectiveness of trained machine learning models before rolling them out for widespread
use, and scikit-learn provides multiple built-in functions to help in this task. The first is the score() method. Instead
of making predictions using the testing features and then plotting the predictions against the known values, the score()
method takes in the testing features and target values and returns the 𝑟2 . The closer the 𝑟2 value is to 1, the better the
predictions are.
[Link](X_test, y_test)
0.9738116533899365
Another tool for evaluating the efficacy of a machine learning algorithm is k-fold cross-validation. The prediction results
will vary depending on how the dataset is randomly split into training and testing data. K-fold cross-validation compensates
for this randomness by splitting the entire dataset into k (k being some number) chunks called folds. It then reserves one
fold as the testing fold and trains the algorithm on the rest. The algorithm is tested using the testing fold, and the process
is repeated with a different fold reserved for testing (Figure 1). Each iteration trains a fresh algorithm, so it does not
remember anything from the previous train/test iteration. The results for each iteration are provided at the end of this
process.
Figure 1 In each iteration of k-fold cross-validation, different folds of data are used for training and testing the algorithm.
A demonstration of k-fold cross-validation is shown below. First, a cross-validation generator is created using the Shuf-
fleSplit() function. This function shuffles the data to avoid having all similar alcohols in any particular fold. The
linear model is then provided to the cross_val_score() function along with the feature and target data and the
cross-validation generator.
splitter = ShuffleSplit(n_splits=5)
reg = linear_model.LinearRegression()
The scores are the 𝑟2 values for each iteration. The average 𝑟2 is a pretty reasonable assessment of the efficacy of the
model and can be found through the mean() function.
[Link]()
390
Scientific Computing for Chemists with Python
np.float64(0.9572255422323902)
Recall that the linear model calculates the boiling point based on a weighted sum of the features, so it can be informative to
know the weights to see which features are the most influential in making the predictions. The LinearRegression()
method contains the attribute coef_ which provides these coefficients in a NumPy array.
reg = linear_model.LinearRegression()
[Link](X_train, y_train)
reg.coef_
These coefficients correspond to molecular weight, number of carbons, degree, whether or not it is aliphatic, average
aryl position, and whether or not it is cyclic, respectively. While some coefficients are larger than others, we cannot yet
distinguish which features are more important than the others because the values for each feature occur in different ranges.
This is because the coefficients are not only proportional to the predictive value of a feature but also inversely proportional
to the magnitude of feature values. For example, while the molecular mass has greater predictive value than the degree,
the degree has a larger coefficient because it occurs in a smaller range (1 → 3) than the molecular weights (32.04 →
186.33 g/mol).
To address this issue, the scikit-learn [Link] module provides a selection of functions for scaling the
features to the same range. Three common feature scaling functions are described in Table 1, but others are detailed on
the scikit-learn website.
Table 1 Preprocessing Data Scaling Functions
Scaler Description
MinMaxS- Scales the features to a designated range; defaults to [0, 1]
caler
Standard- Centers the features around zero and scales them to a variance of one
Scaler
Ro- Centers the features around zero using the median and sets the range using the quartiles; similar to
bustScaler StandardScaler except less affected by outliers
For this data, we will use the MinMaxScaler() with the default scaling of values from 0 → 1. This process parallels
the fit/predict procedure above except that instead of predicting the target, the algorithm transforms it. That is, first the
algorithm learns about the data using the fit() method followed by scaling the data using the transform() method.
Once the scaling model is trained, it can be used to scale any new data by the same amount as the original data.
scaler = MinMaxScaler()
[Link](features)
scaled_features = [Link](features)
With the features now scaled, we can proceed through training the linear regression model as we have done previously
and examine the coefficients.
reg = linear_model.LinearRegression()
[Link](X_train, y_train)
LinearRegression()
reg.coef_
It is quite clear from the coefficients that the molecular weight and number of carbons are both by far the most important
features to predicting the boiling points of alcohols. This makes chemical sense, being that larger molecules have greater
London dispersion forces, thus increasing the boiling points.
Classification involves sorting items into discrete categories such as sorting alcohols, aldehydes/ketones, and amines by
type based on features. Scikit-learn provides a number of algorithms designed for this type of task. One method is known
as a decision tree (Figure 2, left), which sorts items into categories based on a series of conditions. For example, it might
first sort chemicals based on which have degrees of unsaturation greater than zero because these are most likely to be the
aldehydes and ketones. It will then take the samples with zero degrees of unsaturation, which are the alcohols and amines,
and separate them through another condition based on other information about the chemical compounds. Decision trees
are relatively simple and easily interpreted, but they tend not to perform particularly well in practice. An extension of
the decision tree is the random forest (Figure 2, right), which trains a larger number of decision trees using different
subsets of the training data, resulting in large numbers of different decision trees. Each decision tree is used to predict
the category, and the final prediction is based on the majority prediction of all the trees. Random forests tend to be more
accurate than a single decision tree because even if every tree is only slightly better than random at making an accurate
prediction, large numbers of decision trees have a much higher probability of making a correct prediction because of the
law of large numbers.
Figure 2 An illustration of a single decision tree (left) and a random forest (right) composed of numerous decision trees
generated with different subsections of data.
392
Scientific Computing for Chemists with Python
To demonstrate classification, we will use a small dataset containing 122 monofunctional organic compounds from three
different categories: alcohols (category 0), ketones/aldehydes (category 1), and amines (category 2). The features provided
are the molecular weight, number of carbons, boiling point, whether it is cyclic, whether it is aromatic, and the unsaturation
number. All the data is represented numerically, so the data is ready to be used.
data = pd.read_csv('data/org_comp.csv')
[Link]
0 0 455 94.11 6 1 1 3
1 0 475 108.14 7 1 1 3
2 0 475 108.14 7 1 1 3
3 0 464 108.14 7 1 1 3
4 0 474 122.17 8 1 1 3
.. ... ... ... .. ... ... ...
117 2 498 135.21 9 1 1 3
118 2 407 99.17 6 1 0 1
119 2 381 85.15 5 1 0 1
120 2 327 113.20 7 1 0 1
121 2 463 127.23 8 1 0 1
target = data['class']
features = [Link]('class', axis=1)
Now that we have our data, the classification process is similar to the regression example above: first perform a train/test
split, initiate the model, train the model, and then test it.
rf = RandomForestClassifier()
[Link](X_train, y_train)
[Link](X_test)
array([1, 0, 0, 0, 0, 2, 1, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2, 0,
1, 2, 0, 1, 0, 2, 2, 0, 2])
We now have predictions for our testing data, but it would be helpful to know how accurate these predictions are. Again,
there is the score() method that can calculate the fraction of accurately predicted functional groups.
[Link](X_test, y_test)
0.7419354838709677
The above score shows that the predictions are about 74% accurate. However, with three possible categories, this number
does not tell the whole story because it does not inform us as to where the errors are occurring. For this, we will use a
confusion matrix which is a grid of predicted categories versus true categories.
array([[11, 0, 1],
[ 1, 4, 0],
[ 6, 0, 8]])
Each row is a predicted category and each column is the true category, but it is difficult to interpret the confusion matrix
without labels. We can use seaborn’s heatmap() function (see section 10.6) to produce a clearer representation.
10
11 0 1
0
8
Predicted Value
6
1 4 0
1
6 0 8 2
2
0
0 1 2
True Value
Every value in the diagonal has the same predicted category as the true value, making them correct predictions, whereas
anything off diagonal is an incorrect prediction. For example, the bottom left corner shows that six instances were predicted
as category 2 but really belong to category 0. Examination of the confusion matrix shows that the most common erroneous
prediction is a category 0. This could be due to, for example, the fact that alcohols and amines both tend to have degrees
394
Scientific Computing for Chemists with Python
Another major class of machine learning is unsupervised learning where no target value is provided to the machine learn-
ing algorithm. Unsupervised learning seeks to find patterns in the data instead of making predictions. One form of
unsupervised problem is dimensionality reduction where the number of features is condensed down to typically two or
three features while maintaining as much information as possible. Another unsupervised learning task is clustering where
the algorithm attempts to group similar items in a dataset. Because no target label is available, the algorithm does not
know what each group contains; it only knows that the data fall into a pattern of cohesive groups. Blind signal separation
(BSS) is a third unsupervised task introduced below where the algorithm attempts at pulling apart mixed signals into its
components without knowledge of the components. One application of BSS is extracting the spectra of pure compounds
from spectra containing a mixture of chemical compounds.
We will first address dimensionality reduction, which typically condenses features down to two or three dimensions be-
cause it is often used in the visualization of high-dimensional data. To demonstrate this task, we will use scikit-learn’s
datasets module, which contains datasets along with data-generating functions. We will use the wine classification
dataset that includes 178 samples of three different types of wines, which we will classify based on features such as
alcohol content, hue, malic acid, etc.
To load the wine dataset, we first need to import the load_wine() function and then call the function.
The data is now stored as a dictionary-style object in the variable wine, with the features stored under the key data and
targets stored under target.
[Link]
[Link]
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2])
Notice again that every data point, including the category, is a number because scikit-learn requires that all data be
numerically encoded. We can get a full listing of the keys using the keys() method shown below. Most keys are
self-explanatory except for the DESCR, which provides a description of the dataset for those who are interested.
[Link]()
We will store the features and target values in variables for use in the next section.
features = [Link]
target = [Link]
Below is a list of thirteen features in the wine dataset, which is too many to represent in a single plot, so it needs to be
paired down to two or three.
wine.feature_names
['alcohol',
'malic_acid',
'ash',
'alcalinity_of_ash',
'magnesium',
'total_phenols',
'flavanoids',
'nonflavanoid_phenols',
'proanthocyanins',
'color_intensity',
'hue',
'od280/od315_of_diluted_wines',
'proline']
Inevitably, some information will be lost by representing high-dimensionality data in lower dimensions, but the algo-
rithms in scikit-learn are designed to preserve as much information as possible. Among the most common algorithms
is principal component analysis (PCA), which determines the axes of greatest variation in the dataset known as principal
components. The first principal component is the axis of greatest variation, the second principal component is the axis
of the second greatest variation, and so on. Every subsequent principal component is also orthogonal to the previous
principal components.
As a simplified example, below is a dataset containing only two features. The axis of greatest variation slopes down and to
the right, shown with a longer solid line, making this the first principal component. The second principal component is the
axis of second greatest variation perpendicular to the first axis shown as a dotted line. If the data had a third dimension,
the third principal component would come directly out of the page orthogonal to the first two principal components. Each
396
Scientific Computing for Chemists with Python
data point is then represented by its relationship to the principal component axes. That is, the principal components are
the new Cartesian axes. This may seem trivial with only two features, but it allows high-dimensional data to be reasonably
represented in only two or three dimensions while preserving as much information as possible.
Figure 2 Principal components are axes of greatest variation of a dataset in feature space. The first principal component
(solid line) is the axis of greatest variation while the second principal component (dotted line) is the axis of second greatest
variation orthogonal to the first.
The PCA algorithm is provided in the decomposition module of scikit-learn. Unsupervised learning procedures are
similar to those of supervised learning except that there is no reason to split the data into training and testing sets, and
instead of making predictions, the trained algorithm is used to transform the data. The general process is outlined below.
1. Create a model attached to a variable.
2. Train the model with the fit() method using all of the data.
3. Modify the data using the transform() method.
Principal component analysis is sensitive to the scale of features, so before we proceed, we will scale the features using
the StandardScaler() function introduced in section 13.1.5.
from [Link] import StandardScaler
SS = StandardScaler()
features_ss = SS.fit_transform(features)
When training the PCA model, it can take a number of arguments. Most are beyond the scope of this chapter, but the
one you should focus on is n_components= where the user provides the number of principal components desired. In
this case, we will obtain two principal components because it is the easiest to visualize.
from [Link] import PCA
(178, 2)
The result is a two-dimensional array where each column represents a principal component. We can plot these components
against each other and color the markers based on the class.
4
4 2 0 2 4
We can see that the three categories of wine all form cohesive clusters with class 0 and 2 being well resolved and class 1
exhibiting slight overlap with the other two classes of wine. This suggests that we should have better luck distinguishing
between class 0 and 2 than between these two classes and class 1.
13.2.4 Clustering
Clustering involves grouping similar items in a dataset, and this can be performed with a number of algorithms including
k-means, agglomerative clustering, and Density Based Spatial Clustering Application with Noise (DBSCAN) among
others. This process is somewhat similar to classification except that no labels are provided, so the algorithm does not
know anything about the groups and must rely on the similarity of samples. Here we will use the DBSCAN clustering
algorithm. This algorithm works by assigning items in a dataset as core data points if they are within a minimum distance
(eps) of a minimum number of other samples in a dataset (min_samples). Clusters are built around these core data
points, and any data point not within eps distance from a core data point is designated as noise, which means it is not
assigned to any cluster. The larger the minimum distance and smaller minimum number of samples, the fewer clusters
that are likely to be predicted by DBSCAN. One notable attribute of this algorithm versus some of the others mentioned
398
Scientific Computing for Chemists with Python
above is that DBSCAN does not require the user to provide a requested number of clusters; it determines the number of
clusters based on the other parameters mentioned above.
To demonstrate clustering, we will generate a random, synthetic dataset using the make_blob() function from
the [Link] module. This function takes a number of arguments, including the number of samples
(n_samples), number of features (n_features), number of clusters (centers), and the standard deviation of the
clusters (cluster_std). We will only generate two features to make this example easy to visualize. The output of
make_blobs() is a NumPy array containing the features (X) and a second NumPy array containing the labels (y).
8
6
4
2
0
2
4
6
8
2 4 6 8 10
We can see three distinct clusters, with the cluster on the bottom being more distinct than the two at the top. Also,
notice that the scales of the two features are different by roughly a factor of two. Before we can use this data, we will
need to normalize the scale of both features as clustering algorithms are sensitive to scale. For this task, we will use the
StandardScaler() function introduced in section 13.2.5.
SS = StandardScaler()
X_ss = SS.fit_transform(X)
Now that the data is scaled, we will initiate our model, train it using the fit() method, and examine the predictions
using the labels_ attribute.
DBSCAN(eps=0.4)
DB.labels_
array([ 0, 0, 1, 1, 2, 0, 2, 1, 0, 0, 0, 0, 2, 2, 2, 2, 1,
0, 2, 0, 2, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1,
2, 2, 1, 0, 0, 0, 1, 0, 0, 1, 1, 2, 0, 2, -1, 1, 0,
1, 1, 1, 0, 0, 1, 2, 1, 2, 0, 2, 2, 0, 1, 0, 2, 2,
2, 0, 2, 1, 1, 0, 2, 1, 0, 2, 0, 1, 0, 2, 0, 2, 0,
2, 0, 2, 1, 1, 2, 1, 0, 1, 0, 0, 1, 1, 2, 0, 2, 1,
2, 2, 1, 2, 0, 1, 2, 2, 0, 2, 2, 2, 1, 1, 0, 0, 1,
0, 2, 2, 1, 1, 1, 2, 2, 1, 0, 0, 1, 1, 2, 2, 0, 2,
0, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1,
2, 2, 1, 0, 1, 1, 2, 2, 2, 1, 2, 0, 0, 0, 2, -1, 2,
2, 2, 1, 2, 0, 0, 2, 1, 0, 1, 1, 2, 0, 2, 1, 1, 2,
2, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 2, 2])
The DBSCAN algorithm has designated which cluster each data point belongs to by assigning them an integer label.
Notice in the plot below that the labels assigned to each cluster are not the same as those in the previous plot. Clustering
labels are not classes but rather are merely to indicate which data points belong to the same cluster. The values themselves
do not matter. Two data points have been assigned values of -1, which means these data points are noise. The k-means
and agglomerative clustering algorithms would have assigned all data points, including outliers, to a cluster; but DBSCAN
is willing to label outliers as noise.
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2 1 0 1 2
400
Scientific Computing for Chemists with Python
Blind signal (or source) separation (BSS) is the process of separating independent component signals from a mixed signal.
One application is in chemical spectroscopy where a spectrum may include signals from multiple chemical compounds in
a mixture. If we provide the BSS algorithm multiple spectra of chemical mixtures where each mixture contains varying
amounts of each chemical, the BSS algorithm should be able to separate the signals for each chemical component.
To demonstrate this process, we will use infrared (IR) spectroscopy data containing mixtures of acetone, cyclohexane,
toluene, and methanol in random ratios. Below are plots of four mixtures. We can see that, for example, the bands
at ~3400 cm−1 and ~1000 cm−1 increase together suggesting that they originate from the same compound; this type
of information can be used to discriminate which band belongs to which compound. However, instead of doing this
manually, we can allow the machine learning algorithms to pick apart the spectra, and even better yet, yield complete
spectra of each component.
Four Mixed Signals
100
90
80
Transmittance, %
70
60
50
40
For this task, we will use the independent component analysis (ICA) function called fastICA() available in scikit-learn.
The process parallels the other unsupervised learning processes above of first training the algorithm using the fit()
method followed by transforming the data using the transform() method. First we will load the data from the files
and stack them into an array called S_mix where each column contains the data from a spectrum. For comparison
purposes, we will also load IR spectra of each pure component into an array called S_pure. Normally we would not
have spectra of pure components, hence the “blind” in blind signal separation, but this is just an example.
The code below also grabs a copy of the wavenumbers (wn) for plotting purposes later on. The last 300 data points of
the spectra in this example are also being clipped off because they are a low signal high noise region of the spectra which
reduces the effectiveness of the separation.
import os
data_pure = []
data_mix = []
data_array_pure = [Link](data_pure).T
data_array_mix = [Link](data_mix).T
[Link]([Link]([Link]()))
The next step is to train and transform the data. When generating the fastICA model, it requires the number of components
(n_components), which is four in this case. One minor drawback of this algorithm is that the user must first know the
number of components in the mixed signal.
® Note
The below example sets the random_state=42. This is set to keep the outputs of this Jupyter Book consistent
over time but is not necessary for regular use of the FastICA() function.
S_fit.shape
(6961, 4)
You may have noticed that instead of doing the fit() and transform() in two steps, we used a
fit_transform() method. This method is present in many unsupervised algorithms, allowing the user to perform
both steps in a single function call. The resulting array S_fit contains the four extracted components, where each col-
umn of the array is a component. We can plot each component next to IR spectra of pure compounds collected separately
to see how it performed. Remember that the BSS algorithm does not know anything about what these components are,
so interpreting them or matching them to real chemical compounds is left to the user.
fig1 = [Link](figsize=(12,6))
ax1 = fig1.add_subplot(1,2,1)
[Link](wn, S_fit[:,2])
[Link]('Wavenumbers, cm$^{-1}$')
[Link]('Extracted Acetone Spectrum')
[Link]().invert_xaxis()
ax2 = fig1.add_subplot(1,2,2)
(continues on next page)
402
Scientific Computing for Chemists with Python
0 95
2 90
Transmittance, %
85
4
80
6
75
8 70
4000 3500 3000 2500 2000 1500 1000 500 4000 3500 3000 2500 2000 1500 1000 500
Wavenumbers, cm 1 Wavenumbers, cm 1
fig2 = [Link](figsize=(12,6))
ax1 = fig2.add_subplot(1,2,1)
[Link](wn, S_fit[:,0])
[Link]('Wavenumbers, cm$^{-1}$')
[Link]('Extracted Toluene Spectrum')
[Link]().invert_xaxis()
ax2 = fig2.add_subplot(1,2,2)
[Link](wn, S_pure[:,1])
[Link]('Wavenumbers, cm$^{-1}$')
[Link]('Transmittance, %')
[Link]('Pure Toluene Spectrum')
[Link]().invert_xaxis()
Transmittance, %
70
7.5
60
10.0
50
12.5
40
15.0
30
4000 3500 3000 2500 2000 1500 1000 500 4000 3500 3000 2500 2000 1500 1000 500
Wavenumbers, cm 1 Wavenumbers, cm 1
fig3 = [Link](figsize=(12,6))
ax1 = fig3.add_subplot(1,2,1)
[Link](wn, S_fit[:,1])
[Link]('Wavenumbers, cm$^{-1}$')
[Link]('Extracted Cyclohexane Spectrum')
[Link]().invert_xaxis()
ax2 = fig3.add_subplot(1,2,2)
[Link](wn, S_pure[:,0])
[Link]('Wavenumbers, cm$^{-1}$')
[Link]('Transmittance, %')
[Link]('Pure Cyclohexane Spectrum')
[Link]().invert_xaxis()
404
Scientific Computing for Chemists with Python
0 90
2 80
Transmittance, %
4 70
6 60
8
50
10
40
4000 3500 3000 2500 2000 1500 1000 500 4000 3500 3000 2500 2000 1500 1000 500
Wavenumbers, cm 1 Wavenumbers, cm 1
fig4 = [Link](figsize=(12,6))
ax1 = fig4.add_subplot(1,2,1)
[Link](wn, S_fit[:,3])
[Link]('Wavenumbers, cm$^{-1}$')
[Link]('Extracted Methanol Spectrum')
[Link]().invert_xaxis()
ax2 = fig4.add_subplot(1,2,2)
[Link](wn, S_pure[:,3])
[Link]('Wavenumbers, cm$^{-1}$')
[Link]('Transmittance, %')
[Link]('Pure Methanol Spectrum')
[Link]().invert_xaxis()
90
0
80
Transmittance, %
2 70
60
4
50
6 40
30
8
4000 3500 3000 2500 2000 1500 1000 500 4000 3500 3000 2500 2000 1500 1000 500
Wavenumbers, cm 1 Wavenumbers, cm 1
Overall, the fastICA algorithm did a decent job - sometimes even an impressive job of picking out small features, but
there are some discrepancies between the extracted and pure IR spectra. The first is that there are peaks that extend
above the extracted spectra. A transmittance over 100% is not possible, but the algorithm does not know this. The y-axis
scales of the extracted IR spectra also do not match the percent transmittance. While it is not shown here, sometimes
the extracted components are also upside down. This is because the mixtures are assumed to be weighted sums of the
components, and a component can be negative. If this bothers you, there is a related BSS algorithm called non-negative
matrix factorization (NMF) supported in scikit-learn which requires each component to be non-negative. Finally, you
may notice that there is a broad feature at around 3400 cm−1 in the acetone extracted component that is not in the pure
compound. This is an O-H stretch from the methanol IR spectrum showing up in the acetone spectrum. This may be the
result of hydrogen-bonding between methanol and acetone shifting the O-H bond, breaking down the assumption that the
spectra of mixtures are purely additive.
There is a saying that there is no task so simple it cannot be done wrong, and machine learning is no exception. Machine
learning, like any tool, can be used incorrectly, leading to erroneous or error-prone results. One particular source of error
in machine learning is making predictions outside the scope of the training dataset. That is, if we train an algorithm
to predict the boiling points using aliphatic alcohols, there is no reason to expect that the algorithm should be able to
accurately predict the boiling points of aromatic alcohols. Another risk in machine learning is overtraining an algorithm.
Some algorithms provide numerous parameters which customize the behavior, and these parameters are often used to
optimize the accuracy of the predictions. The parameters can be over-optimized for the training data so that the algorithm
then performs worse in predictions for non-training data. This is known as overtraining the algorithm. In all of the
excitement about how powerful and useful machine learning is, we should always keep the sources of error in mind and
always remember that just because a machine learning algorithm makes a prediction does not make it true.
406
Scientific Computing for Chemists with Python
Further Reader
Exercises
Complete the following exercises in a Jupyter notebook and scikit-learn library. Any data file(s) referred to in the problems
can be found in the data folder in the same directory as this chapter’s Jupyter notebook. Alternatively, you can download
a zip file of the data for this chapter from here by selecting the appropriate chapter file and then clicking the Download
button.
1. Import the data file ROH_data.csv containing data on simple alcohols and train a random forest algorithm to
predict whether or not an alcohol is aliphatic. Remember to split the dataset using train_test_split() and
evaluate the quality of the predictions.
2. Open the file titled NMR_mixed_problem.csv which contains three 1 H NMR spectra. Each spectrum (columns)
is a mixture of three chemical compounds in different ratios (artificially generated). Use fastICA to separate out
three pure 1 H NMR spectra of each component. Compare your separated spectra to the pure NMR spectra in
NMR_pure_problem.csv.
3. Import the file titled [Link] containing unlabeled data with two features.
a) Use the DBSCAN algorithm to predict clusters for each datapoint in the set. Plot the data points using color to
represent each cluster.
b) Use the k-means algorithm ([Link]) to predict clusters for each datapoint in the set.
This may require you to visit the Scikit-Learn website to view the documentation for this algorithm and function.
Plot the data points using color to represent each cluster. You will need to provide this algorithm the number of
clusters you feel is most appropriate.
4. Load the handwritten digits dataset using the [Link].load_digits() function.
a) Reduce the dimensionality of the dataset to two principal components and visualize it. Color the markers based
on the category, and use [Link].get_cmap('turbo',10) to generate a colormap with ten colors. You will
need to import PCA from [Link].
b) Train the Gaussian Naive Bayes algorithm to classify the digits. Be sure to evaluate the effectiveness using a
testing dataset. Import GaussianNB from sklearn.naive_bayes.
408
CHAPTER 14: OPTIMIZATION & ROOT FINDING
Optimization is the process of improving something to the extent that it cannot be reasonably improved any further. This
often involves maximizing desirable attributes and/or minimizing those that are undesirable, so finding the maximum and
minimum are common optimization goals. While you may or may not have previously worked directly with optimization,
you almost certainly have used it as part of a larger application or task such as energy minimization of a molecule,
regression analysis, or a number of machine learning algorithms.
In optimization tasks, we often find ourselves searching for the maximum or minimum of a given mathematical function.
If we, for example, seek to minimize a function 𝑓(𝑎, 𝑏), our goal is to find values for input variables 𝑎 and 𝑏 to generate
the smallest possible output from the function 𝑓. One approach is to manually try different input values until you get
the smallest possible output, but this kind of tedious and time-consuming task is best left to computers. The scipy.
optimize module contains a number of tools for performing optimizations of mathematical functions. The goal of this
chapter is to introduce the [Link] module and apply it to chemical applications. This chapter does not go
into the deeper theory behind optimization, such as specific algorithms. For those interested in some of the deeper theory
of optimization, see the Further Reading section.
Before we begin, we first need to address how we measure what is “best”? For this, we use a cost function, also known
as an objective function or criterion, which is a mathematical function that takes in features and returns a value that is
a measure of “goodness.” If we were a company that is trying to maximize our profits, the objective function would
likely be some mathematical equation that calculates our net profit. Optimization of a molecule’s conformation involves
minimizing the energy, so the objective function here is the function that calculates the energy of the molecule based on
the attributes like bond angles and lengths. In the examples below, each of the [Link] functions takes as
its first argument an objective function in the form of a Python function.
[Link](obj_func)
The examples in this chapter assume the following imports from NumPy, SciPy, pandas, and matplotlib.
import numpy as np
import pandas as pd
from scipy import optimize
import [Link] as plt
409
Scientific Computing for Chemists with Python
14.1 Minimization
The first task we will look at is minimization, and for this, [Link] has two related functions scipy.
[Link]() and [Link].minimize_scalar(). Both functions minimize the pro-
vided function, but the difference is in the number of independent variables that the objective function takes. A function
with only one independent variable, 𝑓(𝑎), is known as univariant while a function that takes multiple independent vari-
ables, 𝑓(𝑎, 𝑏, ...), is known as multivariant. The minimize() function can minimize either multivariant and univariant
functions while minimize_scalar() can only accept univariant objective functions.
If we are trying to minimize a function with a single independent variable, the [Link].
minimize_scalar() is likely a good choice. As a simple example, we will find the radius of minimal energy for two
xenon atoms using the Lennard-Jones equation below, which describes the potential energy with respect to the distance,
𝑟, between the two atoms. In this example, 𝜎 = 4.10 angstroms and 𝜖 = 1.77 kJ/mol.
𝜎 12 𝜎 6
𝑃 𝐸 = 4𝜖 [( ) − ( ) ]
𝑟 𝑟
Being that energy described by the Lennard-Jones energy equation is what we are trying to minimize, this is our objective
function. We first need to define this equation as a Python function.
def PE_LJ(r):
epsilon = 4.10 #kJ/mol
sigma = 1.77 #angstroms
PE = 4 * epsilon * ( (sigma/r)**12 - (sigma/r)**6)
return PE
Next, we will feed our objective function into the [Link].minimize_scalar() function along with
some constraints. This is known as constrained optimization and is accomplished by setting the method='bounded'
and setting the bounds= to the range of values the function will operate in. In this case, we are constraining the values
of 𝑟 to a specific range.
Creating bounds is typically optional, but if you know roughly where the minimum will be or where it cannot be, this is
helpful information. In this example, it is important to provide constraints on 𝑟 to ensure the minimize_scalar()
function does not try r = 0 and generate a ZeroDivisionError.
® Note
Because we imported the optimize module explicitly in this chapter, calling any function from inside the
[Link] module does not need to include scipy.
410
Scientific Computing for Chemists with Python
Alternatively, we can use the bracket=(a, b) argument where f(b) < f(a). This argument is different from the
bounds= argument in that instead of telling the function a region to search, it tells the minimize_scalar() function
the direction to search for the minimum. The minimum does not need to be between a and b, but it simply tells the function
that if it moves in the direction of a → b, it will be moving toward the minimum.
® Note
The bracket= argument can also accept three values,bracket=(a, b, c), where 𝑓(a) > 𝑓(b) < 𝑓(c). This
is even more helpful to the minimization function but also requires more foreknowledge from the user about the
function being minimized.
message:
Optimization terminated successfully;
The returned value satisfies the termination criteria
(using xtol = 1.48e-08 )
success: True
fun: -4.099999999999997
x: 1.9867578344041286
nit: 23
nfev: 26
After running our optimization function, an OptimizeResult object is returned. This object has a series of attributes
listed above, but the two most important are success and x. The success attribute tells us if the optimization function
was successful at converging on a solution, while the x attribute is the optimized solution. We can access the solution
using opt.x to learn that the minimized distance according to the Lennard-Jones energy equation is 1.99 angstroms.
opt.x
np.float64(1.9867578344041286)
Being that our energy function is only univariant, we can easily visualize the function and our minimized solution (orange
dot) as done below.
r = [Link](1.7, 6, 0.01)
PE = PE_LJ(r)
[Link](r, PE, label='energy function')
[Link](opt.x, PE_LJ(opt.x), 'o', label='minimum')
[Link]('Distance, Angstroms')
(continues on next page)
6 energy function
minimum
4
Potential Energy, kJ/mol
4
2 3 4 5 6
Distance, Angstroms
® Note
Optimization functions can use algorithms with random components, so if they are run multiple times, variations
in the results may be observed. The results typically vary only slightly, but sometimes more significant variations
may be observed, such as if there are multiple minima or maxima in the objective function.
® How it works…
The goal of optimization is to minimize the objective function, which can be accomplished through a number of
algorithms. Knowledge of these algorithms is not required to use optimization, but if you are curious, here is the view
from 10,000 feet. Despite the wide variety of algorithms available, they generally operate by an almost trial-and-error
approach. They start with initial input values for the objective function and then try slightly different input values. If
the new input values decrease the objective function, they are accepted, and if they increase the objective function,
they are rejected. This continues on for a number of iterations, finding values that progressively decrease the objective
function until the algorithm can no longer minimize the objective function any further. The final input values are then
returned by the optimization function as the optimized values. Optimization algorithms can differ by, for example,
412
Scientific Computing for Chemists with Python
how they decide which input values to try next or how different the subsequent input values to try should be. See
Further Reading for more information on optimization algorithms.
The SciPy library does not contain any maximization functions, but maximization functions are not really necessary as
minimizing the negative of a function provides the maximum. For example, below we have the radial probability function
for the hydrogen 3s orbital. For convenience, the SymPy library’s [Link] module is used to generate the
3s radial function (𝜓, psi) as a Python function. For this maximization example, let’s find the radius of maximum
probability for the electron. The normalized probability can be calculated by 𝜓2 𝑟2 where 𝑟 is the distance from the
nucleus.
import sympy
from [Link] import R_nl
R = [Link]('R')
r = [Link](0,40,0.1)
[Link](r, psi(r)**2 * r**2) # r is in bohrs (~0.529 anstroms)
[Link](0,30)
[Link](0,0.11)
[Link](x=0, y=0)
[Link]('Radius, $a_0$')
[Link]('Probability Density');
0.10
0.08
Probability Density
0.06
0.04
0.02
0.00
0 5 10 15 20 25 30
Radius, a0
There are multiple ways to make the function negative, like including a negative sign in the Python function definition. Our
Python function has already been created, so below we will make the radial probability density negative using a lambda
function (see section 2.1.4 for review on lambda functions).
message:
Optimization terminated successfully;
The returned value satisfies the termination criteria
(using xtol = 1.48e-08 )
success: True
fun: -0.014833612579485785
x: 0.7400370693225894
nit: 13
nfev: 16
The value returned is the first local maximum but not the global maximum we are seeking. To ensure we get the global
maximum, we need to add a constraint for the range of radii used by the optimization function.
414
Scientific Computing for Chemists with Python
probability function
0.10 maximum
0.08
Probability Density
0.06
0.04
0.02
0.00
0 5 10 15 20 25 30
Radius, a0
One of the key minimization functions in the [Link] module is the minimize() function, which is
capable of minimizing multiple variables simultaneously. This function requires at least two arguments: the objective
function and initial guesses for each value as a list or tuple.
[Link](obj_func, (guess))
As an example, we will calculate the equilibrium concentrations for a tandem equilibrium shown below between three
different isomers, assuming we place an initial 122 mmol of the isomer A into solution and allow it to equilibrate at 25
𝑜
C. The two equilibrium constants for this equilibrium are K 1 =5.0 and K 2 =0.80.
𝐾1 𝐾2
𝐴⇌𝐵 ⇌𝐶
To solve this problem, we need to adjust the three isomer concentrations, our variables, such that they get as close as
possible to the equilibrium ratios set by the equilibrium constants.
The first step is to write an objective function as a Python function, obj_func(), that quantifies how poor the solution
is. It is the value from this function that we are minimizing to generate the optimal solution to our problem. Being that
our goal is to bring the isomer quantities as close to the equilibrium ratios as possible, a reasonable objective function will
calculate how far our isomer ratios are from equilibrium. The quality of our solution will be calculated from the squares of
the difference between a proposed solution and the target equilibrium constants so that the further the proposed solution
is from the target equilibrium values, the exponentially worse the quality of the solution will be evaluated as.
return quality
Next, we provide the minimize() function both our objective function and an initial guess for the quantities A, B,
and C. The initial guess needs to be a single collection such as a tuple, array, or list. The output of the minimize()
function is again an OptimizeResult object with the x attribute accessing the minimized quantities for A, B, and C,
respectively.
To access the minimized values, use equ.x in this example. We can then verify the results by calculating the equilibrium
values based on the calculated equilibrium quantities.
equ.x[1]/equ.x[0]
np.float64(5.000000300516061)
equ.x[2]/equ.x[1]
np.float64(0.7999999371517403)
Both values are in excellent agreement with 𝐾1 and 𝐾2 listed above. One step still remains to solve our problem. In the
above problem, it is stated that we started with 122 mmol of isomer A, so if we take the sum of the quantities of A, B,
and C, they need to equal 122.
[Link](equ.x)
np.float64(1.9166790953545365)
They do not total to 122 mmol, so we need to scale the quantities up to a total of 122 mmol. Keep in mind that scaling
up our values for A, B, and C will not change the ratios.
416
Scientific Computing for Chemists with Python
The final equilibrium quantities for A, B, and C are 12.2, 61.0, and 48.8 mmol, respectively.
Á Warning
It is important to recognize that just because an optimization function generates an answer does not mean that it
is indeed the correct answer for your problem. The generated answer is the optimization algorithm’s best effort in
producing the optimal result, which may be, for example, a local minimum instead of the global minimum. If there
is a way to verify the answer, such as is done in the equilibrium example above, this is a prudent last step before using
this information.
An common application of optimization is fitting an equation to a series of data points, such as a linear regression. While
linear regression also happens to have an analytical solution demonstrated in section 8.3.3, we will solve it here using
optimization. In the figure below, a regression line (solid orange) runs through the data points. The residuals are the
difference between the regression line and the data points (green vertical dotted lines). The goal of linear regression is to
generate a regression line that minimizes these residuals.
30 data points
fit line
residuals
25
20
15
10
0
0 2 4 6 8 10
Figure 1 An example of a line of best fit (solid orange) running through data points (blue) with residuals (green dashed)
shown as the difference on the 𝑦-axis between the data point and linear regression.
One of the major questions in regression is how do we measure the quality of the fit. We could in principle use the total
absolute sum of the residuals, known as the least absolute deviation cost or objective function, but the commonly accepted
objective function for fitting equations to data is the mean square error (MSE) function. This is the average of the square
of the difference between the equation’s predictions and the actual data points, or another way of wording this is MSE is
the average square residual of the fit line. The MSE equation is shown below where 𝑓𝑖 is the y-value from the regression
line, 𝑦𝑖 is the data point y-value, and 𝑁 is the number of data points.
1 𝑁
𝑀 𝑆𝐸 = ∑ (𝑓 − 𝑦𝑖 )2
𝑁 𝑖=1 𝑖
There are two general types of regression: linear regression and nonlinear regression. The key difference is that the former
fits data to a linear equation (or plane or hyperplane for higher dimensions) while the latter fits data to nonlinear equations.
There are numerous examples of linear equations in chemistry, and often when equations are nonlinear, they can be
rearranged into a linear form. One classic example of a linear trend is the absorption of light being passed through a
solution of colored analyte (i.e., material being quantified) with respect to the concentration of the analyte. This is related
by Beer’s law shown below where 𝐴 is absorption, 𝜖 is the molar absorptivity constant for a particular analyte, 𝑏 is path
length of the sample, and 𝐶 is the concentration of analyte.
𝐴 = 𝜖𝑏𝐶
Being that the path length for our instrument is 1 cm, which is quite common, this equation simplifies to the following.
𝐴 = 𝜖𝐶
By measuring the absorbance of multiple samples of analyte at known concentrations, the absorbance can be plotted with
respect to concentration, and the slope of the linear trend is the molar absorptivity, 𝜖.
As our sample data, let’s again use the copper cuprizone data we saw in chapter 8.
Table 1 Beer-Lambert Law Data for Copper Cuprizone
The function we will use to fit this data is the optimize.curve_fit() function which performs a least-square
minimization that fits an equation to the data provided. Despite this function being often described for fitting an equation
to nonlinear data, this function is highly versatile and can fit both linear and nonlinear data. This function requires
the theoretical equation, func, in the form of a Python function, the independent variable, xdata, and the dependent
variable, ydata. The curve_fit() function also allows the user to optionally provide an initial guess for the equation
variables/constants, p0. This can help speed up the process for more challenging problems and helps ensure the algorithm
converges on a reasonable solution.
418
Scientific Computing for Chemists with Python
Below we have defined a Python function describing our equation that will be used to fit the data. The Python function
used with optimize.curve_fit() requires that the first argument of the Python function must be the independent
variable(s), and all the rest of the arguments are the parameters used to fit the equation to the data. In this case, these are
the slope, 𝑚, and the y-intercept, 𝑏.
The objective function is then provided to the optimize.curve_fit() function along with the data to fit. The
curve_fit() function returns two arrays: the optimized parameters and the estimated covariance of the optimized
parameters. We are only concerned with the optimized parameters right now, so we use the __ junk variable to hold the
covariance array.
const, __ = optimize.curve_fit(lin_func, C, A)
const
According to the curve_fit() function, the slope is 1.55 × 104 cm−1 M−1 while the y-intercept is -5.45 × 10−6 .
Optimization can also be used to find the best fit for nonlinear data based off of a theoretical equation. One application of
nonlinear fitting is to fit data to a theoretical rate law as a means of determining one or more rate constants in the equation.
For this, we will again use the curve_fit() function from the [Link] module.
To demonstrate this process, let’s consider the two-step reaction of A + B → P catalyzed by a metal catalyst M.
𝑘1
𝑀 + 𝐴 ⇌ 𝑀𝐴
𝑘𝑟1
𝑘2
𝑀𝐴 + 𝐵 → 𝑃 + 𝑀
The theoretical rate law for this two-step reaction is shown below.
𝑘2 𝑘1 [𝑀 ][𝐴][𝐵]
𝑅𝑎𝑡𝑒 =
𝑘𝑟1 + 𝑘2 [𝐵]
We need to again define the theoretical equation in the form of a Python function. Our function calculates the rate of the
chemical reaction versus the concentration of B, but it would also work using data for rate versus the concentration of A
depending upon what data you happen to have.
For our example, we will generate some simulated data with random noise mixed in it. The values of our rate constants
will be k1 =1.2, k𝑟2 =0.48, k2 =4.29, and we will set [A] = 0.50 M and [M] = 1.2 × 10−3 M. The concentrations of [A]
and [M] are unchanged during the course of the rate measurement (e.g., using the method of initial rates).
M, A = 1.2e-3, 0.50
points = 20
conc = [Link](0.1, 8, points)
rng = [Link].default_rng(seed=18)
rate = frate(conc, k1, kr1, k2) + [Link](points)/40000
0.00070
0.00065
0.00060
Rate, M/s
0.00055
0.00050
0.00045
0.00040
0.00035
0 1 2 3 4 5 6 7 8
[B], M
Now that we have our data, we can fit it to the theoretical equation to extract the rate constants.
These rate constants are in good agreement with those used to generate the data. We can also plot the simulated data
versus the rate equation generated by our curve fitting below.
420
Scientific Computing for Chemists with Python
[Link]('[B], M')
[Link]('Rate, M/s')
[Link](loc=7);
0.0007
0.0006
0.0005
Rate, M/s
0.0004 Data
Calculated Regression
0.0003
0.0002
0.0001
0.0000
0 2 4 6 8
[B], M
® Note
If you are optimizing a function with multiple parameters, bounds are formatted with two lists or tuples. The first
contains the lower bounds while the second contains the upper bounds as demonstrated below.
bounds = ((a_low, b_low, c_low), (a_high, b_high, c_high))
optimize.curve_fit(func, xdata, ydata, bounds=bounds)
Another feature of the optimize.curve_fit() function is that it also accepts the uncertainty or errors in each data
point. All regression examples seen so far in this book assume that each data point has the same level of uncertainty, but it
is not uncommon for data to have different uncertainties. If your uncertainty varies, you can provide the curve_fit()
function with the uncertainties as standard deviations to the sigma= argument as an array-like object (e.g., list, set, or
NumPy array). When uncertainties are provided, data points with more uncertainty have less influence on the resulting
regression than data points with less uncertainty. See the [Link].curve_fit() documentation for more
information and options.
In the example below, we will again fit concentration versus kinetic rate data from the above two-step chemical reaction.
This time, we also have an array, uncertainty, that provides degrees of uncertainty for the rates.
Comparing these constants to those calculated with the assumption of constant uncertainty, the values are similar but have
a noticeable difference. The general rule is that the greater the variation in the uncertainties, the more the constants will
differ from those derived with the assumption of constant uncertainty.
® Note
Fitting data to a mathematical function can also be accomplished using the optimize.least_squares() func-
tion. The key difference between using curve_fit() and least_squares() is that the former accepts the
theoretical equation and data directly while the latter requires a Python function that calculates the residuals. In-
terestingly, the source code for the curve_fit() function calls the least_squares() function. We use the
curve_fit() function here as it is more intuitive and convenient.
There is another related function, [Link](), that performs a similar operation but only uses
the Levenberg-Marquardt algorithm and is described as legacy on the [Link] website. The optimize.
least_squares() function is more versatile and is likely the better choice of the two.
Below is an additional example where we use optimization to determine the concentrations of three different dyes mixed
together and analyzed by UV-Vis spectroscopy. This example was inspired by a Journal of Chemical Education article by
Jesse Maccione, Joseph Welch, and Emily C. Heider. By Beer’s law, the absorbance (A) of an analyte is the product of
the molar absoptivity constant (𝜖) for that analyte, the path length in cm (𝑏), and concentration (𝐶).
𝐴 = 𝜖𝑏𝐶
If there are multiple analytes in solution, the total absorbance (A𝑡𝑜𝑡 ) is equal to the sum of the absorbances for the
individual analytes. In our example, we will be dealing with a mixture of red, blue, and yellow dyes.
We ultimately want concentrations of the dyes, so we can substitute in Beer’s law for the three dye absorbances.
The path length is a constant that depends upon the instrument, and the molar absorptivity constants (𝜖) are constants that
depend upon the analytes and the wavelength we are measuring absorbances at. This means that for a particular set of
dyes and instrument, the total absorbance (𝐴𝑡𝑜𝑡 ) depends upon the unknown concentrations of individual dyes. Because
we have three unknowns, we need three equations to solve for the unknowns. This can be accomplished by measuring
the absorbance and molar absorptivity at a minimum of three different wavelengths as demonstrated in section 8.3.2. In
this chapter, we will instead measure absorbances at every nanometer from 400 nm to 850 nm and allow the optimization
function to fit the total absorbances by adjusting the individual dye concentrations.
422
Scientific Computing for Chemists with Python
Á Warning
While including more data points from the spectra can often lead to better results, using too many points can
sometimes have the opposite effect due to overfitting noise. It is often best to select regions where there is the
largest signal-to-noise ratio to avoid fitting too much noise.
First, we will import the absorbance data from the food_coloring.csv file using pandas and plot it to see what the
data look like. In the CSV file, there are UV-Vis spectra for pure red, pure blue, pure yellow, and a mixture of the three.
data = pd.read_csv('data/food_coloring.csv')
[Link] = data['nm']
[Link]('nm', axis=1, inplace=True)
A_red = data['red_40']
A_yellow = data['yellow_6']
A_blue = data['blue_1']
A_mix = data['mix_1']
blue 1
yellow 6
0.8 red 40
mixture
0.6
Absorbance
0.4
0.2
0.0
400 500 600 700 800
Wavelength, nm
Next, we will use the absorbances for each pure dye sample to find the molar absorptivities using Beer’s law. The path
length, 𝑏, in this instrument is 1 cm, and the molarities are known from the experimental setup. That is, below we are
solving for molar absorptivity (𝜖) by the following.
𝐴
𝜖=
𝐶
eps_red = A_red / 4.09e-5
eps_blue = A_blue / 5.00e-6
eps_yellow = A_yellow / 2.92e-5
Finally, we will write a Python function that calculates the total absorbance from the individual concentrations and molar
absorptivities, and we will provide this function to the optimize.curve_fit() function. The fitting parameters are
the calculated concentrations of the individual dyes.
The end result is that the red, blue, and yellow dyes have concentrations of 1.45 × 10−5 M, 2.85 × 10−6 M, and 1.27 ×
10−5 M.
Below is a quick demonstration on how to also solve this problem using the optimize.least_squares() function.
As mentioned earlier, both the curve_fit() and least_squares() functions can be used to solve the same
problems. The least_squares() function requires a Python function that calculates the residuals (i.e., the difference
424
Scientific Computing for Chemists with Python
between the calculated and measured absorbances) instead of the theoretical equation. This function also requires an initial
guess for the fit parameters. Even if you don’t know the concentrations, just give some reasonable value. In this case, we
guessed 1 × 10−3 M for each dye.
def residuals(X):
C_red, C_blue, C_yellow = X
A_calc = C_red * eps_red + C_blue * eps_blue + C_yellow * eps_yellow
return A_mix - A_calc
The resulting concentrations for the three dyes appears identical (or nearly so) to those calculated by the curve_fit()
function.
® Note
The above approach assumes that the contribution of each dye is purely additive, so the contribution of each dye
to the total absorbance is only a function of its own concentration. This means, for example, that the interaction
of different dyes with each other in solution is assumed to be negligible.
Root finding is the process of determining where a function equals zero, 𝑓(𝑎, 𝑏, ...) = 0. Being that any equation can be
rearranged to equal zero, this is a versatile way of solving an equation. If the function is univariant, 𝑓(𝑎) = 0, this task
may sometimes seem trivial even without optimization algorithms, but as the complexity of the equation or number of
variables increases, using optimization algorithms can be beneficial.
Like the minimization functions above, there are two related versions of the root finding functions: [Link].
root() and [Link].root_scalar(). The key difference is that the root() function can solve
for both univariant and multivariant functions while root_scalar() can only solve for univariant functions. Both
functions require a function, func, to find the root of, and root() function also requires an initial guess, x0. The
root_scalar() function also allows for an optional range of values that bracket the root, bracket= to be provides
by the user.
[Link](func, x0)
[Link].root_scalar(func, bracket=(start, stop))
As a root finding example, we can locate the nodes in a radial wave function for the hydrogen 3s orbital. Because there is
only one variable, 𝑟, we can use the [Link].root_scalar() function. Below, we first define our radial
wave function as a Python function, orbital_3s.
def orbital_3s(r):
wf = (2/27)*[Link](3)*(2*r**2/9 - 2*r + 3)* [Link](-r/3)
return wf
Before we find the roots, let’s visualize the function to see what we are dealing with. The horizontal dotted line at y = 0 is
provided as a visual guide. The roots are located where the solid line of the wave function intersects with the dotted line.
0.12
Zero line
0.10 3s radial wave function
0.08
0.06
0.04
0.02
0.00
0.02
0.04
0 5 10 15 20 25 30 35
Radius, a0
The function has two nodes, so our bracket= values will determine which we will end up solving for.
converged: True
flag: converged
function_calls: 11
iterations: 10
root: 1.901923788646684
method: brentq
426
Scientific Computing for Chemists with Python
converged: True
flag: converged
function_calls: 9
iterations: 8
root: 7.098076211353316
method: brentq
0.12
Zero line
0.10 3s radial wave function
Node 1
0.08 Node 2
0.06
0.04
0.02
0.00
0.02
0.04
0 5 10 15 20 25 30 35
Radius, a0
The two dots above show the location of the two roots for this function which clearly are located on the nodes of the wave
function.
Further Reading
Exercises
Solve the following problems using Python in a Jupyter notebook and functions from the [Link] module.
Any data file(s) referred to in the problems can be found in the data folder in the same directory as this chapter’s Jupyter
notebook. Alternatively, you can download a zip file of the data for this chapter from here by selecting the appropriate
chapter file and then clicking the Download button.
1. A warm or hot object emits radiation in a range of wavelengths described by Plank’s law shown below where B is
radiance, 𝜆 is frequency of radiation, 𝑐 is the speed of light, ℎ is Plank’s constant, 𝑘 is Boltzmann’s constant, and
𝑇 is temperature of the object in K.
2ℎ𝑐2 1
𝐵(𝜆) =
𝜆5 𝑒 𝜆𝑘𝑇
ℎ𝑐
−1
Determine the wavelength of greatest radiance for an object at 5000 K using a minimization function. Hint: be
sure to include an extra negative sign in the Python function that you define, and you will want to use either bounds
or brackets to prevent the minimization function from trying zero and generating a ZeroDivisionError.
2. The three isomers of ethyltoluene (i.e., ortho-, meta-, and para-) interchange under Friedel-Crafts conditions fa-
cilitated by aluminum chloride. An investigation into this isomer equilibrium by Allen, R. H. et al. experimentally
determined the rate constants for the interconversion of these isomers. Using the rate constant data, the follow-
ing equilibrium constants were calculated: K 𝑜𝑚 =7.2, K 𝑝𝑚 =2.47, and K 𝑜𝑝 =2.9 where each equilibrium constant is
defined below.
[𝑚𝑒𝑡𝑎] [𝑚𝑒𝑡𝑎] [𝑝𝑎𝑟𝑎]
𝐾𝑜𝑚 = , 𝐾𝑝𝑚 = , 𝐾𝑜𝑝 =
[𝑜𝑟𝑡ℎ𝑜] [𝑝𝑎𝑟𝑎] [𝑜𝑟𝑡ℎ𝑜]
Using this information, calculate the percentages of each isomer at equilibrium. Compare your percentage to those
provided in the above paper (in the abstract).
3. A sealed piston contains 0.32 moles of helium gas at 298 K. Determine the value of 𝑅 by performing a nonlinear
fit on the data below with the optimize.curve_fit() function and the ideal gas law.
𝑛𝑅𝑇
𝑃 =
𝑉
428
Scientific Computing for Chemists with Python
4. Below is the theoretical kinetic rate law for a chemical reaction of A → P catalyzed by 0.001 M of a metal catalyst
C. The table includes kinetic data for the rate, concentration of A, and the uncertainty in rate. Use the optimize.
curve_fit() function to determine values for 𝑘1 and 𝐾𝑒𝑞 . Plot the data below with an overlay of calculated
values using the constants that you determined to show that they are reasonable values.
𝑘1 𝐾𝑒𝑞 [𝐴][𝐶]
𝑅𝑎𝑡𝑒 =
1 + 𝐾𝑒𝑞 [𝐴]
5. One method of solving acid-base equilibrium concentrations is through polynomials as demonstrated by F. Bamdad.
Below is a third-degree polynomial from the equilibria resulting from placing hydrocyanic acid (HCN) in water
where 𝑥 is the concentration of hydronium, K 𝑎 is the acid equilibrium constant, K 𝑤 is equilibrium constant for the
autoionization of water, and [HCN]0 is the initial concentration of hydrocyanic acid. Solve for the concentration
of hydronium using a root finding algorithm in the [Link] module assuming [HCN]0 = 6.8 × 10−6
M and K𝑎 = 6.2 × 10−10 .
𝑥3 + 𝐾𝑎 𝑥2 + (𝐾𝑤 + [𝐻𝐶𝑁 ]0 𝐾𝑎 )𝑥 − 𝐾𝑤 𝐾𝑎 = 0
6. The van der Waals equation is a modified form of the ideal gas law but includes two correction factors that account
for intermolecular forces and the volume of gas molecules. These correction factors include constants 𝑎 and 𝑏 which
are gas-dependent, and the values of 𝑎 and 𝑏 can be calculated by fitting the van der Waals equation to pressure
versus volume data.
𝑛2
(𝑃 + 𝑎 ) (𝑉 − 𝑛𝑏) = 𝑛𝑅𝑇
𝑉2
Load the file PV_CO.csv containing pressure and volume data for one mole of carbon monixide at 298 K acquired
from the NIST Chemistry WebBook. Fit the van der Waals equation to this dataset to determine 𝑎 and 𝑏 values for
carbon monoxide.
Exercises 429
Scientific Computing for Chemists with Python
430
CHAPTER 15: CHEMINFORMATICS WITH RDKIT
Cheminformatics can be thought of as the intersection of data science, computer science, and chemistry as a means of better
understanding and solving chemical problems. This chapter introduces a popular and versatile Python cheminformatics
library known as RDKit, which is useful for tasks such as:
• Visualizing molecules
• Reading SMILES or InChI molecular representations
• Quantifying structural features in molecules such as the number of rings or hydrogen bond donors
• Generating all possible stereoisomers of a molecular structure
• Filtering molecules based on structural features
This is a popular library for those in chemical computing research, with examples of its use being relatively easy to find
in the chemical literature. As of this writing, RDKit can be installed with either conda or pip (see section 0.2.1 and link
below). If you are using Google Colab, you will need to install RDKit at the top of your notebook (see section 0.2.2) as
it is not installed by default in Colab.
Installing RDKit
This chapter assumes the following imports from RDKit.
import pandas as pd
import numpy as np
import [Link] as plt
RDKit is composed of a number of modules, including, but not limited to, the following.
Table 1 Key Modules and Submodules in the RDKit Library
431
Scientific Computing for Chemists with Python
Mod- Description
ule/Submodule
Chem General purpose tools for chemistry. The RDKit website describes it as “A module for
molecules and stuff”.
[Link] Submodule containing more specialized or less often used features; needs to be imported sep-
arately from Chem
Chem. Submodule for quantifying molecular features
Descriptors
[Link] Submodule for visualizing molecules
ML Machine learning tools
The Chem and ML modules are the major modules in RDKit, but for this chapter, we will only be focusing on the Chem
module, which has already been imported above.
There are many ways to depict molecular structures on paper, such as Lewis structures, line-angle structural formulas,
and condensed notation. When representing molecules for a computer, machine-readable methods such as Simplified
Molecular-Input Line-Entry System (SMILES), the International Chemical Identifier (InChI), or mol files are preferred.
For example, the SMILES and InChI representations for benzene are listed below.
SMILES: c1ccccc1
InChI: 1S/C6H6/c1-2-4-6-5-3-1/h1-6H
These are not the most human-readable formats, but computer software such as RDKit is quite good at dealing with
them. We will not get into the structure and rules for interpreting these representations here because it is not really
necessary; reading and writing them is RDKit’s job. You can obtain these representations of a molecular structure from a
variety of sources, such as generating them from chemical drawing software (e.g., ChemDraw or ChemDoodle), searching
NIST Chemical Webbook or NIH PubChem, and many other sources. In this chapter, we will mainly focus on SMILES
representations, but working with the InChI and MOL file formats is analogous and may be used from time to time herein.
The functions below can read and write molecular structures from a variety of formats, including SMILES, InChI, and
MOL files. When reading these molecular structures, a Molecule object (RDKit-specific class of object) is generated.
Table 2 Functions for Loading Molecular Structures
Function Description
[Link]() Generates a Molecule object from SMILES representation
[Link]() Generates SMILES representation from a Molecule object
[Link]() Generates a Molecule object from InChI representation
[Link]() Generates InChI representation from a Molecule object
MolFromMolFile() Generates a Molecule object from a MOL file
As an example, we will load the structure of aspirin (acetylsalicylic acid) using the [Link]() function
from the Chem module.
aspirin = [Link]('O=C(C)Oc1ccccc1C(=O)O')
aspirin
432
Scientific Computing for Chemists with Python
If we check the object type, we find that it is a Molecule ([Link]) RDKit object.
type(aspirin)
[Link]
RDKit can generate other molecular representations such as InChI from the Molecule object as demonstrated below.
[Link](aspirin)
'InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)'
In the above examples, RDKit provided an image of the molecule simply by Jupyter running the Molecule object. By
default, this generates a rather small and low-resolution image. To generate a sharper image, like above, running the
following code at the top of a notebook changes the settings to produce SVG (Scalable Vector Graphic) images, which
are a vector graphic format.
However, simply running the Molecule object does not provide easy control over the image. In this section, we will
generate images that can be saved along with visualizing grids of molecules and other visual representations.
To view the molecule, we can use the [Link](Mol) function, which takes one required positional
argument of the Molecule object (Mol). Optional keyword arguments can be used to set other parameters such as the
image size (size=) in pixels.
[Link](aspirin, size=(400,400))
If we want to save the image to a file, this is accomplished using the [Link]() function, which
requires two pieces of information - the Molecule object and the name of the new file as a string.
Other optional parameters include the size= which is a tuple that takes the width and height, respectively, in pixels, and
the imageType= accepts a string to designate the file format (‘png’ or ‘svg’).
® Note
The PNG file format is a great general-purpose raster file format. Unless you know you need a different file
format, this is often a good choice. The SVG file format is a vector format which makes it easily editable in
software applications such as Inkscape.
434
Scientific Computing for Chemists with Python
Á Warning
It is important that the extension (e.g., “.png”) matches the imageType= argument or else your computer may
have difficulties opening the file.
[Link](aspirin, '[Link]',
size=(500,500),
imageType='svg')
Molecules can also be displayed in plots created by matplotlib. Below is an example of the trans-cinnamic acid structure
being displayed on top of the IR spectrum of the compound.
b Tip
If you get a SyntaxWarning from a \, this is because Python interprets this as an escape character. To get rid of
this error, use a raw string which is formatted like r'mytext'.
cinn_acid = [Link](r'O=C(O)\C=C\c1ccccc1')
image = [Link](cinn_acid)
IR = [Link]('data/cinnamic_acid.CSV', delimiter=',')
[Link](figsize=(8,4))
[Link](IR[:-1,0], IR[:-1,1])
[Link]().invert_xaxis()
[Link]('Wavenumbers, cm$^{-1}$')
[Link]('%, Transmittance')
100
90
%, Transmittance
80
70
60
Whenever we are dealing with collections of molecules, it may be helpful to generate an image that includes multiple
molecular structures known as a grid. As an example, we will load the SMILES strings of the twenty common amino acids
from a text file using pandas and then load the Molecule objects for each structure into a single list called AminoAcids.
df = pd.read_csv('data/amino_acid_SMILES.txt', skiprows=2)
df
name SMILES
0 alanine C[C@@H](C(=O)[O-])[NH3+]
1 arginine [NH3+][C@@H](CCCNC(=[NH2+])N)C(=O)[O-]
2 asparagine O=C(N)C[C@H]([NH3+])C(=O)[O-]
3 aspartate C([C@@H](C(=O)[O-])[NH3+])C(=O)[O-]
4 cysteine C([C@@H](C(=O)[O-])[NH3+])S
5 glutamine [NH3+][C@@H](CCC(=O)N)C([O-])=O
6 glutamate C(CC(=O)[O-])[C@@H](C(=O)[O-])[NH3+]
7 glycine C(C(=O)[O-])[NH3+]
8 histidine O=C([C@H](CC1=CNC=N1)[NH3+])[O-]
9 isoleucine CC[C@H](C)[C@@H](C(=O)[O-])[NH3+]
10 leucine CC(C)C[C@@H](C(=O)[O-])[NH3+]
11 lysine C(CC[NH3+])C[C@@H](C(=O)[O-])[NH3+]
12 methionine CSCC[C@H]([NH3+])C(=O)[O-]
13 phenylalanine [NH3+][C@@H](CC1=CC=CC=C1)C([O-])=O
14 proline [O-]C(=O)[C@H](CCC2)[NH2+]2
15 serine C([C@@H](C(=O)[O-])[NH3+])O
16 threonine C[C@H]([C@@H](C(=O)[O-])[NH3+])O
17 tryptophan c1[nH]c2ccccc2c1C[C@H]([NH3+])C(=O)[O-]
18 tyrosine [NH3+][C@@H](Cc1ccc(O)cc1)C([O-])=O
19 valine CC(C)[C@@H](C(=O)[O-])[NH3+]
436
Scientific Computing for Chemists with Python
[<[Link] at 0x11c5e38b0>,
<[Link] at 0x11c5e3e60>,
<[Link] at 0x11c5e3b50>,
<[Link] at 0x11c5e2ff0>,
<[Link] at 0x11c5e3bc0>,
<[Link] at 0x11c5e30d0>,
<[Link] at 0x11c5e31b0>,
<[Link] at 0x11c5e3290>,
<[Link] at 0x11c5e3370>,
<[Link] at 0x11c5e3450>,
<[Link] at 0x11c5e3530>,
<[Link] at 0x11c5e3610>,
<[Link] at 0x11c5e3680>,
<[Link] at 0x11c5e3760>,
<[Link] at 0x11c5e3840>,
<[Link] at 0x11c5e3920>,
<[Link] at 0x11c5e3ae0>,
<[Link] at 0x11c5e3c30>,
<[Link] at 0x11c5e3d10>,
<[Link] at 0x11c5e3df0>]
To generate the grid, we will use the MolsToGridImage() function from the [Link] submodule. This function
requires one positional argument of an array-like object (e.g., list, tuple, ndarray, etc.) containing the Molecule objects.
Other optional keyword arguments include the number of molecules per row (molsPerRow=), the pixel dimensions
of each molecule (subImgSize=), labels below each molecule (legends=), and the ability to make images in SVG
format (usesSVG=). The image dimensions only matter if using a raster image format and require a tuple with the width
and height in that order. The legends= argument requires an array-like object with the labels in the same order as the
object containing the Molecule objects.
[Link](AminoAcids,
molsPerRow=4,
subImgSize=(200,200),
legends=list(df['name']),
useSVG=True)
<[Link] object>
® Note
If you’re wondering what is up with the AllChem submodule, it stores lesser-used features separately from the
more mainstay features. By storing these features separately, it speeds up importing the main features. However,
the extra cost in time is not substantial, and this submodule contains some cool features such as generating all
possible stereoisomers and filtering molecules based on structural features.
RDKit also supports visualizing molecules inside pandas DataFrames using the AddMoleculeColumnToFrame()
function from the PandasTools submodule ([Link]). This function accepts a DataFrame
with a column of SMILES (smilesCol=) and adds a new column of Molecule objects. The molCol= parameter will
be the header for the new column.
ligands = pd.read_csv('data/[Link]')
ligands
ligand smiles
0 dppe c1ccc(P(CCP(c2ccccc2)c2ccccc2)c2ccccc2)cc1
1 acac CC(=O)CC(C)=O
2 acetonitrile CC#N
3 dcpe C1CCC(P(CCP(C2CCCCC2)C2CCCCC2)C2CCCCC2)CC1
4 HMDS C[Si](C)(C)N[Si](C)(C)C
5 PPh3 c1ccc(P(c2ccccc2)c2ccccc2)cc1
[Link](ligands,
smilesCol='smiles',
molCol='molecules')
ligands
ligand smiles \
0 dppe c1ccc(P(CCP(c2ccccc2)c2ccccc2)c2ccccc2)cc1
1 acac CC(=O)CC(C)=O
2 acetonitrile CC#N
3 dcpe C1CCC(P(CCP(C2CCCCC2)C2CCCCC2)C2CCCCC2)CC1
4 HMDS C[Si](C)(C)N[Si](C)(C)C
5 PPh3 c1ccc(P(c2ccccc2)c2ccccc2)cc1
molecules
0 <[Link] object at 0x11e619850>
1 <[Link] object at 0x11e619af0>
2 <[Link] object at 0x11e61a260>
3 <[Link] object at 0x11e61a2d0>
4 <[Link] object at 0x11e61a1f0>
5 <[Link] object at 0x11e61a3b0>
The DataFrame can also be exported as an Excel spreadsheet complete with images using the SaveXlsxFrom-
Frame(). This function accepts the DataFrame name, output file name, name of the Molecule object column, and
the image size as arguments.
438
Scientific Computing for Chemists with Python
15.3 Stereochemistry
RDKit can assign the stereochemistry of stereocenters, including chiral centers (R vs. S) and alkene stereocenters (E vs.
Z), determine the number of isomers possible, and even generate all possible isomers. Whether or not any stereochemistry
is designated in the SMILES representation or Molecule object is an important detail in carrying out the above tasks.
Even though a molecule may contain a chiral center or an alkene carbon, the stereochemistry around that atom may be
ambiguous.
The SMILES representation shows stereochemistry around a tetrahedral carbon with either @ or @@ and around an
alkene with \ and / symbols. If the SMILES representation does not include these symbols, the stereochemistry is not
indicated.
Table 3 SMILES Stereochemical Designations
The first task is to assign the absolute stereochemistry of a molecule. As an example, below we have a single isomer of
pent-3-en-2-ol which has a single chiral center and an alkene that could potentially be either E or Z. Let’s have RDKit
tell us the absolute configuration (i.e., R or S) of the tetrahedral chiral center and if the alkene is E or Z. First, we will
load the SMILES representation of this compound, O[C@@H](C)/C=C/C, which contains both @ and / symbols, so
we know the stereochemistry is assigned in this representation. When we visualize it below, we can see a wedge for the
methyl on the chiral center instead of a regular line, for example.
pentenol = [Link]('O[C@@H](C)/C=C/C')
pentenol
To obtain the absolute configuration (i.e.,R or S), we can use the [Link]() function which
returns the absolute configuration and an index indication which atom has that configuration.
[Link](pentenol)
[(1, 'S')]
Our pent-3-en-2-ol isomer above has an S stereocenter. Being that pent-3-en-2-ol has only one chiral center, it is not
difficult to determine which atom has the stereochemistry, but if there are multiple chiral centers, it can get confusing.
To see the atom indices and stereochemistry labels on the molecule, this can be enabled (or disabled using False) by
the following code.
Á Warning
The index values are assigned by RDKit and are not the same thing as the numbers from chemical nomenclature.
[Link] = True
[Link] = True
pentenol
To obtain the stereochemistry of double bonds, we can iterate through the bonds and obtain the stereochemistry using the
GetStereo() bond method as shown below. There are three possible outputs listed below.
® Note
Ouput Description
STEREONONE No stereochemistry (often not a double bond)
STEREOE E stereochemistry
STEREOZ Z stereochemistry
440
Scientific Computing for Chemists with Python
® Note
STEREONONE indicates there is no bond stereochemistry, which could be the result of a single or triple bond,
or it could be the result of the alkene bond having multiple equivalent substituents on the same carbon (e.g.,
2-methylpent-2-ene).
STEREONONE
STEREONONE
STEREONONE
STEREOE
STEREONONE
In the above example, there are four bonds with no stereochemistry due to being single bonds, and there is one E bond
corresponding to the alkene. If there are multiple double bonds, it can be difficult to determine which bond has which
stereochemistry. In this case, either use the image like shown above or use additional bond methods (see section 15.6) to
obtain more information about the bonds.
As another example, below we will look at the bonds in 9-cis-retinoic acid, where we can see examples of all three possible
bond stereochemical assignments.
retinoic = [Link](r'O=C(O)\C=C(\C=C\C=C(/C=C/C1=C(/CCCC1(C)C)C)C)C')
STEREONONE
STEREONONE
STEREONONE
STEREOE
STEREONONE
STEREOE
STEREONONE
STEREOZ
STEREONONE
STEREOE
STEREONONE
STEREONONE
STEREONONE
STEREONONE
STEREONONE
STEREONONE
STEREONONE
STEREONONE
STEREONONE
STEREONONE
STEREONONE
STEREONONE
retinoic
Another interesting feature of RDKit is the ability to determine the number of stereoisomers possible for a given structure
and to generate the different isomers. In both these applications, RDKit treats any explicitly assigned stereocenter as fixed
and will not allow it to be changed. For example, below we will again look at (2S, 3E)-pent-3-en-2-ol. Because the struc-
ture already designates this as the (2S, 3E) isomer, the stereochemistry of the chiral center and alkene cannot be changed.
As a result, when using the GetStereoisomerCount() method from the EnumerateStereoisomers module,
it returns a 1, indicating that there is only one stereoisomer possible with these constraints.
[Link](pentenol)
In contrast, if we provide the GetStereoisomerCount() function hexan-2-ol without any stereochemistry desig-
nated (see above), it returns 2 as the number of stereoisomers. This is because (S)-hexan-2-ol and (R)-hexan-2-ol are
both possible isomers.
hexanol = [Link]('OC(C)CCCC')
[Link](hexanol)
The EnumerateStereoisomers module can also generate the different possible isomers, and again, it will only
generate isomers by changing stereochemical features that do not already have assigned configurations. If we again look
at hexan-2-ol, it generates two Molecule objects which are the two isomers.
isomers = list([Link](hexanol))
isomers
[<[Link] at 0x11e63ea70>,
<[Link] at 0x11e63ec00>]
[Link] = False
[Link] = True
[Link](isomers)
442
Scientific Computing for Chemists with Python
<[Link] object>
As a more challenging example, arabinos has three chiral centers allowing for up to eight possible stereoisomers. Because
there is a lack of symmetry between the top and bottom (i.e., -CHO and -CH2 OH are different), no meso compound can
exist, so it will have the full eight stereoisomers. The real challenge lies in drawing out all eight… unless we make RDKit
do the work for us like below.
arabinos = [Link]('O=CC(O)C(O)C(O)CO')
isomers = list([Link](arabinos))
[Link](isomers, useSVG=True)
<[Link] object>
While the examples above mainly focus on stereoisomers from tetrahedral chiral centers, this also works with E/Z
stereoisomers. One limitation with RDKit is that it currently struggles to recognize non-alkene cis/trans stereoisomers
when there are stereocenters that are not chiral centers involved such as rings (see GitHub issue 5597). For example, with
1,2,3-trimethylcyclopropane, it only believes there are eight stereoisomers when in fact there are two.
® Note
A chiral center is a specific example of stereogenic center that is an sp3 atom with four different substituents
whereas a stereogenic center is any atom where exchanging any two substituents/ligands produces a different
stereoisomer. For example, 1,4-dimethylcyclohexane has two stereogenic centers (yields cis versus trans) but no
chiral centers.
In contrast, it has no difficulty identifying the three isomers for 1,2-dimethylcyclopropane because both methylated car-
bons are chiral centers.
DiCProp = [Link]('CC1CC1C')
CPropisomers = list([Link](DiCProp))
[Link](CPropisomers)
<[Link] object>
RDKit can be used to determine a number of key physical properties of molecules known as descriptors using the Chem.
Descriptor module. These can be useful for generating features for a large number of molecules for machine learning
or understanding structural trends in a body of chemical compounds.
There are numerous descriptor functions available which are callable using [Link]() where
method() is the name of a descriptor function that accepts an RDKit Molecule object and returns a numerical value.
Below are a few examples of descriptor functions, with a more complete list available on the RDKit website.
Table 5 Examples of Molecular Descriptors
Function Description
MolWt Molecular weight, assumes natural isotopic distribution
HeavyAtomCount() Number of non-hydrogen atoms
NOCount() Number of N and O atoms
NumAliphaticRings() Number of aliphatic rings
NumAromaticRings() Number of aromatic rings
NumSaturatedRings() Number of saturated rings
NumHAcceptors() Number of hydrogen bond acceptors
NumHDonors() Number of hydrogen bond donors
NumRadicalElectrons() Number of radical electrons
NumValenceElectrons() Number of valence electrons
NumRotatableBonds() Number of rotatable bonds
RingCount() Number of rings
Below we will look at a few of these descriptor functions demonstrated on the compound paclitaxel. Specifically, we will
generate the molecular weight, number of rings, number of aromatic rings, number of valence electrons, and number of
rotatable bonds.
b Tip
ptx = [Link]('CC1=C2[C@@]([C@]([C@H]([C@@H]3[C@]4([C@H](OC4)C[C@@H]'\
'([C@]3(C(=O)[C@@H]2OC(=O)C)C)O)OC(=O)C)OC(=O)c5ccccc5)'\
'(C[C@@H]1OC(=O)[C@H](O)[C@@H](NC(=O)c6ccccc6)c7ccccc7)O)(C)C
↪')
444
Scientific Computing for Chemists with Python
# molecular weight
[Link](ptx)
853.9180000000003
# number of rings
[Link](ptx)
328
10
Among the descriptor methods is a long list of functions that look like fr_group() where group is the name or
abbreviation of a chemical functional group. These functions return an integer quantification of that functional group
present in the molecule. A table with a few examples is provided below, but there are over 80 of these functions available
in RDKit.
Table 6 Examples of Methods to Quantify Functional Groups
b Tip
To see a complete list of functional groups, type [Link].fr_ into a code cell, press Tab for auto-
complete, and see the long list of options. If the functional group is not obvious from the name, place the computer
cursor inside the function’s parentheses and press Shift + Tab to see the Docstring description of what functional
group it quantifies.
We will again look at paclitaxel to see how many benzene rings, aliphatic alcohols, aromatic carboxyls, and esters are
present in the structure.
446
Scientific Computing for Chemists with Python
# number of esters
[Link].fr_ester(ptx)
Molecules can be searched for key structural features using the HasSubstructMatch() method which returns
True or False depending on if a structural pattern exists in a molecule or not. This function requires two RDKit
Molecule objects - one Molecule object (molecule) is checked for the presence of the other Molecule object structure
(substructure) as shown below. There are optional keyword parameters such as useChirality= which allows
for chirality to be factored into whether there is a match or not. The default setting is useChirality=False.
[Link](substructure, useChirality=False)
As an example, we will look for the presence of a carbonyl (i.e., C=O bond) in acetone and pent-3-en-2-ol below, so the
substructure that we will search for is a C=O.
acetone = [Link]('CC(=O)C')
acetone
substructure = [Link]('C=O')
[Link](substructure)
True
[Link](substructure)
False
Not very surprisingly, the HasSubstructMatch() function returns True for acetone and False for the alcohol
because the latter has a single CO bond, not a double. If we change our substructure to CO, we are now searching for a
carbon-oxygen single bond (see Table 7), so acetone returns False while pent-3-en-2-ol returns True.
Table 7 SMILES Bond Order Notation
substructure = [Link]('CO')
substructure
[Link](substructure)
False
[Link](substructure)
True
For a more interesting set of examples, we can search our collection of 20 common amino acids (see section 15.2.2) for
key substructures. We will start by using glycine, the simplest of the common amino acids, as the substructure which
should return all 20 amino acids. As an extra step below, we will also orient all the amino acids in the same way with
respect to the substructure. That is, the substructural element that we are searching for in each amino acid will be oriented
the same way for all 20 amino acids.
# seraches for substruture
substructure= [Link]('C(C(=O)[O-])[NH3+]')
matching_amino_acids = [AA for AA in AminoAcids if [Link](substructure)]
448
Scientific Computing for Chemists with Python
<[Link] object>
Indeed, it did return all 20 amino acids, and notice how the core structures of all amino acids are oriented in the same
direction. Now let us try something a little more interesting by searching for all amino acids with a benzene ring in them.
The substructural bonding pattern in this case is benzene itself, and the three aromatic amino acids are returned.
substructure = [Link]('c1ccccc1')
AA_with_pattern = [AA for AA in AminoAcids if [Link](substructure)]
[Link](AA_with_pattern)
<[Link] object>
It might be nice to still have the name labels for our three matches, so the above search is repeated but instead on a zip
object comprised of the names of the amino acids and the Molecule objects.
substructure = [Link]('c1ccccc1')
with_pattern = [AA for AA in AA_zipped if AA[1].HasSubstructMatch(substructure)]
[Link](mol_obj, legends=name)
<[Link] object>
RDKit allows access to information on specific atoms and bonds through the GetAtoms() and GetBonds() methods,
respectively. These functions return a sequence type of object that can be iterated through using a for loop to access
individual atoms or bonds. Using the following methods, the user can access or even modify various pieces of information
about the atoms or bonds. Below Table 9 and Table 10 contain some key functions for working with atoms and bonds.
Table 9 Select Atom Methods
Function Description
GetDegree() Returns number of atoms bonded directly to it, includes hydrogens only if they are explic-
itly defined
GetAtomicNum() Returns atomic number
GetChiralTag() Determines if the atom is a chiral center and CW or CCW designation
GetFor- Returns formal charage of atom
malCharge()
GetHybridiza- Returns hybridization of atom
tion()
GetIsAromatic() Returns bool as to whether atom is aromatic
GetIsotope() Returns isotope number if designated, otherwise returns 0
GetNeighbors() Returns tuple of directly bonded atoms
GetSymbol() Returns atomic symbols as a string
GetTotalNumHs() Returns number of hydrogens bonded to the atom
IsInRing() Returns bool designating if the atom is in a ring
SetAtomicNum() Sets the atomic number to user defined value
SetFor- Sets formal charge to user defined value
malCharge()
SetIsotope() Sets isotope to user defined integer value
If we generate a list populated with the degrees of atoms (i.e., number of other atoms bonded directly to it), you may
notice that there are no 4 values even though the methyl (i.e., -CH3 ) carbon should have four atoms attached to it. This is
because the hydrogen atoms are not explicitly designated in the structure (i.e., they are implicit), so they are not counted.
[[Link]() for atom in [Link]()]
[1, 3, 1, 2, 3, 2, 2, 2, 2, 3, 3, 1, 1]
We can count the number of implicit hydrogens using the GetNumImplicitHs() method, and the third value is a 3
making it the methyl carbon.
[[Link]() for atom in [Link]()]
[0, 0, 3, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1]
450
Scientific Computing for Chemists with Python
b Tip
If you want to make all hydrogens explicitly defined, this is accomplished using the [Link](mol) func-
tion. An example is in section 16.3.1.
We can also use these atom methods to change values and attributes of various atoms. For example, we can set the
isotopes of the carbonyl carbons (i.e., C=O) to 13 C. This is accomplished with the following code that iterates through all
the atoms and finds the carbonyl carbons by testing for atoms that have an atomic number of 6, are not aromatic, and have
no hydrogens, and then setting the isotope value to 13. The molecular weight is calculated before and after the isotopes
are changed for comparison.
[Link](aspirin)
180.15899999999996
[Link](13)
print([Link](aspirin))
aspirin
182.14370968
The molar mass has increased due to two of the carbon atoms being isotopically labeled, and we can see in the image
which of the two carbons were isotopically labeled. It is worth noting that the molecular weight before isotopically
labeling assumes a natural distribution of isotopes, which for carbon is 98.9% 12 C and 1.1% 13 C. In the isotopically
labeled structure, the two carbonyl carbons are 100% 13 C.
Using bond methods, we can perform analogous types of operations except that bonds have different attributes than atoms.
A table of selected bond methods is provided below.
Table 10 Select Bond Methods
Function Description
GetBeginAtom() Returns first atom in bond
GetEndAtom() Returns second atom in bond
GetBondType() Returns type of bond (e.g., SINGLE, DOUBLE, AROMATIC)
GetIsAromatic() Returns bool as to whether bond is aromatic
GetIsConjugated() Returns bool as to wether bond is conjugated
IsInRing() Returns bool as to wether bond is in ring
SetBondType() Sets bond type
SetIsAromatic() Sets bool designating if a bond is aromatic
As a demonstration, we will examine the bonds in the structure of acetone and change the carbonyl double bond to a
single bond. This is done by searching for a double bond, setting it to a single bond, and then changing the formal charges
of the atoms attached to that bond.
acetone
452
Scientific Computing for Chemists with Python
Further Reading
Exercises
Complete the following exercises in a Jupyter notebook using RDKit. You are encouraged to also use data libraries such
as NumPy or pandas to support your solutions. Any data file(s) referred to in the problems can be found in the data folder
in the same directory as this chapter’s Jupyter notebook. Alternatively, you can download a zip file of the data for this
chapter from here by selecting the appropriate chapter file and then clicking the Download button.
1. Load the structure for morphine into RDKit using either a SMILES or InChI representation. You will need to
either generate one of these representations using chemical drawing software or find one online from a free resource.
a) Visualize the structure of morphine and save it as an SVG image file.
b) Use RDKit to determine the number of chiral centers in the structure. Your code should output an integer value,
not just a list of chiral centers.
c) Use RDKit to determine the number of hydrogen bond acceptors in the structure.
d) Use RDKit to determine the number of rings in the structure.
2. Load the amino_acid_SMILES.txt file and use RDKit for the following.
a) Determine the absolute configuration (i.e., R vs. S) of the 𝛼-carbon for all the chiral amino acids. Most are the
same, but one is an exception. Which is it?
b) How many amino acids have two chiral centers?
3. Load the organic_molecules.txt dataset containing SMILES representations of a range of organic molecules.
a) Using descriptors, generate a list containing the SMILES representations of only primary and secondary aliphatic
alcohols.
b) Using pattern matching, generate an image containing only primary alcohols from the file. To help
you along, here is the SMARTS representation of a primary alcohol for the pattern matching. Chem.
MolFromSmarts('[CH2][OH]')
c) Calculate the percentage of heavy-element (i.e., not with hydrogens) bonds that are C-O bonds.
d) Calculate the percentage of carbon atoms in a ring.
4. Use RDKit to generate an image showing all isomers of 1,2-dimethylcyclohexane. You will need to look up the
SMILES or other representation first.
454
CHAPTER 16: BIOINFORMATICS WITH BIOPYTHON & NGLVIEW
Bioinformatics is the field of working with biological or biochemical data using computing resources, and while the under-
lying techniques for working with biological data are fundamentally the same as what has been seen so far, this field is large
and significant enough to warrant its own chapter. More importantly, bioinformatics contains a multitude of specialized
file formats, making this a significant hurdle in working with these data. The good news is that biological/biochemical file
formats are usually text files like those seen in the previous chapters, and there are Python libraries available to facilitate
the parsing and working with these file formats and data. This chapter focuses on a few common file formats, parsing
them with both our own Python code and using the Biopython library to perform the heavy lifting.
The Biopython library is among the well-known bioinformatics Python libraries handy for working with biological and
biochemical data. It will need to be installed in Jupyter or Google Colab because it is not a default library. As of this
writing, Biopython can be installed using pip by pip install biopython, and a Conda option is also available.
Once installed, it is imported as Bio. This chapter assumes the following imports.
import Bio
from Bio import PDB, SeqIO, SeqUtils, Align
Among the most fundamental data in bioinformatics are sequences, which simply provide the order of monomers in
a sequence of nucleotides or amino acids. For protein sequences, these monomers are mainly the 20 common amino
acids, with other less frequent amino acids and other species possible, and for nucleic acid sequences, the monomers are
nucleotides. In this section, we will work with sequences inside Biopython to perform various operations such as sequence
alignment and translating mRNA sequences into peptide sequences.
Inside Biopython, sequences are often stored as a Sequence object, which looks like a string inside a list wrapped in
Seq() such as below. This object contains many of the same methods as a Python string plus some extra, so you can still
iterate through Sequence objects with a for loop along with index, slice, reverse them, and alter the case like a string.
Seq('GCCGGCAGTCACACGCACAGGC')
455
Scientific Computing for Chemists with Python
There are numerous file formats that can store sequence data, but for the examples in this section, we will focus on the
FASTA file format, which only holds the sequence data and a small amount of metadata (i.e., data about the data). FASTA
files are text files that look like the following when opened in a text editor. A FASTA file can contain a single or multiple
sequence entries with the first line of each entry beginning with a >. The rest of this line includes helpful information
about the sequence, such as the organism and what specific molecule it relates to. The rest of the text block is sequence
information. There is no strict rule on how many letters can be contained in each line, but 70 is a common length.
® Note
While the FASTA lines may not look the same length as shown here, they will be the same width when opened
in a monospaced font.
[Link]('file_name', 'file_type')
[Link]('file_name', 'file_type')
The [Link]() function returns a Sequence Record object, which has a few attributes shown in the table below.
The most important attribute is the sequence itself, which is stored as a Sequence object.
Table 1 Sequence Record Attributes
Attribute Description
id Returns the sequence ID from the file’s first line
description Returns a description from the file’s first line
seq Returns the sequence as a Sequence object
name Returns the sequence name from the file’s first line (may be same as ID)
456
Scientific Computing for Chemists with Python
[Link]
Seq('GGGUGCUCAGUACGAGAGGAACCGCACCC')
In the event we have a file containing multiple entries, the [Link]() function is required. The function works
the same way as the [Link]() version except that a one-time use iterator object is returned that contains each
entry from the FASTA file. To extract this information, we need to iterate over it using a for loop. Data from each entry
can be accessed using the same methods as the [Link]() function. This is demonstrated below using a FASTA
file for a protein structure of Norwegian rat hemoglobin.
fasta_data = [Link]('data/rcsb_pdb_3DHT.fasta', 'fasta')
seq_list = []
for entry in fasta_data:
seq_list.append([Link])
seq_list
[Seq('VLSADDKTNIKNCWGKIGGHGGEYGEEALQRMFAAFPTTKTYFSHIDVSPGSAQ...KYR'),
Seq('VHLTDAEKAAVNGLWGKVNPDDVGGEALGRLLVVYPWTQRYFDSFGDLSSASAI...KYH')]
Because the iterator is a one-time use object, attempting to iterate over it again, like below, fails to return any data, so be
sure to attach any data to a variable or append it to a list.
for entry in fasta_data:
print([Link])
One piece of information we can extract from a nucleotide sequence is the GC content. In DNA, for example, there
are two complementary strands hydrogen bonded together which contain the base pairs adenosine(A)/thymine(T) and
guanine(G)/cytosine(C), so the number of adenosines equals the number of thymines and the number of guanines equals
the number of cytosines. However, the number of A/T pairs does not necessarily equal the number of G/C pairs. The
GC content of DNA is the fraction of total bases that are G/C, which can be calculated using the number (𝑛) of G and C
bases divided by the total number of all bases in the sequence.
𝐺𝐶 𝑏𝑎𝑠𝑒𝑠 𝑛𝐺 + 𝑛 𝐶
𝐺𝐶 𝑐𝑜𝑛𝑡𝑒𝑛𝑡 = =
𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒 𝑙𝑒𝑛𝑔𝑡ℎ 𝑛𝐺 + 𝑛𝐶 + 𝑛𝐴 + 𝑛𝑇
Below, we will calculate the GC content of a DNA sequence in a FASTA file using Biopython’s gc_fraction(seq)
function, which accepts a Biopython sequence and returns the GC content in fraction form.
DNA = [Link]('data/DNA_sequence_drago.fasta', 'fasta')
rat_seq = [[Link] for x in DNA]
SeqUtils.gc_fraction(*rat_seq)
0.5296912114014252
Sometimes there are characters in a DNA sequence other than A, T, C, and G due to ambiguities among other reasons.
An N means that the base is unidentifiable while S means it is either C or G and W means it is either A or T. The
gc_fraction() function provides an ambiguous= parameter that can be used to decide how to deal with ambiguous
characters. Below are the three string options for the ambiguous= parameter where remove is the default setting.
Table 2 Settings for gc_fraction() ambiguous= Parameter
Options Description
'remove' Default setting; only uses ‘ATCGSW’ characters and ignores the rest
'ignore' Uses ‘GCS’ characters for GC count and rest of characters for sequence length
'weighted' Applies weights to various characters effectively forming a weighted average
Our sequence contains some N characters, so if we set it to ignore, the GC content value is expected to decrease due
to a larger denominator in the equation above versus the default remove option.
® Note
The rat_seq is embedded in a list. To remove it from the list, it is “unpacked” using *rat_seq. Using
rat_seq[0] would also accomplish the same thing.
SeqUtils.gc_fraction(*rat_seq, ambiguous='ignore')
0.5247058823529411
In protein synthesis, the coding (or informational) strand of DNA is transcribed to mRNA, which is then translated to
a protein sequence. DNA can also replicate by unwinding and using additional complementary nucleotides to bond the
coding and template strands. Biopython makes performing digital analogues of these operations relatively simple using
the following functions.
Table 3 Methods for Performing Transcription, Translation, and Replication
Function Description
transcribe() Transcribes coding DNA strand to mRNA (maintains 5’ → 3’ direction)
translate() Translates mRNA sequence (5’ → 3’) to a peptide sequence (N → C)
complement() Converts 5’ → 3’ nucleotide sequence to the 3’ → 5’ complementary sequence
reverse_complement() Converts 5’ → 3’ DNA strand to 5’ → 3’ complementary sequence
re- Converts 5’ → 3’ RNA strand to 5’ → 3’ complementary sequence
verse_complement_rna()
complement_rna() Converts 5’ → 3’ RNA strand to 3’ → 5’ complementary sequence
replace(old, new) Replaces old items in sequence with new (can also be used to replace spaces)
While some functions in Biopython accept strings or Sequence objects, the functions above work exclusively with Sequence
objects. The good news is that if you have a string, it is easy to convert to a Sequence object using the Seq() function
like below.
coding_DNA = [Link]('GGAGAGTGACGCCGGCAGTCACACGCACAGGCTGCAGCAACGAAAGAT')
coding_DNA
458
Scientific Computing for Chemists with Python
Seq('GGAGAGTGACGCCGGCAGTCACACGCACAGGCTGCAGCAACGAAAGAT')
We can perform transcription using the transcribe() method, which operates on a DNA strand and assumes that
the DNA strand is the coding (or informational) strand. It also assumes that the sequence is in the 5’ → 3’ direction and
returns the mRNA sequence also in the 5’ → 3’ direction.
mRNA = coding_DNA.transcribe()
mRNA
Seq('GGAGAGUGACGCCGGCAGUCACACGCACAGGCUGCAGCAACGAAAGAU')
If you find yourself with the template strand, this can be converted to the coding strand using the re-
verse_complement() function, like below, which takes a DNA strand in the 5’ → 3’ direction and returns the
complementary strand also in the 5’ → 3’ direction. This coding strand can then be transcribed to mRNA.
template_DNA = [Link]('ATCTTTCGTTGCTGCAGCCTGTGCGTGTGACTGCCGGCGTCACTCTCC')
coding_DNA = template_DNA.reverse_complement()
coding_DNA.transcribe()
Seq('GGAGAGUGACGCCGGCAGUCACACGCACAGGCUGCAGCAACGAAAGAU')
Once we have our mRNA sequence, we can translate it to a peptide sequence using the translate() method, which
is performed using the standard codon table.
® Note
[Link]()
Seq('GE*RRQSHAQAAATKD')
By default, this function will translate the entire mRNA sequence, disregarding any stop codons. To heed the stop codons,
set the to_stop= parameter to True.
[Link](to_stop=True)
Seq('GE')
Biopython can perform both global and local pairwise alignments of sequences, including nucleic acids and proteins.
The difference between these types of alignments is that global pairwise alignment attempts to align the entirety of two
sequences of at least somewhat similar length, while local pairwise alignment attempts to align subsequences of the two
sequences. Local alignment essentially attempts to find common regions between multiple sequences. The alignment
process generates a score based on user-defined rules and attempts to maximize this score to generate the “best” alignment.
For example, aligned bases in two DNA sequences might be awarded a +1, while misaligned bases are penalized a -1.
Pairwise sequence alignment in Biopython starts with creating a PairwiseAligner object, which requires the type of align-
ment ('global' or 'local'). Optionally, you can set the scoring parameters, which dictate how a match, mismatch,
starting a gap, extending a gap, and ending a gap affect the score. By default, +1 is awarded for every match, and mis-
matches and gaps are all 0. Below, the PairwiseAligner is set to 'global', and scoring parameters are adjusted as
shown.
aligner = [Link](mode='global',
match_score=1,
mismatch_score=-1,
open_gap_score=-1,
extend_gap_score=-0.5)
Once we have created the PairwiseAligner object, we can use the align() method to return the optimal alignment
between the two sequences based on the scoring parameters. It is important to note that there can be multiple optimal
sequence alignments (i.e., tied for best score) based on our scoring parameters, so the align() method can return
multiple alignments.
Below, the aligned sequences are stored in the variable alignment. When we check the length of this object, we find
it contains 15 alignments, which can be viewed by indexing or iteration.
seq1 = 'GGAGAGTGACGCCGGCAGTCACACGCACAGGCTGCAGCAACGAAAAGTT'
seq2 = 'GGAGAGTGACGCCGGGCAGTCACACGCTCAGGCTGCAGCAACGAAAAAGTTA'
15
print(alignments[0])
target 0 GGAGAGTGACGCC-GGCAGTCACACGCACAGGCTGCAGCAACG-AAAAGTT- 49
0 |||||||||||||-|||||||||||||.|||||||||||||||-|||||||- 52
query 0 GGAGAGTGACGCCGGGCAGTCACACGCTCAGGCTGCAGCAACGAAAAAGTTA 52
The score from the optimal alignments can be viewed using the score method. Keep in mind the score is affected by
not only the quality of the alignment based on the alignment parameters but also sequence length, so it is not necessarily
useful for comparing alignments between different pairs of sequences.
44.0
460
Scientific Computing for Chemists with Python
In this section, we will work with two common file formats for storing biochemical data: PDB and mmCIF. Both of these
file formats are text files, so information can always be extracted using pure Python code you wrote yourself. However,
there are also preexisting tools that can make this process substantially easier, such as Biopython or scikit-bio (see Further
Reading). Below, you will see demonstrations of both pure Python and Biopython approaches with an emphasis on using
Biopython.
Protein Database (PDB) and Macromolecular Crystallographic Information File (mmCIF) files are designed to hold pro-
tein sequence and structural information, while the FASTA file format only holds sequence data for proteins and nucleic
acids. The FASTA file format is simpler than the PDB and mmCIF file formats, but there is a significant amount of
structural data, addressed below, contained in the latter formats that goes beyond the sequence.
The PDB file format is a classic file format for holding protein sequence and structural information, including the infor-
mation listed below. While the PDB is being slowly replaced by the mmCIF (see section 16.2.2), the PDB file format is
still quite common and worth looking at.
• Amino acid sequence of each strand
• Location and identity of non-amino acid species
• xyz coordinates of atoms in the crystal structure, including trapped solvents
• Connectivity information
• Metadata about the protein (e.g., source organism, resolution, etc.)
• Secondary structural information
First, we need a PDB file of a protein structure, which can be downloaded for free from the RCSB Protein Data Bank.
The Download Files menu on the top right provides a number of file format options, including PDB Format. In the
example below, we will look at the Vanadium nitrogenase VFe protein structure in the [Link] file.
The PDB file is organized where each line holds a different type of information, and a label in all caps on the far left of
each line indicates what type of information is stored in that line. Below are some key labels (i.e., record type), but this is
far from a comprehensive list. Data within a line is identifiable based on the character position in a line. This is in contrast
to many other file types where data in a single line are distinguished by separators such as commas or spaces. For more
information on the PDB file format, see the Protein Data Bank website. If you are using JupyterLab, you can double-click
the PDB file to open it and view the contents.
Table 4 Selected PDB File Record Types
Before we rely on Biopython to extract information from data files, we will use pure Python. As a short demonstration,
the code below opens the PDB file and appends each line to a list called data. We can examine a few of the lines using
slicing to see information about the structure of the protein. The lines shown below provide information about the helices
and sheets in the protein structure.
file = 'data/[Link]'
data = []
with open(file, 'r') as f:
for line in f:
[Link](line)
data[1190:1200]
'SHEET 3 AA1 6 GLN A 354 SER A 360 1 N MET A 358 O TYR A 380 \
↪n',
'SHEET 4 AA1 6 LYS A 330 THR A 335 1 N MET A 331 O GLN A 354 \
↪n',
'SHEET 5 AA1 6 VAL A 401 THR A 404 1 O PHE A 403 N ALA A 332 \
↪n',
'SHEET 6 AA1 6 TYR A 419 ASN A 421 1 O VAL A 420 N ILE A 402 \
↪n']
As an exercise, we can extract information about the 𝛽-sheets in the protein. Specifically, we will look at the relative
directions (sense) of adjacent strands, which can run in the same direction (parallel) or in the opposite directions (an-
tiparallel) as the previous strand. This is indicated by the integer in positions 39-40 of a SHEET line and can be either 0
for the first strand of a 𝛽-sheet, 1 for a strand parallel with the previous strand, and -1 for a strand antiparallel with the
previous strand. The function below extracts this information by opening the PDB file, moving through each line of
the file, and if the line begins with SHEET, it appends the relative direction to a list and returns the populated list.
def get_sheet_direction(file):
'''Accepts a PDB files path (string) and returns a list
of values indicating if a strand starts a beta sheet (0),
strand is parallel to the previous strand (1), or is
antiparallel to the previous strand (-1).
structure_list = []
462
Scientific Computing for Chemists with Python
return structure_list
sheet_sense = get_sheet_direction('data/[Link]')
print(sheet_sense)
↪1, 0, 1, 1, 1]
[Link]('Sheet Sense')
[Link]('Count')
40
35
30
25
Count
20
15
10
5
0
-1 0 1
Sheet Sense
According to the graph above, parallel 𝛽-sheet strands are significantly more prevalent in this protein structure than
antiparallel strands. This might be different for other proteins, so we will expand this analysis to a folder full of protein
structures.
current_directory = [Link]()
data_folder = [Link](current_directory, 'data/proteins')
sheet_sense = []
(continues on next page)
400
300
Count
200
100
0
-1 0 1
Sheet Sense
The trend over a larger sample of proteins is that antiparallel is significantly more common than parallel, so it seems that
the 7aiz protein structure is an exception to the typical trend. However, this is only a little over a dozen structures, so it
would require a much larger dataset to be certain of this trend.
Next, we will use the Biopython library to read data from PDB and other structural files. One of the appeals of using
Biopython is that the user does not need to understand the structure of the file format; Biopython parses the files allowing
you to focus on higher-level concerns.
First, we need to import the PDB module of the Biopython library with the import [Link] command if you have
not done so already (see start of this chapter). Biopython, like SciPy, requires that individual modules be imported one at
a time instead of the entire library (i.e., import Bio is not enough). You are welcome to import functions individually
(e.g., from [Link] import PDBParser()), but herein we will only import the module using import Bio.
PDB so that the code more clearly shows the source of every function. The PDB module provides tools for dealing with
464
Scientific Computing for Chemists with Python
the 3D structural data of macromolecules such as proteins and DNA. To parse a PDB file, we first create a parser object
using the [Link]() function.
parser = [Link]()
We will then use the get_structure() function to read in data from a file. This function requires two positional
arguments - a name for the structure and the name of the file. Both arguments are strings, and the structure name can be
anything you like.
structure = parser.get_structure('7aiz', 'data/[Link]')
Despite the name, the PDB module contains tools for dealing with other file formats such as mmCIF, PQR, and MMTF.
The mmCIF file format is the successor to the PDB format, making it an increasingly common file format. The good
news is that parsing different structural files is almost identical as Biopython deals with most of the file format details
behind the scenes. The only difference in dealing with mmCIF files versus PDB in Biopython is that we use the PDB.
MMCIFParser() function to read the mmCIF file instead of [Link](), so mmCIF code would look like
the following.
parser = [Link]()
structure = parser.get_structure('7aiz', 'data/[Link]')
Biopython is also capable of writing structures to new PDB or mmCIF files, but by default, it will not include much of the
metadata (e.g., resolution, name of structure, authors, etc.) and information about secondary structures in the new files.
® Note
Additional information can be included in the written file, but the process is a little involved. See the official
Biopython documentation for more information.
The general methodology is to first create a writing object using either [Link]() or [Link]() for creating
a new PDB or mmCIF file, respectively. Next, use the set_structure() method on the writing object to load the
data from an individual structure. Finally, write the file using the save() function and providing it with the name of the
new file as a string.
# write a new PDB
io = [Link]()
(continues on next page)
The structural data extracted from the PDB or mmCIF by Biopython is organized in the hierarchical order of structure
→ model → chain → residue → atom. This means that models are contained within the structure, chains are contained
within each model, residues are contained within each chain, and atoms are contained within each residue. The structure
is the protein, the model is a particular 3D model of the protein, the chain is a single peptide chain in the protein, the
residue is a single amino acid residue in the chain, and the atom is each atom within a given amino acid residue (Table 6).
Table 6 Levels of Structure from PDB Data
Level Description
Structure Protein strucuture; may contain multiple models
Model Particular 3D model of the protein (usually only one)
Chain Pepetide chain
Residue Amino acid residue in a given chain
Atom Atoms in a particular amino acid residue
® Note
If the file contains a crystal structure, there is likely only one model, but if the structure came from NMR spec-
troscopy, there are often multiple structures.
While PDB files can contain multiple models of a protein, most only contain one. Even though there is only one model
in our data, we will need to access the first (and only) model using indexing. For the first protein model, use struc-
ture[0], and if there were a second, it would be structure[1].
protein_model = structure[0]
Because of the hierarchical structure, each level of structure can be accessed by iterating through the level above it. For
example, the following code will append all atoms in every residue in every chain in the protein model to a list called
atoms.
atoms = []
for chain in protein_model:
for residue in chain:
for atom in residue:
[Link](atom)
(continues on next page)
466
Scientific Computing for Chemists with Python
atoms[:10]
[<Atom N>,
<Atom CA>,
<Atom C>,
<Atom O>,
<Atom CB>,
<Atom CG>,
<Atom CD>,
<Atom N>,
<Atom CA>,
<Atom C>]
This can add up to a large number of for loops in your code. Alternatively, you can get more direct access to the different
levels of structure using the following methods that yield a generator.
® Note
A generator function contains yield in place of return and only produces an item upon request (e.g., Python’s
range() function) to save memory.
For example, the following appends all residues in the protein model to a list and displays the first ten residues.
res_list = []
for residue in protein_model.get_residues():
res_list.append(residue)
res_list[:10]
Parts of the protein structure can also be accessed using keys (i.e., the ID’s) of the various levels of structure. This
does require more knowledge of the structure beforehand, though. To first get access to the ID’s, you can iterate
through a structure and use the get_id() method to see all of the substructure ID’s. Alternatively, you can use the
get_unpacked_list() function to get a list of all substructures of an object with ID’s. For example, below we
iterate through the protein model to get the strand ID’s. The same can be done with iterating through strands to obtain the
residue ID’s or through residues to obtain the atom ID’s. The strand and atom ID’s will be letters (strings) while residue
ID’s are integers.
A
B
C
D
E
F
strand_A = protein_model['A']
strand_A
<Chain id=A>
residue_10 = strand_A[10]
residue_10
As a demonstration of both the get_id() and get_unpacked_list() approaches, below we can see the atoms
present in a lysine residue.
® Note
The CA is the 𝛼-carbon in the peptide backbone while C is the carbonyl carbon. Additional carbons may be
present depending upon the identity of the amino acid.
residue_10.get_unpacked_list()
[<Atom N>,
<Atom CA>,
<Atom C>,
<Atom O>,
<Atom CB>,
<Atom CG>,
<Atom CD>,
<Atom CE>,
<Atom NZ>]
468
Scientific Computing for Chemists with Python
N
CA
C
O
CB
CG
CD
CE
NZ
residue_10['CA']
<Atom CA>
Once we can access the atoms, residues, and strand, information can be extracted such as the identity, 3D coordinates,
bond angles, and more. For example, below is a table of interesting atom attributes/functions.
Table 8 Selected Atom Attributes/Functions
Attribute/Function Description
get_name() Returns the name of the atom as a string
get_coord() Returns the xyz coordinates of the atom as an array
get_vector() Returns the xyz coordinates of the atom as a vector object
transform() Rotates or translates the atomic coordinates along the xyz axes
The following code is used to obtain the 3D coordinates as arrays for all atoms in the protein model.
atom_coords = []
for atom in protein_model.get_atoms():
atom_coords.append(atom.get_coord())
atom_coords[:5]
Attribute/Function Description
get_resname() Returns the name of the residue as a three-letter code string
get_segid() Returns the segment ID if available
get_atoms() Returns the atoms in the residue at a generator
get_unpacked_list() Returns atoms in the residue as a list
res_list = []
for residue in protein_model.get_residues():
res_list.append(residue.get_resname())
res_list[:5]
There are a lot of interesting data obtainable from the strands, but getting access to these data is a little more involved.
We need to first initiate (i.e., creating) a polypeptide builder object using [Link]() and then build the
Polyptetides object using the build_peptides() method. The build_peptides() function accepts the struc-
ture as the one required argument and by default only returns standard amine acids in the peptide chains unless the
aa_only=False argument is included. The peptide information in the example below is stored in the variable pep-
tides, which shows the six peptide chains in this particular protein structure along with sequence identifier integers that
indicate the position of the amino acid along the peptide chain.
ppb = [Link]()
peptides = ppb.build_peptides(structure[0])
peptides
We can iterate through the PolyPeptide object (peptides) to get the individual peptide chains. With the peptide chains,
we can obtain information about the peptide chain, such as the names of amino acids, phi (𝜙) and psi (𝜓) angles, etc.,
using the various methods tabulated below.
Table 10 Selected strand Attributes/Functions
Attribute/Function Description
get_sequence() Returns the squence of each strand using single-letter amino acid codes
get_phi_psi_list() Returns a list of phi and psi dihedral angles in radians
get_ca_list() Returns list of alpha carbons
get_theta_list() Returns a list of theta angles in radians
get_tau_list() Returns list of tau torsional angles in radians
In the example below, we iterate through the peptide strands in peptides and print the theta angles in radians.
470
Scientific Computing for Chemists with Python
↪5825173083130313)]
↪6121663625568363)]
As an example application, we can generate a Ramachandran plot which visualizes the trends of the psi (𝜓) versus phi
(𝜙) dihedral angles along peptide chains. While the omega (𝜔) dihedral angles tend to be flat, the psi (𝜓) versus phi (𝜙)
dihedral angles tend to exist in distinct ranges.
current_directory = [Link]()
data_folder = [Link](current_directory, 'data/proteins')
parser = [Link]()
ppb = [Link]()
phi[:10]
[np.float64(-1.3150748393961473),
np.float64(-2.7159390905663523),
np.float64(-2.909570150157909),
np.float64(-1.9350566725748244),
np.float64(-2.3853630088972273),
np.float64(1.3306807975618),
np.float64(-1.6592559311514123),
np.float64(-1.788129930665399),
np.float64(-1.3609620204292667),
np.float64(-1.014135279691033)]
1
psi (degrees)
3
3 2 1 0 1 2 3
phi (degrees)
You may notice that the first and last dihedral angles were sliced off the list of phi and psi angles (last two lines of code).
This is because there are no phi (𝜙) values for the first amino acid and no psi (𝜓) values in the last amino acid of a
strand. Dihedral angle measurements require four atoms, and the terminal amino acids are missing one of the required
472
Scientific Computing for Chemists with Python
four atoms. For example, phi (𝜙) dihedral angles are measured along the N-C𝛼 bond of a C(O)-N-C𝛼 -C(O) chain of
atoms, but the first amine acid only has N-C𝛼 -C(O).
The Ramachandran plot above is in radians which can be converted to degree (1 radian = 180/𝜋) as is done below.
import math
psi_deg = [rad * (180 / [Link]) for rad in psi]
phi_deg = [rad * (180 / [Link]) for rad in phi]
150
100
50
psi (degrees)
50
100
150
There are many pieces of software for viewing molecular structures directly from your desktop, but there are currently few
for viewing structures within a Jupyter notebook. This section provides a brief introduction to nglview for interactively
viewing molecular structures. Additional information on nglview can be found on the nglview documentation page.
b Tip
Nglview often requires a restart after installation before working. As of this writing, I am having good luck with the
most recent version, 3.1.2, working in JupyterLab for my students and me.
Nglview is not a standard library for Anaconda or Colab, so it needs to be installed, and as of this writing, nglview can
be installed using either pip or conda. Below, it will be imported with the nv alias. A restart may be required after
installation.
import nglview as nv
Molecular structures can be loaded using a number of different sources, including directly from files, from RDKit Molecule
objects, from Biopython structure objects, and from psi4 molecules, among others. Below is a table of some key functions
for loading molecular structures.
Table 11 A Selection of Nglview Functions for Loading Structural Data
Function Description
nv.show_file() Loads from a file (e.g., PDB or mmCIF) on your computer
nv.show_pdbid() Fetches data from RCSB database when provided a PDB ID (e.g., ‘7aiz’)
nv.show_rdkit() Loads structure from a 3D RDKit Molecule object
nv.show_biopython() Loads data from a Biopython structure object
As our first example, we will load a file using the show_file() function, which accepts a protein data file such as
PDB. The structure is displayed in an interactive window where clicking and dragging rotates the molecule, and scrolling
zooms in and out. The size of this window can be expanded or contracted using the little gray arrow control(s) on the
right corners of the display window.
® Note
The following examples are no longer interactive. If you run this code in your own notebook, you will be able to
interact with the structures.
prot = nv.show_file('data/[Link]')
prot
The next example accepts the four-letter ID for a protein crystal structure and fetches the data from an online database.
prot = nv.show_pdbid('3hpb')
prot
474
Scientific Computing for Chemists with Python
We can also view a molecule loaded from a Biopython structure object (see section 16.2.2) using the
show_biopython() function.
RDKit Molecule objects can also be viewed in nglview using the show_rdkit() function, but first, a 3D representa-
tion of the molecule needs to be generated using the [Link](mol_object) function. Many
SMILES representations do not include all of the hydrogens, so implicit hydrogens need to be added in using the Chem.
AddHs() method. The visualization of glucose 6-phosphate from a SMILES representation is shown below.
mol = [Link]('O[C@H]1[C@H](O)[C@@H](COP(O)(O)=O)OC(O)[C@@H]1O')
mol = [Link](mol) # add H's
[Link](mol) # generate 3D structure
G6P = nv.show_rdkit(mol)
G6P
The way molecules are represented by nglview can be modified using add_representation(rep), which takes
a variety of string parameters indicating the representation. A few examples of representations are listed below, with a
more complete list provided on the nglview documentation page.
The default representation is cartoon which shows the peptide backbone as strands and ribbons (for secondary struc-
tures). It is important to clear the default representation using the clear_representation() method before adding
a new representation. Otherwise, you will have both representations showing up on top of each other, unless this is what
you want.
Table 12 Selected Molecular Representations
Representation Description
cartoon Cartoon with strands and ribbons; sidechains not shown
ball+stick Atomic spheres and stick bonds; sidechains shown
licorice Balls and sticks where atoms and bonds have the same radii; sidechains shown
rope Backbone is shown as a tube; sidechains not shown
spacefill Spacefilling model with atoms showing atomic size; sidechains shown
surface Shows the surface of the molecule; other surface parameters available
® Note
The selection=protein argument indicates to only show the protein and not surrounding waters and other
non-peptides. See below for more about the selection= argument.
prot_3hpb = nv.show_file('data/[Link]')
prot_3hpb.clear_representations()
prot_3hpb.add_representation('ball+stick', selection='protein')
prot_3hpb
476
Scientific Computing for Chemists with Python
prot_3hpb = nv.show_file('data/[Link]')
prot_3hpb.clear_representations()
prot_3hpb.add_representation('licorice', selection='protein')
prot_3hpb
Different sections of a protein can be represented differently using the selection= parameter in the
add_representation() function. This includes using residue numbers from the structure file or using a vari-
ety of string arguments that select different types of structures. A short list of options is included below with a more
complete list on the nglview documentation page.
Table 13 Selected Options for the selection= Parameter
As an example, we can show a protein structure with the backbone as the default cartoon and the side chains using a
licorice structure as shown below.
prot_1rpy = nv.show_file('data/[Link]')
prot_1rpy.add_representation('licorice', selection='sidechains')
prot_1rpy
The colors can be customized using the color= parameter. This can accept either a color name as a string (e.g.,
'blue') or color code the molecule based on other features such as hydrophobicity or chain.
Table 14 Selected Options for the color= Parameter
Option Description
chainid Each chain is colored differently
chainname Each chain is colored differently
element Uses standard element color coding (for licorice or ball+stick representations)
hydrophobicity Sections colored by peptide hydrophobicity
moleculetype Each molecule colored by type (e.g., peptide chain versus sulfate)
residueindex Color changes gradually down the peptide chain
resname Each peptide side chain is assigned a color
sstruc Colors based on secondary structure
Not all of the above options work for every representation, and some only work on the peptide side chains.
prot_1rpy = nv.show_file('data/[Link]')
prot_1rpy.clear_representations()
prot_1rpy.add_representation('cartoon', color='hydrophobicity')
prot_1rpy
478
Scientific Computing for Chemists with Python
prot_1rpy = nv.show_file('data/[Link]')
prot_1rpy.add_representation('licorice', color='hydrophobicity')
prot_1rpy
prot_1rpy = nv.show_file('data/[Link]')
prot_1rpy.clear_representations()
prot_1rpy.add_representation('cartoon', color='sstruc')
prot_1rpy
prot_1rpy = nv.show_file('data/[Link]')
(continues on next page)
To view the molecule with a surface, use the add_surface() method which takes a number of optional parameters.
Possibly the most important is opacity= which accepts a float from 0 → 1 indicating how opaque the surface is with
1 exhibiting no translucency and 0 being completely transparent.
full_surface = nv.show_biopython(structure)
full_surface.add_surface(opacity=0.3)
full_surface
Another useful parameter is the selection= parameter that operates like described in section 16.3.2 where only the
selected components have a surface around them. In the example below, only acidic amino acids are wrapped in a surface.
acidic = nv.show_file('data/[Link]')
acidic.clear_representations()
acidic.add_representation('licorice')
acidic.add_surface(selection='acidic',
opacity=0.4,
color='pink')
acidic
480
Scientific Computing for Chemists with Python
We can also use and and or to produce more complex selections such as below where we only wrap backbones of
residues that are acidic or basic and must only be in strand B.
acidbase = nv.show_biopython(structure)
acidbase.clear_representations()
acidbase.add_representation('licorice')
acidbase.add_surface(selection=':B and backbone and (basic or acidic)',
opacity=0.3,
color='lightblue')
acidbase
Nglview also supports an interactive graphical user interface (GUI) within Jupyter notebooks. From the panel on the
right, the user can add representations and change selections using the same selection keywords as above (see Table 12).
From the File menu on the top left, new files can be opened or proteins can be fetched using the protein ID.
prot = nv.show_pdbid('1rpy')
prot.clear_representations()
prot.add_representation('cartoon')
prot.add_representation('licorice', selection='ring')
prot.gui_style = 'ngl'
prot
Further Reading
482
CHAPTER 17: COMMAND LINE & SPYDER
Up to this point, we have been running all of our Python scripts through the IPython environment from either a Jupyter
notebook or a Python interpreter. A third way to run Python code is to save it as text files and run the code from the
computer’s or Jupyter’s terminal. The advantage of this approach is that it is more practical for larger scripts and more
convenient for doing repetitive tasks like reformatting instrument data. You will need access to the terminal to run your
Python script, which is discussed below.
The terminal is the command line interface used in macOS and Unix-like systems such as the Linux and BSD families
and allows users to perform a wide array of tasks from installing and running software to file management. If you are
using Linux or Mac, launch the terminal from the Applications, and if you are on Windows, you will likely first need to
activate the Bash command line before proceeding. Alternatively, if you are using the JupyterLab version of Jupyter, you
can launch a terminal window from the Launcher menu (see section 0.2, Figure 2). In section 17.2, you will learn to run
Python scripts from the terminal, but before you can run a script, you need to be able to navigate your file system and find
your Python scripts. This section is a brief primer on navigating the file system through the Terminal.
When you open the terminal, you are greeted with a line that looks something like the following, where Comp is your
computer name and Me is your account user name. After the $ sign is where you type your commands.
Comp:~Me$
From here, you can navigate your file system. The first thing you will want to know is where on the file system you are
currently looking. This is known as the current working directory, which can be determined with the command pwd
(print working directory).
$ pwd
/Users/Me
This means that we are currently in the home directory for the user Me. To view the contents of the directory, we can list
its contents using the ls command.
$ ls
483
Scientific Computing for Chemists with Python
You may see files listed in the terminal that you cannot see when manually looking in a folder. This is normal. Computers
often contain invisible files for items such as icons, and it is often best not to alter or delete these invisible files.
To change the current working directory, use the cd command. This can be used either incrementally by stepping one
directory at a time or by providing the full path name such as /Users/Me/Documents/Scripts/.
$ cd Desktop
This only allows the user to navigate into folders. To back out of a folder, cd .. (space with two periods) is used.
$ cd ..
There is certainly much more that can be done in the terminal, but this is enough of a foundation for you to find and run
scripts as we will do below.
Now that you know the basics of the terminal command line, we can now run our first script. Open a text editor of
your choice. Be careful if you write Python code in a regular word processor (e.g., Word, LibreOffice, Pages, etc.) as it
may save extra formatting in any text file generated. A better option is to either use Spyder introduced in section 17.5 or
(easiest) select Python File from the JupyterLab launcher. Write some Python code in a new file and save it as a text file
titled first_script.py. The .py extension does not do anything to the file; it just indicates to other software that
this text file is a Python script. For this demonstration, I’ll include the following code in my text file.
import random
rng = [Link].default_rng()
rdn = [Link](0,100)
print(rdn)
Next, open the terminal and navigate to the directory (i.e., folder) containing the above script file and type the following
into the terminal.
$ python first_script.py
66
You just ran your first script from the command line! The output only includes what you print in the Python script.
One key difference between a script run in the command line and Python code run in a Jupyter notebook is that when
running from the command line, if you want something displayed, you need to explicitly instruct this action using the
print() function. In contrast, the Jupyter notebook automatically prints the output of calculations that are not assigned
to variables.
An alternative way to run the above file without having to navigate to the folder is to provide the file with the full (absolute)
path like is shown below.
484
Scientific Computing for Chemists with Python
$ python /Users/Me/Desktop/first_script.py
98
This might seem like a lot of typing. One handy shortcut is to type python followed by a space and then drag-and-drop
the file into the terminal window. This will result in the file path and name being automatically pasted into the terminal
window.
$ python /Users/Me/Desktop/first_script.py
65
There are often times when running a script from the command line that you want to be able to include additional inputs or
information to the Python script. This may come in the form of a user input or extra files. Below are ways to accomplish
this, making your script more interactive.
In the event you want the user to be able to input values, Python includes an input() function that prompts the user
to provide information. For example, if we want to write a script to calculate molecular weights of simple hydrocarbon
molecules based on the number of hydrogen and carbon atoms, it would be helpful to allow the user to input the number
of hydrogen and carbon atoms instead of altering the script itself. The argument inside the input() function is what
is displayed in front of the user to prompt an input. It is important to note that the input() function provides the user
input as a string. Being that we are expecting integers, we need to convert these strings to integers before calculating the
molecular weight of the molecule, as has been done below.
H = input('H = ')
C = input('C = ')
Save the above script in a text file named [Link] and run it. You are prompted to provide the number of hydrogens and
carbons before a molecular weight is calculated and printed.
$ python [Link]
H = 4
C = 1
16.05
17.3.2 [Link]
Another approach to allowing the user to provide additional information is to provide all the required information in the
same line as calling the script. For example, when running the above hydrocarbon molecular weight script, you might
expect it to look like the following.
$ python [Link] 4 1
16.05
We can instruct Python to grab the information behind the script file name using the argv() function from the sys
module. This function brings all information after python as a list, which can be accessed using indexing. The above
input generates the following list from [Link].
['[Link]', '4', '1']
Now it is just a matter of indexing and converting strings to integers as is done below.
import sys
H = [Link][1]
C = [Link][2]
$ python [Link] 8 3
44.11
The above method is ideal for accepting file names and extensions as they can be dragged into the terminal more easily
than typed. The downside to this approach is that the user needs to be aware of what information to provide the script and
in what order. This is analogous to the difference between a keyword argument and positional argument in a function.
As a way to combine Python scripts in external .py files and Jupyter notebooks, it is possible to run these Python scripts
from the Jupyter notebook using the %run magic command. As an example, let’s say we have the following code in a
file called 𝑑𝑖𝑠𝑡.𝑝𝑦.
pt1 = (1,5,9)
pt2 = (9, 0, 3)
We can run this code from a Jupyter notebook using the following command. Like we’ve seen previously, Jupyter assumes
the referenced file is in the same directory as the Jupyter notebook unless otherwise indicated.
%run [Link]
486
Scientific Computing for Chemists with Python
pt1
(1, 5, 9)
distance(pt1, pt2)
11.180339887498949
Now that the 𝑑𝑖𝑠𝑡.𝑝𝑦 file has been executed, the variables and function are available in the Jupyter notebook as if this
code had been run in a Jupyter code cell.
17.5 Spyder
While using a text editor to write your scripts works just fine, you may long for some of the features of Jupyter notebooks,
like how it automatically color codes text based on syntax and provides easy access to function docstrings. To get some of
these features back, you can use an Integrated Development Environment (IDE). There are many to choose from, but here
we will address Spyder (Scientific Python Development Environment) as it is specifically tailored to scientific applications
and comes with the Anaconda installation of Python.
There are two methods of launching Spyder. The first is to type spyder in the terminal.
$ spyder
The second method is to press the launch button for Spyder in Anaconda Navigator (Figure 1). The latter method is often
slower because it requires that Navigator be first launched.
Once Spyder has launched, you will be greeted by an interface divided into three windows (Figure 2). The left window
is a text editor where code is written. Like the Jupyter notebook, it color codes your Python code based on syntax and
provides docstrings and helpful notices. To run the code written here, you can either save it as a text file and run it as
described above, or you can press the run button (►) at the top of the window. The latter approach is particularly handy
during the development phase of a script as it allows you to quickly test and modify your script without having to jump
between Spyder and the terminal. The smaller window on the bottom right is a Python terminal where you can test out
code and see the output of your code if you run your code inside Spyder. The top right window is useful as a file navigator
and as a variable explorer depending upon the tabs you choose. In Figure 2, it is a variable explorer which shows each
variable in memory and what it contains. This is a powerful tool when debugging code as it allows you to quickly see what
the code is doing and where things are not working.
Figure 2 The Spyder interface with the text editor (left), variable explorer (top right), and interpreter (bottom right).
So when should you use a Jupyter notebook and when should you use Spyder? The decision is often a matter of preference,
but if you are doing interactive data analysis, Jupyter notebooks are typically the better choice. This is particularly true if
you need to share your analysis and results with others. If you are writing large blocks of code, Spyder is likely a better
choice of environment. As an example, if you wish to perform complex mining of information from an external dataset
and then analyze the resulting information, you might want to write the data mining code in Spyder and then run the data
analysis in a Jupyter notebook.
488
Scientific Computing for Chemists with Python
Further Reading
Exercises
Complete the following problems by writing Python scripts either in a text editor or Spyder and run them from the
terminal. JupyterLab, the newest version of Jupyter, includes a text editor if you wish to use it, but do not use a Jupyter
notebook for any of these problems! Any data file(s) referred to in the problems can be found in the data folder in the
same directory as this chapter’s Jupyter notebook. Alternatively, you can download a zip file of the data for this chapter
from here by selecting the appropriate chapter file and then clicking the Download button.
1. When an electron in a hydrogen atom relaxes from a higher to a lower energy orbital, a photon is released with the
wavelength in nm described by the equation below. Write and run a Python script that prompts the user to input
the initial and final principal quantum numbers (n) and prints the wavelength (λ) of light emitted with units.
1 1 1
= 1.097 × 107 𝑛𝑚−1 ( 2 − 2 )
𝜆𝑛𝑚 𝑛𝑖 𝑛𝑓
2. In the folder titled data, you will find synthetic data for the conversion of A → P. Both datasets are for first-order
reactions.
a) Write a Python script that accepts the name of a single data file like below and outputs the rate constant (k) for
the data. Test it on both datasets. For the script to find the file, it needs to either be in the same directory as the
data file or be provided the absolute path to the file.
or
b) Modify the above script to print out the rate constant for all datasets in the folder. This script will accept the
folder name instead of the file name. Remember to use the os module described in section 2.4.1.
490
Part III
Back Matter
491
APPENDIX 0: IPYTHON WIDGETS
You can create widgets such as sliders or check boxes in Jupyter notebooks to make it easier to rapidly modify input values
in your code. This can be useful for rapid experimentation with different parameters in your code or as part of a demo.
For this, we will use Jupyter Widgets. In the following examples, we will simulate an NMR free induction decay (FID)
signal and NMR splitting pattern to see how changing various parameters affects the end result. This section assumes
knowledge of chapters 0-4, but you probably can (mostly) follow along if you are through chapter 1.
® Note
While the widgets in this appendix are movable, the graphs do not change because this is a static book with no kernel
running in the back. If you download this notebook and run it yourself, the values and graphs will automatically
update as you interact with the widgets. The widgets do not show up in the PDF version of the book.
This notebook requires that you have ipywidgets installed either using pip or conda. There is a good chance you already
have it installed, though. The last example also assumes you have nmrsim installed from section 12.2. This appendix
assumes the following imports.
Basic Widgets
To create a widget that affects your code, you must first package the code in a single Python function. Below we will
simulate an NMR free induction decay by the following equation where 𝑡 is time in seconds, 𝜈 is frequency in Hz, and T2
is the relaxation constant.
𝑠𝑖𝑔𝑛𝑎𝑙(𝑡) = 𝑐𝑜𝑠(2𝜋𝜈𝑡)𝑒−𝑡/𝑇2
We will see how the frequency (𝜈) and T2 affect the appearance of the FID. To do this, we will write a function,
plot_fid(nu, T2), that accepts these two parameters as arguments and generates a plot of signal versus time.
493
Scientific Computing for Chemists with Python
[Link](t, wave*decay_func)
[Link]('Time, s')
[Link]('Signal Amplitude')
plot_fid(2, 5)
1.00
0.75
0.50
Signal Amplitude
0.25
0.00
0.25
0.50
0.75
1.00
0 2 4 6 8 10
Time, s
To make this function interactive, we will use the interact() function from ipywidgets, which takes our function
above as a required, positional argument. We also need to provide initial values for our two parameters as keyword
arguments, as demonstrated below. When we run our code, two sliders appear above our graph. As noted above, the
sliders do not affect the plot in this static book but would automatically change the graph if you run the code in your own
Jupyter notebook.
® Note
If you wrote your function with keyword arguments instead of positional arguments, the interact() does not
require initial values.
494
Scientific Computing for Chemists with Python
The interact() function makes a guess at the ranges of values you might need for your parameters, but you can also
explicitly define these by providing a tuple with minimum, maximum, and step size values in this order ((min, max,
step)).
At this point, you may be wondering why you get sliders versus any other type of widget. The interact() function
automatically generates sliders for function arguments with numerical values. If the argument in interact() contains
a list, a dropdown menu appears, a bool generates a check box, and a text argument produces a text box.
If you want a value to be unchangeable by the widgets, wrap the desired value in the fixed() function as demonstrated
below.
Another way to create ipython widgets is to employ the interact() function as a decorator for your function. In-
stead of calling the interact() function after you define your function, you place @interact() just above your
own function definition and skip feeding your function into the interact() function. The code below generates an
equivalent outcome as we saw just above.
@interact(nu=[1,2,3,4,5,6], T2=2)
def plot_fid(nu, T2):
t = [Link](0,10,1000)
wave = [Link](2*[Link]*nu*t)
decay_func = [Link](-t/T2)
[Link](t, wave*decay_func)
[Link]('Time, s')
[Link]('Signal Amplitude')
Customized Widgets
You can customize your widgets with more widget types listed in the Ipywidgets documentation page. For example, if
we want our frequency to be controlled by buttons, we can create a button widget with ipywidgets’ RadioButtons()
function and assign that to the frequency variable in the interact() function. Each customized widget can have
different arguments, so it is a good idea to view the documentation on the Ipywidgets documentation page.
button_widget = RadioButtons(options=[1,2,3,4,5,6])
interact(plot_fid, nu=button_widget, T2=(1,5,0.5));
As a second example of a custom widget, we will create a slider with upper and lower limits using either FloatRangeS-
lider() or IntRangeSlider(). As you might guess, one is for float values and the other is for integers. It is im-
portant to note that these two widgets return two values in a tuple, so your function must be written to accept a two-valued
tuple as an argument.
[Link](t, wave*decay_func)
[Link]('Time, s')
[Link]('Signal Amplitude')
[Link](limits)
Slow Functions
If your function is slow to run, you may not want it to execute every time a slider moves. There are two solutions to this.
The first is to use the interact_manual() function, which is a cousin of the interact() function except that
your function only runs when you click the Run Interact button.
The second option is to create a custom slider widget and set the parameter continuous_update=False. This will
result in your function only running once you let go of the slider with your mouse. A basic float slider can be created with
the FloatSlider() function, like is done below.
496
Scientific Computing for Chemists with Python
As an additional example, we will simulate NMR splitting patterns below using the nmrsim library introduced in section
12.2. For this, we will use the Multiplet() function, which takes the resonance frequency in Hz (v) as the first
positional argument followed by the intensity (I) of the resonance signal. The parameters that we are most interested in
here are the number of each type of neighbors and the coupling constants with these neighbors, which are provided as
coupling constants(J) and number of nuclei (n_nuc) pairs in a list of tuples.
The function below assumes our signal is being split by two types of neighboring nuclei - n_nuc1 of the first type of
neighbors with a J1 coupling constant and n_nuc2 of the second type of neighbors with a J2 coupling constant. This
resonance will be visualized using the mplplot() function from nmrsim.
plot_nmr();
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
560 540 520 500 480 460 440
We can again feed our function into interact() which produces sliders because our parameters are all numbers.
498
APPENDIX 1: REMOTE REQUESTS
There are a number of freely available online chemical databases that can be used to build datasets, such as the Chem-
ical Abstract Services (CAS), ChEMBL, ChemSpider, RCSB Protein Data Bank, PubChem, and PubMed, among oth-
ers. While some databases principally support access through a web browser, such as Spectral Database for Organic
Compounds (SDBS), many databases support programmatically accessing the data that enables the user to automate the
downloading or searching of data from databases.
® Note
In the absence of an API for automated access, the user could also scrape the website using tools such as beauti-
fulsoup4, but this is potentially a bit more involved.
This requires the database to have what is known as an Application Programming Interface (API) that allows Python
to communicate with the database software. The APIs often have idiosyncratic formatting rules that must be carefully
followed to ensure no errors arise. It is also important to follow the database usage rules such, as how much data may be
downloaded, what the data may be used for, or if users are required to register with the database. The latter is often free
for academic or nonprofit use. In this example, you will learn to access the PubChem databases and build a small dataset
of organic chemicals with the chemical features to describe them. PubChem does not require any registration to use it,
but there is a rate limit to accessing the data, which will be addressed below.
To access the database, we will use the Python requests library, which allows the user to use Python to access data from
remote web servers. This package is installed by default with Anaconda or can be installed using pip. It is also prudent to
keep this library updated just as you would with a web browser because it makes remote requests.
PubChem requests uses a URL like your web browser with the following five components:
• prolog_URL - [Link]
• data_input - compound/smiles
• identifier - OC(C=1C=CN=C2C=CC(OC)=CC21)C3N4CCC(C3)C(C=C)C4
• operation - property/Volume3D
• output - txt
The prolog is the base URL which allows requests to find the remote database server, the data_input indicates what
information will be provided to look up a chemical compound, the identifier is the chemical identifier, the operation is
what information you want out, and the output is the format of the returned information. The latter will be text in our case,
but you can have PubChem return other formats such as PNG or CSV if desired. The five above pieces are concatenated
499
Scientific Computing for Chemists with Python
with / separating them using the join() string method and are provided as an overall URL to the requests library. You
could also concatenate the above strings using the + operator as long as you ensure there are / separating each component.
full_url = '/'.join([prolog_URL, data_input, identifier, operation, output])
This URL is then fed into the [Link]() function like below which makes the request to the remote server to
fetch the information.
[Link](full_url)
import requests
prolog_URL = "[Link]
data_input = "compound/smiles"
identifier = 'OC(C=1C=CN=C2C=CC(OC)=CC21)C3N4CCC(C3)C(C=C)C4'
operation = "property/Volume3D"
output = "txt"
res = [Link](full_url)
res
<Response [200]>
Once you have the result, use the .text method to get the regular text, and you will need to remove the last two
characters.
[Link]
'252.2000000000\n'
[Link][:-1]
'252.2000000000'
If you want to access a larger number of molecules, you will need to use a for loop with a list of molecular identifiers
that can be swapped out in each request. It is important to note that PubChem limits requests to no more than 5 per
second, so you will need to limit your request rate. This is relatively easy to accomplish using the [Link](n)
function from the native Python time module where n is the number of seconds to pause your code. For example, every
time [Link](1) is run, the function waits 1 second before the next line of code is executed. By placing this in
our for loop, it ensures a maximum rate of requests will not be exceeded.
As an example, below we request the volume of four alcohols from PubChem and store them in a list.
import time
volumes = []
(continues on next page)
500
Scientific Computing for Chemists with Python
volumes
502
APPENDIX 2: VISUALIZING ATOMIC ORBITALS
® Note
This appendix assumes a future version of SymPy for the Z_lm() function. This function has been temporarily
defined in a code cell below to provide this feature until the next SymPy release.
The visualization of atomic orbitals and orbital information is an important enough topic in chemistry to warrant specific
attention. This appendix focuses on different methods of visualizing various aspects of atomic orbitals and tools to assist
in this task. This content is not included in the chapter on plotting with matplotlib because this appendix heavily utilizes
various libraries such as SymPy, interact, and NumPy not yet introduced before Chapter 03. While this appendix is
written to be standalone as much as possible, knowledge of matplotlib, including surface plots, will be helpful along with
NumPy and SymPy basics.
Atomic orbitals are described by a wavefunction, Ψ(𝑛, 𝑙, 𝑚), which is the product of the radial wavefunction, 𝑅(𝑛, 𝑙),
and the angular wavefunction, 𝑌 (𝑙, 𝑚). Each atomic orbital has a different wavefunction Ψ, but they sometimes share
common radial wavefunctions.
The radial wavefunction depends upon the principal (n) and angular (𝑙) quantum numbers and provides information about
the wavefunction or electron probability at various distances from the nucleus. The radial wavefunction is independent of
the direction. The angular wavefunction describes the direction of the orbital with respect to the spherical coordinate
angles and depends upon the angular (𝑙) and magnetic (𝑚 or 𝑚𝑙 ) quantum numbers. We will first visualize the radial and
angular components individually before combining them into a more complete picture of atomic orbitals.
We will use NumPy and matplotlib heavily in this chapter, and we will make heavy use of the SymPy library for convenient
functions in its hydrogen module. These are all imported below.
import numpy as np
import [Link] as plt
from mpl_toolkits.mplot3d import Axes3D
import sympy
from [Link] import R_nl, Psi_nlm #, Z_lm
# delete this cell and replace with actual Z_lm after next SymPy release
from [Link].spherical_harmonics import Znm
def Z_lm(l, m, phi, theta):
return Znm(l, m, theta, phi).expand(func=True)
503
Scientific Computing for Chemists with Python
® Note
The SymPy library is introduced in Chapter 8 and provides mainly tools for symbolic mathematics along with
other tools for wavefunctions, harmonic oscillators, biomechanics, etc.
Radial Wavefunctions
Because the radial wavefunctions are independent of direction, they can be represented effectively on a simple 2D plot.
The toughest part is coding the equations for every combination of n and 𝑙. The good news is that the SymPy library
includes a function, R_nl(), in the Hydrogen Wavefunction ([Link]) module that provides
this functionality. This function takes the principal quantum number (n), angular quantum number (𝑙), radius in Bohrs
(r), and atomic number (Z). A Bohr equals about 52.9 pm.
R_nl(n, l, m, r, Z=1)
We can evaluate the function for any hydrogen-like atomic orbitals such as the 3p orbitals (n = 3 and 𝑙 = 1) at 4.0 Bohrs.
√
0.0173561901639985 6
√
SymPy prefers to return results in exact form, so it includes 6 in this particular result. To get a float answer, use the
evalf() method.
® Note
The evalf() method can take an optional argument for the precision number such as evalf(5) for 5 digits
of precision.
0.0425138097805085
It might now be interesting to evaluate this radial function at a range of distances and plot them. This function does not
support taking multiple radii, so you have two options below.
1) Iterate through a list or array of radii and evaluate this function one radius at a time.
2) Convert the R_nl() function to a function that can accept an array using the lambdify() method.
504
Scientific Computing for Chemists with Python
[Link](radii, R_eval)
[Link](0, 0, 30, colors='r', linestyles='dashed')
[Link]('Distance from Nucleus (Bohrs)')
[Link]('Evaluated Radiable Wave');
0.08
0.06
Evaluated Radiable Wave
0.04
0.02
0.00
0.02
0 5 10 15 20 25 30
Distance from Nucleus (Bohrs)
# second approach - lambdify
r = [Link]('r') # create SymPy symbol
[Link](radii, R_3p(radii))
[Link](0, 0, 30, colors='r', linestyles='dashed')
[Link]('Distance from Nucleus (Bohrs)')
[Link]('Evaluated Radiable Wave');
0.08
0.06
Evaluated Radiable Wave
0.04
0.02
0.00
0.02
0 5 10 15 20 25 30
Distance from Nucleus (Bohrs)
The electron probability density can be found by calculating 𝑅2 where 𝑅 is the radial wavefunction, and the radial
probability is 𝑅2 𝑟2 where 𝑟 is the distance from the nucleus.
506
Scientific Computing for Chemists with Python
0.10
0.08
Radial Probability, R 2r2
0.06
0.04
0.02
0.00
0 5 10 15 20 25 30
Distance from Nucleus (Bohrs)
The reason we multiply the probability density by the square of the radial wavefunction, 𝑟2 , is to account for the greater
surface area of a sphere (𝐴𝑠𝑝ℎ𝑒𝑟𝑒 = 4𝜋𝑟2 ) the larger the radius. We are effectively carrying out the calculation depicted
below. We divide the sphere surface area by 4𝜋 to normalize the integration, making the probability over all space total
to one.
1s Probability Density Sphere Surface Area Over 4 1s Radial Probability
0.007 0.10
0.006 10000
0.08
Surface Area / 4 (Bohrs2)
0.005 8000
Radial Probability
0.004 0.06
0.003
X 6000
=
4000 0.04
0.002
2000 0.02
0.001
0.000 0 0.00
0 10 20 30 0 10 20 30 0 10 20 30
Radius (Bohrs) Radius (Bohrs) Radius (Bohrs)
One of the uses of these radial plots is to compare the radial probability of multiple different orbitals on the same axes,
like below, for the fourth row of the periodic table. This can be used, for example, to discuss the valence electron
configurations of Cr and Cu.
4s
0.10 4p
3d
0.08
Radial Probability (R 2r2)
0.06
0.04
0.02
0.00
0 10 20 30 40
Distance from Nucleus (Bohrs)
The probability 𝑅2 𝑟2 can be integrated using the [Link]() function, which accepts the function or math-
ematical expression to be integrated and a tuple that contains the variable, the min, and the max values.
For example, we can integrate the R_nl() for the 2s orbital from 0 to 3.0 Bohrs, like below.
0.473330547984585
Let’s test that the radial probability is normalized by integrating from zero to infinity.
508
Scientific Computing for Chemists with Python
® Note
1.0
Angular Wavefunctions
The other component of Ψ is the angular wavefunctions, which provides directional information about an orbital. The
angular equations can be coded by hand, or we can also use the Y_lm() or Z_lm() spherical harmonics wavefunctions
from [Link] to assist us. The difference between these two functions is that Y_lm() may
return a complex expression, whereas Z_lm() will return the real-valued angular wavefunction. Because our goal is to
visualize the wavefunctions, we will restrict ourselves to the latter here. The angular wavefunction provides information
in all directions, so we will plot this information in 3D.
Á Warning
The plot of angular wavefunction does not include the radial information, so it does not fully describe the shape
of atomic orbitals. Do not interpret the angular plots below as the actual shape of atomic orbitals, even though
they resemble them.
There are multiple conventions for spherical coordinates. We will use the SciPy/SymPy convention of using theta (𝜃) for
the azimuthal (i.e., direction on xy-plane) and phi (𝜙) as the polar angle (i.e., angle from the positive z-axis) for plotting
the angular wavefunctions. Below, we plot the 𝑑𝑧2 orbital by coding the angular wavefunction expression by hand.
b Tip
2.0
1.5
1.0
0.5
z-axis
0.0
0.5
1.0
1.5
2.0
1.00
0.75
0.50
0.25
1.00
0.75 0.00
0.50
0.25 0.25
0.50
0.00
is
0.25 0.75
x
0.50
y-a
Alternatively, we can use the Z_lm() function from [Link] to generate the angular wave-
function based on the angular and magnetic quantum numbers.
510
Scientific Computing for Chemists with Python
SymPy functions cannot calculate wavefunctions for an array of angles like NumPy functions can, but fortunately SymPy
functions can be converted to NumPy functions using the lambdify() method. Just provide the lambdify()
method with a collection of argument variables for the wavefunction as SymPy symbols, the wavefunction, and mod-
ules='numpy', and it returns a new function.
x = [Link](theta) * [Link](phi)
y = [Link](theta) * [Link](phi)
z = [Link](theta)
We can also visualize the angular component of wavefunctions in 2D using a polar plot, but we can only visualize one
angle at a time. Below we will visualize theta and leave phi fixed. Because we are only visualizing in 2D and not sweeping
around the phi angles, we need to make theta go from 0 → 2𝜋.
l, m = 2, 0
azmuth, polar = [Link]('azmuth polar')
f = [Link]((polar, azmuth), Z_lm(l, m, polar, azmuth), modules='numpy')
512
Scientific Computing for Chemists with Python
0°
0.6
45° 0.5 315°
0.4
0.3
0.2
0.1
90° 270°
135° 225°
180°
l, m = 2, 1
azmuth, polar = [Link]('azmuth polar')
f = [Link]((polar, azmuth), Z_lm(l, m, polar, azmuth), modules='numpy')
0°
135° 225°
180°
l, m = 2, 2
azmuth, polar = [Link]('azmuth polar')
f = [Link]((polar, azmuth), Z_lm(l, m, polar, azmuth), modules='numpy')
514
Scientific Computing for Chemists with Python
0°
135° 225°
180°
The last orbital image is a d-orbital viewed from the side.
Complete Wavefunction
Now we will visualize both angular and radial components together (Ψ) which is again the product of the radial, 𝑅(𝑛, 𝑙)
and angular, 𝑌 (𝑙, 𝑚) wavefunctions.
To obtain the entire wavefunction, Ψ, we can either multiply the radial and angular wavefunctions from the previous
sections or use the SymPy Psi_nlm() function, which makes this task a little more convenient. Orbitals have no edge,
so there are multiple ways of representing orbitals, including contour plots, isosurfaces, 90% surface plots, scatter plots,
and translucent 3D plots. The scatter and contour plot methods are demonstrated below. We will need the probability
density, P, of the atomic orbital, which is proportional to the product of a wavefunction, Ψ, and its complex conjugate,
Ψ* or the square of the absolute value of a wavefunction.
𝑃 = Ψ∗ Ψ = |Ψ|2
First, let’s take a look at the Psi_nlm() function, which operates similarly to the other SymPy wavefunctions above.
Below, we integrate it over all space, returning 1, which tells us that this function is normalized when we include 𝑟2 𝑠𝑖𝑛(𝜃).
® Note
The |Ψ|2 approach is favored below, but if you want to use Ψ∗ Ψ, you can wrap your wavefunction in sympy.
conjugate().
Now let’s visualize an orbital using a scatter plot. We will use a strategy previously reported in J. Chem. Educ., 1990, 67,
42-44, which includes the following steps.
1. Use a random number generator to produce a series of 𝑟, 𝜃, and 𝜙 values or just 𝑟 and 𝜃 values depending upon
dimensions
2. Use the values above to calculate the xyz or yz values
3. Use the above radius and angles to calculate probabilities using the wavefunction
4. Normalize the probabilities by dividing by the maximum probability value across all the data points
5. If each normalized probability is above a random value from 0 → 1, it gets included in the scatter plot
b Tip
If plotting a very large number of data points, consider using [Link]() instead of [Link]() be-
cause the latter is slower and uses more memory due to its ability to individualize each marker in the plot.
# 2p orbital - 3D simulation
516
Scientific Computing for Chemists with Python
x = r * [Link](polar) * [Link](azmuth)
y = r * [Link](polar) * [Link](azmuth)
z = r * [Link](polar)
10
Distance from Nucleus (Bohrs)
10
the orbital being thicker there. If we instead reduce the simulation to 2D (i.e., only the yz plane), like below, the orbital
lobes appear rounder because we are visualizing a slice through the middle of the orbital.
# 2p orbital - 2D simulation
x = r * [Link](polar) * [Link](0)
y = r * [Link](polar) * [Link](0)
z = r * [Link](polar)
518
Scientific Computing for Chemists with Python
10
Distance from Nucleus (Bohrs)
10
10 5 0 5 10
Distance from Nucleus (Bohrs)
We can visualize larger orbitals to see more nodes such as in the 3p and 3s orbitals below. We can also color the points
based on the sign of the wavefunction before calculating the probability. In the examples below, the color only represents
the sign of the wavefunction and not the magnitude of the value.
# 3p orbital
x = r * [Link](polar) * [Link](0)
y = r * [Link](polar) * [Link](0)
z = r * [Link](polar)
#[Link](x[mask], y[mask],
fig = [Link](figsize = (6, 6))
ax = fig.add_subplot(1, 1, 1)
is_pos = wf(r, polar)[mask] > 0 # test if wavefunc is positive
[Link](y[mask], z[mask], s=0.5, c=is_pos, cmap='coolwarm')
ax.set_xlabel('Distance from Nucleus (Bohrs)')
ax.set_ylabel('Distance from Nucleus (Bohrs)');
30
20
Distance from Nucleus (Bohrs)
10
10
20
20 10 0 10 20
Distance from Nucleus (Bohrs)
# 3s orbital
520
Scientific Computing for Chemists with Python
x = r * [Link](polar) * [Link](0)
y = r * [Link](polar) * [Link](0)
z = r * [Link](polar)
20
Distance from Nucleus (Bohrs)
10
10
20
20 10 0 10 20
Distance from Nucleus (Bohrs)
A second way to visualize orbitals is through a contour plot. Here we calculate the probability in a mesh of locations and
provide the [Link]() function with the locations and probabilities.
# calculate probability
prob = [Link](f(r, polar))**2
[Link](Z, Y, prob, levels=[1e-9, 3e-9, 5e-9, 1e-8, 5e-8, 1e-7, 3e-7, 9e-7])
[Link]()
[Link]('Distance from Nucleus (Bohrs)')
[Link]('Distance from Nucleus (Bohrs)');
20 1e 7 9.00
15 3.00
10
Distance from Nucleus (Bohrs)
1.00
5
0.50
0
0.10
5
0.05
10
15 0.03
20 0.01
20 15 10 5 0 5 10 15 20
Distance from Nucleus (Bohrs)
Y, Z = [Link]([Link](-20, 20, 200),
[Link](-20, 20, 200))
522
Scientific Computing for Chemists with Python
# calculate probability
prob = [Link](f(r, polar))**2
[Link](Z, Y, prob, levels=[1e-9, 3e-9, 5e-9, 1e-8, 5e-8, 1e-7, 3e-7, 5e-7])
[Link]()
[Link]('Distance from Nucleus (Bohrs)')
[Link]('Distance from Nucleus (Bohrs)');
20 1e 7 5.00
15 3.00
10
Distance from Nucleus (Bohrs)
1.00
5
0.50
0
0.10
5
0.05
10
15 0.03
20 0.01
20 15 10 5 0 5 10 15 20
Distance from Nucleus (Bohrs)
524
APPENDIX 3: UNCERTAINTY PROPAGATION
Uncertainty occurs in any scientific measurement and is often represented as the standard deviations, 𝜎, of measurements
or the 95% confidence interval, 95% CI. When performing calculations containing values with uncertainty, the uncertainty
needs to be propagated through the calculations, which is a tedious and error-prone task when done by hand. This appendix
demonstrates how to use the Python uncertainties package to remove most of the pain from uncertainty propagation
along with simulating uncertainty using a random number generator.
Uncertainties Package
As of this writing, the uncertainties package can be installed using pip. We will then import a couple key functions,
ufloat() and ufloat_fromstr(), along with the umath module which brings a range of math functions (e.g.,
log and sin). We will also import NumPy and matplotlib to use in the simulation section.
import numpy as np
import [Link] as plt
Uncertainties Variable
Basic mathematical operations with the uncertainties package center around the uncertainties variable object. This
is created using the ufloat() function which accepts two important values - the first is the nominal value and the
second is the standard deviation.
ufloat(nominal_value, std_dev)
For example, let’s say we have a value of 18.66 with a standard deviation of 0.03.
We can access the nominal value or the standard deviation by themselves using the nominal_value or std_dev
methods, respectively.
525
Scientific Computing for Chemists with Python
b Tip
val.nominal_value
18.32
val.std_dev
0.03
If you are calculating uncertainties taken from a text problem, the uncertainties package provides a convenience
function ufloat_fromstr() that allows you to copy-and-paste in values and their uncertainties all together. Below
are acceptable formats.
ufloat_fromstr('0.011 ± 0.002')
0.011+/-0.002
ufloat_fromstr('0.172807(0.000008)')
0.172807+/-8e-06
0.172807+/-8e-06
ufloat_fromstr('0.172')
0.172+/-0.001
The last one did not include an uncertainty, so the uncertainty was interpreted to be ±1 of the least significant decimal
place.
526
Scientific Computing for Chemists with Python
Simple Calculations
Beyond this, we just need to carry out our mathematical operations. For example, let’s say we want to calculate the molar
absorptivity constant using Beer’s law, 𝐴 = 𝜖𝑏𝐶, where A is absorbance, 𝜖 is the molar absorptivity constant, 𝑏 is the path
length in cm, and 𝐶 is concentration in molarity. If A = 0.3822 ± 0.0003, 𝑏 = 1.00±0.01 cm, and 𝐶=0.0017±0.0001
M, we can calculate the molar absorptivity constant like below.
A = ufloat(0.3822, 0.0003)
b = ufloat(1.00, 0.01)
C = ufloat(0.0017, 0.0001)
E = A / (b * C)
E
224.8235294117647+/-13.415813085736838
3 * b
3.0+/-0.03
The umath module provides special mathematical functions like square root or sine. For example, if we want to calculate
the pH of a solution with an [H3 O+ ] = 6.33×10−6 ± 3×10−7 M, or (6.33 ± 0.3)×10−6 M, we can carry out this
calculation below which gives us a pH = 5.199±0.021.
® Note
Like in many Python libraries, log() is the natural log and log10() is the common log.
5.198596289982645+/-0.02058267686745269
Correlated Values
The above calculations assume that all the values in the calculation have no correlation with each other, which is not
always the case. When correlation occurs, this adds an extra layer of complexity to the error propagation calculations.
The uncertainties package recognizes some correlation automatically and handles it for you such as below when
subtracting a value by itself.
b - b
0.0+/-0
If a new value is calculated using uncertainties, the package automatically recognizes and factors in the correla-
tion into future calculations. We can get a sense of the correlation using the covariance_matrix() or corre-
lation_matrix() functions. For example, we can input variables from the above Beer’s law problem to see the
covariance and correlation matrices.
covariance_matrix([b, C, E])
correlation_matrix([b, C, E])
array([[ 1. , 0. , -0.16758099],
[ 0. , 1. , -0.98577055],
[-0.16758099, -0.98577055, 1. ]])
When correlated values are derived outside of uncertainties such as in linear regressions, the user needs to provide
correlation information when creating uncertainties variable objects. This is done with the correlated_values()
function which requires the nominal values and a covariance matrix as the two required positional arguments. Alterna-
tively, you can use the related correlated_values_norm() function which instead accepts the nominal values
and the correlation matrix.
correlated_values(nominal_values, covariance_matrix)
correlated_values_norm(nominal_values, correlation_matrix)
The good news is that NumPy and SciPy functions can also return the covariance matrix along with the best fit parameters.
For example, [Link].curve_fit() automatically returns pcov which is the “estimated approximate”
covariance matrix and [Link](cov=True) returns the scaled covariance matrix as a second returned item
when cov=True.
Below, we will demonstrate this using a calibration curve for absorbance and concentration data using the np.
polyfit() function introduced in section 6.4.1.
array([7.93846154e+04, 3.89230769e-02])
528
Scientific Computing for Chemists with Python
cov
The fit returns the slope and y-intercept values along with the covariance matrix. We can then create our uncertainties
variable in uncertainties by providing both to the correlated_values() function.
from uncertainties import correlated_values
m, b = correlated_values(fit, cov)
m
79384.61538461538+/-2620.258467760363
0.03892307692307681+/-0.017732781881431937
If we then decide to calculate the concentration for an absorbance of 0.501, for example, uncertainties will factor
in uncertainty and correlation automatically like below.
(0.501 - b) / m
5.8207364341085285e-06+/-1.3535749157410873e-07
If we were to carry out the above calculation without factoring in correlation, it would look like below. While the value
itself does not change, the uncertainty is overestimated.
m_uncorr = ufloat_fromstr('79384.6153846154+/-2620.258467760346')
b_uncorr = ufloat_fromstr('0.038923076923077046+/-0.01773278188143182')
5.820736434108524e-06+/-2.9463551943621185e-07
Simulating Uncertainties
We can also simulate uncertainties using Monte Carlo simulations as demonstrated below. Let’s say we want to
calculate the molar absorptivity constant using the same nominal and standard deviation values as above. Using
a random number generator, we can generate values for A, 𝑙, and C with the given standard deviations using the
normal(nominal_value, std_dev) function from the [Link] module. We then carry out the calcu-
lation with all of these values. The molar absorptivity is the average of these values with an uncertainty calculated from
the standard deviation of these calculated values.
import numpy as np
import [Link] as plt
N = int(1e7)
rng = [Link].default_rng(seed=21)
A = [Link](loc=A_nom, scale=A_sig, size=N)
l = [Link](loc=l_nom, scale=l_sig, size=N)
C = [Link](loc=C_nom, scale=C_sig, size=N)
E = A / (l * C)
[Link](E, bins=40)
[Link](160, 300)
[Link]('Molar Absorbtivity Constant (cm$^{-1}$M$^{-1}$)')
[Link]('Count');
1e6
1.2
1.0
0.8
Count
0.6
0.4
0.2
0.0
160 180 200 220 240 260 280 300
Molar Absorbtivity Constant (cm 1M 1)
print([Link](E))
print([Link](E, ddof=1))
225.63035520507208
13.600103576801123
This results in a value of 226±14 cm−1 M−1 , which is close to what we calculated using the uncertainties library.
530
Scientific Computing for Chemists with Python
Further Reading
532
APPENDIX 4: REGULAR EXPRESSIONS
There is a saying that synthetic chemists spend 10% of their time running reactions and 90% of their time purifying
compounds. A similar saying could be said that working with chemical data is 10% performing the intended calculations
or analyses on the data and 90% of the time cleaning and organizing the data. While these are both hyperboles, they
underline the large amount of effort required to clean materials. This chapter is dedicated to a powerful method known
as regular expressions, or regex for short, for cleaning and filtering text data, especially in situations requiring complex
pattern matching. Python string methods and indexing offer basic search and filtering functionality, but they tend to only
allow for identifying simple and consistent patterns. For example, if you want a file name without the file extension
(e.g., titration instead of [Link]), this can be solved using indexing and the string split() method because file
extensions always follow the last period in the full file name. Likewise, parsing data from a PDB file can be parsed with
only a string search and slicing because PDB files follow very strict formatting rules based on labels and positions in rows.
The reason these two examples are not terribly complex to parse is because they are consistent and were designed to be
machine readable. However, not all data follow well-defined formatting rules or there could be more variation that needs
to be accounted for. Regular expressions is not strictly a Python feature but rather is a syntax supported by Python using
the re module imported below. This module is a built-in Python module, so it comes with every installation of Python.
import re
Below we will first cover some key functions from the re module followed by generating more complex patterns, and
finally ending with a couple chemical databases and literature examples.
re Functions
The re module provides a series of functions including those listed in Table 1 that allow the user to search for, split on,
or substitute for patterns within a string. Additional functions can be found on the Python regular expressions page.
Table 1 Select re Functions
The way these functions work is that the user provides a pattern to search for, which in the most basic scenarios can be
533
Scientific Computing for Chemists with Python
a simple string, along with a string in which the function will search for the pattern. In the example below, we search a
string of amine names for an aniline derivative by using 'aniline' as a pattern.
pattern = 'aniline'
[Link](pattern, amines)
['aniline', 'aniline']
This is not terribly informative being that all it tells us is that 'aniline' appears twice. The [Link]()
function can be used instead to return an iterator providing the user with the location of each match using either a for
loop or list() function. We can see below that there are three matches along with the indices of those matches and
the string that matches the pattern.
list([Link]('aniline', amines))
Tp access the matched strings, use the group() method on the Match objections like below.
aniline
aniline
The re module can also be used to find and replace patterns such as replacing 'aniline' with 'anilinium' like
below.
We could still probably have done the above tasks with string methods and indexing. The real power of regular expressions
is its ability to generate more complex and flexible patterns, which is what we address below.
534
Scientific Computing for Chemists with Python
Let’s try something a little more complicated by searching for any instance of a methyl not located on a nitrogen. This
means that the name should have a 'methyl' string with a hyphenated number before it. The re module provides
syntax, Table 2, for indicating specific types of characters and delimiters. For example, \d indicates a digit. Many of
these character designators also have a negative version using the capital letter, so \D, for example, signifies any character
except a number.
Table 2 Regex Character Designators
Being that we need any number before the methyl, the pattern is \d-methyl. Now that we have patterns that use a
backslash, you may see a SyntaxError because the backslash is also a Python escape character. To avoid this error, either
precede the backslash with another backslash, \\d-methyl, or make your string a raw string by preceding it with an r
like is done below.
b Tip
The . * ? + ^ $ \ | { } [ ] ( ) symbols are also part of regular expression syntax, so if you need to
use them as just the symbol, either escape them by preceding it with a backslash or make the string a raw string
by starting with an r like r'[spiro[\d.\d]octane'.
The \D could be used as a means of locating methyls that are not on an aliphatic carbon chain because they do not have
numbers before them (at least in this example) like is done below. Now that our patterns are more broad, the listing of
matches like below are more informative because we can see that both N-methyl and o-methyl fit our pattern.
As another example, below is a string that lists chemical identifiers including chemical names, CAS numbers, and a
PubChem CID. The first thing we might want to do is split this up into a list where each item represents a different
chemical.
Using a string method to split based on spaces demonstrated below will not work well because some chemicals (ethyl
benzoate, ethanoic acid, and acetic anhydride) have a space in their name. There is also a complication where there are
multiple spaces after '2-methylphenol'. This problem will be solved below using additional tools from regular
expressions.
[Link](r'\s', chemicals)
['2-methylphenol',
'',
'',
'methanol',
'N,N-diethylamine',
'pentanol',
'281-23-2',
'ethyl',
'benzoate',
'glycerol',
'93-89-0',
'5793',
'ethanoic',
'acid',
'acetic',
'anhydride']
Quantifiers
Let’s first deal with the multiple spaces using quantifiers in Table 3. These quantifiers allow the user to specify how many
of something will be in the pattern. For example, the a+ will search for one or more a’s while '\s{1,3}' looks for
1-3 spaces.
b Tip
Because the ? quantifier searches for 0 or 1 of something, it is helpful to think of this as looking for optional
items. For example, -?\d is a pattern for a number that could be positive or negative because a negative sign
may or may not be present.
536
Scientific Computing for Chemists with Python
Below, we use \s+ to split our string of chemicals based on one or more spaces.
[Link](r'\s+', chemicals)
['2-methylphenol',
'methanol',
'N,N-diethylamine',
'pentanol',
'281-23-2',
'ethyl',
'benzoate',
'glycerol',
'93-89-0',
'5793',
'ethanoic',
'acid',
'acetic',
'anhydride']
Now let’s address the issue of spaces inside the name. IUPAC nomenclature for esters follows a pattern where the first
word always ends in -yl, and carboxylic acids and anhydrides have -ic at the end of the first word (i.e., the carboxyl part).
These trends can be used to identify spaces where the string should not be split, and we will carry this out using something
known as a lookahead or lookbehind shown in Table 4. These look for the presence or absence of something before or
after our main pattern. We specifically want spaces that do not have a yl or ic preceding them. We will add these one at
a time. Below(?<!yl) is added in front of \s+ to avoid splitting on yl patterns.
Table 4 Lookahead and Lookbehind Syntax
pattern = r'(?<!yl)\s+'
[Link](pattern, chemicals)
['2-methylphenol',
'methanol',
'N,N-diethylamine',
'pentanol',
'281-23-2',
'ethyl benzoate',
'glycerol',
(continues on next page)
pattern = r'(?<!yl)(?<!ic)\s+'
[Link](pattern, chemicals)
['2-methylphenol',
'methanol',
'N,N-diethylamine',
'pentanol',
'281-23-2',
'ethyl benzoate',
'glycerol',
'93-89-0',
'5793',
'ethanoic acid',
'acetic anhydride']
Character Sets
What happens if there are multiple symbols that need to be matched? By placing the symbols or characters to be matched
in square brackets, [], anything in the brackets is searched for. For example, it is not uncommon to see numbers separated
by either a period or dash (e.g., phone numbers), so [-.] can be used to indicate that either symbol is a fit. Regular
expressions also allow for ranges of letters and numbers such as [a-e] for any of the first five lowercase letters. It is a
good idea to place the dash first to ensure that it does not get interpreted as a range.
Below there is a string of toluene derivatives. If we want to filter for only para-substituted toluene derivatives, the name
(at least in this example) should start with either p- or 4-. Both symbols can be enclosed in the square brackets like [4p].
The next challenge is figuring out how to deal with the rest of the symbols. We could try .+ to indicate any number of
more symbols, but this includes white spaces and returns the rest of the string.
[Link](r'[4p]-.+', toluene)
To solve this, we can again use character sets to include any letter, number, or dash like below. By including the + behind
the square brackets, this means one or more of these symbols.
[Link](r'[4p][-\d\w]+', toluene)
['4-methyltoluene', 'p-bromotoluene']
538
Scientific Computing for Chemists with Python
Groups
Regular expressions in Python also support the extraction of information from specific segments in a string. In section
1.3.4, string formatting is introduced where the user can create a template string and insert different strings in various
locations. Below are examples where the compound and molecular weight can be swapped out using either the format()
method or f-string formatting.
compound = 'ammonia'
MW = 17.03
compound = 'urea'
MW = 60.06
Groups in regular expressions are essentially the opposite of above, where information from the string is instead extracted.
Groups are helpful for extracting data from a larger pattern. Below are a couple of beginnings of NMR data listings
that would appear in chemical literature. If we are interested in the carrier frequency, we simply write out the regular
expression as normal but then wrap the part we want to extract in parentheses.
[Link](carrier, H_NMR)
['400']
[Link](carrier, C_NMR)
['100']
Multiple groupings can be extracted by wrapping multiple sections in parentheses. Below extracts both the solvent and
the carrier frequency.
[('CDCl3', '400')]
Let’s now do some extra examples. When downloading data files from PubChem, the CAS number is mixed in with other
names and numerical identifiers. There are two challenges here. The first is that CAS numbers vary in length. They are
always three segments of numbers separated by hyphens, such as 58-08-2 or 2501-94-2, where the second segment is
always two digits and the third is always a single digit. However, the first segment varies from 2-7 digits. The second
major issue is that the CAS numbers are mixed in with other chemical identifiers such as CID numbers, common names,
and IUPAC names. These other identifiers can include hyphens and numbers, so indexing and string searches cannot
easily filter for CAS numbers without a long series of boolean conditions.
This is a relatively simple task for regular expressions. We indicate digits with the \d and use curly brackets to indicate
the number of digits as demonstrated below.
[Link](r'\d{2,7}-\d{2}-\d', chemicals)
['281-23-2', '93-89-0']
As a demonstration, PubChem allows for the free download of datasets which include a Synonym column. This column
includes identifiers such as common and IUPAC names, CAS numbers, and PubChem CID numbers. The following code
extracts the CAS numbers from one of these files. Two additional challenges arise from multiple CAS numbers being
listed for a given compound or no CAS number being listed at all. When there are multiple CAS numbers, the most
common one is stored, and if no CAS number is present, a NaN is stored in its place.
® Note
This data file is not included with the book, but you can freely download these files from the above URL.
solv = pd.read_csv('data/[Link]')
names = solv['Synonyms']
cas_pattern = r'\d{2,7}-\d{2}-\d'
cas = []
for row in names:
cas_in_row = [Link](cas_pattern, row)
try:
# get more common CAS number
most_common_cas = max(set(cas_in_row), key=cas_in_row.count)
[Link](most_common_cas)
except ValueError:
# append NaN if no CAS number found
[Link]([Link])
cas[:10]
540
Scientific Computing for Chemists with Python
['107-06-2',
'120-82-1',
'67-64-1',
'71-43-2',
'71-36-3',
'111-65-9',
'67-68-5',
'64-17-5',
'75-12-7',
'67-56-1']
When data on an NMR spectrum is reported in the literature, it follows relatively strict formatting rules, but these rules
are designed to be ready by humans, not machines. To make things more complicated, there are numerous commas and
spaces in the data making it difficult to use these as delimiters, so regular expressions are ideal for parsing this kind of
data. Below is the 1 H NMR data for butanamide in DMSO-𝑑6 at 22 ∘ C following American Chemical Society guidelines.
1
H NMR ((CD)3 SO, 400 MHz): 𝛿 7.23 (br, 1H), 6.70 (br, 1H), 2.00 (t, 2H, J = 7.3 Hz), 1.48 (tq, 2H, J =
7.3, 7.3 Hz), 0.84 (t, 3H, J = 7.3 Hz).
As an example, we will extract the entries for each signal in the NMR spectrum. Each entry looks like 7.23 (br,
1H) or 0.84 (t, 3H, J = 7.3 Hz) where the decimal is the chemical shift and additional information on the
signal is provided in the parentheses behind the chemical shift.
proton = ('1H NMR ((CD)3SO, 400 MHz): δ 7.23 (br, 1H), 6.70 (br, 1H),'
'2.00 (t, 2H, J = 7.3 Hz), 1.48 (tq, 2H, J = 7.3, 7.3 Hz),'
'0.84 (t, 3H, J = 7.3 Hz).')
Each signal starts with a number to two decimal places, but there may be one or two digits before the decimal place. Even
though our example always has one digit before the decimal, we want our code to be robust and versatile. The regular
expression for this number is '\d{1,2}.\d{2}'.
nmr_pattern = r'\d{1,2}.\d{2}'
[Link](nmr_pattern, proton)
Next, the information about the signal is stored in parentheses separated from the chemical shift by a space. We will use
'\s+' just in case someone accidentally used multiple spaces. Because parentheses are a regular expression character,
we need to precede it with a backslash to indicate that we actually mean just a parentheses character.
nmr_pattern = r'\d+.\d{2}\s+\('
[Link](nmr_pattern, proton)
['7.23 (', '6.70 (', '2.00 (', '1.48 (', '0.84 (']
nmr_pattern = r'\d+.\d{2}\s+\(\w+,\s+\d+H,\s+J\s+=\s+\d+.\d\s+Hz\)'
[Link](nmr_pattern, proton)
['2.00 (t, 2H, J = 7.3 Hz)', '0.84 (t, 3H, J = 7.3 Hz)']
The current pattern misses the signals that do not include the coupling information or have multiple coupling constants.
This is where quantifiers are helpful. By placing the regular expression that pattern matches , J = 7.3 in square
brackets followed by an asterisk like below, it indicates that there could be zero or more of these.
[,\s+J\s?=\s?\d+.\d]*
nmr_pattern = r'\d+.\d{2}\s\(\w+,\s+\d+H[,\s+J\s?=\s?\d+.\d]*\sHz\)'
[Link](nmr_pattern, proton)
Now the regular expression finds all signals that have coupling constants but is still missing the two without coupling
constants. This is because the pattern still requires a ' Hz'. Because there should be either zero or one of these, the
regular expression that searches for this should also be enclosed in square brackets and followed by an * like below.
[\sHz]*
nmr_pattern = r'\d+.\d{2}\s\(\w+,\s+\d+H[,\s+J\s?=\s?\d+.\d]*[\sHz]*\)'
[Link](nmr_pattern, proton)
It looks like the code finds all the signals. One more addition that would be helpful in making the code more robust is
to add the possibility of a negative chemical shift. While proton chemical shifts are typically positive, negative values do
show up in situations such as silanes with Si-H bonds or metal hydrides. To allow for this possibility, a -? is placed in
the front indicated that the negative may or may not be there. To test this, an extra negative resonance was added just for
testing purposes.
proton = ('1H NMR ((CD)3SO, 400 MHz): δ 7.23 (br, 1H), 6.70 (br, 1H),'
'2.00 (t, 2H, J = 7.3 Hz), 1.48 (tq, 2H, J = 7.3, 7.3 Hz),'
'0.84 (t, 3H, J = 7.3 Hz), -0.54 (s, 1H).')
nmr_pattern = r'-?\d+.\d{2}\s\(\w+,\s+\d+H[,\s+J\s?=\s?\d+.\d]*[\sHz]*\)'
[Link](nmr_pattern, proton)
542
Scientific Computing for Chemists with Python
If someone wanted to extract values from the NMR signals, additional regular expressions could be written to iterate
through the list and extract the desired information.
Further Reading
544
INDEX
545
Scientific Computing for Chemists with Python
546
INDEX
A merge, 186
alias, 97 statistics, 181
altair, 334 datetime data, 89
Anaconda software installer, 10 descriptors, 443
anonymous function, 69 dictionaries, 70
arguments, 27
argv, 485
E
arrays, 147 eigenvalues, 271
augmented assignment, 67 eigenvectors, 271
average, 162 encoding numbers, 79
enumeration, 77, 78
B equilibrium
balancing chemical equations, 269 ICE table, 261
Balmer series, 49 kinetic simulation, 289
baseline correction, 220 solving double equilibra, 415
Beer's law solving with polynomials, 429
using optimization, 422 error handling, 83
with matricies, 267 except, 87
bioinformatics, 453
blackbody radiation, 428 F
boolean logic, 38 factorial, 28
broadcasting, 158 fancy indexing, 155
features, 443
C file input/output
cheminformatics, 429 Excel files, 179
command line, 482 FASTA, 455
comments, 22 mmCIF, 461
compound assignment, 69 multiple files, 75
comprehensions, 68 PDB, 461
conditions, 42 reading NMR data, 362
confidence intervals, 295 with NumPy, 56
confusion matrix, 393 with pandas, 176
constants, 193 with Python, 52
constrained optimization, 414 fitting data, 213
curve fitting, 419 floats, 26
Fourier transform
D basics, 211
DataFrame NMR, 364
concatenation, 187 functions
create, 175 arguments, 27
drop columns, 185 calling functions, 27
insert columns, 184 defining functions, 58
547
Scientific Computing for Chemists with Python
docstring, 62 L
recursive, 81 label plotting axes, 101
scope, 60 lambda function, 69
variable arguments, 80 least squares, 417
vectorization, 160 legend, 120
linear equation solving
G with optimization, 418
gas chromatography, 144 with SymPy, 264
gas law, 428 lists, 43
GC content, 457 local min/max, 198
generator, 75 loops
break, 51
H continue, 51
Hamming distance, 88 for, 48
Hess's law, 266 pass, 51
while, 49
I
images M
blob detection, 247 machine learning, 385
color, 232 blind signal (or source) seperation,
contrast, 242 400
eccentricity, 252 classification, 392
encoding, 241 clustering, 398
entropy, 250 dimensionality reduction, 395
false color, 238 k-means, 398
grayscale, 228 random forest, 392
loading, 229 supervised, 385
saving, 240 unsupervised, 395
immutable, 47 masking, 156, 199
InChI, 432 math
indexing, 33 algebra, 258
inflection points, 203 calculus, 272
integer division, 25 differentiation, 272
integers, 26 factoring polynomials, 260
interactivity integration, 273
pan and zoom, 341 linear algebra, 263
rotate molecules, 473 matricies, 263
selection, 352 ordinary differential equations
widgets, 493 (ODEs), 280
interpolation, 218 simplification, 260
island of stability, 109 solve equations, 264
isomers, 438 symbolic, 257
isotopic decay kinetics, 280 matrix
determinant, 264
J dot product, 264
Jupyter notebooks, 16 eigenvalues, 271
eigenvectors, 271
K inverse, 264
pseudoinverse, 266
k-fold cross-validation, 390
singular matrix, 266
kinetics
maximization, 413
determine rate constants, 419
maximum, 194
simulations, 288
median, 162
stochastic simulations, 294
meshgrid, 127, 128
548 Index
Scientific Computing for Chemists with Python
Index 549
Scientific Computing for Chemists with Python
T
title on plot, 102
transpose, 151
try, 87
tuples, 47
U
user input, 485
V
van der Waals equation, 429
variable naming rules, 29
variable scope, 60
variables, 29
vectorization, 157
550 Index
Python provides a higher-level interface for reading and writing CSV files through the built-in 'csv' module, which abstracts many file I/O tasks such as handling line terminations and delimiters that might vary between different CSV files. This contrasts with manual file manipulation methods where you'd open a file, iterate over lines or split strings manually, as described using the open function and readlines method. Python's 'csv' module manages complexities like quoting characters and escape sequences, allowing developers to work with CSV data more efficiently and with fewer errors .
Jupyter is favored for its ability to seamlessly integrate code with visualizations and narrative in notebooks, making it ideal for exploratory data analysis and presentation. Projects that benefit from combining code execution with descriptive content, such as machine learning model iteration and data storytelling, are particularly suited to Jupyter. These features facilitate interactive learning and documentation, which are less emphasized in traditional IDEs like Spyder .
To plot multiple datasets efficiently in matplotlib, one can use a single plot command with varying styles for each dataset, leveraging labels to differentiate them. The inclusion of a legend using plt.legend() then helps in identifying each dataset by associating them with designated labels. Colors, markers, styles, and the loc argument for legend placement all assist in ensuring datasets are clearly distinguished within a chart .
The CSV (Comma Separated Values) file format is primarily used for encoding tabular data, making it a popular choice for data exchange between different applications, particularly because it is simple and widely supported. It allows easy storage of spreadsheet-like data where each line represents a data row and each data field within a row is separated by a comma. This simplicity makes CSV files easily readable by humans and machines alike, facilitating data import and export across a wide range of software applications .
Scikit-image is specifically designed for scientific image analysis, offering advanced features such as boundary detection, object counting, and image transformations suitable for scientific applications. Meanwhile, Pillow serves more general image processing tasks like rotation and cropping. Scikit-image’s focus on scientific needs makes it conducive for consistent, objective processing and measuring of image features, which is crucial in scientific research scenarios .
Keyword arguments in matplotlib allow for fine-grained control over various plot attributes such as line style, color, marker style, and labels. They enable customization of plots beyond the default settings to achieve precise visual output. Although positional arguments can be used for quick plots, keyword arguments provide the ability to adjust numerous aspects of a plot systematically, resulting in more refined and visually clear graphical representations .
Flattening an array reduces it to a one-dimensional form, making it useful for operations requiring linear data integrity such as statistical analysis or data visualizations. Transposing, by flipping rows and columns, is essential in linear algebra and data manipulation where matrix orientation affects outcomes. Practical applications include preparing data for machine learning algorithms and adjusting formats for compatibility between computational operations .
Resources for learning and improving Python skills in scientific computing include a variety of free and paid books, online courses, and documentation. Some notable resources are 'The Hitchhiker’s Guide to Python,' 'Think Python,' and 'Introduction to Python Programming' by OpenStax. Additionally, platforms like Stack Overflow, YouTube tutorials, and the official Python documentation at python.org provide extensive support .
NumPy provides a variety of methods for modifying and reshaping arrays, such as np.reshape() for changing the dimensions of an array, while maintaining the original data order and count. Additionally, np.flatten() and array.T allow for flattening and transposing arrays, respectively. Methods like np.vstack, np.hstack, and np.dstack enable merging along different axes. These methods provide flexibility when adapting data for specific computational needs or analyses .
In Google Colab, interacting with CSV files requires additional steps for files stored on Google Drive. You must give access using specific functions to read/write files, unlike in a local Python environment where reading a CSV involves simply specifying the path if the file is outside the notebook's directory. Google Colab requires mounting Google Drive and navigating directories programmatically, which adds initial complexity but benefits from seamless cloud storage integration .