Mastering Python 3 I/O (Version 2)

Mastering Python 3 I/O
(version 2.0)
David Beazley
http://www.dabeaz.com

Presented at PyCon'2011
Atlanta, Georgia

Copyright (C) 2011, David Beazley, http://www.dabeaz.com 1

This Tutorial

• Details about a very speciﬁc aspect of Python 3
• Maybe the most important part of Python 3
• Namely, the reimplemented I/O system


Why I/O?
• Real programs interact with the world
• They read and write ﬁles
• They send and receive messages
• I/O is at the heart of almost everything that
Python is about (scripting, data processing,
gluing, frameworks, C extensions, etc.)
• Most tricky porting issues are I/O related
Copyright (C) 2011, David Beazley, http://www.dabeaz.com
3

The I/O Issue

• Python 3 re-implements the entire I/O stack
• Python 3 introduces new programming idioms
• I/O handling issues can't be ﬁxed by automatic
code conversion tools (2to3)


The Plan
• We're going to take a detailed top-to-bottom
tour of the Python 3 I/O system
• Text handling, formatting, etc.
• Binary data handling
• The new I/O stack
• System interfaces
• Library design issues

Prerequisites
• I assume that you are already somewhat
familiar with how I/O works in Python 2
• str vs. unicode
• print statement
• open() and ﬁle methods
• Standard library modules
• General awareness of I/O issues
• Prior experience with Python 3 not assumed

Performance Disclosure
• There are some performance tests
• Execution environment for tests:
• 2.66 GHZ 4-Core MacPro, 3GB memory
• OS-X 10.6.4 (Snow Leopard)
• All Python interpreters compiled from
source using same conﬁg/compiler
• Tutorial is not meant to be a detailed
performance study so all results should be
viewed as rough estimates

Resources

• I have made a few support ﬁles:
http://www.dabeaz.com/python3io/index.html

• You can try some of the examples as we go
• However, it is ﬁne to just watch/listen and try
things on your own later


Part 1
Introducing Python 3


Syntax Changes
• As you know, Python 3 changes some syntax
• print is now a function print()
print("Hello World")

• Exception handling syntax changes slightly
try: added
...
except IOError as e:
...

• Yes, your old code will break

Many New Features
• Python 3 introduces many new features
• Composite string formatting
"{:10s} {:10d} {:10.2f}".format(name, shares, price)

• Dictionary comprehensions
a = {key.upper():value for key,value in d.items()}

• Function annotations
def square(x:int) -> int:
return x*x

• Much more... but that's a different tutorial

Changed Built-ins
• Many of the core built-in operations change
• Examples : range(), zip(), etc.
>>> a = [1,2,3]
>>> b = [4,5,6]
>>> c = zip(a,b)
>>> c
<zip object at 0x100452950>
>>>

• Python 3 prefers iterators/generators


Library Reorganization
• The standard library has been cleaned up
• Example : Python 2
from urllib2 import urlopen
u = urlopen("http://www.python.org")

• Example : Python 3
from urllib.request import urlopen
u = urlopen("http://www.python.org")


2to3 Tool
• There is a tool (2to3) that can be used to
identify (and optionally ﬁx) Python 2 code
that must be changed to work with Python 3
• It's a command-line tool:
bash % 2to3 myprog.py
...

• 2to3 helps, but it's not foolproof (in fact, most
of the time it doesn't quite work)


2to3 Example
• Consider this Python 2 program
# printlinks.py
import urllib
import sys
from HTMLParser import HTMLParser

class LinkPrinter(HTMLParser):
def handle_starttag(self,tag,attrs):
if tag == 'a':
for name,value in attrs:
if name == 'href': print value

data = urllib.urlopen(sys.argv[1]).read()
LinkPrinter().feed(data)

• It prints all <a href="..."> links on a web page

2to3 Example
• Here's what happens if you run 2to3 on it
bash % 2to3 printlinks.py
...
--- printlinks.py (original)
+++ printlinks.py (refactored)
@@ -1,12 +1,12 @@
-import urllib
It identiﬁes +import urllib.request, urllib.parse, urllib.error
lines that import sys
must be -from HTMLParser import HTMLParser
changed +from html.parser import HTMLParser

if tag == 'a':
- if name == 'href': print value
+ if name == 'href': print(value)
...

Fixed Code
• Here's an example of a ﬁxed code (after 2to3)
import urllib.request, urllib.parse, urllib.error
import sys
from html.parser import HTMLParser

if tag == 'a':
if name == 'href': print(value)

data = urllib.request.urlopen(sys.argv[1]).read()

• This is syntactically correct Python 3
• But, it still doesn't work. Do you see why?

Broken Code
• Run it
bash % python3 printlinks.py http://www.python.org
Traceback (most recent call last):
File "printlinks.py", line 12, in <module>
File "/Users/beazley/Software/lib/python3.1/html/parser.py",
line 107, in feed
self.rawdata = self.rawdata + data
TypeError: Can't convert 'bytes' object to str implicitly
bash %

Ah ha! Look at that!

• That is an I/O handling problem
• Important lesson : 2to3 didn't ﬁnd it

Actually Fixed Code
• This version "works"
import urllib.request, urllib.parse, urllib.error
import sys
from html.parser import HTMLParser

if tag == 'a':
if name == 'href': print(value)

data = urllib.request.urlopen(sys.argv[1]).read()
LinkPrinter().feed(data.decode('utf-8'))

I added this one tiny bit (by hand)


Important Lessons

• A lot of things change in Python 3
• 2to3 only ﬁxes really "obvious" things
• It does not ﬁx I/O problems
• Why you should care : Real programs do I/O


Part 2
Working with Text


Making Peace with Unicode

• In Python 3, all text is Unicode
• All strings are Unicode
• All text-based I/O is Unicode
• You can't ignore it or live in denial
• However, you don't have to be a Unicode guru


Text Representation
• Old-school programmers know about ASCII

• Each character has its own integer byte code
• Text strings are sequences of character codes

Unicode Characters
• Unicode is the same idea only extended
• It deﬁnes a standard integer code for every
character used in all languages (except for
ﬁctional ones such as Klingon, Elvish, etc.)
• The numeric value is known as a "code point"
• Denoted U+HHHH in polite conversation
ñ = U+00F1
ε = U+03B5
ઇ = U+0A87
= U+3304


Unicode Charts
• An issue : There are a lot of code points
• Largest code point : U+10FFFF
• Code points are organized into charts
http://www.unicode.org/charts

• Go there and you will ﬁnd charts organized by
language or topic (e.g., greek, math, music, etc.)


Unicode Charts


Using Unicode Charts
• Consult to get code points for use in literals

t = "That's a spicy Jalapeu00f1o!"

• In practice : It doesn't come up that often

Unicode Escapes
• There are three Unicode escapes in literals
• xhh : Code points U+00 - U+FF
• uhhhh : Code points U+0100 - U+FFFF
• Uhhhhhhhh : Code points > U+10000
• Examples:
a = "xf1" # a = 'ñ'
b = "u210f" # b = ' '
c = "U0001d122" # c = ''


A repr() Caution
• Python 3 source code is now Unicode
• Output of repr() is Unicode and doesn't use the
escape codes (characters will be rendered)
>>> a = "Jalapexf1o"
>>> a
'Jalapeño'

• Use ascii() to see the escape codes
>>> print(ascii(a))
'Jalapexf1o'
>>>


Commentary

• Don't overthink Unicode
• Unicode strings are mostly like ASCII strings
except that there is a greater range of codes
• Everything that you normally do with strings
(stripping, ﬁnding, splitting, etc.) works ﬁne,
but is simply expanded


A Caution
• Unicode is just like ASCII except when it's not
>>> s = "Jalapexf1o"
>>> t = "Jalapenu0303o"
>>> s
'Jalapeño'
'ñ' = 'n'+'˜' (combining ˜)
>>> t
'Jalapeño'
>>> s == t
False
>>> len(s), len(t)
(8, 9)
>>>

• Many hairy bits
• However, that's also a different tutorial

Unicode Representation
• Internally, Unicode character codes are just
stored as arrays of C integers (16 or 32 bits)
t = "Jalapeño"

004a 0061 006c 0061 0070 0065 00f1 006f (UCS-2,16-bits)
0000004a 0000006a 0000006c 00000070 ... (UCS-4,32-bits)

• You can ﬁnd out which using the sys module
>>> sys.maxunicode
65535 # 16-bits

>>> sys.maxunicode
1114111 # 32-bits


Memory Use
• Yes, text strings in Python 3 require either 2x
or 4x as much memory to store as Python 2
• For example: Read a 10MB ASCII text ﬁle
data = open("bigfile.txt").read()

>>> sys.getsizeof(data) # Python 2.6
10485784

>>> sys.getsizeof(data) # Python 3.1 (UCS-2)
20971578

>>> sys.getsizeof(data) # Python 3.1 (UCS-4)
41943100

• See PEP 393 (possible change in future)

Performance Impact
• Increased memory use does impact the
performance of string operations that
involving bulk memory copies
• Slices, joins, split, replace, strip, etc.
• Example:
timeit("text[:-1]","text='x'*100000")

Python 2.7.1 (bytes) : 11.5 s
Python 3.2 (UCS-2) : 24.2 s
Python 3.2 (UCS-4) : 47.5 s

• Slower because more bytes are moving

Performance Impact
• Operations that process strings character by
character often run at comparable speed
• lower, upper, ﬁnd, regexs, etc.
• Example:
timeit("text.upper()","text='x'*1000")

Python 2.7.1 (bytes) : 37.9s (???)
Python 3.2 (UCS-2) : 6.9s
Python 3.2 (UCS-4) : 7.0s

• The same number of iterations regardless of
the size of each character

Commentary

• Yes, unicode strings come at a cost
• Must study it if text-processing is a major
component of your application
• Keep in mind--most programs do more than
just string operations (overall performance
impact might be far less than you think)


Issue : Text Encoding
• The internal representation of characters is not
the same as how characters are stored in ﬁles
Text File Hello World

File content 48 65 6c 6c 6f 20 57 6f 72 6c 64 0a
(ASCII bytes)
read() write()
Representation 00000048 00000065 0000006c 0000006c
inside the interpreter 0000006f 00000020 00000057 0000006f
00000072 0000006c 00000064 0000000a
(UCS-4, 32-bit ints)


• There are also many possible char encodings
for text (especially for non-ASCII chars)
"Jalapeño"

latin-1 4a 61 6c 61 70 65 f1 6f

cp437 4a 61 6c 61 70 65 a4 6f

utf-8 4a 61 6c 61 70 65 c3 b1 6f

utf-16 ff fe 4a 00 61 00 6c 00 61 00
70 00 65 00 f1 00 6f 00

• Emphasize : This is only related to how text
is stored in ﬁles, not stored in memory

• Emphasize: text is always stored exactly the
same way inside the Python interpreter
Python Interpreter
4a 00 61 00 6c 00 61 00
"Jalapeño"
70 00 65 00 f1 00 6f 00

latin-1 utf-8

4a 61 6c 4a 61 6c
61 70 65 Files 61 70 65
f1 6f c3 b1 6f

• It's only the encoding in ﬁles that varies

I/O Encoding
• All text is now encoded and decoded
• If reading text, it must be decoded from its
source format into Python strings
• If writing text, it must be encoded into some
kind of well-known output format
• This is a major difference between Python 2
and Python 3. In Python 2, you could write
programs that just ignored encoding and
read text as bytes (ASCII).


Reading/Writing Text
• Built-in open() function now has an optional
encoding parameter
f = open("somefile.txt","rt",encoding="latin-1")

• If you omit the encoding, UTF-8 is assumed
>>> f = open("somefile.txt","rt")
>>> f.encoding
'UTF-8'
>>>

• Also, in case you're wondering, text ﬁle modes
should be speciﬁed as "rt","wt","at", etc.


Encoding/Decoding Bytes
• Use encode() and decode() for byte strings
>>> s = "Jalapeño"
>>> data = s.encode('utf-8')
>>> data
b'Jalapexc3xb1o'

>>> data.decode('utf-8')
'Jalapeño'
>>>

• You'll need this for transmitting strings on
network connections, passing to external
systems, etc.


Important Encodings
• If you're not doing anything with Unicode
(e.g., just processing ASCII ﬁles), there are
still three encodings you must know
• ASCII
• Latin-1
• UTF-8
• Will brieﬂy describe each one

ASCII Encoding
• Text that is restricted to 7-bit ASCII (0-127)
• Any characters outside of that range
produce an encoding error
>>> f = open("output.txt","wt",encoding="ascii")
>>> f.write("Hello Worldn")
12
>>> f.write("Spicy Jalapeñon")
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode
character 'xf1' in position 12: ordinal not in
range(128)
>>>


Latin-1 Encoding
• Text that is restricted to 8-bit bytes (0-255)
• Byte values are left "as-is"
>>> f = open("output.txt","wt",encoding="latin-1")
>>> f.write("Spicy Jalapeñon")
15
>>>

• Most closely emulates Python 2 behavior
• Also known as "iso-8859-1" encoding
• Pro tip: This is the fastest encoding for pure
8-bit text (ASCII ﬁles, etc.)

UTF-8 Encoding
• A multibyte variable-length encoding that can
represent all Unicode characters
Encoding Description
0nnnnnnn ASCII (0-127)
110nnnnn 10nnnnnn U+007F-U+07FF
1110nnnn 10nnnnnn 10nnnnnn U+0800-U+FFFF
11110nnn 10nnnnnn 10nnnnnn 10nnnnnn U+10000-U+10FFFF

• Example:
ñ = 0xf1 = 11110001

= 11000011 10110001 = 0xc3 0xb1 (UTF-8)


UTF-8 Encoding
• Main feature of UTF-8 is that ASCII is
embedded within it
• If you're not working with international
characters, UTF-8 will work transparently
• Usually a safe default to use when you're not
sure (e.g., passing Unicode strings to
operating system functions, interfacing with
foreign software, etc.)


Interlude
• If migrating from Python 2, keep in mind
• Python 3 strings use multibyte integers
• Python 3 always encodes/decodes I/O
• If you don't say anything about encoding,
Python 3 assumes UTF-8
• Everything that you did before should work
just ﬁne in Python 3 (probably)


Encoding Errors
• When working with Unicode, you might
encounter encoding/decoding errors
>>> f = open('foo',encoding='ascii')
>>> data = f.read()
File "/usr/local/lib/python3.2/encodings/
ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)
[0]
UnicodeDecodeError: 'ascii' codec can't decode byte
0xc3 in position 6: ordinal not in range(128)
>>>

• This is almost always bad--must be ﬁxed

Fixing Encoding Errors
• Solution: Use the right encoding
>>> f = open('foo',encoding='utf-8')
>>> data = f.read()
>>>

• Bad Solution : Change the error handling
>>> f = open('foo',encoding='ascii',errors='ignore')
>>> data = f.read()
>>> data
'Jalapeo'
>>>

• My advice : Never use the errors argument
without a really good reason. Do it right.


Part 3
Printing and Formatting


New Printing

• In Python 3, print() is used for text output
• Here is a mini porting guide
Python 2 Python 3

print x,y,z print(x,y,z)
print x,y,z, print(x,y,z,end=' ')
print >>f,x,y,z print(x,y,z,file=f)

• print() has a few new tricks

Printing Enhancements
• Picking a different item separator
>>> print(1,2,3,sep=':')
1:2:3
>>> print("Hello","World",sep='')
HelloWorld
>>>

• Picking a different line ending
>>> print("What?",end="!?!n")
What?!?!
>>>

• Relatively minor, but these features were often
requested (e.g., "how do I get rid of the space?")


Discussion : New Idioms
• In Python 2, you might have code like this
print ','.join([name,shares,price])

• Which of these is better in Python 3?
print(",".join([name,shares,price]))

- or -
print(name, shares, price, sep=',')

• Overall, I think I like the second one (even
though it runs a tad bit slower)


Object Formatting

• Here is Python 2 (%)
s = "%10.2f" % price

• Here is Python 3 (format)
s = format(price,"10.2f")

• This is part of a whole new formatting system


Some History
• String formatting is one of the few features
of Python 2 that can't be easily customized
• Classes can deﬁne __str__() and __repr__()
• However, they can't customize % processing
• Python 2.6/3.0 adds a __format__() special
method that addresses this in conjunction
with some new string formatting machinery


String Conversions
• Objects now have three string conversions
>>> x = 1/3
>>> x.__str__()
'0.333333333333'
>>> x.__repr__()
'0.3333333333333333'
>>> x.__format__("0.2f")
'0.33'
>>> x.__format__("20.2f")
' 0.33'
>>>

• You will notice that __format__() takes a
code similar to those used by the % operator


format() function
• format(obj, fmt) calls __format__
>>> x = 1/3
>>> format(x,"0.2f")
'0.33'
>>> format(x,"20.2f")
' 0.33'
>>>

• This is analogous to str() and repr()
>>> str(x)
'0.333333333333'
>>> repr(x)
'0.3333333333333333'
>>>


Format Codes (Builtins)
• For builtins, there are standard format codes
Old Format New Format Description
"%d" "d" Decimal Integer
"%f" "f" Floating point
"%s" "s" String
"%e" "e" Scientific notation
"%x" "x" Hexadecimal

• Plus there are some brand new codes
"o" Octal
"b" Binary
"%" Percent


Format Examples
• Examples of simple formatting
>>> x = 42
>>> format(x,"x")
'2a'
>>> format(x,"b")
'101010'

>>> y = 2.71828
>>> format(y,"f")
'2.718280'
>>> format(y,"e")
'2.718280e+00'
>>> format(y,"%")
'271.828000%'


Format Modiﬁers
• Field width and precision modiﬁers
[width][.precision]code

• Examples:
>>> y = 2.71828
>>> format(y,"0.2f")
'2.72'
>>> format(y,"10.4f")
' 2.7183'
>>>

• This is exactly the same convention as with
the legacy % string formatting


Alignment Modiﬁers
• Alignment Modiﬁers
[<|>|^][width][.precision]code

< left align
> right align
^ center align

• Examples:
>>> y = 2.71828
>>> format(y,"<20.2f")
'2.72 '
>>> format(y,"^20.2f")
' 2.72 '
>>> format(y,">20.2f")
' 2.72'
>>>


Fill Character
• Fill Character
[fill][<|>|^][width][.precision]code

• Examples:
>>> x = 42
>>> format(x,"08d")
'00000042'
>>> format(x,"032b")
'00000000000000000000000000101010'
>>> format(x,"=^32d")
'===============42==============='
>>>


Thousands Separator
• Insert a ',' before the precision speciﬁer
[fill][<|>|^][width][,][.precision]code

• Examples:
>>> x = 123456789
>>> format(x,",d")
'123,456,789'
>>> format(x,"10,.2f")
'123,456,789.00'
>>>

• Alas, the use of the ',' isn't localized

Discussion

• As you can see, there's a lot of ﬂexibility in
the new format method (there are other
features not shown here)
• User-deﬁned objects can also completely
customize their formatting if they implement
__format__(self,fmt)


Composite Formatting
• String .format() method formats multiple
values all at once (replacement for %)
• Some examples:
>>> "{name} has {n} messages".format(name="Dave",n=37)
'Dave has 37 messages'

>>> "{:10s} {:10d} {:10.2f}".format('ACME',50,91.1)
'ACME 50 91.10'

>>> "<{0}>{1}</{0}>".format('para','Hey there')
'<para>Hey there</para>'
>>>


Composite Formatting

• format() method scans the string for
formatting speciﬁers enclosed in {} and
expands each one
• Each {} speciﬁes what is being formatted as
well as how it should be formatted
• Tricky bit : There are two aspects to it


What to Format?
• You must specify arguments to .format()
• Positional:
"{0} has {1} messages".format("Dave",37)

• Keyword:
"{name} has {n} messages".format(name="Dave",n=37)

• In order:
"{} has {} messages".format("Dave",37)


String Templates
• Template Strings
from string import Template

msg = Template("$name has $n messages")
print(msg.substitute(name="Dave",n=37)

• New String Formatting
msg = "{name} has {n} messages"
print(msg.format(name="Dave",n=37))

• Very similar


Indexing/Attributes
• Cool thing :You can perform index lookups
record = {
'name' : 'Dave',
'n' : 37
}

'{r[name]} has {r[n]} messages'.format(r=record)

• Or attribute lookups with instances
record = Record('Dave',37)

'{r.name} has {r.n} messages'.format(r=record)

• Restriction: Can't have arbitrary expressions

Specifying the Format

• Recall: There are three string format functions
str(s)
repr(s)
format(s,fmt)

• Each {item} can pick which it wants to use
{item} # Replaced by str(item)
{item!r} # Replaced by repr(item)
{item:fmt} # Replaced by format(item, fmt)


Format Examples

• More Examples:
>>> "{name:10s} {price:10.2f}".format(name='ACME',price=91.1)
'ACME 91.10'

>>> "{s.name:10s} {s.price:10.f}".format(s=stock)
'ACME 91.10'

>>> "{name!r},{price}".format(name="ACME",price=91.1)
"'ACME',91.1"
>>>

note repr() output here


Other Formatting Details

• { and } must be escaped if part of formatting
• Use '{{ for '{'
• Use '}}' for '}'
• Example:
>>> "The value is {{{0}}}".format(42)
'The value is {42}'
>>>


Nested Format Expansion
• .format() allows one level of nested lookups in
the format part of each {}
>>> s = ('ACME',50,91.10)
>>> "{0:{width}s} {2:{width}.2f}".format(*s,width=12)
'ACME 91.10'
>>>

• Probably best not to get too carried away in
the interest of code readability though


Formatting a Mapping
• Variation : s.format_map(d)
>>> record = {
'name' : 'Dave',
'n' : 37
}
>>> "{name} has {n} messages".format_map(record)
'Dave has 37 messages'
>>>

• This is a convenience function--allows names
to come from a mapping without using **


Commentary

• The new string formatting is very powerful
• The % operator will likely stay, but the new
formatting adds more ﬂexibility


Part 4
Binary Data Handling and Bytes


Bytes and Byte Arrays

• Python 3 has support for "byte-strings"
• Two new types : bytes and bytearray
• They are quite different than Python 2 strings


Defining Bytes
• Here's how to define byte "strings"
a = b"ACME 50 91.10" # Byte string literal
b = bytes([1,2,3,4,5]) # From a list of integers
c = bytes(10) # An array of 10 zero-bytes
d = bytes("Jalapeño","utf-8") # Encoded from string

• Can also create from a string of hex digits
e = bytes.fromhex("48656c6c6f")

• All of these define an object of type "bytes"
>>> type(a)
<class 'bytes'>
>>>

• However, this new bytes object is odd

Bytes as Strings
• Bytes have standard "string" operations
>>> s = b"ACME 50 91.10"
>>> s.split()
[b'ACME', b'50', b'91.10']
>>> s.lower()
b'acme 50 91.10'
>>> s[5:7]
b'50'

• And bytes are immutable like strings
>>> s[0] = b'a'
TypeError: 'bytes' object does not support item assignment


Bytes as Integers
• Unlike Python 2, bytes are arrays of integers
>>> s = b"ACME 50 91.10"
>>> s[0]
65
>>> s[1]
67
>>>

• Same for iteration
>>> for c in s: print(c,end=' ')
65 67 77 69 32 53 48 32 57 49 46 49 48
>>>

• Hmmmm. Curious.

Porting Note
• I have encountered a lot of minor problems
with bytes in porting libraries
data = s.recv(1024)
if data[0] == '+':
...

data = s.recv(1024)
if data[0] == b'+': # ERROR!
...

data = s.recv(1024)
if data[0] == 0x2b: # CORRECT
...


Porting Note
• Be careful with ord() (not needed)
data = s.recv(1024) data = s.recv(1024)
x = ord(data[0]) x = data[0]

• Conversion of objects into bytes
>>> x = 7
>>> bytes(x)
b'x00x00x00x00x00x00x00'

>>> str(x).encode('ascii')
b'7'
>>>


bytearray objects
• A bytearray is a mutable bytes object
>>> s = bytearray(b"ACME 50 91.10")
>>> s[:4] = b"PYTHON"
>>> s
bytearray(b"PYTHON 50 91.10")
>>> s[0] = 0x70 # Must assign integers
>>> s
bytearray(b'pYTHON 50 91.10")
>>>

• It also gives you various list operations
>>> s.append(23)
>>> s.append(45)
>>> s.extend([1,2,3,4])
>>> s
bytearray(b'ACME 50 91.10x17-x01x02x03x04')
>>>


An Observation
• bytes and bytearray are not really meant to
mimic Python 2 string objects
• They're closer to array.array('B',...) objects
>>> import array
>>> s = array.array('B',[10,20,30,40,50])
>>> s[1]
20
>>> s[1] = 200
>>> s.append(100)
>>> s.extend([65,66,67])
>>> s
array('B', [10, 200, 30, 40, 50, 100, 65, 66, 67])
>>>


Bytes and Strings
• Bytes are not meant for text processing
• In fact, if you try to use them for text, you will
run into weird problems
• Python 3 strictly separates text (unicode) and
bytes everywhere
• This is probably the most major difference
between Python 2 and 3.


Mixing Bytes and Strings

• Mixed operations fail miserably
>>> s = b"ACME 50 91.10"
>>> 'ACME' in s
TypeError: Type str doesn't support the buffer API
>>>

• Huh?!?? Buffer API?
• We'll mention that later...


Printing Bytes
• Printing and text-based I/O operations do not
work in a useful way with bytes
>>> s = b"ACME 50 91.10"
>>> print(s)
b'ACME 50 91.10'
>>>

Notice the leading b' and trailing
quote in the output.

• There's no way to ﬁx this. print() should only
be used for outputting text (unicode)


Formatting Bytes
• Bytes do not support operations related to
formatted output (%, .format)
>>> s = b"%0.2f" % 3.14159
TypeError: unsupported operand type(s) for %: 'bytes' and
'float'
>>>

• So, just forget about using bytes for any kind of
useful text output, printing, etc.
• No, seriously.

Passing Bytes as Strings
• Many library functions that work with "text"
do not accept byte objects at all
>>> time.strptime(b"2010-02-17","%Y-%m-%d")
File "/Users/beazley/Software/lib/python3.1/
_strptime.py", line 461, in _strptime_time
return _strptime(data_string, format)[0]
File "/Users/beazley/Software/lib/python3.1/
_strptime.py", line 301, in _strptime
raise TypeError(msg.format(index, type(arg)))
TypeError: strptime() argument 0 must be str, not <class
'bytes'>
>>>


Commentary
• Why am I focusing on this "bytes as text" issue?
• If you are writing scripts that do simple ASCII
text processing, you might be inclined to use
bytes as a way to avoid the overhead of Unicode
• You might think that bytes are exactly the same
as the familiar Python 2 string object
• This is wrong. Bytes are not text. Using bytes as
text will lead to convoluted non-idiomatic code


How to Use Bytes
• Bytes are better suited for low-level I/O
handling (message passing, distributed
computing, embedded systems, etc.)
• I will show some examples that illustrate
• A complaint: documentation (online and
books) is somewhat thin on explaining
practical uses of bytes and bytearray objects


Example : Reassembly
• In Python 2, you may know that string
concatenation leads to bad performance
msg = b""
while True:
chunk = s.recv(BUFSIZE)
if not chunk:
break
msg += chunk

• Here's the common workaround (hacky)
chunks = []
while True:
if not chunk:
break
chunks.append(chunk)
msg = b"".join(chunks)


Example : Reassembly
• Here's a new approach in Python 3
msg = bytearray()
while True:
if not chunk:
break
msg.extend(chunk)

• You treat the bytearray as a list and just
append/extend new data at the end as you go
• I like it. It's clean and intuitive.


Example: Reassembly

• The performance is good too
• Concat 1024 32-byte chunks together (10000x)
Concatenation : 18.49s
Joining : 1.55s
Extending a bytearray : 1.78s


Example: Record Packing
• Suppose you wanted to use the struct module
to incrementally pack a large binary message
objs = [ ... ] # List of tuples to pack
msg = bytearray() # Empty message

# First pack the number of objects
msg.extend(struct.pack("<I",len(objs)))

# Incrementally pack each object
for x in objs:
msg.extend(struct.pack(fmt, *x))

# Do something with the message
f.write(msg)

• I like this as well.

Example : Calculations
• Run a byte array through an XOR-cipher
>>> s = b"Hello World"
>>> t = bytes(x^42 for x in s)
>>> t
b'bOFFEn}EXFN'
>>> bytes(x^42 for x in t)
b'Hello World'
>>>

• Compute and append a LRC checksum to a msg
# Compute the checksum and append at the end
chk = 0
for n in msg:
chk ^= n
msg.append(chk)


Commentary

• I like the new bytearray object
• Many potential uses in building low-level
infrastructure for networking, distributed
computing, messaging, embedded systems, etc.
• May make much of that code cleaner, faster, and
more memory efﬁcient


Related : Buffers
• bytearray() is an example of a "buffer"
• buffer : A contiguous region of memory (e.g.,
allocated like a C/C++ array)
• There are many other examples:
a = array.array("i", [1,2,3,4,5])
b = numpy.array([1,2,3,4,5])
c = ctypes.ARRAY(ctypes.c_int,5)(1,2,3,4,5)

• Under the covers, they're all similar and often
interchangeable with bytes (especially for I/O)


Advanced : Memory Views
• memoryview()
>>> a = bytearray(b'Hello World')
>>> b = memoryview(a)
>>> b
<memory at 0x1007014d0>
>>> b[-5:] = b'There'
>>> a
bytearray(b'Hello There')
>>>

• It's essentially an overlay over a buffer
• It's very low-level and its use seems tricky
• I would probably avoid it

Mastering Python 3 I/O (Version 2)

Mastering Python 3 I/O (Version 2)

More Related Content

Viewers also liked

Similar to Mastering Python 3 I/O (Version 2)

Recently uploaded

Mastering Python 3 I/O (Version 2)