Mastering Python 3 I/O
                                                           (version 2.0)
                                                       David Beazley
                                                  http://www.dabeaz.com

                                               Presented at PyCon'2011
                                                   Atlanta, Georgia


Copyright (C) 2011, David Beazley, http://www.dabeaz.com                   1
This Tutorial


                  • Details about a very specific aspect of Python 3
                  • Maybe the most important part of Python 3
                  • Namely, the reimplemented I/O system


Copyright (C) 2011, David Beazley, http://www.dabeaz.com              2
Why I/O?
                • Real programs interact with the world
                   • They read and write files
                   • They send and receive messages
                • I/O is at the heart of almost everything that
                        Python is about (scripting, data processing,
                        gluing, frameworks, C extensions, etc.)
                • Most tricky porting issues are I/O related
Copyright (C) 2011, David Beazley, http://www.dabeaz.com
                                                                       3
The I/O Issue

            • Python 3 re-implements the entire I/O stack
            • Python 3 introduces new programming idioms
            • I/O handling issues can't be fixed by automatic
                   code conversion tools (2to3)




Copyright (C) 2011, David Beazley, http://www.dabeaz.com       4
The Plan
              • We're going to take a detailed top-to-bottom
                     tour of the Python 3 I/O system
                        • Text handling, formatting, etc.
                        • Binary data handling
                        • The new I/O stack
                        • System interfaces
                        • Library design issues
Copyright (C) 2011, David Beazley, http://www.dabeaz.com              5
Prerequisites
              • I assume that you are already somewhat
                     familiar with how I/O works in Python 2
                              • str vs. unicode
                              • print statement
                              • open() and file methods
                              • Standard library modules
                              • General awareness of I/O issues
              • Prior experience with Python 3 not assumed
Copyright (C) 2011, David Beazley, http://www.dabeaz.com          6
Performance Disclosure
               • There are some performance tests
               • Execution environment for tests:
                  • 2.66 GHZ 4-Core MacPro, 3GB memory
                  • OS-X 10.6.4 (Snow Leopard)
                  • All Python interpreters compiled from
                                source using same config/compiler
               • Tutorial is not meant to be a detailed
                      performance study so all results should be
                      viewed as rough estimates
Copyright (C) 2011, David Beazley, http://www.dabeaz.com           7
Resources

              • I have made a few support files:
               http://www.dabeaz.com/python3io/index.html

              • You can try some of the examples as we go
              • However, it is fine to just watch/listen and try
                     things on your own later



Copyright (C) 2011, David Beazley, http://www.dabeaz.com          8
Part 1
                                                      Introducing Python 3




Copyright (C) 2011, David Beazley, http://www.dabeaz.com                     9
Syntax Changes
              • As you know, Python 3 changes some syntax
              • print is now a function print()
                               print("Hello World")


              • Exception handling syntax changes slightly
                               try:                        added
                                   ...
                               except IOError as e:
                                   ...


              • Yes, your old code will break
Copyright (C) 2011, David Beazley, http://www.dabeaz.com           10
Many New Features
            • Python 3 introduces many new features
            • Composite string formatting
                      "{:10s} {:10d} {:10.2f}".format(name, shares, price)

            • Dictionary comprehensions
                      a = {key.upper():value for key,value in d.items()}


            • Function annotations
                      def square(x:int) -> int:
                          return x*x


            • Much more... but that's a different tutorial
Copyright (C) 2011, David Beazley, http://www.dabeaz.com                     11
Changed Built-ins
             • Many of the core built-in operations change
             • Examples : range(), zip(), etc.
                      >>> a = [1,2,3]
                      >>> b = [4,5,6]
                      >>> c = zip(a,b)
                      >>> c
                      <zip object at 0x100452950>
                      >>>


              • Python 3 prefers iterators/generators

Copyright (C) 2011, David Beazley, http://www.dabeaz.com     12
Library Reorganization
              • The standard library has been cleaned up
              • Example : Python 2
                          from urllib2 import urlopen
                          u = urlopen("http://www.python.org")


              • Example : Python 3
                         from urllib.request import urlopen
                         u = urlopen("http://www.python.org")




Copyright (C) 2011, David Beazley, http://www.dabeaz.com         13
2to3 Tool
              • There is a tool (2to3) that can be used to
                     identify (and optionally fix) Python 2 code
                     that must be changed to work with Python 3
              • It's a command-line tool:
                          bash % 2to3 myprog.py
                          ...


              • 2to3 helps, but it's not foolproof (in fact, most
                     of the time it doesn't quite work)


Copyright (C) 2011, David Beazley, http://www.dabeaz.com               14
2to3 Example
              • Consider this Python 2 program
                       # printlinks.py
                       import urllib
                       import sys
                       from HTMLParser import HTMLParser

                       class LinkPrinter(HTMLParser):
                           def handle_starttag(self,tag,attrs):
                               if tag == 'a':
                                  for name,value in attrs:
                                      if name == 'href': print value

                       data = urllib.urlopen(sys.argv[1]).read()
                       LinkPrinter().feed(data)

              • It prints all <a href="..."> links on a web page
Copyright (C) 2011, David Beazley, http://www.dabeaz.com               15
2to3 Example
           • Here's what happens if you run 2to3 on it
                                   bash % 2to3 printlinks.py
                                   ...
                                   --- printlinks.py (original)
                                   +++ printlinks.py (refactored)
                                   @@ -1,12 +1,12 @@
                                   -import urllib
   It identifies                    +import urllib.request, urllib.parse, urllib.error
    lines that                      import sys
     must be                       -from HTMLParser import HTMLParser
     changed                       +from html.parser import HTMLParser

                                     class LinkPrinter(HTMLParser):
                                         def handle_starttag(self,tag,attrs):
                                             if tag == 'a':
                                                for name,value in attrs:
                                   -                if name == 'href': print value
                                   +                if name == 'href': print(value)
                                   ...
Copyright (C) 2011, David Beazley, http://www.dabeaz.com                                16
Fixed Code
           • Here's an example of a fixed code (after 2to3)
                       import urllib.request, urllib.parse, urllib.error
                       import sys
                       from html.parser import HTMLParser

                       class LinkPrinter(HTMLParser):
                           def handle_starttag(self,tag,attrs):
                               if tag == 'a':
                                  for name,value in attrs:
                                      if name == 'href': print(value)

                       data = urllib.request.urlopen(sys.argv[1]).read()
                       LinkPrinter().feed(data)


           • This is syntactically correct Python 3
           • But, it still doesn't work. Do you see why?
Copyright (C) 2011, David Beazley, http://www.dabeaz.com                   17
Broken Code
           • Run it
                   bash % python3 printlinks.py http://www.python.org
                   Traceback (most recent call last):
                     File "printlinks.py", line 12, in <module>
                       LinkPrinter().feed(data)
                     File "/Users/beazley/Software/lib/python3.1/html/parser.py",
                   line 107, in feed
                       self.rawdata = self.rawdata + data
                   TypeError: Can't convert 'bytes' object to str implicitly
                   bash %


                                           Ah ha! Look at that!

              • That is an I/O handling problem
              • Important lesson : 2to3 didn't find it
Copyright (C) 2011, David Beazley, http://www.dabeaz.com                      18
Actually Fixed Code
           • This version "works"
                       import urllib.request, urllib.parse, urllib.error
                       import sys
                       from html.parser import HTMLParser

                       class LinkPrinter(HTMLParser):
                           def handle_starttag(self,tag,attrs):
                               if tag == 'a':
                                  for name,value in attrs:
                                      if name == 'href': print(value)

                       data = urllib.request.urlopen(sys.argv[1]).read()
                       LinkPrinter().feed(data.decode('utf-8'))



                                             I added this one tiny bit (by hand)

Copyright (C) 2011, David Beazley, http://www.dabeaz.com                           19
Important Lessons

              • A lot of things change in Python 3
              • 2to3 only fixes really "obvious" things
              • It does not fix I/O problems
              • Why you should care : Real programs do I/O

Copyright (C) 2011, David Beazley, http://www.dabeaz.com     20
Part 2
                                                           Working with Text




Copyright (C) 2011, David Beazley, http://www.dabeaz.com                       21
Making Peace with Unicode

              • In Python 3, all text is Unicode
              • All strings are Unicode
              • All text-based I/O is Unicode
              • You can't ignore it or live in denial
              • However, you don't have to be a Unicode guru

Copyright (C) 2011, David Beazley, http://www.dabeaz.com       22
Text Representation
                 • Old-school programmers know about ASCII




                 • Each character has its own integer byte code
                 • Text strings are sequences of character codes
Copyright (C) 2011, David Beazley, http://www.dabeaz.com           23
Unicode Characters
                • Unicode is the same idea only extended
                • It defines a standard integer code for every
                       character used in all languages (except for
                       fictional ones such as Klingon, Elvish, etc.)
                • The numeric value is known as a "code point"
                • Denoted U+HHHH in polite conversation
                                                           ñ   =   U+00F1
                                                           ε   =   U+03B5
                                                           ઇ   =   U+0A87
                                                               =   U+3304

Copyright (C) 2011, David Beazley, http://www.dabeaz.com                    24
Unicode Charts
               • An issue : There are a lot of code points
               • Largest code point : U+10FFFF
               • Code points are organized into charts
                                     http://www.unicode.org/charts

                • Go there and you will find charts organized by
                       language or topic (e.g., greek, math, music, etc.)


Copyright (C) 2011, David Beazley, http://www.dabeaz.com                    25
Unicode Charts




Copyright (C) 2011, David Beazley, http://www.dabeaz.com   26
Using Unicode Charts
            • Consult to get code points for use in literals




                                   t = "That's a spicy Jalapeu00f1o!"


            • In practice : It doesn't come up that often
Copyright (C) 2011, David Beazley, http://www.dabeaz.com                 27
Unicode Escapes
             • There are three Unicode escapes in literals
                • xhh : Code points U+00 - U+FF
                • uhhhh : Code points U+0100 - U+FFFF
                • Uhhhhhhhh : Code points > U+10000
             • Examples:
                       a = "xf1"                          # a = 'ñ'
                       b = "u210f"                        # b = ' '
                       c = "U0001d122"                    # c = ''

Copyright (C) 2011, David Beazley, http://www.dabeaz.com                28
A repr() Caution
              • Python 3 source code is now Unicode
              • Output of repr() is Unicode and doesn't use the
                      escape codes (characters will be rendered)
                       >>> a = "Jalapexf1o"
                       >>> a
                       'Jalapeño'

              • Use ascii() to see the escape codes
                       >>> print(ascii(a))
                       'Jalapexf1o'
                       >>>

Copyright (C) 2011, David Beazley, http://www.dabeaz.com           29
Commentary

                  • Don't overthink Unicode
                  • Unicode strings are mostly like ASCII strings
                         except that there is a greater range of codes
                  • Everything that you normally do with strings
                         (stripping, finding, splitting, etc.) works fine,
                         but is simply expanded



Copyright (C) 2011, David Beazley, http://www.dabeaz.com                   30
A Caution
           • Unicode is just like ASCII except when it's not
                     >>> s = "Jalapexf1o"
                     >>> t = "Jalapenu0303o"
                     >>> s
                     'Jalapeño'
                                              'ñ' = 'n'+'˜' (combining ˜)
                     >>> t
                     'Jalapeño'
                     >>> s == t
                     False
                     >>> len(s), len(t)
                     (8, 9)
                     >>>


           • Many hairy bits
           • However, that's also a different tutorial
Copyright (C) 2011, David Beazley, http://www.dabeaz.com                    31
Unicode Representation
                 • Internally, Unicode character codes are just
                         stored as arrays of C integers (16 or 32 bits)
                          t = "Jalapeño"

                          004a 0061 006c 0061 0070 0065 00f1 006f      (UCS-2,16-bits)
                          0000004a 0000006a 0000006c 00000070 ...      (UCS-4,32-bits)

                 • You can find out which using the sys module
                          >>> sys.maxunicode
                          65535                            # 16-bits

                          >>> sys.maxunicode
                          1114111                          # 32-bits



Copyright (C) 2011, David Beazley, http://www.dabeaz.com                             32
Memory Use
             • Yes, text strings in Python 3 require either 2x
                    or 4x as much memory to store as Python 2
             • For example: Read a 10MB ASCII text file
                     data = open("bigfile.txt").read()

                     >>> sys.getsizeof(data)               # Python 2.6
                     10485784

                     >>> sys.getsizeof(data)               # Python 3.1 (UCS-2)
                     20971578

                     >>> sys.getsizeof(data)               # Python 3.1 (UCS-4)
                     41943100


             • See PEP 393 (possible change in future)
Copyright (C) 2011, David Beazley, http://www.dabeaz.com                          33
Performance Impact
                  • Increased memory use does impact the
                         performance of string operations that
                         involving bulk memory copies
                                 • Slices, joins, split, replace, strip, etc.
                  • Example:
                                  timeit("text[:-1]","text='x'*100000")

                                  Python 2.7.1 (bytes)      : 11.5 s
                                  Python 3.2 (UCS-2)       : 24.2 s
                                  Python 3.2 (UCS-4)       : 47.5 s

                  • Slower because more bytes are moving
Copyright (C) 2011, David Beazley, http://www.dabeaz.com                        34
Performance Impact
                 • Operations that process strings character by
                        character often run at comparable speed
                                 • lower, upper, find, regexs, etc.
                 • Example:
                          timeit("text.upper()","text='x'*1000")

                           Python 2.7.1 (bytes)             : 37.9s (???)
                           Python 3.2 (UCS-2)              : 6.9s
                           Python 3.2 (UCS-4)              : 7.0s

                 • The same number of iterations regardless of
                         the size of each character
Copyright (C) 2011, David Beazley, http://www.dabeaz.com                    35
Commentary

                  • Yes, unicode strings come at a cost
                  • Must study it if text-processing is a major
                         component of your application
                  • Keep in mind--most programs do more than
                         just string operations (overall performance
                         impact might be far less than you think)



Copyright (C) 2011, David Beazley, http://www.dabeaz.com               36
Issue : Text Encoding
                 • The internal representation of characters is not
                         the same as how characters are stored in files
                  Text File                                Hello World


             File content                                  48 65 6c 6c 6f 20 57 6f 72 6c 64 0a
            (ASCII bytes)
                                                            read()                   write()
         Representation                                    00000048 00000065 0000006c 0000006c
     inside the interpreter                                0000006f 00000020 00000057 0000006f
                                                           00000072 0000006c 00000064 0000000a
      (UCS-4, 32-bit ints)

Copyright (C) 2011, David Beazley, http://www.dabeaz.com                                         37
Issue : Text Encoding
                  • There are also many possible char encodings
                         for text (especially for non-ASCII chars)
                                                                  "Jalapeño"

                            latin-1                        4a 61 6c 61 70 65 f1 6f

                             cp437                         4a 61 6c 61 70 65 a4 6f

                              utf-8                        4a 61 6c 61 70 65 c3 b1 6f


                             utf-16                        ff fe 4a 00 61 00 6c 00 61 00
                                                           70 00 65 00 f1 00 6f 00


                   • Emphasize : This is only related to how text
                           is stored in files, not stored in memory
Copyright (C) 2011, David Beazley, http://www.dabeaz.com                                   38
Issue : Text Encoding
                  • Emphasize: text is always stored exactly the
                         same way inside the Python interpreter
                                                           Python Interpreter
                                                              4a 00 61 00 6c 00 61 00
                                            "Jalapeño"
                                                              70 00 65 00 f1 00 6f 00



                                             latin-1                             utf-8

                                             4a 61 6c                       4a 61 6c
                                             61 70 65             Files     61 70 65
                                             f1 6f                          c3 b1 6f


                    • It's only the encoding in files that varies
Copyright (C) 2011, David Beazley, http://www.dabeaz.com                                 39
I/O Encoding
                  • All text is now encoded and decoded
                  • If reading text, it must be decoded from its
                         source format into Python strings
                  • If writing text, it must be encoded into some
                         kind of well-known output format
                  • This is a major difference between Python 2
                         and Python 3. In Python 2, you could write
                         programs that just ignored encoding and
                         read text as bytes (ASCII).

Copyright (C) 2011, David Beazley, http://www.dabeaz.com              40
Reading/Writing Text
                 • Built-in open() function now has an optional
                         encoding parameter
                            f = open("somefile.txt","rt",encoding="latin-1")


                 • If you omit the encoding, UTF-8 is assumed
                            >>> f = open("somefile.txt","rt")
                            >>> f.encoding
                            'UTF-8'
                            >>>


                  • Also, in case you're wondering, text file modes
                          should be specified as "rt","wt","at", etc.

Copyright (C) 2011, David Beazley, http://www.dabeaz.com                       41
Encoding/Decoding Bytes
                 • Use encode() and decode() for byte strings
                            >>> s = "Jalapeño"
                            >>> data = s.encode('utf-8')
                            >>> data
                            b'Jalapexc3xb1o'

                            >>> data.decode('utf-8')
                            'Jalapeño'
                            >>>


               • You'll need this for transmitting strings on
                       network connections, passing to external
                       systems, etc.

Copyright (C) 2011, David Beazley, http://www.dabeaz.com          42
Important Encodings
                   • If you're not doing anything with Unicode
                          (e.g., just processing ASCII files), there are
                          still three encodings you must know
                                  • ASCII
                                  • Latin-1
                                  • UTF-8
                   • Will briefly describe each one
Copyright (C) 2011, David Beazley, http://www.dabeaz.com                  43
ASCII Encoding
                 • Text that is restricted to 7-bit ASCII (0-127)
                 • Any characters outside of that range
                        produce an encoding error
                          >>> f = open("output.txt","wt",encoding="ascii")
                          >>> f.write("Hello Worldn")
                          12
                          >>> f.write("Spicy Jalapeñon")
                          Traceback (most recent call last):
                             File "<stdin>", line 1, in <module>
                          UnicodeEncodeError: 'ascii' codec can't encode
                          character 'xf1' in position 12: ordinal not in
                          range(128)
                          >>>



Copyright (C) 2011, David Beazley, http://www.dabeaz.com                     44
Latin-1 Encoding
                   • Text that is restricted to 8-bit bytes (0-255)
                   • Byte values are left "as-is"
                            >>> f = open("output.txt","wt",encoding="latin-1")
                            >>> f.write("Spicy Jalapeñon")
                            15
                            >>>


                   • Most closely emulates Python 2 behavior
                   • Also known as "iso-8859-1" encoding
                   • Pro tip: This is the fastest encoding for pure
                          8-bit text (ASCII files, etc.)
Copyright (C) 2011, David Beazley, http://www.dabeaz.com                         45
UTF-8 Encoding
                   • A multibyte variable-length encoding that can
                          represent all Unicode characters
                            Encoding                                        Description
                            0nnnnnnn                                        ASCII (0-127)
                            110nnnnn 10nnnnnn                               U+007F-U+07FF
                            1110nnnn 10nnnnnn 10nnnnnn                      U+0800-U+FFFF
                            11110nnn 10nnnnnn 10nnnnnn 10nnnnnn             U+10000-U+10FFFF


                   • Example:
                            ñ = 0xf1 = 11110001


                                                 = 11000011 10110001 = 0xc3 0xb1   (UTF-8)



Copyright (C) 2011, David Beazley, http://www.dabeaz.com                                       46
UTF-8 Encoding
                   • Main feature of UTF-8 is that ASCII is
                          embedded within it
                   • If you're not working with international
                          characters, UTF-8 will work transparently
                   • Usually a safe default to use when you're not
                          sure (e.g., passing Unicode strings to
                          operating system functions, interfacing with
                          foreign software, etc.)

Copyright (C) 2011, David Beazley, http://www.dabeaz.com                 47
Interlude
                   • If migrating from Python 2, keep in mind
                      • Python 3 strings use multibyte integers
                      • Python 3 always encodes/decodes I/O
                      • If you don't say anything about encoding,
                                    Python 3 assumes UTF-8
                   • Everything that you did before should work
                          just fine in Python 3 (probably)


Copyright (C) 2011, David Beazley, http://www.dabeaz.com               48
Encoding Errors
                   • When working with Unicode, you might
                          encounter encoding/decoding errors
                            >>> f = open('foo',encoding='ascii')
                            >>> data = f.read()
                            Traceback (most recent call last):
                              File "<stdin>", line 1, in <module>
                              File "/usr/local/lib/python3.2/encodings/
                            ascii.py", line 26, in decode
                                return codecs.ascii_decode(input, self.errors)
                            [0]
                            UnicodeDecodeError: 'ascii' codec can't decode byte
                            0xc3 in position 6: ordinal not in range(128)
                            >>>

                   • This is almost always bad--must be fixed
Copyright (C) 2011, David Beazley, http://www.dabeaz.com                          49
Fixing Encoding Errors
                   • Solution: Use the right encoding
                             >>> f = open('foo',encoding='utf-8')
                             >>> data = f.read()
                             >>>

                   • Bad Solution : Change the error handling
                             >>> f = open('foo',encoding='ascii',errors='ignore')
                             >>> data = f.read()
                             >>> data
                             'Jalapeo'
                             >>>

                   • My advice : Never use the errors argument
                          without a really good reason. Do it right.

Copyright (C) 2011, David Beazley, http://www.dabeaz.com                            50
Part 3
                                                  Printing and Formatting




Copyright (C) 2011, David Beazley, http://www.dabeaz.com                    51
New Printing

               • In Python 3, print() is used for text output
               • Here is a mini porting guide
                          Python 2                         Python 3

                          print x,y,z                      print(x,y,z)
                          print x,y,z,                     print(x,y,z,end=' ')
                          print >>f,x,y,z                  print(x,y,z,file=f)


               • print() has a few new tricks
Copyright (C) 2011, David Beazley, http://www.dabeaz.com                          52
Printing Enhancements
                 • Picking a different item separator
                         >>> print(1,2,3,sep=':')
                         1:2:3
                         >>> print("Hello","World",sep='')
                         HelloWorld
                         >>>

                 • Picking a different line ending
                         >>> print("What?",end="!?!n")
                         What?!?!
                         >>>


                 • Relatively minor, but these features were often
                         requested (e.g., "how do I get rid of the space?")

Copyright (C) 2011, David Beazley, http://www.dabeaz.com                      53
Discussion : New Idioms
               • In Python 2, you might have code like this
                               print ','.join([name,shares,price])


               • Which of these is better in Python 3?
                                print(",".join([name,shares,price]))

                                                           - or -
                               print(name, shares, price, sep=',')


               • Overall, I think I like the second one (even
                      though it runs a tad bit slower)

Copyright (C) 2011, David Beazley, http://www.dabeaz.com               54
Object Formatting

                 • Here is Python 2 (%)
                            s = "%10.2f" % price


                 • Here is Python 3 (format)
                           s = format(price,"10.2f")


                 • This is part of a whole new formatting system

Copyright (C) 2011, David Beazley, http://www.dabeaz.com           55
Some History
               • String formatting is one of the few features
                       of Python 2 that can't be easily customized
               • Classes can define __str__() and __repr__()
               • However, they can't customize % processing
               • Python 2.6/3.0 adds a __format__() special
                       method that addresses this in conjunction
                       with some new string formatting machinery


Copyright (C) 2011, David Beazley, http://www.dabeaz.com             56
String Conversions
               • Objects now have three string conversions
                        >>> x = 1/3
                        >>> x.__str__()
                        '0.333333333333'
                        >>> x.__repr__()
                        '0.3333333333333333'
                        >>> x.__format__("0.2f")
                        '0.33'
                        >>> x.__format__("20.2f")
                        '                0.33'
                        >>>

              • You will notice that __format__() takes a
                      code similar to those used by the % operator


Copyright (C) 2011, David Beazley, http://www.dabeaz.com             57
format() function
               • format(obj, fmt) calls __format__
                        >>> x = 1/3
                        >>> format(x,"0.2f")
                        '0.33'
                        >>> format(x,"20.2f")
                        '                0.33'
                        >>>

               • This is analogous to str() and repr()
                        >>> str(x)
                        '0.333333333333'
                        >>> repr(x)
                        '0.3333333333333333'
                        >>>



Copyright (C) 2011, David Beazley, http://www.dabeaz.com   58
Format Codes (Builtins)
               • For builtins, there are standard format codes
                        Old Format                         New Format   Description
                        "%d"                               "d"          Decimal Integer
                        "%f"                               "f"          Floating point
                        "%s"                               "s"          String
                        "%e"                               "e"          Scientific notation
                        "%x"                               "x"          Hexadecimal


                 • Plus there are some brand new codes
                                                           "o"          Octal
                                                           "b"          Binary
                                                           "%"          Percent




Copyright (C) 2011, David Beazley, http://www.dabeaz.com                                      59
Format Examples
                  • Examples of simple formatting
                           >>> x = 42
                           >>> format(x,"x")
                           '2a'
                           >>> format(x,"b")
                           '101010'

                           >>> y = 2.71828
                           >>> format(y,"f")
                           '2.718280'
                           >>> format(y,"e")
                           '2.718280e+00'
                           >>> format(y,"%")
                           '271.828000%'




Copyright (C) 2011, David Beazley, http://www.dabeaz.com   60
Format Modifiers
               • Field width and precision modifiers
                          [width][.precision]code


               • Examples:
                          >>> y = 2.71828
                          >>> format(y,"0.2f")
                          '2.72'
                          >>> format(y,"10.4f")
                          '    2.7183'
                          >>>


               • This is exactly the same convention as with
                       the legacy % string formatting

Copyright (C) 2011, David Beazley, http://www.dabeaz.com       61
Alignment Modifiers
                • Alignment Modifiers
                          [<|>|^][width][.precision]code

                          <         left align
                          >         right align
                          ^         center align

               • Examples:
                          >>> y = 2.71828
                          >>> format(y,"<20.2f")
                          '2.72                '
                          >>> format(y,"^20.2f")
                          '        2.72        '
                          >>> format(y,">20.2f")
                          '                2.72'
                          >>>


Copyright (C) 2011, David Beazley, http://www.dabeaz.com   62
Fill Character
              • Fill Character
                         [fill][<|>|^][width][.precision]code


              • Examples:
                         >>> x = 42
                         >>> format(x,"08d")
                         '00000042'
                         >>> format(x,"032b")
                         '00000000000000000000000000101010'
                         >>> format(x,"=^32d")
                         '===============42==============='
                         >>>




Copyright (C) 2011, David Beazley, http://www.dabeaz.com         63
Thousands Separator
              • Insert a ',' before the precision specifier
                         [fill][<|>|^][width][,][.precision]code


              • Examples:
                         >>> x = 123456789
                         >>> format(x,",d")
                         '123,456,789'
                         >>> format(x,"10,.2f")
                         '123,456,789.00'
                         >>>


              • Alas, the use of the ',' isn't localized
Copyright (C) 2011, David Beazley, http://www.dabeaz.com           64
Discussion

               • As you can see, there's a lot of flexibility in
                       the new format method (there are other
                       features not shown here)
               • User-defined objects can also completely
                       customize their formatting if they implement
                       __format__(self,fmt)



Copyright (C) 2011, David Beazley, http://www.dabeaz.com                65
Composite Formatting
               • String .format() method formats multiple
                       values all at once (replacement for %)
               • Some examples:
                       >>> "{name} has {n} messages".format(name="Dave",n=37)
                       'Dave has 37 messages'

                       >>> "{:10s} {:10d} {:10.2f}".format('ACME',50,91.1)
                       'ACME               50      91.10'

                       >>> "<{0}>{1}</{0}>".format('para','Hey there')
                       '<para>Hey there</para>'
                       >>>



Copyright (C) 2011, David Beazley, http://www.dabeaz.com                        66
Composite Formatting

                • format() method scans the string for
                        formatting specifiers enclosed in {} and
                        expands each one
                • Each {} specifies what is being formatted as
                        well as how it should be formatted
                • Tricky bit : There are two aspects to it

Copyright (C) 2011, David Beazley, http://www.dabeaz.com          67
What to Format?
                • You must specify arguments to .format()
                • Positional:
                         "{0} has {1} messages".format("Dave",37)


                • Keyword:
                         "{name} has {n} messages".format(name="Dave",n=37)


                • In order:
                         "{} has {} messages".format("Dave",37)




Copyright (C) 2011, David Beazley, http://www.dabeaz.com                      68
String Templates
                • Template Strings
                         from string import Template

                         msg = Template("$name has $n messages")
                         print(msg.substitute(name="Dave",n=37)


                • New String Formatting
                         msg = "{name} has {n} messages"
                         print(msg.format(name="Dave",n=37))


                • Very similar

Copyright (C) 2011, David Beazley, http://www.dabeaz.com           69
Indexing/Attributes
                • Cool thing :You can perform index lookups
                         record = {
                            'name' : 'Dave',
                            'n' : 37
                         }

                         '{r[name]} has {r[n]} messages'.format(r=record)


                • Or attribute lookups with instances
                         record = Record('Dave',37)

                         '{r.name} has {r.n} messages'.format(r=record)


                • Restriction: Can't have arbitrary expressions
Copyright (C) 2011, David Beazley, http://www.dabeaz.com                    70
Specifying the Format

                • Recall: There are three string format functions
                          str(s)
                          repr(s)
                          format(s,fmt)


                • Each {item} can pick which it wants to use
                         {item}                            # Replaced by str(item)
                         {item!r}                          # Replaced by repr(item)
                         {item:fmt}                        # Replaced by format(item, fmt)




Copyright (C) 2011, David Beazley, http://www.dabeaz.com                                     71
Format Examples

                 • More Examples:
                   >>> "{name:10s} {price:10.2f}".format(name='ACME',price=91.1)
                   'ACME            91.10'

                   >>> "{s.name:10s} {s.price:10.f}".format(s=stock)
                   'ACME            91.10'

                   >>> "{name!r},{price}".format(name="ACME",price=91.1)
                   "'ACME',91.1"
                   >>>


                            note repr() output here



Copyright (C) 2011, David Beazley, http://www.dabeaz.com                      72
Other Formatting Details

                    • { and } must be escaped if part of formatting
                    • Use '{{ for '{'
                    • Use '}}' for '}'
                    • Example:
                             >>> "The value is {{{0}}}".format(42)
                             'The value is {42}'
                             >>>




Copyright (C) 2011, David Beazley, http://www.dabeaz.com              73
Nested Format Expansion
                • .format() allows one level of nested lookups in
                        the format part of each {}
                          >>> s = ('ACME',50,91.10)
                          >>> "{0:{width}s} {2:{width}.2f}".format(*s,width=12)
                          'ACME                91.10'
                          >>>



                  • Probably best not to get too carried away in
                         the interest of code readability though


Copyright (C) 2011, David Beazley, http://www.dabeaz.com                          74
Formatting a Mapping
               • Variation : s.format_map(d)
                           >>> record = {
                               'name' : 'Dave',
                               'n' : 37
                           }
                           >>> "{name} has {n} messages".format_map(record)
                           'Dave has 37 messages'
                           >>>


                • This is a convenience function--allows names
                       to come from a mapping without using **


Copyright (C) 2011, David Beazley, http://www.dabeaz.com                      75
Commentary


              • The new string formatting is very powerful
              • The % operator will likely stay, but the new
                     formatting adds more flexibility




Copyright (C) 2011, David Beazley, http://www.dabeaz.com       76
Part 4
                                          Binary Data Handling and Bytes




Copyright (C) 2011, David Beazley, http://www.dabeaz.com                   77
Bytes and Byte Arrays


                • Python 3 has support for "byte-strings"
                • Two new types : bytes and bytearray
                • They are quite different than Python 2 strings


Copyright (C) 2011, David Beazley, http://www.dabeaz.com           78
Defining Bytes
              • Here's how to define byte "strings"
                      a    =    b"ACME 50 91.10"            #   Byte string literal
                      b    =    bytes([1,2,3,4,5])          #   From a list of integers
                      c    =    bytes(10)                   #   An array of 10 zero-bytes
                      d    =    bytes("Jalapeño","utf-8")   #   Encoded from string

              • Can also create from a string of hex digits
                        e = bytes.fromhex("48656c6c6f")


              • All of these define an object of type "bytes"
                          >>> type(a)
                          <class 'bytes'>
                          >>>


              • However, this new bytes object is odd
Copyright (C) 2011, David Beazley, http://www.dabeaz.com                                    79
Bytes as Strings
             • Bytes have standard "string" operations
                       >>> s = b"ACME 50 91.10"
                       >>> s.split()
                       [b'ACME', b'50', b'91.10']
                       >>> s.lower()
                       b'acme 50 91.10'
                       >>> s[5:7]
                       b'50'

             • And bytes are immutable like strings
                       >>> s[0] = b'a'
                       Traceback (most recent call last):
                         File "<stdin>", line 1, in <module>
                       TypeError: 'bytes' object does not support item assignment




Copyright (C) 2011, David Beazley, http://www.dabeaz.com                        80
Bytes as Integers
            • Unlike Python 2, bytes are arrays of integers
                     >>> s = b"ACME 50 91.10"
                     >>> s[0]
                     65
                     >>> s[1]
                     67
                     >>>


            • Same for iteration
                     >>> for c in s: print(c,end=' ')
                     65 67 77 69 32 53 48 32 57 49 46 49 48
                     >>>

            • Hmmmm. Curious.
Copyright (C) 2011, David Beazley, http://www.dabeaz.com      81
Porting Note
            • I have encountered a lot of minor problems
                   with bytes in porting libraries
                     data = s.recv(1024)
                     if data[0] == '+':
                         ...


                     data = s.recv(1024)
                     if data[0] == b'+':                   # ERROR!
                         ...


                     data = s.recv(1024)
                     if data[0] == 0x2b:                   # CORRECT
                         ...



Copyright (C) 2011, David Beazley, http://www.dabeaz.com               82
Porting Note
            • Be careful with ord() (not needed)
                     data = s.recv(1024)                   data = s.recv(1024)
                     x = ord(data[0])                      x = data[0]



             • Conversion of objects into bytes
                     >>> x = 7
                     >>> bytes(x)
                     b'x00x00x00x00x00x00x00'

                     >>> str(x).encode('ascii')
                     b'7'
                     >>>




Copyright (C) 2011, David Beazley, http://www.dabeaz.com                         83
bytearray objects
                   • A bytearray is a mutable bytes object
                            >>> s = bytearray(b"ACME 50 91.10")
                            >>> s[:4] = b"PYTHON"
                            >>> s
                            bytearray(b"PYTHON 50 91.10")
                            >>> s[0] = 0x70           # Must assign integers
                            >>> s
                            bytearray(b'pYTHON 50 91.10")
                            >>>

                   • It also gives you various list operations
                            >>> s.append(23)
                            >>> s.append(45)
                            >>> s.extend([1,2,3,4])
                            >>> s
                            bytearray(b'ACME 50 91.10x17-x01x02x03x04')
                            >>>

Copyright (C) 2011, David Beazley, http://www.dabeaz.com                       84
An Observation
                • bytes and bytearray are not really meant to
                        mimic Python 2 string objects
                • They're closer to array.array('B',...) objects
                          >>> import array
                          >>> s = array.array('B',[10,20,30,40,50])
                          >>> s[1]
                          20
                          >>> s[1] = 200
                          >>> s.append(100)
                          >>> s.extend([65,66,67])
                          >>> s
                          array('B', [10, 200, 30, 40, 50, 100, 65, 66, 67])
                          >>>




Copyright (C) 2011, David Beazley, http://www.dabeaz.com                       85
Bytes and Strings
               • Bytes are not meant for text processing
               • In fact, if you try to use them for text, you will
                       run into weird problems
               • Python 3 strictly separates text (unicode) and
                       bytes everywhere
               • This is probably the most major difference
                       between Python 2 and 3.

Copyright (C) 2011, David Beazley, http://www.dabeaz.com              86
Mixing Bytes and Strings

            • Mixed operations fail miserably
                        >>> s = b"ACME 50 91.10"
                        >>> 'ACME' in s
                        Traceback (most recent call last):
                          File "<stdin>", line 1, in <module>
                        TypeError: Type str doesn't support the buffer API
                        >>>

             • Huh?!?? Buffer API?
             • We'll mention that later...

Copyright (C) 2011, David Beazley, http://www.dabeaz.com                     87
Printing Bytes
             • Printing and text-based I/O operations do not
                    work in a useful way with bytes
                         >>> s = b"ACME 50 91.10"
                         >>> print(s)
                         b'ACME 50 91.10'
                         >>>



                     Notice the leading b' and trailing
                          quote in the output.

             • There's no way to fix this.       print() should only
                    be used for outputting text (unicode)

Copyright (C) 2011, David Beazley, http://www.dabeaz.com              88
Formatting Bytes
             • Bytes do not support operations related to
                    formatted output (%, .format)
                       >>> s = b"%0.2f" % 3.14159
                       Traceback (most recent call last):
                         File "<stdin>", line 1, in <module>
                       TypeError: unsupported operand type(s) for %: 'bytes' and
                       'float'
                       >>>


             • So, just forget about using bytes for any kind of
                    useful text output, printing, etc.
             • No, seriously.
Copyright (C) 2011, David Beazley, http://www.dabeaz.com                           89
Passing Bytes as Strings
             • Many library functions that work with "text"
                    do not accept byte objects at all
                      >>> time.strptime(b"2010-02-17","%Y-%m-%d")
                      Traceback (most recent call last):
                        File "<stdin>", line 1, in <module>
                        File "/Users/beazley/Software/lib/python3.1/
                      _strptime.py", line 461, in _strptime_time
                          return _strptime(data_string, format)[0]
                        File "/Users/beazley/Software/lib/python3.1/
                      _strptime.py", line 301, in _strptime
                          raise TypeError(msg.format(index, type(arg)))
                      TypeError: strptime() argument 0 must be str, not <class
                      'bytes'>
                      >>>



Copyright (C) 2011, David Beazley, http://www.dabeaz.com                         90
Commentary
             • Why am I focusing on this "bytes as text" issue?
             • If you are writing scripts that do simple ASCII
                    text processing, you might be inclined to use
                    bytes as a way to avoid the overhead of Unicode
             • You might think that bytes are exactly the same
                    as the familiar Python 2 string object
             • This is wrong. Bytes are not text. Using bytes as
                    text will lead to convoluted non-idiomatic code

Copyright (C) 2011, David Beazley, http://www.dabeaz.com              91
How to Use Bytes
               • Bytes are better suited for low-level I/O
                       handling (message passing, distributed
                       computing, embedded systems, etc.)
               • I will show some examples that illustrate
               • A complaint: documentation (online and
                       books) is somewhat thin on explaining
                       practical uses of bytes and bytearray objects


Copyright (C) 2011, David Beazley, http://www.dabeaz.com               92
Example : Reassembly
               • In Python 2, you may know that string
                       concatenation leads to bad performance
                                msg = b""
                                while True:
                                   chunk = s.recv(BUFSIZE)
                                   if not chunk:
                                       break
                                   msg += chunk

                 • Here's the common workaround (hacky)
                                chunks = []
                                while True:
                                   chunk = s.recv(BUFSIZE)
                                   if not chunk:
                                       break
                                   chunks.append(chunk)
                                msg = b"".join(chunks)

Copyright (C) 2011, David Beazley, http://www.dabeaz.com        93
Example : Reassembly
               • Here's a new approach in Python 3
                               msg = bytearray()
                               while True:
                                  chunk = s.recv(BUFSIZE)
                                  if not chunk:
                                      break
                                  msg.extend(chunk)


               • You treat the bytearray as a list and just
                       append/extend new data at the end as you go
               • I like it.                     It's clean and intuitive.

Copyright (C) 2011, David Beazley, http://www.dabeaz.com                    94
Example: Reassembly

                • The performance is good too
                • Concat 1024 32-byte chunks together (10000x)
                              Concatenation         : 18.49s
                              Joining               : 1.55s
                              Extending a bytearray : 1.78s




Copyright (C) 2011, David Beazley, http://www.dabeaz.com         95
Example: Record Packing
                • Suppose you wanted to use the struct module
                       to incrementally pack a large binary message
                           objs = [ ... ]                  # List of tuples to pack
                           msg = bytearray()               # Empty message

                           # First pack the number of objects
                           msg.extend(struct.pack("<I",len(objs)))

                           # Incrementally pack each object
                           for x in objs:
                               msg.extend(struct.pack(fmt, *x))

                           # Do something with the message
                           f.write(msg)


                  • I like this as well.
Copyright (C) 2011, David Beazley, http://www.dabeaz.com                              96
Example : Calculations
                • Run a byte array through an XOR-cipher
                             >>> s = b"Hello World"
                             >>> t = bytes(x^42 for x in s)
                             >>> t
                             b'bOFFEn}EXFN'
                             >>> bytes(x^42 for x in t)
                             b'Hello World'
                             >>>

                 • Compute and append a LRC checksum to a msg
                           # Compute the checksum and append at the end
                           chk = 0
                           for n in msg:
                               chk ^= n
                           msg.append(chk)



Copyright (C) 2011, David Beazley, http://www.dabeaz.com                  97
Commentary

                • I like the new bytearray object
                • Many potential uses in building low-level
                       infrastructure for networking, distributed
                       computing, messaging, embedded systems, etc.
                • May make much of that code cleaner, faster, and
                       more memory efficient



Copyright (C) 2011, David Beazley, http://www.dabeaz.com              98
Related : Buffers
                • bytearray() is an example of a "buffer"
                • buffer : A contiguous region of memory (e.g.,
                        allocated like a C/C++ array)
                • There are many other examples:
                            a = array.array("i", [1,2,3,4,5])
                            b = numpy.array([1,2,3,4,5])
                            c = ctypes.ARRAY(ctypes.c_int,5)(1,2,3,4,5)


                • Under the covers, they're all similar and often
                        interchangeable with bytes (especially for I/O)

Copyright (C) 2011, David Beazley, http://www.dabeaz.com                  99
Advanced : Memory Views
                • memoryview()
                          >>> a = bytearray(b'Hello World')
                          >>> b = memoryview(a)
                          >>> b
                          <memory at 0x1007014d0>
                          >>> b[-5:] = b'There'
                          >>> a
                          bytearray(b'Hello There')
                          >>>


                • It's essentially an overlay over a buffer
                • It's very low-level and its use seems tricky
                • I would probably avoid it
Copyright (C) 2011, David Beazley, http://www.dabeaz.com         100