0% found this document useful (0 votes)
9 views11 pages

Python Pickle Module: Serialization Guide

Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views11 pages

Python Pickle Module: Serialization Guide

Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

The Python Pickle Module

Pickle in Python is a powerful module for serializing and deserializing Python object
structures, transforming them into a byte stream for storage or transmission. This process,
known as pickling, enables efficient data exchange and persistence, essential for applications
involving complex data manipulation.

Python Pickle - Python object serialization


In Python, everything is an object, and Pickle in python helps us to save the internal state of
these Python objects to a database or file for future use with the process called
serialization. Serialization is also known as pickling or marshalling, or flattening. This bytes
stream contains all the required information to reconstruct the object structure in another
Python script.

Unpickling is also known as deserialization or unmarshalling.


One of the most common use cases of pickling in the Data Science domain is when the
developer can save the internal state or weights of the trained model, which can be used
later for making predictions without having to train the model all over again.
To use pickle module in Python program, it has to be imported using the following statement
import pickle
Python Pickle Example
In this example, we demonstrate how to use the Pickle module to serialize and deserialize a
Python dictionary, showcasing the simplicity and power of Pickle for object persistence.
Serialization (Pickling):
import pickle
# Python dictionary
data = {'id': 1, 'name': 'John Doe', 'age': 22, 'city': 'New York'}
# Serialize the dictionary
with open('[Link]', 'wb') as file:
[Link](data, file)
print("Data serialized successfully.")

Deserialization (Unpickling):
import pickle
# Deserialize the previously serialized dictionary
with open('[Link]', 'rb') as file:
loaded_data = [Link](file)
print("Data deserialized successfully.")
print(loaded_data)
Output:
Data serialized successfully.
Data deserialized successfully.
{'id': 1, 'name': 'John Doe', 'age': 22, 'city': 'New York'}
Pickle module in python
Constants
1. pickle.HIGHEST_PROTOCOL: This constant represents the highest protocol version
available. Using the highest protocol can improve efficiency in terms of serialization
speed and the size of the resulting serialized object. However, it may not be
compatible with older Python versions.
2. pickle.DEFAULT_PROTOCOL: This constant is set to the default protocol used by
Pickle if no protocol is specified. It strikes a balance between compatibility and
efficiency. As of Python 3.8, the default protocol is 4.
3. Protocol Versions (0 to 5):
o 0: The original ASCII protocol and is backward compatible with earlier
versions of Python.
o 1: An old binary format which is also compatible with earlier versions of
Python.
o 2: Introduced in Python 2.3, provides more efficient pickling of new-style
classes.
o 3: Introduced in Python 3.0, designed for Python 3.x, making it incompatible
with Python 2.x.
o 4: Introduced in Python 3.4, adds support for very large objects, pickling more
kinds of objects, and improving efficiency.
o 5: Introduced in Python 3.8, adds support for out-of-band data and speed
optimizations for numpy arrays.
Functions
1. [Link](object, file, protocol=None, *, fix_imports=True,
buffer_callback=None)
This function is used to write the pickled representation of the object obj to the file
object file.
protocol is an optional argument that takes an integer value and enables the pickler to use
the specified protocol.
If the fix_imports argument is true and the protocol is less than 3, the pickle will try to map
the new Python 3 names to the old module names used in Python 2, so that the pickle data
stream is readable in Python 2.
import pickle

dic = {
'name': 'Steve',
'age': 21,
'course': 'DSA'
}

# pickling the created dictionary


with open('[Link]', 'wb') as f:
[Link](dic, f)
Output
b'\x80\x03}q\x00(X\x04\x00\x00\x00nameq\x01X\x05\x00\x00\x00Steveq\x02X\x03\x00\
x00\x00ageq\x03K\x15X\x06\x00\x00\x00courseq\x04X\x03\x00\x00\x00DSAq\x05u.'
In the above example, we created a dictionary dic and used [Link]() to serialize the
dictionary and store it in a [Link] file for later use.
2. [Link](obj, protocol = None, *, fix_imports = True, buffer_callback=None)
This function returns the pickled representation of the object obj in form of a bytes object
instead of writing it to a file.
import pickle

l = [1, 'Scaler', True, 'Academy']

# pickling the created dictionary


data = [Link](l)

# printing the pickled bytes stream


print(data)
Output
b'\x80\x03]q\x00(K\x01X\x06\x00\x00\x00Scalerq\x01\x88X\x07\x00\x00\x00Academyq\
x02e.'
The file produced via pickling using pickle in Python is of .pickle format.
3. [Link](file, *, fix_imports=True, encoding='ASCII', errors='strict', buffers=None)
This function reads the pickled object representation from the open file object file and
returns the reconstituted object.
The encoding and errors tell pickle how to decode 8-bit string instances pickled by Python 2,
these default to ‘ASCII’ and ‘strict’, respectively.
import pickle

dic = {
'name': 'Steve',
'age': 21,
'course': 'DSA'
}

# pickling the created dictionary


with open('[Link]', 'wb') as f:
[Link](dic, f)

# reading the pickled object


with open('[Link]', 'rb') as f:
unpickled_data = [Link](f)

print(unpickled_data)
Output
{'name': 'Steve', 'age': 21, 'course': 'DSA'}
4. [Link](data, *, fix_imports = True, encoding = “ASCII”, errors = “strict”)
This function reads the pickled object representation from the bytes stream object data and
returns the reconstituted object.
import pickle
l = [1, 'Scaler', True, 'Academy']

# pickling the created dictionary


data = [Link](l)

# printing the unpickled data


unpickled_data = [Link](data)
print(unpickled_data)
Output
[1, 'Scaler', True, 'Academy']
The main difference between dumps() and dump(), is that the latter has s at the end of the
function name which stands for string.
Exceptions
1. exception [Link] This Exception is the base class for all other raised
exceptions in the Pickle module.
2. exception [Link] This exception is raised when the pickle object does
not support pickling.
3. exception [Link] This exception is raised when there is data
corruption or a security violation while unpickling an object.
Classes exported by the pickle module
Class Instances can be pickled and unpickled without using any additional code. By default,
the pickle will retrieve the class and attributes of an instance via introspection.
This default implementation of pickle in python can be altered by using one or more special
methods explained as follows:
1. object.__getnewargs_ex__()
This method commands the values passed to the __new__() method while unpickling. The
method will return a pair (args, kwargs) where args is a tuple of positional arguments
and kwargs is a dictionary of named arguments for constructing the object.
2. object.__getnewargs__()
This method is similar to object.__getnewargs_ex__() but has support for only positive
arguments. The method will return a tuple of arguments args which will be then passed to
the __new__() method while unpickling.
3. object.__getstate__()
If this method is defined by classes, it is called and the returned object is pickled as the
contents for the instance, instead of the contents of the instance’s dictionary.
4. object.__setstate__(state)
After unpickling, if the class defines __setstate__(), then it is in unpickled state and there is
no need for the state object to be dictionary. While, the pickled state must be a dictionary
and its items are assigned to the new instance’s dictionary.
5. object.__reduce__()
This method takes no argument and returns either a string or preferably a tuple.
6. object.__reduce_ex__(protocol)
This method is similar to __reduce__ method but it takes a single integer argument and
provides backward compatibility by reducing the values for previous Python releases.
Protocol Formats of the Python pickle Module
There are six different protocols that the Python Pickle module uses.
1. Protocol version 0 - It was the original human-readable protocol having backward
compatibility with previous Python versions.
2. Protocol version 1 - It was the first binary format supporting backward compatibility.
3. Protocol version 2 - It was introduced in Python 2.3 and provided a lot more
improvement in efficiency.
4. Protocol version 3 - It was introduced in Python 3.0. It cannot be unpickled by the
Python 2.x version. This was the default protocol in Python 3.0–3.7.
5. Protocol version 4 - It was introduced in Python 3.4. This version includes support for
a wide range of object sizes and types and is the default protocol starting with
Python 3.8.
6. Protocol version 5 - It was introduced in Python 3.8. It adds support for out-of-band
data and improves speed for in-band data.
If we pickle the Pandas data frame using different versions, we can see the difference in
pickled file size.
import pickle
import pandas as pd
import os

df = [Link]({
'name': ['Ash', 'Gary', 'Tracy'],
'age': [21, 25, 30]
})

# printing the dataframe


print(df)

# pickling dataframes with different protocol versions


with open('[Link]', 'wb') as f:
[Link](df, f, protocol=4)
with open('[Link]', 'wb') as f:
[Link](df, f, protocol=3)
with open('[Link]', 'wb') as f:
[Link](df, f, protocol=2)
with open('[Link]', 'wb') as f:
[Link](df, f, protocol=1)

# seeing different in pickled files


print('Protocol4:', [Link]('[Link]'))
print('Protocol3:', [Link]('[Link]'))
print('Protocol2:', [Link]('[Link]'))
print('Protocol1:', [Link]('[Link]'))
Output
name age
0 Ash 21
1 Gary 25
2 Tracy 30

Protocol4: 844
Protocol3: 998
Protocol2: 1054
Protocol1: 1153
The higher versions are always better than the lower ones in terms of
 The size of the pickled objects
 The performance of unpickling
What Can Be Pickled And Unpickled?
The following Python object types can be pickled:
 None, Boolean Values (True, False)
 int, float, complex numbers
 strings (normal and Unicode), bytes, byte arrays
 lists, sets, tuples, and dictionaries containing only picklable
Other than these, functions (both user-defined and built-in) and classes can also be pickled,
only if these are defined at the top level of a module.
While there are some inbuilt python functions/classes like generators, DateTime module,
lambda functions, and defaultdicts that cannot be pickled, for pickling lambda function, an
additional package named dill is required, and defaultdict can be pickled by creating a
module-level function. Live connection objects like a database or network connection
cannot be pickled as pickle won't be able to connect once the connection is closed.
Let us see an example of a datetime module that cannot be pickled.
import datetime
import pickle

with open('[Link]', 'wb') as f:


[Link](datetime, f)
Output
Traceback (most recent call last):
File "[Link]", line 5, in <module>
[Link](datetime, f)
TypeError: can't pickle module objects
Another example of pickling lambda function is demonstrated below:
import pickle

add = lambda a, b : a * b
print(add(3, 4))
# pickling the created dictionary
with open('[Link]', 'wb') as f:
[Link](add, f)
Output
12
Traceback (most recent call last):
File "C:\Users\NAMANJEET SINGH\Documents\[Link]", line 8, in <module>
[Link](add, f)
_pickle.PicklingError: Can't pickle <function <lambda> at 0x03C348E8>: attribute lookup
<lambda> on __main__ failed
Compression of Pickled Objects
Compressing pickled objects can significantly reduce storage space and improve the
efficiency of data transmission. This process involves serializing the Python object with Pickle
and then compressing the serialized byte stream using a compression library such
as gzip or bz2. Compression is particularly useful when dealing with large data structures or
when serialized objects need to be stored or transmitted over a network.
Example of Compressing a Pickled Object with gzip:
import pickle
import gzip

# Python object to be serialized and compressed


data = {'id': 100, 'name': 'Alice', 'features': [0.25, 0.75, 0.5]}

# Serialize and compress the object


with [Link]('[Link]', 'wb') as file:
[Link](data, file)

print("Object serialized and compressed successfully.")


Decompression and Deserialization:
import pickle
import gzip
# Decompress and deserialize the previously stored object
with [Link]('[Link]', 'rb') as file:
loaded_data = [Link](file)

print("Object decompressed and deserialized successfully.")


print(loaded_data)
Benefits:
 Compression can drastically reduce the size of serialized files, making storage and
data transfer more efficient.
 Compressing and decompressing data can also serve as an additional layer of data
integrity check, as corrupt compressed files are often easier to detect.
Security Concerns
Apart from the pros, a Developer must be aware of some drawbacks while using the pickle
module. Major drawback of using the pickle module in python is that it is possible to
create malicious pickled data which will execute any set of arbitrary code while unpickling.
Therefore, we should never unpickle data that comes from an untrusted source or is
transmitted over an insecure network. Such attacks can be prevented by using libraries
like hmac for signing the data and reducing security risks.
Apart from security concerns, some other drawbacks of using pickle in python are that the
pickle file is unreadable, and the pickle module is only limited to Python, thus other
languages might have support issues while dealing with pickled files.
Advantages and disadvantages of using pickle in python
Advantages of Using Pickle in Python
1. Pickle's API is straightforward, making it simple to serialize and deserialize Python
objects with minimal code.
2. It can serialize a wide range of Python objects, including complex data structures like
lists, dictionaries, custom classes, and more.
3. Pickle is a part of Python's standard library, ensuring good integration with Python's
ecosystem and no additional installation requirements.
4. Pickle maintains the object's state and all its data attributes, allowing for an exact
recreation of the original object upon deserialization.
Disadvantages of Using Pickle in Python
1. Deserializing data from an untrusted source can execute arbitrary code, leading to
significant security vulnerabilities.
2. Pickle is tightly coupled with Python, making the serialized data not easily readable
or usable from other programming languages.
3. Pickle files may not be compatible across different Python versions, leading to
potential issues when unpickling data with a different Python version than the one
used for pickling.
4. For large data sets, Pickle's performance may not be as efficient as more specialized
libraries designed for high-performance serialization, such as numpy for arrays
or pandas for data frames.
Conclusion
1. Serialization (pickling) converts Python objects into a byte stream for storage or
network transmission, while deserialization (unpickling) reverses this process.
2. Despite its utility, the pickle module carries security risks and produces unreadable
files, limiting its use in Python environments.
3. Pickling is favoured for data frames over CSVs due to its speed, despite CSVs being
human-readable.
4. The pandas library's built-in methods for pickling and unpickling enhance efficiency in
data processing, making it a preferred choice for handling complex data structures.
See More
 Class in Python

Good job on finishing the article! Its time to level up with a Challenge

Common questions

Powered by AI

Serialization with the pickle module is preferred over CSV for data frames because it is faster and can handle complex data structures more efficiently. Although CSVs are human-readable, pickling can serialize entire data frames, including their metadata, preserving complete data state for accurate reconstitution .

Protocol versions in the pickle module define the rules for serialization and deserialization. Each version enhances efficiency, compatibility, and functionality. For instance, protocol version 2 improved efficiency with new-style classes in Python 2.3, while protocol 5 introduced speed optimizations for numpy arrays in Python 3.8. Higher protocols generally improve performance and reduce object sizes but may lack backward compatibility with older Python versions .

Using pickle.HIGHEST_PROTOCOL can improve serialization speed and reduce the size of the resulting serialized objects. However, it may not be compatible with older Python versions, as newer protocol versions introduce improvements in efficiency and additional features like support for large objects, as seen from protocols introduced in Python 3.4 and 3.8 .

The pickle module can execute arbitrary code during unpickling, posing security risks if the data is from an untrusted source. This vulnerability allows for potential malicious code execution. To mitigate these risks, developers should avoid unpickling data from untrusted sources or use libraries like hmac for data signing to ensure integrity and security .

Certain Python object types, such as live connection objects, like open sockets or file descriptors, cannot be serialized because their state depends on active system resources. Others, like lambda functions, cannot be pickled due to their inline nature. Additionally, objects like generators and certain built-in module objects are restricted due to their encapsulated execution states and contexts, which cannot be easily persisted .

Data compression can be applied to pickled objects by serializing the object with pickle and then compressing the byte stream using a library like gzip. This reduces file size, improving storage and transmission efficiency. It adds an extra integrity check since decompression failures highlight corruption, as demonstrated by compressing a dictionary with gzip and later decompressing it .

The pickle module's default behavior can be customized by defining special methods within classes, such as __getstate__(), __setstate__(), __reduce__(), and __reduce_ex__(). These methods allow control over how class instance attributes are serialized and deserialized. For example, __getstate__() can return a custom object for pickling, while __setstate__() restores the object state during unpickling .

The pickle.dump() function writes the serialized representation of an object to a file, whereas pickle.dumps() returns the serialized representation as a bytes object. The choice between them depends on whether the serialized data will be stored directly in a file or manipulated in memory .

In data science, the pickle module allows for the serialization of trained models, saving their internal states or weights. This process enables models to be persisted for future use, allowing predictions without retraining. It ensures efficient data exchange for models, which is essential in complex data manipulation environments like machine learning pipelines .

Pickle is tightly coupled with Python, making its serialized data not easily readable or usable in other programming languages. There are also potential compatibility issues across different Python versions, meaning a pickle file from one version might not be deserializable in another, especially when different protocol versions are involved .

You might also like