LiteRT supports
converting weights to 16-bit floating point values during model conversion from TensorFlow to LiteRT's flat buffer format. This results in a 2x reduction in model size. Some hardware, like GPUs, can compute natively in this reduced precision arithmetic, realizing a speedup over traditional floating point execution. The LiteRT GPU delegate can be configured to run in this way. However, a model converted to float16 weights can still run on the CPU without additional modification: the float16 weights are upsampled to float32 prior to the first inference. This permits a significant reduction in model size in exchange for a minimal impacts to latency and accuracy.
In this tutorial, you train an MNIST model from scratch, check its accuracy in TensorFlow, and then convert the model into a LiteRT flatbuffer
with float16 quantization. Finally, check the accuracy of the converted model and compare it to the original float32 model.
# Load MNIST datasetmnist=keras.datasets.mnist(train_images,train_labels),(test_images,test_labels)=mnist.load_data()# Normalize the input image so that each pixel value is between 0 to 1.train_images=train_images/255.0test_images=test_images/255.0# Define the model architecturemodel=keras.Sequential([keras.layers.InputLayer(input_shape=(28,28)),keras.layers.Reshape(target_shape=(28,28,1)),keras.layers.Conv2D(filters=12,kernel_size=(3,3),activation=tf.nn.relu),keras.layers.MaxPooling2D(pool_size=(2,2)),keras.layers.Flatten(),keras.layers.Dense(10)])# Train the digit classification modelmodel.compile(optimizer='adam',loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),metrics=['accuracy'])model.fit(train_images,train_labels,epochs=1,validation_data=(test_images,test_labels))
For the example, you trained the model for just a single epoch, so it only trains to ~96% accuracy.
Convert to a LiteRT model
Using the LiteRT Converter, you can now convert the trained model into a LiteRT model.
To instead quantize the model to float16 on export, first set the optimizations flag to use default optimizations. Then specify that float16 is the supported type on the target platform:
In this example, you have quantized a model to float16 with no difference in the accuracy.
It's also possible to evaluate the fp16 quantized model on the GPU. To perform all arithmetic with the reduced precision values, be sure to create the TfLiteGPUDelegateOptions struct in your app and set precision_loss_allowed to 1, like this:
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2026-05-28 UTC."],[],[]]