0% found this document useful (0 votes)

115 views5 pages

Fixed vs Floating Point Representation

Digital signal processing uses either fixed-point or floating-point number representations. Fixed-point represents numbers with a fixed number of bits before and after the decimal place, limiting the range of representable values. Floating-point represents numbers in a format similar to scientific notation, using a mantissa and exponent to support a much wider range. The key difference is that fixed-point has uniform spacing between values while floating-point spacing varies, with smaller gaps between smaller numbers and larger gaps between larger numbers.

Uploaded by

Anil Agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

115 views5 pages

Fixed vs Floating Point Representation

Uploaded by

Anil Agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Fixed versus Floating Point

Digital Signal Processing can be divided into two categories, fixed point and floating point. These refer to the
format used to store and manipulate numbers within the devices. Fixed point DSPs usually represent each
number with a minimum of 16 bits, although a different length can be used. For instance, Motorola
manufactures a family of fixed point DSPs that use 24 bits. There are four common ways that these 2 16 = 65536
possible bit patterns can represent a number. In unsigned integer, the stored number can take on any integer
value from 0 to 65,535. Similarly, signed integer uses two's complement to make the range include negative
numbers, from -32,768 to 32,767. With unsigned fraction notation, the 65,536 levels are spread uniformly
between 0 and 1. Lastly, the signed fraction format allows negative numbers, equally spaced between -1 and 1.

Fixed versus Floating Point

Digital signal processing can be separated into two categories - fixed point and floating point. These
designations refer to the format used to store and manipulate numeric representations of data. Fixed-point DSPs
are designed to represent and manipulate integers – positive and negative whole numbers – via a minimum of
16 bits, yielding up to 65,536 possible bit patterns (2 16). Floating-point DSPs represent and manipulate rational
numbers via a minimum of 32 bits in a manner similar to scientific notation, where a number is represented with
a mantissa and an exponent (e.g., A x 2B, where 'A' is the mantissa and ‘B’ is the exponent), yielding up to
4,294,967,296 possible bit patterns (232).

The term ‘fixed point’ refers to the corresponding manner in which numbers are represented, with a fixed
number of digits after, and sometimes before, the decimal point. With floating-point representation, the
placement of the decimal point can ‘float’ relative to the significant digits of the number. For example, a fixed-
point representation with a uniform decimal point placement convention can represent the numbers 123.45,
1234.56, 12345.67, etc, whereas a floating-point representation could in addition represent 1.234567, 123456.7,
0.00001234567, 1234567000000000, etc. As such, floating point can support a much wider range of values than
fixed point, with the ability to represent very small numbers and very large numbers.

With fixed-point notation, the gaps between adjacent numbers always equal a value of one, whereas in floating-
point notation, gaps between adjacent numbers are not uniformly spaced – the gap between any two numbers is
approximately ten million times smaller than the value of the numbers (ANSI/IEEE Std. 754 standard format),
with large gaps between large numbers and small gaps between small numbers.
Fixed Point and Floating Point Number Representations

Digital Computers use Binary number system to represent all types of information inside the computers.
Alphanumeric characters are represented using binary bits (i.e., 0 and 1). Digital representations are easier to
design, storage is easy, accuracy and precision are greater.

There are various types of number representation techniques for digital number representation, for example:
Binary number system, octal number system, decimal number system, and hexadecimal number system etc. But
Binary number system is most relevant and popular for representing numbers in digital computer system.

Storing Real Number:

These are structures as following below:

There are two major approaches to store real numbers (i.e., numbers with fractional component) in modern
computing. These are (i) Fixed Point Notation and (ii) Floating Point Notation. In fixed point notation, there are
a fixed number of digits after the decimal point, whereas floating point number allows for a varying number of
digits after the decimal point.

Fixed-Point Representation:

This representation has fixed number of bits for integer part and for fractional part. For example, if given fixed-
point representation is [Link], then you can store minimum value is 0000.0001 and maximum value is
9999.9999. There are three parts of a fixed-point number representation: the sign field, integer field, and
fractional field.

We can represent these numbers using:

 Signed representation: range from -(2(k-1)-1) to (2(k-1)-1), for k bits.
 1’s complement representation: range from -(2(k-1)-1) to (2(k-1)-1), for k bits.
 2’s complementation representation: range from -(2(k-1)) to (2(k-1)-1), for k bits.

2’s complementation representation is preferred in computer system because of unambiguous property and
easier for arithmetic operations.

Example: Assume number is using 32-bit format which reserve 1 bit for the sign, 15 bits for the integer part
and 16 bits for the fractional part.

Then, -43.625 is represented as following:

Where, 0 is used to represent + and 1 is used to represent. 000000000101011 is 15 bit binary value for decimal
43 and 1010000000000000 is 16 bit binary value for fractional 0.625.

The advantage of using a fixed-point representation is performance and disadvantage is relatively limited range
of values that they can represent. So, it is usually inadequate for numerical analysis as it does not allow enough
numbers and accuracy. A number whose representation exceeds 32 bits would have to be stored inexactly.

These are above smallest positive number and largest positive number which can be store in 32-bit
representation as given above format. Therefore, the smallest positive number is 2-16 ≈ 0.000015 approximate
and the largest positive number is (215-1)+(1-2-16)=215(1-2-16) =32768, and gap between these numbers is 2-16.

We can move the radix point either left or right with the help of only integer field is 1.

Floating-Point Representation:

This representation does not reserve a specific number of bits for the integer part or the fractional part. Instead it
reserves a certain number of bits for the number (called the mantissa or significand) and a certain number of bits
to say where within that number the decimal place sits (called the exponent).

The floating number representation of a number has two part: the first part represents a signed fixed point
number called mantissa. The second part of designates the position of the decimal (or binary) point and is called
the exponent. The fixed point mantissa may be fraction or an integer. Floating -point is always interpreted to
represent a number in the following form: Mxre.
Only the mantissa m and the exponent e are physically represented in the register (including their sign). A
floating-point binary number is represented in a similar manner except that is uses base 2 for the exponent. A
floating-point number is said to be normalized if the most significant digit of the mantissa is 1.

So, actual number is (-1)s(1+m)x2(e-Bias), where s is the sign bit, m is the mantissa, e is the exponent value, and
Bias is the bias number.

Note that signed integers and exponent are represented by either sign representation, or one’s complement
representation, or two’s complement representation.

The floating point representation is more flexible. Any non-zero number can be represented in the normalized
form of ±(1.b1b2b3 ...)2x2n This is normalized form of a number x.

Example: Suppose number is using 32-bit format: the 1 bit sign bit, 8 bits for signed exponent, and 23 bits for
the fractional part. The leading bit 1 is not stored (as it is always 1 for a normalized number) and is referred to
as a “hidden bit”.

Then −53.5 is normalized as -53.5=(-110101.1)2=(-1.101011)x25 , which is represented as following below,

Where 00000101 is the 8-bit binary value of exponent value +5.

Note that 8-bit exponent ﬁeld is used to store integer exponents -126 ≤ n ≤ 127.

The smallest normalized positive number that fits into 32 bits is (1.00000000000000000000000)2x2-126=2-
126
≈1.18x10-38 , and largest normalized positive number that fits into 32 bits is
(1.11111111111111111111111)2x2127=(224-1)x2104 ≈ 3.40x1038 . These numbers are represented as following
below,
The precision of a floating-point format is the number of positions reserved for binary digits plus one (for the
hidden bit). In the examples considered here the precision is 23+1=24.

The gap between 1 and the next normalized floating-point number is known as machine epsilon. the gap is (1+2-
23
)-1=2-23for above example, but this is same as the smallest positive floating-point number because of non-
uniform spacing unlike in the fixed-point scenario.

Note that non-terminating binary numbers can be represented in floating point representation, e.g., 1/3 =
(0.010101 ...)2 cannot be a ﬂoating-point number as its binary representation is non-terminating.

IEEE Floating point Number Representation:

IEEE (Institute of Electrical and Electronics Engineers) has standardized Floating-Point Representation as
following diagram.

So, actual number is (-1)s(1+m)x2(e-Bias), where s is the sign bit, m is the mantissa, e is the exponent value, and
Bias is the bias number. The sign bit is 0 for positive number and 1 for negative number. Exponents are
represented by or two’s complement representation.

According to IEEE 754 standard, the floating-point number is represented in following ways:

 Half Precision (16 bit): 1 sign bit, 5 bit exponent, and 10 bit mantissa
 Single Precision (32 bit): 1 sign bit, 8 bit exponent, and 23 bit mantissa
 Double Precision (64 bit): 1 sign bit, 11 bit exponent, and 52 bit mantissa
 Quadruple Precision (128 bit): 1 sign bit, 15 bit exponent, and 112 bit mantissa

Special Value Representation:

There are some special values depended upon different values of the exponent and mantissa in the IEEE 754
standard.

 All the exponent bits 0 with all mantissa bits 0 represents 0. If sign bit is 0, then +0, else -0.
 All the exponent bits 1 with all mantissa bits 0 represents infinity. If sign bit is 0, then +∞, else -∞.
 All the exponent bits 0 and mantissa bits non-zero represents denormalized number.
 All the exponent bits 1 and mantissa bits non-zero represents error.

Common questions

Floating-point representation manages overflow and underflow by using the exponent to expand the range of representable numbers, which reduces the likelihood of these events compared to fixed-point . When overflow occurs, floating-point representation typically returns infinity, allowing systems to handle the exception gracefully . On the other hand, underflow leads to representation as zero or a subnormal number. Fixed-point, with a fixed bit allocation, cannot adjust to overflow or underflow effectively, leading to discarded or incorrect data as it lacks a dynamic scaling mechanism like the exponent . This limitation makes fixed-point susceptible to overflow or underflow without additional programmatic handling .

Choosing between single and double precision involves trade-offs between memory usage, precision, and computation speed. Single precision uses 32-bit representation and is beneficial for conserving memory and increasing processing speed, making it suitable for applications where performance is a higher priority than precision, like graphics processing . However, it provides limited precision and range, which can lead to rounding errors in high-precision applications. Double precision, with its 64 bits, offers a broader range and higher precision, critical for scientific simulations and financial calculations requiring accurate results over a wide dynamic range . The trade-off is increased computational overhead and memory usage, which can affect system performance and cost .

Fixed-point representations are advantageous due to their performance speed. They work well when numerical range and precision are not limiting factors and offer predictable execution timing, which is crucial for real-time processing . However, they have a relatively limited range of values and precision, which can be inadequate for complex numerical analysis that requires high precision . Floating-point representations, on the other hand, offer a wider range of values and greater precision, being able to represent very large and very small numbers . This makes them suitable for a broader array of applications but comes at the cost of increased computational complexity and power consumption .

In fixed-point notation, the gap between any two adjacent numbers remains constant, equivalent to one unit of the least significant bit . This uniform spacing is due to the fixed allocation of bits to represent integer and fractional parts. Conversely, in floating-point notation, the gap between numbers is not uniform; it varies depending on the magnitude of the numbers. Larger numbers have larger gaps, and smaller numbers have smaller gaps, which is characterized by the precision of the representation . This non-uniform spacing allows floating point to represent a wider range of values with varying precision .

Fixed-point representation is preferred in scenarios where execution speed and deterministic performance are crucial, such as in embedded systems and real-time digital signal processing, where hardware and power resources are constrained . Its simplicity in implementation results in lower power consumption and faster arithmetic operations compared to floating-point. Applications that require high precision but a limited range, such as integer-based calculations or simple financial computations, often benefit from fixed-point due to its efficient memory usage and predictability in execution time . However, in scenarios requiring complex numerical computations with large dynamic ranges, floating-point would be more suitable .

Precision in floating-point representation is more flexible than in fixed-point representation because floating-point allows the decimal point to 'float,' which enables dynamic allocation of precision where needed. In fixed-point, precision is limited by the fixed allocation of bits for integer and fractional parts, which restricts the range and granularity achievable . Conversely, floating-point can adjust precision dynamically within operations, allowing it to effectively represent very large or small numbers with the required precision by shifting the exponent, which influences where the significant digits are placed . This flexibility is crucial in scientific computations that demand high precision across a vast range of values .

The IEEE 754 standard defines a floating-point representation with a format that includes one sign bit, an exponent field, and a mantissa field. For example, in single precision, it uses 1 sign bit, 8 exponent bits, and 23 mantissa bits . The standardization is significant because it ensures consistency and reliability in floating-point arithmetic across different computing systems, facilitating portability and reducing rounding errors in calculations . It also defines special representations for zero, infinity, and errors, which are crucial for robust error handling in computations .

Digital computers implement binary systems to represent a wide range of data types, including numerical values (using both fixed-point and floating-point representations), alphanumeric characters (using character encoding systems like ASCII), and more complex data structures (such as arrays and matrices). Alphanumeric characters are encoded using binary codes, where specific bit patterns correspond to different characters . This uniform binary encoding facilitates data manipulation, transfer, and storage within digital systems as it simplifies hardware design and enhances processing efficiency . Moreover, binary systems are versatile enough to support logical operations and be used as control structures in programming languages .

In floating-point representation, the mantissa (or significand) represents the significant digits of the number, while the exponent determines the scale by indicating the position of the decimal point . Together, they allow the representation of numbers in scientific notation (MxRe) where M is the mantissa and e is the exponent . The exponent shifts the decimal point, which enables the floating-point format to represent both very large and very small numbers by adjusting this scale dynamically .

Special values in the IEEE 754 standard, such as zero, infinity, and 'Not a Number' (NaN), influence computational outcomes by providing mechanisms for handling exceptional cases gracefully. For instance, division by zero can return infinity, and invalid arithmetic operations can return NaN, which helps in error detection and debugging . These representations prevent crashes and undefined behavior in programs by standardizing the result of unconventional operations and allowing the continuation of computation with proper error handling procedures . By embedding these special values into the floating-point arithmetic standard, systems can robustly manage edge cases and rare events during numerical computations .

Number Representation in Computing
No ratings yet
Number Representation in Computing
7 pages
Fixed vs Floating Point Number Systems
No ratings yet
Fixed vs Floating Point Number Systems
6 pages
Fixed vs Floating Point Representation
No ratings yet
Fixed vs Floating Point Representation
7 pages
Fixed vs Floating Point Representation
No ratings yet
Fixed vs Floating Point Representation
5 pages
Inbound 1969709289156214005
No ratings yet
Inbound 1969709289156214005
29 pages
Fixed and Floating Point Representations
No ratings yet
Fixed and Floating Point Representations
24 pages
Fixed vs Floating Point Representation
No ratings yet
Fixed vs Floating Point Representation
5 pages
VP Lect05 2025 UNIT 1st
No ratings yet
VP Lect05 2025 UNIT 1st
10 pages
Understanding Number Representation in Computers
No ratings yet
Understanding Number Representation in Computers
23 pages
Computer Arithmetic and Data Representation
No ratings yet
Computer Arithmetic and Data Representation
32 pages
Advantages of Floating Point Representation
No ratings yet
Advantages of Floating Point Representation
6 pages
Fixed and Floating Point Number Representation
No ratings yet
Fixed and Floating Point Number Representation
21 pages
Finite Word Length Effects in DSP
No ratings yet
Finite Word Length Effects in DSP
31 pages
Fixed vs Floating Point Number Formats
No ratings yet
Fixed vs Floating Point Number Formats
8 pages
Understanding Digital Signal Processing
No ratings yet
Understanding Digital Signal Processing
27 pages
Signed Binary Integers Explained
No ratings yet
Signed Binary Integers Explained
35 pages
Fixed vs Floating Point Number Systems
No ratings yet
Fixed vs Floating Point Number Systems
4 pages
Understanding Binary Data Formats
No ratings yet
Understanding Binary Data Formats
27 pages
Lecture 14 - Arithmetic Subsystems - Numbering Systems and Floating Point Unit (FPU)
No ratings yet
Lecture 14 - Arithmetic Subsystems - Numbering Systems and Floating Point Unit (FPU)
32 pages
Understanding IEEE 754 Representation
No ratings yet
Understanding IEEE 754 Representation
20 pages
Decimal of 27/100 Explained
No ratings yet
Decimal of 27/100 Explained
8 pages
Digital Electronics: Data Types & Codes
No ratings yet
Digital Electronics: Data Types & Codes
15 pages
Floating-Point Number Representation
No ratings yet
Floating-Point Number Representation
3 pages
Number Systems and Data Representation
No ratings yet
Number Systems and Data Representation
28 pages
Negative Number Representations in Computing
No ratings yet
Negative Number Representations in Computing
12 pages
Fixed-Point Simulation in Matlab Lab
No ratings yet
Fixed-Point Simulation in Matlab Lab
7 pages
Understanding Integer and Floating Point Representation
No ratings yet
Understanding Integer and Floating Point Representation
23 pages
Data Representation and Computer Arithmetic
No ratings yet
Data Representation and Computer Arithmetic
13 pages
Introduction to Numerical Computing
No ratings yet
Introduction to Numerical Computing
35 pages
Floating-Point Representation in Computing
No ratings yet
Floating-Point Representation in Computing
6 pages
Floating Point Number Representation Guide
No ratings yet
Floating Point Number Representation Guide
21 pages
COA Module 1 - 3
No ratings yet
COA Module 1 - 3
7 pages
Computer Arithmetic & Parallel Processing
No ratings yet
Computer Arithmetic & Parallel Processing
30 pages
Finite Word Length Effects in DSP
No ratings yet
Finite Word Length Effects in DSP
26 pages
Floating Point Imprecision Explained
No ratings yet
Floating Point Imprecision Explained
44 pages
Numerical Accuracy and Error Analysis
No ratings yet
Numerical Accuracy and Error Analysis
18 pages
Number Representation & Quantization Effects
No ratings yet
Number Representation & Quantization Effects
6 pages
Fixed and Floating Point Representation
No ratings yet
Fixed and Floating Point Representation
3 pages
COA Unit 1
No ratings yet
COA Unit 1
24 pages
Fixed-Point vs Floating-Point Representation
No ratings yet
Fixed-Point vs Floating-Point Representation
16 pages
Computer Organization & Architecture Guide
No ratings yet
Computer Organization & Architecture Guide
16 pages
VHDL Fixed Point Math in FPGA
No ratings yet
VHDL Fixed Point Math in FPGA
5 pages
Csa 1
No ratings yet
Csa 1
32 pages
Lecture 12 - Quantization
No ratings yet
Lecture 12 - Quantization
6 pages
Finite Word Length Effects in Digital Filter
No ratings yet
Finite Word Length Effects in Digital Filter
26 pages
Mod 2b
No ratings yet
Mod 2b
33 pages
Fixed Point and Floating Point Representation
No ratings yet
Fixed Point and Floating Point Representation
20 pages
DSP Unit-5A
No ratings yet
DSP Unit-5A
30 pages
Data Representation in Computer Architecture
No ratings yet
Data Representation in Computer Architecture
59 pages
DSP Unit-5b
No ratings yet
DSP Unit-5b
10 pages
Significand in Floating-Point Numbers
No ratings yet
Significand in Floating-Point Numbers
17 pages
Number System Conversions Explained
No ratings yet
Number System Conversions Explained
4 pages
Floating-Point to Fixed-Point Audio Conversion
No ratings yet
Floating-Point to Fixed-Point Audio Conversion
10 pages
Finite Word Length Effects in DSP
No ratings yet
Finite Word Length Effects in DSP
38 pages
IEEE 754 Floating Point Overview
No ratings yet
IEEE 754 Floating Point Overview
38 pages
Fixed vs Floating Point Representation
No ratings yet
Fixed vs Floating Point Representation
32 pages
Understanding Fixed and Floating Point Numbers
No ratings yet
Understanding Fixed and Floating Point Numbers
21 pages
Number Base Conversions and BCD Operations
No ratings yet
Number Base Conversions and BCD Operations
4 pages
OSC Commands for Console Control
No ratings yet
OSC Commands for Console Control
16 pages
Samsung Setup Wizard Overview
No ratings yet
Samsung Setup Wizard Overview
42 pages
Floating Point Representation Overview
No ratings yet
Floating Point Representation Overview
28 pages
Data Transmission Summary Report
No ratings yet
Data Transmission Summary Report
162 pages
Charging Data Record Definitions
100% (1)
Charging Data Record Definitions
17 pages
C Right-Left Rule for C Declarations
No ratings yet
C Right-Left Rule for C Declarations
4 pages
Python Wrapper Class Methods
No ratings yet
Python Wrapper Class Methods
8 pages
Java Primitive Data Types Overview
No ratings yet
Java Primitive Data Types Overview
4 pages
IEEE 754-2008 DFP Adder/Subtractor Design
No ratings yet
IEEE 754-2008 DFP Adder/Subtractor Design
7 pages
16-bit Floating Point Multiplier Design
No ratings yet
16-bit Floating Point Multiplier Design
6 pages
IEEE Arithmetic Fundamentals
No ratings yet
IEEE Arithmetic Fundamentals
6 pages
Understanding Pointers in C Programming
No ratings yet
Understanding Pointers in C Programming
27 pages
Siemens S7 Shift Instructions Guide
No ratings yet
Siemens S7 Shift Instructions Guide
53 pages
MCQs and Exercises on Pointers in C
No ratings yet
MCQs and Exercises on Pointers in C
15 pages
Floating-Point Adder Principles Explained
No ratings yet
Floating-Point Adder Principles Explained
5 pages
Understanding Java Data Types and Operations
No ratings yet
Understanding Java Data Types and Operations
33 pages
Database Schema for User Management
No ratings yet
Database Schema for User Management
2 pages
C Questions and Answers 1. What Is A Pointer? Ans: A Pointer Is A Variable That Contains Memory Address
No ratings yet
C Questions and Answers 1. What Is A Pointer? Ans: A Pointer Is A Variable That Contains Memory Address
1 page
Understanding Floating-Point Arithmetic
No ratings yet
Understanding Floating-Point Arithmetic
2 pages
Win32 API Certificate Functions
No ratings yet
Win32 API Certificate Functions
7 pages
Chut Program: Date Calculation Tool
No ratings yet
Chut Program: Date Calculation Tool
1 page
Library Overdue Checker Program
No ratings yet
Library Overdue Checker Program
5 pages
Java Programming Experiments Guide
No ratings yet
Java Programming Experiments Guide
61 pages
Understanding Pointers in C Programming
No ratings yet
Understanding Pointers in C Programming
21 pages
IEEE 754 Double-Precision Format
No ratings yet
IEEE 754 Double-Precision Format
8 pages
Cheat Table for Drakengard Mods
No ratings yet
Cheat Table for Drakengard Mods
44 pages
Subprograme Utilizator în Pascal și C++
No ratings yet
Subprograme Utilizator în Pascal și C++
11 pages
Biometrics Device Programming Guide
50% (2)
Biometrics Device Programming Guide
28 pages
3 - Chuong 1 Kts
No ratings yet
3 - Chuong 1 Kts
16 pages

Fixed vs Floating Point Representation

Uploaded by

Fixed vs Floating Point Representation

Uploaded by

Fixed versus Floating Point

Fixed versus Floating Point

Storing Real Number:

These are structures as following below:

We can represent these numbers using:

Then, -43.625 is represented as following:

Then −53.5 is normalized as -53.5=(-110101.1)2=(-1.101011)x25 , which is represented as following below,

Where 00000101 is the 8-bit binary value of exponent value +5.

IEEE Floating point Number Representation:

Special Value Representation:

Common questions

How do the floating-point and fixed-point representations manage overflow and underflow during numerical computations?

What are the considerations and trade-offs involved in choosing between single precision and double precision in floating-point representation?

What are the advantages and disadvantages of fixed-point versus floating-point representations in digital signal processing?

Explain how the representation gap between numbers differs in fixed-point versus floating-point notations.

In what scenarios would fixed-point representation be preferred over floating-point representation?

Why is the precision in floating-point representation considered more flexible than in fixed-point representation?

How does the IEEE 754 standard define floating-point representation, and why is this standard significant?

Discuss how binary systems are implemented in digital computers to represent various types of data beyond just numerical values.

What is the role of the mantissa and the exponent in floating-point representation, and how do these components interact with each other?

In what ways can special values represented in the IEEE 754 standard influence computational outcomes?

You might also like