Low-Level
Software
Low-Level Software
Low-level software is a general name for the infrastructural aspects
of the software world.
Because the low-level aspects of software are often the only ones
visible to us as reverse engineers
we must develop a firm understanding of these layers that together
make up the realm of low-level software.
Let’s review some basic software development concepts as they are viewed
from the perspective of conventional software engineers.
We will have a quick overview of fundamental software engineering concepts
such as:
◦ program structure
◦ procedures,
◦ objects,
◦ and the like
◦ data management concepts (such as typical data structures, the role of
variables, and so on), and
◦ basic control flow constructs.
Program Structure
Program structure is the thing that makes software, an inherently large and
complex thing, manageable by humans.
We break the monster into small chunks where each chunk represents a “unit”
in the program in order to conveniently create a mental image of the program in
our minds.
The same process takes place during reverse engineering.
Reversers must try and reconstruct this map of the various components that
together make up a program.
Unfortunately, that is not always easy.
The problem is that machines don’t really need program structure as much as
we do.
We humans can’t deal with the concept of working on and understanding one
big complicated thing—objects or concepts need to be broken up into
manageable chunks.
These chunks are good for dividing the work among various people and also for
creating a mental division of the work within one’s mind.
This is really a generic concept about human thinking—when faced with large
tasks.
we’re naturally inclined to try to break them down into a bunch of smaller
tasks that together make up the whole.
Machines on the other hand often have a conflicting need for eliminating some
of these structural elements.
For example, think of how the process of compiling and linking a program
eliminates program structure:
◦ individual source files and libraries are all linked into a single executable,
◦ many function boundaries are eliminated through inlining and are simply
pasted into the code that calls them.
The machine is eliminating redundant structural details that are not
needed for efficiently running the code.
All of these transformations affect the reversing process and make it
somewhat more challenging.
How do software developers break down software into manageable chunks?
The general idea is to view the program as a set of separate black boxes that are responsible for
very specific and (hopefully) accurately defined tasks.
The idea is that someone designs and implements a black box, tests it and confirms that it
works, and then integrates it with other components in the system.
A program can therefore be seen as a large collection of black boxes
that interact with one another.
Different programming languages and development platforms
approach these concepts differently, but the general idea is almost
always the same.
Likewise, when an application is being designed it is usually broken down into
mental black boxes that are each responsible for a chunk of the application.
For instance, in a word processor you could view the text-editing component as
one box and the spell checker component as another box.
This process is called encapsulation because each component box encapsulates
certain functionality and simply makes it available to whoever needs it, without
exposing unnecessary details about the internal implementation of the
component.
Component boxes are frequently developed by different people or even by
different groups, but they still must be able to interact.
Boxes vary in size:
- Some boxes implement entire application features (like the earlier spell
checker example),
- while others represent far smaller and more primitive functionality such as
sorting functions and other low-level data management functions.
These smaller boxes are usually made to be generic, meaning that they can be
used anywhere in the program where the specific functionality they provide is
required.
Developing a robust and reliable product rests primarily on two factors:
◦ that each component box is well implemented and reliably performs its
duties, and
◦ that each box has a well defined interface for communicating with the outside
world.
In most reversing scenarios:
◦ the first step is to determine the component structure of the application and
◦ the exact responsibilities of each component.
From there, one usually picks a component of interest and investigates into the
details of its implementation.
Modules
The largest building block for a program is the module.
Modules are simply binary files that contain isolated areas of a program’s
There are two basic types of modules that can be combined together to make a
program:
◦ static libraries and
◦ dynamic libraries.
Static libraries
Static libraries make up a group of source-code files that are built together and
represent a certain component of a program.
Logically, static libraries usually represent a feature or an area of functionality in
the program.
Frequently, a static library is not an integral part of the product that’s being
developed but rather an external, third party library that adds certain
functionality to it.
Static libraries
Static libraries are added to a program while it is being built, and
they become an integral part of the program’s binaries.
They are difficult to make out and isolate when we look at the
program from a low-level perspective while reversing.
Dynamic libraries
Dynamic libraries (called Dynamic Link Libraries, or DLLs in Windows) are similar
to static libraries, except:
◦ that they are not embedded into the program, and they remain in a separate
file, even when the program is shipped to the end user.
- A dynamic library allows for upgrading individual components in a program
without updating the entire program.
- a library can be replaced seamlessly—without upgrading any other
components in the program.
Dynamic libraries
An upgraded library would usually contain improved code, or even
entirely different functionality through the same interface.
Dynamic libraries are very easy to detect while reversing, and
- the interfaces between them often simplify the reversing process
because
◦ they provide helpful hints regarding the program’s architecture.
Common Code Constructs
There are two basic code-level constructs that are considered the most
fundamental building blocks for a program.
These are procedures and objects.
In terms of code structure, the procedure is the most fundamental unit in
software.
A procedure is a piece of code, usually with a well-defined purpose, that can be
invoked by other areas in the program.
Common Code Constructs
Procedures can optionally receive input data from the caller and return data to
the caller.
Procedures are the most commonly used form of encapsulation in any
programming language.
The next logical leap that supersedes procedures is to divide a program into
objects.
Designing a program using objects is an entirely different process than the
process of designing a regular procedure-based program.
This process is called object-oriented design (OOD).
It is considered to be the most popular and effective approach to software
design.
OOD methodology
OOD methodology defines an object as a program component that has both data and code
associated with it.
The code can be a set of procedures that is related to the object and can manipulate its data.
The data is part of the object and is usually private, meaning that it can only be accessed by
object code, but not from the outside world.
OOD methodology
This simplifies the design processes, because developers are forced to treat objects as
completely isolated entities that can only be accessed through their well-defined interfaces.
Those interfaces usually consist of a set of procedures that are associated with the object.
Those procedures can be defined as publicly accessible procedures, and are invoked primarily
by clients of the object.
Clients are other components in the program that require the services of the object but are not
interested in any of its implementation details. In most programs, clients are themselves objects
that simply require the other objects’ services.
OOD methodology
Clients are other components in the program:
◦ that require the services of the object but are not interested in any of
its implementation details.
In most programs, clients are themselves objects that simply require
the other objects’ services.
OOD methodology
Beyond the mere/simple division of a program into objects, most object-
oriented programming languages provide an additional feature called
inheritance.
Inheritance allows designers to establish a generic object type and implement
many specific implementations of that type that offer somewhat different
functionality.
The idea is that the interface stays the same, so the client using the object
doesn’t have to know anything about the specific object type it is dealing with—
it only has to know the base type from which that object is derived.
OOD methodology
This concept is implemented by declaring a base object, which includes a
declaration of a generic interface to be used by every object that inherits from
that base object.
Base objects are usually empty declarations that offer little or no actual
functionality.
In order to add an actual implementation of the object type, another object is
declared,
- which inherits from the base object and
- contains the actual implementations of the interface procedures.
OOD methodology
The beauty of this system is that:
◦ for a single base object there can be multiple descendant objects
◦ that can implement entirely different functionalities, but
◦ export the same interface.
◦ Clients can use these objects without knowing the specific object type they
are dealing with—they are only aware of the base object’s type.
◦ This concept is called polymorphism.
OOD methodology
A program deals with data.
Any operation always requires:
◦ input data,
◦ Room for intermediate data, and
◦ a way to send back results.
To view a program from below and understand what is happening:
◦ one must understand.
◦ how data is managed in the program
This requires two perspectives:
◦ the high-level perspective as viewed by software developers and
◦ the low-level perspective that is viewed by reversers.
OOD methodology
High-level languages tend to isolate software developers from the details
surrounding data management at the system level.
Developers are usually only made aware of the simplified data flow described by
the high-level language
Naturally:
- most reversers are interested in obtaining a view of the program that matches
that simplified high-level view as closely as possible.
That’s because the high-level perspective is usually far more human-friendly
than the machine’s perspective.
In order to be able to recover some or all of that high-level data flow
information from a program binary:
- one must understand how programs view and treat data from both
the programmer’s high-level perspective and the low level
machine-generated code.
Variables
For a software developer, the key to managing and storing data is usually
named variables.
All high-level languages provide developers with the means to declare variables
at various scopes and use them to store information.
Programming languages provide several abstractions for these variables.
The level at which variables are defined, determines which parts of the
program will be able to access it, and also where it will be physically stored.
variables
The names of named variables are usually relevant only during compilation.
Many compilers completely strip the names of variables from a program’s
binaries and
◦ identify them using their address in memory.
User-Defined Data
Structures
User-defined data structures are simple constructs that represent a group of
data fields, each with its own type.
The idea is that these fields are all somehow related, which is why the program
stores and handles them as a single unit.
The data types of the specific fields inside a data structure can either be simple
data types such as integers or pointers or they can be other data structures.
While reversing, reverser will be encountering a variety of user-defined data
structures.
Proper identification of such data structures and deciphering their contents is
critical for achieving program comprehension.
The key to doing this successfully is to gradually record every tiny detail
discovered about them until you have a sufficient understanding of the
individual fields.
Other than user-defined data structures:
- programs routinely use a variety of generic data structures for organizing their
data.
Most of these generic data structures represent lists of items (where each item
can be of any type, from a simple integer to a complex user-defined data
structure).
A list is simply a group of data items that share the same data type and that the
program views as belonging to the same group.
In most cases, individual list entries contain unique information while sharing a
common data layout.
Examples include lists such as a list of contacts in an organizer program or list of
e-mail messages in an e-mail program.
Those are the user-visible lists, but most programs will also maintain a variety of
user-invisible lists:
that manage such things as areas in memory currently active, files currently
open for access, and the like.
The way in which lists are laid out in memory is a significant design
decision for software engineers and usually depends on the contents
of the items and what kinds of operations are performed on the list.
The expected number of items is also a deciding factor in choosing
the list’s format.
For example, lists that are expected to have thousands or millions
of items might be laid out differently than lists that can only grow to
a couple of dozens of items.
Also, in some lists the order of the items is critical, and new items are
constantly added and removed from specific locations in the middle
of the list.
Other lists aren’t sensitive to the specific position of each item.
Another criterion is the ability to efficiently search for items and
quickly access them.
Arrays
Arrays: Arrays are the most basic and intuitive list layout—items are placed
sequentially in memory one after the other.
Items are referenced by the code using their index number, which is just the
number of items from the beginning of the list to the item in question.
There are also multidimensional arrays, which can be visualized as multilevel
arrays.
For example, a two-dimensional array can be visualized as a simple table with
rows and columns, where each reference to the table requires the use of two
position indicators: row and column.
There are also multidimensional arrays, which can be visualized as multilevel
arrays.
For example:
a two-dimensional array can be visualized as a simple table with rows and
columns
- where each reference to the table requires the use of two position indicators:
◦ row and column.
The most significant downside of arrays is the difficulty of adding and
removing items in the middle of the list.
Doing that requires that the second half of the array (any items that
come after the item we’re adding or removing) be copied to make
room for the new item or
- eliminate the empty slot previously occupied by an item.
With very large lists, this can be an extremely inefficient operation.
Linked lists
Linked lists: In a linked list, each item is given its own memory space and can be
placed anywhere in memory.
Each item stores the memory address of the next item (a link), and sometimes
also a link to the previous item.
This arrangement has the added flexibility of supporting the quick addition or
removal of an item because no memory needs to be copied.
To add or remove items in a linked list:
- the links in the items that surround the item being added or removed must be
changed to reflect the new order of items.
Linked lists address the weakness of arrays with regard to inefficiencies when
adding and removing items by not placing items sequentially in memory.
Of course, linked lists also have their weaknesses.
Because items are randomly scattered throughout memory, there can be no
quick access to individual items based on their index.
Also, linked lists are less efficient than arrays with regard to memory utilization,
because each list item must have one or two link pointers, which use up
precious memory.
Trees
A tree is similar to a linked list in that memory is allocated separately for each
item in the list.
The difference is in the logical arrangement of the items:
In a tree structure:
items are arranged hierarchically, which greatly simplifies the process of
searching for an item.
The root item represents a median point in the list,
and contains links to the two halves of the tree (these are essentially branches):
one branch links to lower-valued items,
while the other branch links to higher-valued items.
This layout greatly simplifies the process of binary searching:
- where with each iteration one eliminates one-half of the list in
which it is known that the item is not present.
With a binary search, the number of iterations required is very low
because:
- with each iteration the list becomes about 50 percent shorter.
Control Flow
In order to truly understand a program while reversing:
One’ll almost always have to decipher control flow statements and try to
reconstruct the logic behind those statements.
Control flow statements are statements that affect the flow of the program
based on certain values and conditions.
In high-level languages:
control flow statements come in the form of basic conditional blocks and loops,
which are translated into low-level control flow statements by the compiler.
Conditional Blocks
Conditional code blocks are implemented in most programming languages using
the if statement.
They allow for specifying one or more condition that controls whether a block of
code is executed or not.
Switch blocks
Switch blocks (also known as n-way conditionals) usually take an input value
and define multiple code blocks that can get executed for different input values.
One or more values are assigned to each code block, and the program jumps to
the correct code block in runtime based on the incoming input value.
The compiler implements this feature by:
- generating code that takes the input value and
- searches for the correct code block to execute,
- usually by consulting a lookup table that has pointers to all the different code
blocks.
LOOPS
Loops allow programs to repeatedly execute the same code block
any number of times.
A loop typically manages a counter that determines the number of
iterations already performed or the number of iterations that remain.
All loops include some kind of conditional statement that determines
when the loop is interrupted.
Another way to look at a loop is:
- as a conditional statement that is identical to a conditional block,
with the difference that the conditional block is executed
repeatedly.
The process is interrupted when the condition is no longer satisfied.
High-Level Languages
High-level languages were made:
- to allow programmers to create software without having to worry about the
specific hardware platform on which their program would run and
- without having to worry about all kinds of frustrating low-level details that just
aren’t relevant for most programmers.
Assembly language has its advantages, but
- it is virtually impossible to create large and complex software on assembly
language alone.
High-level languages were made to isolate programmers from
- the machine and its tiny details as much as possible.
The problem with high-level languages is:
- that there are different demands from different people and different fields in
the industry.
- The primary tradeoff is between simplicity and flexibility.
Simplicity means that you can write a relatively short program that does exactly
what you need it to,
- without having to deal with a variety of unrelated machine-level details.
Flexibility means that there isn’t anything that you can’t do with the language.
High-level languages are usually aimed at finding the right balance that suits
most of their users.
On one hand, there are certain things that happen at the machine-level that
programmers just don’t need to know about.
On the other, hiding certain aspects of the system means that you lose the
ability to do certain things.
From a reversing standpoint, the most important thing about a high-level
programming language is how strongly it hides or abstracts the underlying
machine.
Some languages such as C provide a fairly low-level view on the machine and
produce code that directly runs on the target processor.
Other languages such as Java provide a large level of separation between the
programmer and the underlying processor.
C programming language
The C programming language is a relatively low-level language.
C provides direct support for memory pointers and lets you
manipulate them as you please.
Arrays can be defined in C, but there is no bounds checking
whatsoever, so you can access any address in memory that you
please.
On the other hand:
C provides support for the common high-level features found in other, higher-
level languages.
This includes:
- support for arrays and data structures,
- the ability to easily implement control flow code such as conditional code
- and loops,
- and others.
C is a compiled language, meaning that to run the program you must run the
source code through a compiler that generates platform-specific program
binaries.
These binaries contain machine code in the target processor’s own native
language.
C also provides limited cross-platform support.
To run a program on more than one platform you must recompile it with a
compiler that supports the specific target platform.
Many factors have contributed to C’s success, but perhaps most important is the
fact that the language was specifically developed for the purpose of writing the
Unix operating system.
Modern versions of Unix such as the Linux operating system are still written in C.
Also, significant portions of the Microsoft Windows operating system were also
written in C (with the rest of the components written in C++).
Another feature of C that greatly affected its commercial success has been its
high performance.
Because C brings you so close to the machine, the code written by programmers
is almost directly translated into machine code by compilers, with very little
added overhead.
This means that programs written in C tend to have very high runtime
performance.
C code is relatively easy to reverse because it is fairly similar to the machine
code.
When reversing one tries to read the machine code and reconstruct the original
source code as closely as possible
Because the C compiler alters so little about the program, relatively speaking, it
is fairly easy to reconstruct a good approximation of the C source code from a
program’s binaries.
C++
The C++ programming language is an extension of C, and shares C’s basic syntax.
C++ takes C to the next level in terms of flexibility and sophistication by
introducing support for object-oriented programming.
The important thing is that C++ doesn’t impose any new limits on programmers.
With a few minor exceptions, any program that can be compiled under a C
compiler will compile under a C++ compiler.
The core feature introduced in C++ is the class.
A class is essentially a data structure that can have code members and
attributes.
These code members usually manage the data stored within the class.
This allows for a greater degree of encapsulation, whereby data structures are
unified with the code that manages them.
C++ also supports inheritance, which is the ability to define a hierarchy of
classes that enhance each other’s functionality.
Inheritance allows for the creation of base classes that unify a group of
functionally related classes.
It is then possible to define multiple derived classes that extend the base class’s
functionality.
The real beauty of C++ (and other object-oriented languages) is polymorphism .
Polymorphism allows for derived classes to override members declared in the
base class.
This means that the program can use an object without knowing its exact data
type—it must only be familiar with the base class.
This way, when a member function is invoked, the specific derived object’s
implementation is called, even though the caller is only aware of the base class.
Reversing code written in C++ is very similar to working with C code,
except that emphasis must be placed on
◦ deciphering the program’s class hierarchy and
◦ on properly identifying:
◦ class method calls,
◦ constructor calls, etc
java
Java is an object-oriented, high-level language that is different from other
languages
such as C and C++ because
◦ it is not compiled into any native processor’s assembly language,
◦ but into the Java bytecode.
Briefly, the Java instruction set and bytecode are like a Java assembly language
of sorts, with the difference that this language is not usually interpreted
directly by the hardware,
but is instead interpreted by software (the Java Virtual Machine).
Java’s primary strength is the ability to allow a program’s binary to run on any
platform for which the Java Virtual Machine (JVM) is availabl
Because Java programs run on a virtual machine (VM), the process of reversing a
Java program is completely different from reversing programs written in
compiler-based languages such as C and C++.
Java executables don’t use the operating system’s standard executable format
(because they are not executed directly on the system’s CPU).
Instead they use .class files, which are loaded directly by the virtual machine
C#
C# was developed by Microsoft as a Java-like object-oriented language that aims
to overcome many of the problems inherent in C++.
C# was introduced as part of Microsoft’s .NET development platform, and is
based on the concept of using a virtual machine for executing programs.
C# programs are compiled into an intermediate bytecode format (similar to the
Java bytecode) called the Microsoft Intermediate Language (MSIL).
C# has quite a few advanced features such as garbage collection and type safety.
C# also has a special unmanaged mode that enables direct pointer
manipulation.
Reversing C# programs sometimes requires that you learn the native
language of the CLR—MSIL.
On the other hand, in many cases manually reading MSIL code will
be unnecessary because MSIL code contains highly detailed
information regarding:
◦ the program and the data types it deals with,
which makes it possible to produce a reasonably accurate high-level
language representation of the program through decompilation.