Back to the articles
Binary Analysis Fundamentals
Table of contents
- What is a binary file?
- What are binary file formats?
- What is binary analysis?
- How does binary analysis work?
- When is binary analysis performed?
- What is the difference between binary analysis and source code analysis?
- What are some basic binary analysis tools?
- What are the challenges of binary code analysis?
- How to get started with binary analysis?
What is a binary file?
In contrast to plain text files, which store data as characters organized into zero or more lines, binary files store data more efficiently as a sequence of bytes, usually interpreted as something other than textual characters. Executable files are typically referred to as binaries, while, in fact, binary files can have any type of content.
Some binary files contain important metadata that provides information on how to interpret the data in the file. For example, file signatures, also known as magic numbers, that can identify the file format, just like the labels on the products in the grocery store. Executable files contain binary code that cause the computer to perform specific tasks according to the instructions in the code.
Consider a plain text file bugprove.txt
containing the message “Hello from BugProve” and the binary contents of an executable binary file bugprove
that writes the same message to the console. The following screenshot shows the raw content of the files in hexadecimal using a hexdump utility called xxd, the content of the plain text file printed by cat, one of the most frequently used commands in Linux, and the console output of the program.
The first five bytes 48 65 6c 6c 6f
are representing the characters H e l l o
in ASCII, a standard data-encoding format that assigns numeric values to characters. Note that the content of the executable binary, shown on the right side of the above screenshot, is not in human-readable form, even though the program prints out the same message. Furthermore, you may also notice the magic number 7F 45 4c 46
in the first four bytes, indicating that this is an ELF
binary. However, in the case of a JPG image file, you would see FF D8
. See Gary Kessler's continuously updated list of file signatures for more examples.
What are binary file formats?
In general, file formats are data structures that contain the information necessary for other programs to know how to handle the file, including the operating system's loader to manage the executable code wrapped into binary files.
Developers build executable binaries through compilation and linking by converting human-written source code, such as C or C++, into machine code to be executed by the computer. The compiler only produces machine language instructions from the source code and creates one or more object files you cannot run yet. The linker creates an executable binary from the object files, which the operating system's loader can place into memory and prepare for execution.
Executable files begin with a header that includes information about the code, the type of application, required library functions, and space requirements. The ELF (Executable and Linkable Format) format is the common standard binary file format for executable files, object code, shared libraries and core dumps on Linux systems, while the PE (Portable Executable) file format is used by executables, object code and DLLs on Windows systems. Most binary formats share many similarities and a few essential differences. An obvious distinction is the file extension: for example, the filename of PE binaries ends with .exe
, while ELF binaries do not have an extension.
What is binary analysis?
Binary analysis is the process of examining the properties of binary files, including their containing instructions and data encoded in binary, to learn more about the file's contents or the program's purpose.
Although reverse engineering is one common application, binary analysis is a significantly broader field than most people realize, as it also involves some basic but important activities and many more advanced concepts like binary instrumentation or dynamic taint analysis that allow you to observe or even alter the program's behavior and track data flows, respectively. There are a lot of tools out there, but two of the most widely known and free-to-use dynamic binary instrumentation tools are Intel's Pin and the open-source DynamoRIO.
How does binary analysis work?
We can categorize binary code analysis into two fundamental groups according to when we are performing the analysis and another two important groups based on the level of complexity of the methods we choose to perform the analysis.
Static analysis involves examining the executable binary without running it. Static analysis tools may identify bugs or security vulnerabilities by only reading a program.
- Basic static analysis consists of examining the executable file without viewing the actual instructions, hence it is straightforward and can be quick, but ineffective against large binaries and can miss important behaviors.
- Advanced static analysis consists of reverse engineering the binary's internals by loading the executable into a disassembler that transforms the binary code into human-readable text for looking at the program instructions or a decompiler that converts assembly-level idioms into high-level abstractions for examining the much more concise pseudocode, that typically omits some details to make the code easier for people to understand. IDA Pro is one of the most powerful and widely used disassemblers and decompilers.
Dynamic analysis involves observing the program as it executes in a real, or almost identical environment, if necessary, using full-system or user-mode emulation with QEMU. Dynamic analysis tools instrument the program with analysis code that stores information regarding the execution of the program as metadata.
- Basic dynamic analysis techniques involve running the binary and observing its behavior on the system, including the executed system and library calls and their arguments.
- Advanced dynamic analysis involves using a debugger to examine the internal state of a running executable, including the values of variables and the outcomes of conditional branches. On Linux systems, gdb is the most popular and remarkably versatile tool suitable for numerous purposes.
Note that these approaches are complementary as static analysis is typically only the first step, and further analysis is usually necessary. Static analysis requires less time and can consider all execution paths in a program, whereas dynamic analysis is much more resource intensive and only considers a single execution path. However, dynamic analysis is typically more precise because it works with genuine values. Quite often, combining the advantages of static and dynamic analysis in a hybrid approach is the optimal solution.
When is binary analysis performed?
You can use binary analysis to examine and document the behavior of executable files for which you do not have the source code, as often is the case for third-party libraries, drivers, and other system components which might be only available as binaries.
Usually, software developers can create drivers for hardware devices based on their thorough documentation, but many third-party manufacturers only provide a binary-only executable, referred to as a binary blob. The precise operation and security quality of binary blobs are unknown, which means that they require users to trust third-party vendors with the security of their system. Due to their nature, malicious software also requires comprehensive malware analysis and reverse engineering. However, binary analysis can also be used for finding issues, including security weaknesses manifesting only at the binary level and for advanced debugging, even if you have source code available.
Third-party security labs often use binary analysis to assess the security of applications, system software code or the firmware of interconnected devices. The automotive and medical industries already have strict requirements for cybersecurity, and many consumer electronics devices also undergo third-party or in-house penetration testing, where the testing methodology often tries to “simulate” a real-world attacker. These evaluations usually follow a black-box or gray-box approach, implying that the source code is not available for the security researchers. In such cases, binary analysis is the primary tool to discover bugs, security vulnerabilities or non-compliances with cybersecurity requirements for those projects.
What is the difference between binary analysis and source code analysis?
Source code analysis involves analyzing programs at the level of source code. Binary analysis involves analyzing programs at the level of machine code, stored either as object code in intermediate portions of the final program or executable code in the complete program. It is important to point out that, as is the case with static analysis and dynamic analysis, the two approaches are complementary. Creating a binary is more involved than simply translating source code into machine code. The binary is typically not used as is but is configured extensively for deployment in a production environment, including, but not limited to, using the security hardening capabilities of the compiler, packaging strong cryptographic materials, and disabling debugging capabilities.
What are some basic binary analysis tools?
You can achieve many seemingly difficult tasks by correctly combining a few simple tools usually installed on most Linux systems by default. Below is a summary of some fundamental tools you can use to begin your binary analysis journey on Linux systems:
- The file utility, for instance, was created to identify the type of a file by looking for telltale patterns in the files, like magic numbers.
- Binwalk is a similar tool created to search a given binary for embedded files and executable code. It is compatible with file signatures recognized by the file utility but also comes with a custom signature file for items that are frequently included in firmware images.
- You can inspect the details of a binary's header using readelf and use the ldd program to find out which shared objects a binary depends on and which library versions the binary expects.
- You can use a hex-dumping program called xxd to display the bytes of a file in hexadecimal representation.
- You can use nm to list both static and dynamic symbols in a given binary, object file or shared object and demangle symbol names.
- Searching through character sequences in the binary using the strings utility can be a simple way to get hints about the functionality of a program. Legitimate programs almost always include many strings, like messages, URLs or file locations. Malware that is packed or obfuscated contains very few strings.
- Specific functions used by an executable provide hints to the functionality of the program, hence inspecting the system and library calls executed by a binary using strace and ltrace can often give you a good high-level idea of what a program is doing.
What are the challenges of binary code analysis?
There are several reasons why binary code analysis is more cumbersome, in contrast to source code analysis. To note a few examples, binary files typically appear as large blobs, which makes restoring the high-level code structure a complex task. They contain a mixture of easily misinterpreted code and data. They are often stripped of symbols, making it considerably harder to understand the code. Variable types are never explicitly stated, making the purpose and structure of data hard to recognize. Furthermore, any minor modification of the code or data could easily break the binary. Although it is possible to some extent to translate binary code into a human-readable format, we cannot fully recover the source code of the program.
How to get started with binary analysis?
We collected many helpful learning materials related to firmware security and binary analysis in your resource directory for IoT security, including books to read, podcasts to listen to, video channels to watch, and subreddits to follow. Binary analysis is a broad topic with many available resources and countless existing tools, so be prepared for sustained and long-term study. We would recommend starting your firmware security and binary analysis journey with the DVRF project and moving on to more complex CTFs involving some binary exploitation. We also encourage you to try our firmware analysis platform via its Free Plan.