Analyzing targets for fuzzing

A fuzz target is a function that takes data as input and processes it using the API under test. In other words, it is what we need to fuzz.

This step consists of carefully analyzing each fuzzing target from the attack surface. Here’s what needs to be learned:

The function arguments through which the data is passed for processing. We need the data buffer itself and its length, if it is possible to determine it.

The type of data being passed. For example, html document, png picture, zip archive. How the input data will be generated and mutated depends on it.

List of resources (memory, objects, global variables) that must be initialized before calling the target function.

If we phase internal functions of components rather than APIs, we will need to make a list of constraints that are imposed on the data by the code executed earlier. There are times when data validation takes place in several phases – we should also take this into account.

This stage is the most painstaking, because there can be a lot of targets for fuzzing: hundreds or even thousands! This is where we got the term “source reversal”, because the time and effort to analyze can be spent about as much as it takes to reverse a decent binary file.

Selecting input data

Before you start phasing, you need to select a set of input data that will serve as a starting point for the phaser – sids (seeds). In essence, a seed is a folder with files whose contents must be valid from the point of view of the target program or function. Sids will undergo numerous mutations during the phasing process and lead to an increase in code coverage.

For each of the functions we will be phasing, we must have our own sids. We often borrow them from the project’s tests. But if you don’t have enough samples from the tests, you can always find something on the Internet

When creating a set of seeds, you should take into account that:

Its elements should affect the coverage of the program code. The higher the coverage is, the less unexplored places are left in the program.

The size of its elements must not be large, otherwise it will affect the phasing speed. After all, the longer the length of input data is, the longer it will take the function to process it and the fewer program launches the phasizer will be able to make per unit of time.

Cases in sids should be functionally different. Strongly similar data will slow down the phasing process, because the phaszer will often hit places that it has already explored. It is better to minimize the data and remove everything unnecessary before launching the phaszer.

During phasing, sids are transformed into a corpus. A corpus is a set of test cases that led to the growth of code coverage during phasing of a target program or function. In other words, these are the most interesting inputs that can potentially lead to a program crash. The corpus first contains sids, then their mutations, then mutations of mutations, and so on. Here we see that phasing (feedback-driven) is a cyclic process, where at each new iteration we have more and more chances to generate a set of inputs that will allow us to find vulnerabilities in the program.