REveal is framework to evaluate 1-to-N similarities in binaries using sector hashes. The process of taking many binaries, storing their sector hashes in a database, and searching that database with sector hashes on a file-of-interest is called Match Set Analysis.
To see Match Set Analysis used with Data Flow Slices, see the related project Flowslicer.
REveal uses Python 3.10 or newer, and dependencies can be installed with:
python -m pip install -r requirements.txt
To ingest a list of files into a database:
py reveal.py ingest database.db \path\to\binaries\
Note: The sample command was run from a Powershell prompt on a Windows box. On Linux, substitute py with python3 and
change the path separator to /
.
The above command will recursively enumerate all files in any specified folders, not just executables. To filter
only executable files (PE and ELF), use the --exe
flag.
To use globbing rules, add the --glob
flag. Then the paths may contain glob rules. For example, the path
\path\**\*.dll
will recursively find all files with the .dll
extension.
Sample output:
4.84 92 files, 137947 hashes, 28517 hash/s 0.71 File: 46439 hashes \path\to\binaries\cmake
5.88 102 files, 187396 hashes, 31868 hash/s 0.70 File: 47620 hashes \path\to\binaries\cpack
7.00 109 files, 240438 hashes, 34361 hash/s 0.81 File: 55190 hashes \path\to\binaries\ctest
11.61 158 files, 334313 hashes, 28784 hash/s 2.43 File: 312436 hashes \path\to\binaries\emacs-gtk
14.62 216 files, 679111 hashes, 46444 hash/s 0.84 File: 65940 hashes \path\to\binaries\gdb
22.14 362 files, 938023 hashes, 42364 hash/s 0.51 File: 6730 hashes \path\to\binaries\lshw
29.77 504 files, 1166263 hashes, 39174 hash/s 0.62 File: 42895 hashes \path\to\binaries\python3.8
31.79 509 files, 1260824 hashes, 39657 hash/s 1.29 File: 126849 hashes \path\to\binaries\qemu-system-i386
33.14 510 files, 1387673 hashes, 41869 hash/s 1.35 File: 127150 hashes \path\to\binaries\qemu-system-x86_64
40.37 651 files, 1604279 hashes, 39740 hash/s 1.81 File: 191997 hashes \path\to\binaries\snap
52.99 900 files, 2165876 hashes, 40876 hash/s
6.65 Finalizing ingest by sorting into final table
This output shows any operations that took over 0.5 seconds to process.
A search compares a binary file against a hash database. The search is performed in a search database, and all results are stored in that search database. That search database can then be queried and visualized.
The following command searches the ls
binary against the database.db
hash database, storing results in the
search_ls.db
search database.
py reveal.py search database.db search_ls.db \path\to\bin\ls
The search command starts by determining which sections of the file have sufficient entropy. Then it performs a rolling
hash. A rolling hash is a hash at every starting byte in the file for the length of the block size. The search
command inserts these hashes into a table in the search database, then finds matches by attaching to the hash database
and performing a JOIN operation across tables. The match results are then stored in another table in the search
database for quick loading.
To launch the GUI on a search database, run the show
subcommand:
py reveal.py show search_ls.db
REveal now has a graphical interface to display matches. With the ls
binary compared to 900 Linux binaries, we can
see how REveal can modularize parts of the binary based on match sets:
The GUI is broken up into three sections: top, bottom left, and bottom right.
The top section shows a visual representation of the bytes of the file
- The x axis is the byte offset within the file.
- The white line shows where the mouse is hovering, selecting a byte offset in the file to fill details in the lower left pane.
- The ruler-like intervals show the location of various structures in the file
- PE/ELF headers, segments, sections, resources, etc
- The colored graph describes matches
- The y axis is the count of matching files on a logarithmic scale.
- The colors describe Match Families
- A Match Set is a list of files that match at a particular offset
- A Match Family is a group of similar match sets
- The display shows different parts of the input file
ls
matching different sets of files.
The bottom left section displays information about the selected byte offset
- The selected file offset, based on the mouse hovering on the top section
- Various entropy measures
- All based on Shannon Entropy, normalized to a 0-to-1 scale
- They all measure a 512-byte window starting at the cursor offset.
- The Byte, Word, Dword, and Qword entropies measure the entropy of 1, 2, 4, and 8 byte values.
- NibLo and NibHi measure the entropy of the low and high 4 bits of each byte respectively.
- Hex dump of the bytes.
- Underlined bytes are affected by "zeroizing", where the bytes are zeroed before hashing.
- Matches by count.
- This relates to a table that simply stores match count instead of storing the list of every file.
- "Files by count" measures how many files in the database matched at that offset
- "Hashes by count" measures how many times the hash was seen, including if seen multiple times in the same file.
- These counts can be higher due to self-similar overlaps.
- For example, if 0x10-0x90 matches 3 files and 0x20-0xa0 matches 5 more, there could be between 5 and 8 distinct files between the two sets, yet this the count will report 8.
- Matching files
- A separate table stores the list of files by hash.
- The name of each file is listed, up to a limit.
- Structure detail
- Detailed information about the intervals and items at the specified offset
The bottom right section lists the Match Set Families *
- They are sorted by descending size
- There are 3 file counts:
- The number of items in the first match set of this family. The first match set has the largest number of bytes.
- The union describes the list of all files seen in the family, regardless of how many bytes they match
- The intersect describes how many files were seen in every match set in the family
- A short list of file names describes the list of files
- A second list of files that were not in every match set describes the percent of bytes they were included in.
- A list of ranges represented by the family. The ranges are in
<length>+<size>
format. - Hovering the mouse over the match set families lists the full path of the files.
The REveal GUI can be used to analyze sections of malware. To demonstrate this, we downloaded 1358 files from Malware Bazaar. We then ingested those files into a sector hash database and searched a sample of AsyncRAT.
The GUI shows several match set families. Clearly, different parts of the binary have different matching power. By hovering over the different sections, the hexdump will show some additional information:
- The pink sections appear to be a list of function names for imports
- The red sections match strings that are commands injected into the system, and an XML document (near the end of the file)
- The blue, orange, and yellow sections appear to be unusual, obfuscated code sequences.
- Strings analysis (using Language-Aware Strings)
- Fancy Tables
- Cells should be able to summarize multiple lines of detail, and expand when hovered
- For example, if files are indexed by hash, then they will have multiple names and paths from where they have been found
- Disassembly, decompilation
- Use Ghidra or Binary Ninja to extract function ranges
- Show disassembly and decompilation when hovering
- Flowslicer Integration
- Have a separate analysis to extract data flow slices into a database
- Display data flow slice information as a separate graph alongside the sector hash matches.
- Debug Symbol Information
- If files in the match set have debug symbols, extract names and source code
- Summarize names and source in the bottom view
This feature is experimental. It adds flowslicer's data-flow slicing for an indexing/search option. Output is still text-based.
Flowslicer requires a licenced version of BinaryNinja with headless capability. Flowslicer currently works with version 3.2.3814, so you may need to downgrade by going to Edit > Preferences > Update Channel, select the channel "Latest Binary Ninja release" and select version 3.2.3814. Also, install the Binary Ninja API to your python interpreter.
Run the following commands in the REveal folder to checkout flowslicer:
git submodule init
git submodule update
Then run the ingest with the --slice
option:
py reveal.py ingest database.db \path\to\binaries\ --slice
This process can take a long time, as it serially loads each binary into Binary Ninja to process.
Then you can run a search against the database with the --slice option:
py reveal.py search database.db search_ls.db \path\to\bin\ls --slice
A detailed text output will be displayed, including:
- Raw Data flow slices, and the list of addresses backing that slice
- A formatted tree of match set families, match sets, and slices.