Brian Carrier
Issue #7
August 15, 2003
The seventh issue of The Sleuth Kit Informer completes the two part series on hash databases. This issue examines the 'hfind' tool in The Sleuth Kit and shows how it works and what is in the index file that it creates. Also in this issue is an update on developments with The Sleuth Kit and Autopsy, a random tip on making timelines, and a correction on what the NIST NSRL actually contains.
On July 22, Pepijn Vissers of Fox-IT released a new beta version of a patch for Autopsy to integrate Jesse Kornblum's 'foremost' tool.
http://sourceforge.net/mailarchive/forum.php?thread_id=2813215&forum_id=10358
On August 2, version 1.63 of The Sleuth Kit was released and on August 3, version 1.64 was released. 1.64 fixed a compile error on some Linux systems. 1.63 added media management tools to view the partition layout of DOS partitions, BSD partitions, Mac partitions, and Sun VTOCs. A bug in 'sorter' was also fixed.
http://www.sleuthkit.org/sleuthkit/index.php
On August 8, Paul Bakker of Fox-IT released a new beta version of his indexed search for Autopsy.
http://sourceforge.net/mailarchive/forum.php?thread_id=2922896&forum_id=10358
On August 12, a new mailing list was created. sleuthkit-developers is for the discussion of new features and development of The Sleuth Kit and Autopsy.
http://lists.sourceforge.net/lists/listinfo/sleuthkit-developers
From the command line, it is easy to reduce the amount of data in a timeline. After the 'body' file is created by 'fls' and 'ils', then grep can be used to only include specific files. For example, if you only want entries for the '/dev/' directory:
# grep '|\/dev\/' body > body.dev
The '|' is added to the beginning of the search term to force the '/dev' to be at the beginning of the path and not in the middle such as '/usr/dev/'. To get all entries besides the one in the search pattern, supply the '-v' argument:
# grep -v '|\/dev\/' body > body.notdev
Then, run 'mactime' as normal on the output file.
At the Digital Forensic Research Workshop (DFRWS), I learned that my (and many others) understanding of what the NIST NSRL contains is incorrect. Not all files in the NSRL are known-good. The NSRL could contain rootkits and other "hacker" tools. The files are broken up into categories and one category is called "Hacker Tool". They have a brief mention at:
I do not know how to solve this problem with 'sorter' because it currently assumes that all entries are known to be good. In the future, rootkits and such could be included in the database. I do not want to have to maintain a list of what categories are "bad" and which are "good". Therefore, the next version of 'sorter' will likely have the NSRL option removed until a better solution can be developed.
Issue #6 of The Sleuth Kit Informer gave an overview of hash databases and how they are used in Autopsy for data reduction to identify known-good and known-bad files. An overview of the binary search algorithm was also given. This article dives down deeper into the world of hashes and outlines how the 'hfind' tool in The Sleuth Kit works and includes the details of the index file that it uses. The goal of this article is to educate the user about the underlying process and open the design up for discussion and debate.
The 'hfind' tool from The Sleuth Kit allows one to lookup hashes from the NIST NSRL, Hashkeeper, and md5sum hash databases. Before any lookups can be performed, the database must be indexed. The index is created by running 'hfind' with the '-i' flag and the database type:
For example, to index the NSRL by the MD5 values, the following would be used:
# hfind -i nsrl-md5 /usr/local/nsrl/NSRLFile.txt
Once the database has been indexed, any value can be looked up. Values can be supplied on the command line, from STDIN, or from a file. To use the command line, supply the hash database and one or more hashes:
# hfind /usr/local/nsrl/NSRLFile.txt 917c4a96fc6bd589fe522765391077e8
917c4a96fc6bd589fe522765391077e8 Hash Not Found
Lookups can also be done with standard input:
# md5sum unknown.dat | hfind /usr/local/nsrl/NSRLFile.txt
F99AC9C11F6D745310F53A25D46BE551 MTRUSHMO.WMF
If you have a list of hashes to lookup, place them in a file and use the '-f' flag to specify the file location. In this example, we have a file (hashes.txt) with the two above hashes in it, one per line. It is specified with '-f':
# hfind -f hashes.txt /usr/local/nsrl/NSRLFile.txt
917c4a96fc6bd589fe522765391077e8 Hash Not Found
F99AC9C11F6D745310F53A25D46BE551 MTRUSHMO.WMF
Note that the database type does not need to be specified after the indexing has been done. Therefore, the only difference with using a Hashkeeper or a md5sum database is in the indexing process. After the database has been indexed, the database type does not need to be specified.
There are a few other flags that could be useful. The quick output flag, '-q', can be given to only print a 1 if the hash is found and a 0 if it is not found. The '-e' flag produces extended output so that other information besides just the file name are given.
The purpose of the index file is to allow 'hfind' to quickly find hashes using the binary search algorithm, which requires entries to be the same length and be sorted. Hash database entries are generally not the same length because the entry also contains the file name. Therefore, files with long names have longer database entries and files with shorter names have shorter entries. The binary search algorithm is most efficient when each entry has the same size and it is easy to calculate at what byte offset an entry begins. Furthermore, the binary search algorithm requires the list of entries to be sorted and many of the databases do not come sorted.
The index file solves these problems by making each line the same size and making the entries sorted by the hash value. The index file only contains the hash value and the location in the original hash database where the entry details can be found. The offset value is padded with 0's until it has 16 numbers in it.
For example, the first database entry would have an offset of 0 bytes and if it had a length of 202 bytes then the second entry would have an offset of 202. The hash and offset values are separated by '|'. For example:
50b9a98a606b77ab2bfdd7814d842e76|0000000000000000
356c2b641189145b1180bfdba0245f16|0000000000000202
An index file for MD5 values will have entries that are 50 bytes each: 32 for the hash, 1 for the '|', 16 for the offset, and 1 for the newline. An entry for SHA-1 will have 58 bytes for each entry.
The indexing process reads each entry in the database and writes the hash and offset to a temporary file. That file is then sorted using the UNIX 'sort' tool. The index file will have the same base name as the database and it adds '-md5.idx' or '-sha1.idx' to the end. The difference is based on what hash type was used to index the database. As the NSRL provides both hash values in the database, both types may exist. The index file must be in the same directory as the database file.
If two consecutive hash database entries have the same hash, then only the first is added to the index file. The lookup process will find the additional ones in the database.
The first entry in the index file is the header value. It has the same general format as other entries, but its hash value is larger and it is all 0's. The larger number ensures that it will always be at the top when the temporary file is sorted. Instead of an offset value, the entry contains the type of database that this is for. For example:
00000000000000000000000000000000000000000|nsrl
When a hash value is looked up, a generic method (tm_lookup()) is used. Using the database name, the index file is located based on the hash type that was given. For example, if 'linux.dat' is given as the database name, then 'linux.dat-md5.idx' is used as the index name. The header line is read from the index file to determine what type of hash database will be used. A binary search is then performed on the index file. To find the middle entry for the first round of the algorithm, the size of the file is determined, the size of the header line is subtracted, and the size is divided by 2. The size of the file minus the size of the header should be a multiple of 50 or 58, depending on which hash was used for the index.
A binary search is performed on the entries until a hash is found or until the algorithm is complete. If the hash is found, the offset is passed to a database specific method, which parses the original database to identify the name of the file. The original database is also examined for additional entries that follow the first one. Once the hash has been found in the index file by the binary search algorithm, the previous and next entries in the index file are also examined. Other entries for the same hash will exist in the index file if the original database was not sorted.
This article has given an overview of how the 'hfind' tool works and the inner details on how the index file is created, what it contains, and how it is used. Hashes are useful for quickly identifying files and as databases become larger and larger then quick lookup methods are needed. The index file allows fast lookups to be performed even in very large databases.
The format of the index file may change in the coming months. I have learned that the NSRL maybe broken up into smaller files instead of a single, huge file. Therefore, the index file will likely be updated to have a field for the file that the hash came from.
HashKeeper
http://www.hashkeeper.org
NIST NSRL
http://www.nsrl.nist.gov
The Sleuth Kit
http://www.sleuthkit.org/sleuthkit/index.php
The Sleuth Kit Informer Issue #6
http://www.sleuthkit.org/informer/sleuthkit-informer-6.html