Fast fixed-memory hash table.
|_| _ |_|
| |(/_ |
He4
Branch | Status |
---|---|
master | |
dev |
Branch Policy:
- The "master" branch is intended to be stable at all times.
- The "dev" branch is the one where all contributions must be merged before
being promoted to master.
- If you plan to propose a patch, please commit into the "dev" branch or a separate feature branch. Direct commit to "master" will not be accepted.
The idea of this library is to create a fast in-memory hash table implementation in C99. The following are the design criteria.
- This is a fixed-size hash table; new memory is never allocated by this library once the initial table is created, but you can force a rehash.
- Because the table is fixed-size, it can fill up. If it fills up, then there are two possible outcomes: new entries are dropped, or older entries are dropped.
- A default hash algorithm is supplied; you can use your own. The default is the xxHash algorithm.
Why? Well, embedded systems with fixed memory may need hash tables, too.
This library uses semantic versioning. The version number is set
in the CMakeLists.txt
file and can be obtained at runtime with the
he4_version
function.
You may contribute to this project by submitting issues or feature requests, forking the project and submitting pull requests, and by contributing to the meagre documentation. All help is welcome!
Please report vulnerabilities on the issue tracker with highest priority.
No specific style guide is adopted for this project. The closest is this guide.
This project is built with CMake and the API documentation is built with Doxygen.
"Out of source" builds are supported by CMake, and to build the static and dynamic libraries for your platform, along with the API documentation, you can do the following from the project root folder.
mkdir build
cd build
cmake ..
make
This populates doc/api/html
with the API documentation, with all other
build products in the build
folder.
To build the examples that come with this library, do the following from the
build
folder.
make examples
The count
example counts the occurrences of lines in a file and write the
result. It provides an example of a table that maps char *
keys to concrete
values (that is, values that are not pointers). An example input file
wordlist.txt
is provided.
To instead create an Eclipse project, a Ninja file, or something else, use
a generator. The command cmake -h
will list the generators supported by your
version of CMake.
These instructions should work out of the box for Linux and OS X.
You can get a Windows standalone SDK (for Windows 10 you can find it here).
Be sure to open Visual Studio and create at least one minimal C project so that any "on demand" components are installed. Then do the following from the command prompt.
mkdir build
cd build
cmake ..
At this point you should have a .sln
file you can open in Visual Studio.
This builds just fine with Windows 10 and the Community Edition of Visual Studio.
This table implementation uses open addressing with linear probing (because linear probing is simply faster to compute, even though it can lead to clustering). As such, you should allocate enough space that your fixed-size table will never have a load factor above 0.7.
This implementation uses lazy deletion, which makes deletion fast, but can increase search time. To improve on this, any successful search that has passed a lazily deleted entry will relocate the found entry to the first open slot. Thus found items "move to the front of the line" to improve search time.
Be aware! If the table becomes full your performance is going to be horrible. Again, a full table will have horrible performance, since it cannot automagically rehash. Paradoxically this is worse as the table becomes larger, because every search may have to go through the entire table to find an entry.
To use the library #include <he4.h>
and then make use of the functions
found therein. If you are happy with your platform's malloc
and are not
constrained by library restrictions, then there is little else you need to
think about (but do read the following two sections on handling full tables).
A sample program that counts the entries in a file is in the examples
folder.
Create a table with he4_new
and dispose of it when done with he4_delete
.
The table methods manage the deallocation of the keys and values.
// Make a new table with the given size and the defaults. The table
// will map widgets to things.
#define HE4_KEY_TYPE widget *
#define HE4_ENTRY_TYPE thing *
#include <he4.h>
// ...
HE4 * table = he4_new(size, NULL, NULL, NULL, NULL);
// Use the table...
he4_delete(table);
Insert items with he4_insert
and remove them with he4_remove
(to get the
value back) or he4_discard
(to let the library free it).
Find an entry in the table with he4_get
. Alternately use he4_find
, which
returns a pointer to the item found in the table, as opposed to the item
itself. This allows modification of the item in place in the table, and can
improve performance significantly by avoiding multiple searches. See the
example basics.c
.
Of course you can force a rehash. Code like the following is recommended,
if you have memory to spare. Be aware that the rehash function returns NULL
to signal that it cannot get memory for the new table.
if (he4_load(table) > 0.7) {
HE4 * newtable = he4_rehash(table, 0);
table = newtable == NULL ? table : newtable;
}
The second argument (0
) uses the default new capacity, which is double the
old capacity.
The original table is freed, but only if the new table is successfully constructed. See the next section for how you might use this strategy.
By default the library adds a field to each entry called the touch index. The table has a stored maximum value. Whenever an entry is found or inserted, its touch index is set to the table maximum and the maximum is then incremented. The item with the lowest touch index at any time is the least-recently-used item.
You can use this to clean up the table. Set a threshold and trim the table by
discarding everything with touch index below the threshold. See
he4_trim
. The new table's touch indices are debited by the threshold (so
least-recently-used information is preserved).
You can also use this when rehashing. Set a threshold and rehash to the same
size table. The items with touch index below the threshold will be removed.
See he4_trim_and_rehash
.
Maintaining the touch index comes with a slight cost in time and space. If you
do not want this, then uncomment the appropriate line in CMakeLists.txt
.
// Delete old entries until the table is less than half full.
while (he4_load(table) > 0.5) {
HE4 * newtable = he4_trim(table, table->max_touch / 4);
table = newtable == NULL ? table : newtable;
}
The library includes Doug Lea's malloc
implementation. If you
want to use it, then uncomment the correct line in CMakeLists.txt
.
Alternately you can just #define
the macros HE4MALLOC
and HE4FREE
as
indicated in he4.h
. In all these cases, stdlib.h
is no longer included and
no longer required.
The debugging facility uses stdio.h
. You can #define NODEBUG
to eliminate
this dependency.
Both stdint.h
and stdbool.h
are required, but are also easily replaced.
Building requires string.h
, and the string library is needed to run. It
should not be hard to replace it if you need to.
If you want debugging but can't tolerate stdio.h
, have a look at
this library, or at this implementation.
Working on this. For now you're stuck. Right now you can uncomment
set(NO_STD_LIB 1)
in the CMakeLists.txt
and compile, but you will have
a few undefined symbols. On my Mac I get a few undefined symbols.
___error
___memcpy_chk
_abort
_memcmp
_memcpy
_memset
_mmap
_munmap
_sysconf
This is licensed under the two-clause "simplified" BSD license. See the LICENSE file in the distribution.
xxHash is BSD licensed. See the LICENSE-xxHash file in the distribution. The branch policy was taken from xxHash.
Doug Lea's malloc is in the public domain.
The Helium creep image used in the generated documentation is from Wikipedia. They determine our reality, and the image is licensed under the Creative Commons Attribution Share Alike 3.0 License.