Skip to content

LMDB freelist illustrated guide

ledgerwatch edited this page Nov 12, 2020 · 31 revisions

How to regenerate illustrations

Illustrations used in this guide are auto-generated from the actual LMDB databases created by some test code. Therefore, they reflect the behaviour of the actual LMDB code currently embedded into turbo-geth via lmdb-go repository here: https://github.com/ledgerwatch/lmdb-go. To understand which illustration is which, this guide will give its name, and will also show the test code that generates the database for each illustration. The illustrations can be reproduced by running:

make hack
./build/bin/hack --action defrag

This creates the .dot and .png files in the current directory.

Empty database

Test code does nothing:

func nothing(kv ethdb.KV, _ ethdb.Tx) (bool, error) {
	return true, nil
}

Here is what we see:

vis1_0

Each LMDB database has a meta page (actually it has two meta pages, but any reader or writer always chooses the one that was the last updated). Meta page has lots of interesting information, but the most important pieces for this guide are two "page pointers": FREE_DBI and MAIN_DBI. By page pointers here we mean a 64-bit page ID, which also happens to be the offset of that page in the data.md file, divided by the page size (4096 bytes). This means that the page with ID 0 has offset 0 * 4096 = 0 from the beginning of the file (this happens to be the first meta page), the page with ID 1 has offset 4096 (this is the second meta page), page with ID 2 has offset 2 * 4096 = 9192 and so on.

Q: How do readers and writers choose which of the two meta pages to use? One of the fields within a meta page is the ID of the last transaction that has updated that meta page. Transactions with even IDs update the meta page 0, and transactions with odd IDs update the meta page 1. So readers and writers look at both meta pages and simply choose the one that has the higher transaction ID.

One table, two records in it

Test code that generates it:

func generate2(tx ethdb.Tx, entries int) error {
	c := tx.Cursor("t")
	defer c.Close()
	for i := 0; i < entries; i++ {
		k := fmt.Sprintf("%05d", i)
		if err := c.Append([]byte(k), []byte("very_short_value")); err != nil {
			return err
		}
	}
	return nil
}

This function takes parameter entries, which in this example is 2 to create two entries. And here is the result:

vis2_0

The test code snippet above gives an impression that there is only DB transaction happening. You should note, however, that the code does not create table t, it starts using it (by making a cursor for it) assuming that it is already created. This table was created by some "convenience" wrappers and it was transaction with ID 1000001. Transaction that is shown in the snippet is the one with ID 1000002.

Q: Why do transaction IDs in an empty database start with 1000001 and not with 1? This is a recent change that we have made to our version of LMDB. We have noticed that the behaviour of LMDB regarding free space is dependent (surprisingly) on the numerical value of transaction ID, and the lower the ID, the more problematic behaviour becomes for large amount of free space (this will be explained later, because it is actually one of the main reasons this guide is being written in the first place). So a simple fix is just to start with a higher transaction ID, and value 1 million is round, so that people will notice straight away it is not random, and it is high enough that we should not see the problem we were seeing for the amount of free space up to 2Tb.

First of all, lets list all the pages that we can see (or not see) on the picture. We know that pages 0 and 1 are meta pages, and are always present. Page 2 is mentioned in a freelist record, inside the blue box. To make illustrations more clear, we use blue boxes to show pages belonging to FREE_DBI, which is a sub-database storing the collection of so-called freelists. On the picture above, the FREE_DBI sub-database resides on one page, page 5 (this is the value of the "page pointer" FREE_DBI). It contains one freelist, currently associated with the transaction ID 1000002. And this freelist has only one element, which is page 2. The fact that page 2 is on the freelist, means that in subsequent transactions, page 2 can be recycled, i.e. filled up with some other data and/or structures.

Q: If a freelist is associated with some transaction ID, does it mean that that transaction has created this freelist (meaning that it has freed the pages mentioned in this freelist)? In many cases yes, but not always. It will be shown later that sometimes freelist associated with transaction IDs that did not create them, or indeed with transactions that never happened (for example, a freelist can be associated with a transaction ID 999997, which never happens, because we start with 1000001). This will be explained in more details later.

We continue listing all the pages. Page 3 is the "root page" of the sub-database for the table t. This can be seen from the yellow box. To make illustrations more clear, we use yellow boxes to show pages belonging to MAIN_DBI, which is a sub-database storing collection of more sub-databases we call "tables". Pages for data and structure belonging to the tables are shown in brown (for branches) and green (for leaves). So page 3 is actually that green box with two entries in it. It belongs to the table t. And finally, page 4 is the yellow box containing the only entry inside MAIN_DBI. You can see that the yellow box is page 4 by looking at the "page pointer" MAIN_DBI coming from meta page.

In this small example, neither MAIN_DBI nor table t are large enough to require any so-called "branch pages". All information fits into a single page, which here is 4096 (LMDB default). In the leaf page of the table t, one can see keys and values separated by colons.

61 empty tables

Test code snippet:

func generate3(_ ethdb.KV, tx ethdb.Tx) (bool, error) {
	for i := 0; i < 61; i++ {
		k := fmt.Sprintf("table_%05d", i)
		if err := tx.(ethdb.BucketMigrator).CreateBucket(k); err != nil {
			return false, err
		}
	}
	return true, nil
}

This code creates 61 empty tables within a single writeable transaction. Why 61? Because this is just enough not to fit on a single page, so that we can see how MAIN_DBI begins to have branch pages. Here is how it looks:

vis3_0

To keep illustrations small, we use some "folding" of repeated data, here shown as ...53 more records here.... This will allows us to create examples that are large enough to have certain features (like branch pages), without overburdening the illustrations.

Branch pages of the MAIN_DBI sub-database are shown as purple. We can see such a branch page which has ID 4 (it is on the MAIN_DBI "page pointer"). One may notice that there is no FREE_DBI sub-database here, because there are no freelists. There was just one single transaction (we did not need table t pre-created like before), and it could not have freed anything up, because it started with an empty database.

One table with 200 entries

It is the same code as we saw previously:

func generate2(tx ethdb.Tx, entries int) error {
	c := tx.Cursor("t")
	defer c.Close()
	for i := 0; i < entries; i++ {
		k := fmt.Sprintf("%05d", i)
		if err := c.Append([]byte(k), []byte("very_short_value")); err != nil {
			return err
		}
	}
	return nil
}

But this time, we pass value 200 as entries. And here is the result:

vis4_0

As in the example with one table and two records, we see one freelist, exactly the same as before, and for the same reason (table t used to be empty, so that the page containing pointer to it has to be replaced with a new page).

Since we now have 200 entries in the table t, and they do not fit into a single page anymore, we observe a branch page (brown box), as well as two leaf pages (green boxes). Nothing much more interesting here.

One table, one key with a short value, one key with a very long value

Here is the test code that generates it:

func generate4(_ ethdb.KV, tx ethdb.Tx) (bool, error) {
	c := tx.Cursor("t")
	defer c.Close()
	if err := c.Append([]byte("k1"), []byte("very_short_value")); err != nil {
		return false, err
	}
	if err := c.Append([]byte("k2"), []byte(strings.Repeat("long_value_", 1000))); err != nil {
		return false, err
	}
	return true, nil
}

The value for the second key ends up being 11 * 1000 = 11000 bytes long, which clearly does not fit on a single page, but would fit on three pages. This is how it looks like indeed:

vis5_0

When confronted with such large values, LMDB places 64-bit number instead, which is interpreted as the first page ID of where the actual value resides. These pages where the actual value resides are called "overflow pages". LMDB insists that overflow pages belonging to the same value have consecutive page IDs, i.e. are stored together in the data.md file. We suspect that this is done to avoid "jumping around" the database file looking for the pieces of some value. In this example, the first overflow page ID is 4 (looking at OVERFLOW(4)), which means that pages with IDs 4, 5, and 6 contain that very long value. This matches us with the rest of the picture, because page 7 is the yellow box, page 8 is the blue box, page 2 is free (it is on the freelist), and page 3 is the first green box.

We are now coming closer to the main motivation for writing this guide. Overflow pages present serious challenges when combined with the specific way the freelists are handled in LMDB.

Q: If overflow pages present a serious challenge, would it make sense to try to avoid them? Yes, this is what we will definitely attempt to do. Wish we knew all this much earlier.

One DupSort table, and keys with three, two and one value

Here is the test code:

func generate5(_ ethdb.KV, tx ethdb.Tx) (bool, error) {
	c := tx.CursorDupSort("t")
	defer c.Close()
	if err := c.AppendDup([]byte("key1"), []byte("value11")); err != nil {
		return false, err
	}
	if err := c.AppendDup([]byte("key1"), []byte("value12")); err != nil {
		return false, err
	}
	if err := c.AppendDup([]byte("key1"), []byte("value13")); err != nil {
		return false, err
	}
	if err := c.AppendDup([]byte("key2"), []byte("value21")); err != nil {
		return false, err
	}
	if err := c.AppendDup([]byte("key2"), []byte("value22")); err != nil {
		return false, err
	}
	if err := c.AppendDup([]byte("key3"), []byte("value31")); err != nil {
		return false, err
	}
	return true, nil
}

vis6_0

It appears that the DupSort feature of LMDB has been inherited from its predecessor, BerkleyDB, in somewhat more restricted form. This feature allows having multiple values per key (therefore Dup). In BerkleyDB, values for such "duplicated" keys could be stored either unsorted or sorted, but in LMDB they are always stored as sorted. We suspect that the main motivation for bringing this feature to LMDB is simpler implementation of secondary indices. If you use a table for mapping Key => (Attr1, Attr2, Attr3), and then you want to have secondary index Attr2 => Key, it is very likely that for many values of Attr2 there will be more than one matching Keys. So you can use DupSort feature to store such index. LMDB will not automatically keep secondary indices consistent with the "primary" table though.

The picture simply demonstrates a key with a single value (key3), a duplicated key with two values (key2), and a duplicated key with three values (key1). All three reside on the same page.

There is, however, one restriction about DupSort tables that make them relevant to our challenges with freelists and overflow pages. The size of any value associated with a key in such a table may not exceed 512 bytes. This means that for DupSort tables, overflow pages never occur. This might actually be the reason for introducing such restriction - overflow pages might have been too complicated to combine with large number of values for the same duplicated key, as shown in the next example.

One DupSort table, with a large number of values for one of the keys

Test code:

func generate6(_ ethdb.KV, tx ethdb.Tx) (bool, error) {
	c := tx.CursorDupSort("t")
	defer c.Close()
	for i := 0; i < 1000; i++ {
		v := fmt.Sprintf("dupval_%05d", i)
		if err := c.AppendDup([]byte("key1"), []byte(v)); err != nil {
			return false, err
		}
	}
	if err := c.AppendDup([]byte("key2"), []byte("value21")); err != nil {
		return false, err
	}
	if err := c.AppendDup([]byte("key2"), []byte("value22")); err != nil {
		return false, err
	}
	if err := c.AppendDup([]byte("key3"), []byte("value31")); err != nil {
		return false, err
	}
	return true, nil
}

The difference between this one and the previous illustration is that key1 has 1000 values instead of three. This number of values does not fit in a single page, so this demonstrates what LMDB does in such cases:

vis7_0

It creates a sub-database for every such key. In this example, the sub-database for key1 has one branch page and a few leaf nodes. It is important to note that sub-database are only creates when values for one key do not fit in a single page. In this sense, sub-database is similar to the overflow pages. However, the crucial difference is that, since every individual value cannot be more than 512 bytes long, LMDB does not require for any pages of such sub-databases to be laid out on the consecutive pages. This makes DupSort sub-databases "exempt" from the challenges with the freelists that very long values have (due to overflow pages).

One table with 1000 records, then cleared out

Test code is the sequence of two operations, first to generate 1000 records, using the same code as before, but called with entries = 1000:

func generate2(tx ethdb.Tx, entries int) error {
	c := tx.Cursor("t")
	defer c.Close()
	for i := 0; i < entries; i++ {
		k := fmt.Sprintf("%05d", i)
		if err := c.Append([]byte(k), []byte("very_short_value")); err != nil {
			return err
		}
	}
	return nil
}

and the second - to clear the table:

func dropT(_ ethdb.KV, tx ethdb.Tx) (bool, error) {
	if err := tx.(ethdb.BucketMigrator).ClearBucket("t"); err != nil {
		return false, err
	}
	return true, nil
}

So in this example we have two pictures, one before the clearing out:

vis8_0

and another - after the clearing out:

vis8_1

The first picture looks similar to what we have seen before (but this time 1000 records instead of 200 records), but the second one displays a larger freelist than we have seen before. The syntax is similar to how you would specify pages when you print a document, there are intervals and individual pages separated by commas. In this example, 15,13-2(12) means that there are 13 free pages in this list, with IDs 15, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2. First thing to notice is that pages are listed in the reverse order. This is how LMDB sorts them (we don't know why). Second thing to notice is that after each interval, like 13-2 there is the size of that interval in parenthesis, like (12). The number in parenthesis becomes important when the freelist gets used for recycling pages. If we, for example, would require a consecutive run of 13 or 14 pages (for overflowing values), we would not be able to use this freelist, but if we only required 5 consecutive pages, then we could take pages 13, 12, 11, 10, 9, and reduce the remaining freelist to 15,8-2(7).

Where did the page 14 go? It is the one which is represented by the yellow box, now containing info about the empty table t.

Two tables with 1000 records each, then one of the tables cleared out

As before, the test code is the sequence of two operations:

func generate7(_ ethdb.KV, tx ethdb.Tx) (bool, error) {
	c1 := tx.Cursor("t1")
	defer c1.Close()
	c2 := tx.Cursor("t2")
	defer c2.Close()
	for i := 0; i < 1000; i++ {
		k := fmt.Sprintf("%05d", i)
		if err := c1.Append([]byte(k), []byte("very_short_value")); err != nil {
			return false, err
		}
		if err := c2.Append([]byte(k), []byte("very_short_value")); err != nil {
			return false, err
		}
	}
	return true, nil
}
func dropT1(_ ethdb.KV, tx ethdb.Tx) (bool, error) {
	if err := tx.(ethdb.BucketMigrator).ClearBucket("t1"); err != nil {
		return false, err
	}
	return true, nil
}

And here are the two pictures, one before the clearing, and another - after the clearing:

vis10_0

vis10_1

If we look closely into the test code that inserts 1000 record into two tables, we notice that instead of inserting 1000 into one, and then inserting 1000 into another one, we insert each key/value pair into both tables, and then go to the next pair. This is deliberate. We know that two tables will be residing in different sub-databases, so they will be allocated their own pages. When there is nothing in the freelist, new pages are allocated by extending the database, and allocated page IDs will monotonically increase: 4, 5, 6, 7, .... If we were to insert 1000 records into one table first, LMDB would likely to allocate consecutive run of pages to it, for example 4-13 or something. If we, however, insert records in both, one pair at a time, it is more likely that one table will most get odd page IDs, and another one - even page IDs. This is what we want - because we would like to create a so-called "fragmented" freelist, where there aren't many long consecutive runs of page IDs.

Now if one looks at the second picture, after the table t1 is cleared, it is obvious that we almost succeeded. Most pages previously occupied by the table t1 is now in the freelist, and the freelist looks fragmented, containing mostly odd page IDs (with some exceptions, like 22 and 6).

Two tables with 1000 records each, then both tables cleared out one after another

This example is based on the previous one, but the test code is the sequence of three operations instead of two. Here is the test code of the third operation:

func dropT2(_ ethdb.KV, tx ethdb.Tx) (bool, error) {
	if err := tx.(ethdb.BucketMigrator).ClearBucket("t2"); err != nil {
		return false, err
	}
	return true, nil
}

The two first pictures in this example are identical to the ones in the previous example, therefore here is the third picture showing the database after the table t2 is also cleared out:

vis11_2

Here we can see that because clearing out tables t1 and t2 happened in two separate transactions, we have two freelists. And, as designed, these freelists are fragmented.

100 tables, with 1000 records in each, then half of them is removed

Now we are starting to work with larger-scale examples. The test code is the sequence of three operations. First operation creates 100 tables:

func generate8(_ ethdb.KV, tx ethdb.Tx) (bool, error) {
	for i := 0; i < 100; i++ {
		k := fmt.Sprintf("table_%05d", i)
		if err := tx.(ethdb.BucketMigrator).CreateBucket(k); err != nil {
			return false, err
		}
	}
	return false, nil
}

Second operation populates them with 1000 records each (note that we use the same approach as before to make sure the tables do not occupy consecutive runs of pages):

func generate9(tx ethdb.Tx, entries int) error {
	var cs []ethdb.Cursor
	for i := 0; i < 100; i++ {
		k := fmt.Sprintf("table_%05d", i)
		c := tx.Cursor(k)
		defer c.Close()
		cs = append(cs, c)
	}
	for i := 0; i < entries; i++ {
		k := fmt.Sprintf("%08d", i)
		for _, c := range cs {
			if err := c.Append([]byte(k), []byte("very_short_value")); err != nil {
				return err
			}
		}
	}
	return nil
}

The third operation removes half of the tables, all of them with the even numbers in their names. Removal is not performed in one single transaction, instead each table is removed in a separate transaction:

func dropGradually(kv ethdb.KV, tx ethdb.Tx) (bool, error) {
	tx.Rollback()
	for i := 0; i < 100; i += 2 {
		k := fmt.Sprintf("table_%05d", i)
		if err := kv.Update(context.Background(), func(tx1 ethdb.Tx) error {
			return tx1.(ethdb.BucketMigrator).DropBucket(k)
		}); err != nil {
			return false, err
		}
	}
	return true, nil
}

The two following picture shows the state of the database after the creation and population of the 100 tables with 1000 records each:

vis12_1

The small freelist with 3 free pages on it occurred because the second operation (populating tables with records) replaced these 3 pages in the MAIN_DBI that used to contain the records for the empty tables.

The next picture shows what it looks like after half of the tables were removed, each table in its own transaction:

vis12_2