Skip to content

Commit

Permalink
Refactor hntrie to avoid the need for boundary cells
Browse files Browse the repository at this point in the history
Whereas before the string segment was encoded as:

LL OOOOOOOOOOOO

where L are the upper 8 bits and used to encode the length
of the segment, and O are the lower 24 bits and used to
encode the offset of the string data in the character
buffer, the new code encode as follow:

OOOOOOOOOOOO LL

And furthermore the most significant bit of the length
LL is now used to mark whether the current string segment
is a label boundary.

This means a cell can't reference a segment longer then
127 characters. To work around this limitation for when a
segment is longer than 127 characters (a rare occurrence),
the algorithm will simply split the segment into multiple
adjacent cells.

As a result, there is no longer a need to encode
"boundariness" into special cells, which simplifies
both the storing and matching algorithms.

Additionally, added minimal documentation for the NPM
package on how to import and use HNTrieContainer as a
standalone API.
  • Loading branch information
gorhill committed Aug 10, 2021
1 parent a3f430e commit c6fb70b
Show file tree
Hide file tree
Showing 8 changed files with 370 additions and 249 deletions.
62 changes: 62 additions & 0 deletions platform/nodejs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,3 +94,65 @@ It is possible to pre-parse filter lists and save the intermediate results for
later use -- useful to speed up the loading of filter lists. This will be
documented eventually, but if you feel adventurous, you can look at the code
and use this capability now if you figure out the details.

---

## Extras

You can directly use specific APIs exposed by this package, here are some of
them, which are used internally by uBO's SNFE.

### `HNTrieContainer`

A well optimised [compressed trie](https://en.wikipedia.org/wiki/Trie#Compressing_tries)
container specialized to specifically store and lookup hostnames.

The matching algorithm is designed for hostnames, i.e. the hostname labels
making up a hostname are matched from right to left, such that `www.example.org`
with be a match if `example.org` is stored into the trie, while
`anotherexample.org` won't be a match.

`HNTrieContainer` is designed to store a large number of hostnames with CPU and
memory efficiency as a main concern -- and is a key component of uBO.

To create and use a standalone `HNTrieContainer` object:

```js
import HNTrieContainer from '@gorhill/ubo-core/js/hntrie.js';

const trieContainer = new HNTrieContainer();

const aTrie = trieContainer.createOne();
aTrie.add('example.org');
aTrie.add('example.com');

const anotherTrie = trieContainer.createOne();
anotherTrie.add('foo.invalid');
anotherTrie.add('bar.invalid');

// matches() return the position at which the match starts, or -1 when
// there is no match.

// Matches: return 4
console.log("aTrie.matches('www.example.org')", aTrie.matches('www.example.org'));

// Does not match: return -1
console.log("aTrie.matches('www.foo.invalid')", aTrie.matches('www.foo.invalid'));

// Does not match: return -1
console.log("anotherTrie.matches('www.example.org')", anotherTrie.matches('www.example.org'));

// Matches: return 0
console.log("anotherTrie.matches('foo.invalid')", anotherTrie.matches('foo.invalid'));
```

The `reset()` method must be used to remove all the tries from a trie container,
you can't remove a single trie from the container.

```js
hntrieContainer.reset();
```

When you reset a trie container, you can't use the reference to prior instances
of trie, i.e. `aTrie` and `anotherTrie` are no longer valid and shouldn't be
used following a reset.
5 changes: 3 additions & 2 deletions platform/nodejs/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@gorhill/ubo-core",
"version": "0.1.7",
"version": "0.1.8",
"description": "To create a working instance of uBlock Origin's static network filtering engine",
"type": "module",
"main": "index.js",
Expand All @@ -15,7 +15,8 @@
"keywords": [
"uBlock",
"uBO",
"adblock"
"adblock",
"trie"
],
"author": "Raymond Hill",
"license": "GPL-3.0-or-later",
Expand Down
62 changes: 48 additions & 14 deletions platform/nodejs/test.js
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@ import {
StaticNetFilteringEngine,
} from './index.js';

import HNTrieContainer from './js/hntrie.js';

/******************************************************************************/

function fetch(listName) {
Expand All @@ -42,7 +44,7 @@ function fetch(listName) {
});
}

function runTests(engine) {
function testSNFE(engine) {
let result = 0;

// Tests
Expand Down Expand Up @@ -77,33 +79,65 @@ function runTests(engine) {
}
}

async function main() {
try {
const result = await enableWASM();
if ( result !== true ) {
console.log('Failed to enable all WASM code paths');
}
} catch(ex) {
console.log(ex);
}

async function doSNFE() {
const engine = await StaticNetFilteringEngine.create();

await engine.useLists([
fetch('easylist').then(raw => ({ name: 'easylist', raw })),
fetch('easyprivacy').then(raw => ({ name: 'easyprivacy', raw })),
]);

runTests(engine);
testSNFE(engine);

const serialized = await engine.serialize();
engine.useLists([]);

runTests(engine);
testSNFE(engine);

await engine.deserialize(serialized);

runTests(engine);
testSNFE(engine);
}

async function doHNTrie() {
const trieContainer = new HNTrieContainer();

const aTrie = trieContainer.createOne();
aTrie.add('example.org');
aTrie.add('example.com');

const anotherTrie = trieContainer.createOne();
anotherTrie.add('foo.invalid');
anotherTrie.add('bar.invalid');

// matches() return the position at which the match starts, or -1 when
// there is no match.

// Matches: return 4
console.log("aTrie.matches('www.example.org')", aTrie.matches('www.example.org'));

// Does not match: return -1
console.log("aTrie.matches('www.foo.invalid')", aTrie.matches('www.foo.invalid'));

// Does not match: return -1
console.log("anotherTrie.matches('www.example.org')", anotherTrie.matches('www.example.org'));

// Matches: return 0
console.log("anotherTrie.matches('foo.invalid')", anotherTrie.matches('foo.invalid'));
}

async function main() {
try {
const result = await enableWASM();
if ( result !== true ) {
console.log('Failed to enable all WASM code paths');
}
} catch(ex) {
console.log(ex);
}

await doSNFE();
await doHNTrie();

process.exit();
}
Expand Down
2 changes: 1 addition & 1 deletion src/js/background.js
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@ const µBlock = { // jshint ignore:line
// Read-only
systemSettings: {
compiledMagic: 37, // Increase when compiled format changes
selfieMagic: 37, // Increase when selfie format changes
selfieMagic: 38, // Increase when selfie format changes
},

// https://github.com/uBlockOrigin/uBlock-issues/issues/759#issuecomment-546654501
Expand Down
Loading

0 comments on commit c6fb70b

Please sign in to comment.