-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve user's privacy with k-anonymity #25
Comments
Technically, no new endpoint would be required. if the input has a Storing hashes as binary blobs instead of strings is likely more efficient too; see for example https://stackoverflow.com/a/24011877 |
Smart, I don't know how well this will work with how little videoIDs are in this DB, but it definitely could work. |
On Fri, Aug 23, 2019 at 11:53:31AM -0700, Ajay Ramachandran wrote:
I don't know how well this will work with how little videoIDs are in
this DB
It doesn't matter when we're only returning a single result. That
doesn't compromise the user's privacy, because maybe they have requested
an ID we don't know about.
I think as a first step, I'd be great to modify the existing database to
add a `sha1sum` field. It would be good if its type were a binary blob
(so we can search on it fast), but I don't know if that's easy to handle
with JavaScript. The Easy Way™ would be to make it a plain old string,
and use a `LIKE %` query, but that is a relatively expensive operation.
What do you think?
|
Ah, true. I think a blob could work, since it could be converted from a string to blob (base 64 or something) |
One thing to note about this is is could be quite expensive, since it doesn't just send all sponsors for one video, it uses the weighted random generator. |
On Fri, Aug 23, 2019 at 12:31:17PM -0700, Ajay Ramachandran wrote:
it uses the weighted random generator.
Oh, I didn't notice that! Good catch! That function does look involved,
to say the least! ;-)
It's probably not a bad idea if we'd do some preliminary evaluation.
Once we have the hashes in the database, we can have a look at how many
bits of the hash return how many results (median). My uninformed guess
is that even sending just 5*16 bits (i.e. 5 hexadecimal chars) will
return just a single result.
Then, I'd go for the tried-and-true method of "just going with the flow"
and make the webextension send 5*16 bits of the hash. if it turns out
that this increases cpu time too much, we can either reevaluate the
complete idea, or increase the number of bits required (and return a 400
when the client tries to use too few bits).
If you want, I could prepare an sqlite query for adding a new column to
the database and populating it with the sha1sums.
|
Go for it! I think putting setting it up to start putting sha1sums is a good start, then we can add the rest after. Is there a reason to use sha1 over sha256? |
On Fri, Aug 23, 2019 at 01:06:19PM -0700, Ajay Ramachandran wrote:
Go for it! I think putting setting it up to start putting sha1sums is a
good start,
to be clear: I'm no javascript developer; i'd have to leave implementing
that to you. I can work on the sqlite stuff, though.
Is there a reason to use sha1 over sha256?
it uses less space. we don't require a secure hash for this at all.
technically, i think, one could implement something similar with for
example CRC32 (you'd lose some precision in truncating).
we do need to mangle the youtube-ids somehow, since they're sparse to
begin with.
so if the user asked for the prefix {video_id:"abcd*"}, and we have one
video id (and very likely only 1) with that prefix, we could conclude
that the user wanted video abcdefgh, since not many video ids begin with
abcd. however, if we hash the ids first, then (by design of a secure
hasing algorithm) we cannot conclude anything about the input.
I'm having a hard time coming up with a satisfying explanation, so feel
free to ask for more details!
|
That makes sense. So really, the only disadvantage is more data throughput and more calculations server side. |
On Fri, Aug 23, 2019 at 01:24:41PM -0700, Ajay Ramachandran wrote:
That makes sense. So really, the only disadvantage is more data
throughput and more calculations server side.
exactly. and both of those disadvantages can be controlled (by
increasing the number of required prefix-bits (and decreasing the
privacy of the user)), if it gets out of hand (which I don't think it
will, unless we're adding millions of videoIDs).
I'll have a look at the sql stuff over the weekend!
|
As promised, here's a quick and reliable way to add the sha1sum column.
It turns out that sqlite has a sha1 function, but only as an extension.
You'll need a C compiler (e.g. gcc or clang) and sqlite-devel (RPM based
systems, sqlite-dev on Debian systems I presume).
1. Download sha1.c
wget 'https://www.sqlite.org/src/raw/ext/misc/sha1.c?name=df0a667211baa2c0612d8486acbf6331b9f8633fd4d605c17c7cccd26d59c6bd' -O sha1.c
2. Install sqlite-devel (Fedora) or libsqlite3-dev (Debian/Ubuntu)
sudo dnf install sqlite-devel
# OR!
sudo apt-get install libsqlite3-dev
3. compile it using
gcc -g -fPIC -shared sha1.c -o sha1.so -std=gnu99
Finally, launch sqlite3 database.db from the directory of your compiled
sha1.so and issue:
SELECT load_extension("./sha1.so");
BEGIN TRANSACTION;
ALTER TABLE sponsorTimes ADD COLUMN sha1sum BLOB;
UPDATE sponsorTimes SET sha1sum = sha1(videoID) WHERE sha1sum ISNULL;
COMMIT;
One can then query for hashes using (note the zeroed out missing digits)
SELECT * from sponsorTimes WHERE sha1sum BETWEEN
'89b4800000000000000000000000000000000000' AND
'89b4900000000000000000000000000000000000';
This translates into simple >= and < operations and is therefore pretty
efficient.
let me know if that works for you!
|
I was going to file an issue suggesting exactly this, but I have one issue: I don't think there are likely enough videos submitted to make this sufficiently anonymous. It seems to me, that unless and until the database gets huge, the best thing to do would be to just have each client download the whole database, and maintain it locally. It's currently only ~3MB, and I expect it will stay at a manageable size for the foreseeable future. To update (probably done every time a video is loaded), you wouldn't need to download the whole database again, just add an API endpoint for getting all entries since It's not 100% a fair comparison, but this is what other ad blocking software does: you don't send every URL you visit to a remote server, you download and sync a list that tells you what to block. |
On Fri, Aug 30, 2019 at 04:38:17PM -0700, afontenot wrote:
I don't think there are likely enough videos submitted to make this
sufficiently anonymous.
that does not matter for k-anonymity: when only 1 hash matches, the
probability that the user requested a video that is not in our database
is high, so even then we can't infer much of anything.
To update (probably done every time a video is loaded), you wouldn't
need to download the whole database again, just add an API endpoint for
getting all entries since `timestamp`.
partial database exports probably won't scale when/if the project really
takes off, both in terms of transfer and storage size. i'm all for the
full sqlite dumps, as they bring resiliency to the project.
|
Is there an easy way to have a database in the browser? I was thinking of setting up a variant of the server that does what you are describing, and then making it so the extension can pull from that local server instead. |
i'd suggest opening a new bug for that, but in the meantime:
i don't think you can have the sqlite db directly in the addon. two
options come to mind, both i think aren't great for that use case:
1. use the localStorage apis:
https://developer.mozilla.org/en-US/docs/Web/API/Web_Storage_API/Local_storage
i'm not familiar with them, but it looks like they only support a
key-value lookup and are meant more for config values. there probably
is a relatively low limit on how much space (i.e. table rows) can be
stored.
the get-partial-db api would then need to return e.g. json, and the
extension would insert the values into the store.
2. use a companion application through nativeMessaging:
https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/Native_messaging
this way the calls to the sql db could be outsourced to a out-of-browser
program, with the database somewhere on disk.
this app would have to be programmed and maintained (extra work), and
would likely be somewhat os-specific. afaik this can't be used on
android, but don't quote me on that.
also, installing the companion app must be done by the user manually,
and requires dicking around in ~/.mozilla. it can't be done via the
extension for (obvious) security reasons.
|
Yea, the other issue is that localStorage does have a very low max storage usage, so it would not scale well. |
I think an easier way would be to use an all lower case search of the video IDs (not just the first letters, a random subset of letters). The seem random enough, and when you use 3 characters, you usually get a few results. |
On Wed, Sep 04, 2019 at 10:48:55AM -0700, Ajay Ramachandran wrote:
I think an easier way would be to use an all lower case search of the
video IDs. The seem random enough, and when you use 3 characters, you
usually get a few results.
hard no!
it's not about getting "a few results". this approach gives you nothing,
but renders the privacy aspect moot. video ids are sparse, but small and
partially known. when a user requests a partial id abc?????, the
probability that we can associate it with the real id is far higher and
easier than when the user requests a partial hash
abcde???????????????????????????????????.
but it's not just about good practices alone: using (properly
implemented) proven-in-the-wild solutions gives users confidence in said
solution; one cooked up by a gut-feeling of "probably good enough", used
nowhere else, requires users to trust you that you have carefully
studied the trade-offs.
are you worried about performance, implementation work, or something
else entirely? I'm sure we can find a solution to it. :-)
|
Isn't that the same probability, just that there is more precision with more characters? If someone requests for "sh3", their video could be anything that contains "Sh3", "sH3", etc. Maybe I am just misunderstanding. |
please just let's all agree to implement the feature as it has been
proven in the wild (e.g. in the haveibeenpwned-api). diverging from
known-working algorithms costs time and gives us nothing at best,
compromises users' privacy at worst.
none of us have studied this field; we can't possibly know where the
pitfalls of a custom approach are.
|
Currently, there are 25k entries in the DB. this means that on average log2(25k) = 15 bit are needed to uniquely identify one video. If we want every request to pull e.g. 16 videos for privacy, then we need to query by a prefix of log2(25k) - log2(16) = 11 bits. youtube video ids are base64 strings, so they have 6 bit per character. Hashing the id here doesn't matter regarding privacy I think but even though youtube video ids are pretty obviously random, it's probably still better to hash it to be safe of the distribution. Also it allows for adding other websites later (e.g. hash whole normalized URL which is probably a better idea anyways) and if youtube id algo changes. I wouldn't use BLOB in sqlite for this since it's harder to filter by, better use hex strings. Hex strings are log2(16)=4 bit per character. So right now it would be pretty easy e.g. add a videoID_sha256 column then do queries by the first three characters of the hash (12bits) which would return on average 8 videos. sqlite3 is able to use indices for prefix GLOB queries ( For each doubling of the database size, we in theory want to increase the length of the prefix we are looking for by one bit. If we go by hex character prefixes this isn't really possible since we can only increase the prefix every 4 bits or every 16xing of db size. |
Alternatively to using sqlite3 (a) just storing the sha hash prefix in binary strings (pretty easy since we'll never need more than 32 prefix bit characters which would require 32bytes of storage per entry in this case) (b) doing range queries on the hex strings: Convert the sha256 to a bigint, set all bits apart from the first or last 11 to zero (=lower bound) and then to 1 (=upper bound), e.g. with Probably neither of these is worth it and just filtering by sha256 prefix is fine, especially since many videos we're looking for aren't even in the result set so just using the first 12 bits is fine until we switch to 16 bit when db size has increased to like 128k entries. |
I think 16 is too much. Maybe there can be an option to do that, but I think the default should be lower. |
Note that this wouldn't be able to fix the call to invidio.us: |
Invidious API usage has been removed. |
I came upon SponsorBlock today for the first time and a browser-around let me to this issue. As I personally like to worry about privacy and would not want to send every single video ID to a SponsorBlockServer instance (unless maybe I was operating one within my own network). So I considered looking into this and maybe doing a PR. But first I would like to share some thoughts. To start I would like to point at an implementation of k-anonimity for data queries that pre-dates Pwned Passwords: the “Safe Browsing” (Chrome) and “Block dangerous and deceptive content” (Firefox) systems. Information about it is a bit spread around, but some good resources are: the API Documentation (look for “Update API”), more links collected on the Mozilla Wiki, and a study by Gerbet et al. I think this is a good implementation to keep in the back of our heads when continueing the discussion because of the many parallels there are. Safe Browsing is set-up to facilitate a client (web browser) wanting to check a centralised repository (Google’s collection of mallicious links) for something about a resource (website) without telling the repository about every resource the client uses. Now make the client a video player, the repository SponsorBlockServer, and the resource a YouTube video, and we are in the exact same situation. Safe Browsing also has a number of points that address some arguments raised by this discussion. You do not need to study it before reading on, just do not be surprised if I reference it again (and again, and again) 😄 I will consider the threat vector as: the SponsorBlockServer will get to know every video the user watches. Leave other discussions for later. (If we do not trust the network, we may need obfuscation techniques like padding. If we do not trust the client … well that is a separate can of worms.) Hashing I would say hashing is absolutely required. It is the only way to ensure no actual information about the video ID is shared with SponsorBlockServer. Even part of a video ID is still information. Further more, are we sure YouTube video IDs are randomly distributed throughout the possible ID space? Is a It does not really matter what hashing algorithm is chosen. As long as it upholds the default assumptions for a hash we can use it to map both value spaces to eachother: the amount of IDs known to SponsorBlockServer to the amount of IDs that can possibly exist in YouTube. Pwned Password uses SHA-1. But Troy has also explained that he is not really hashing to seriously protect the data. Only to make sure “any personal info” like email addresses “in the source data is obfuscated”. People were also “cracking the passwords for fun” and succeeding at this. (Multiple discussions have happened around this, quotes taken from a February 2018 blog post.) I am honestly not sure it matters much for prefixes in our specific case, but Safe Browsing uses SHA-256 and as I said might be closer to our flow. I also do not expect SponsorBlock clients to exist on many platforms that would not have access to SHA-256 primitives. So let us say we hash every video identifier with a SHA-256 function. Let us also assume that storage and querying the database is trivial. (Because any debate about What to hash This is something that was only very shortly touched on in the discussion here by @phiresky:
This is a step that once again exists in the Safe Browsing API. They do URL canonicalisation. However my recommendation would be to not do that. In my experience anything close to trying to normalise a URL leads to differences between clients. As a quick note: RFC 3986 includes steps for “Normalization”, the WHATWG URL standard includes steps for “equivalence”, Safe Browsing has steps for “Canonicalization”. If you are unlucky the language you are writing your client in will have access to anything from 0 to a 100 implementations of any of these. Not only that, you would also end up having to specify additional steps per platform. YouTube has lots of possible URLs pointing at the same video. Sometimes from entirely different domains ( Instead stick to clear identifiers supplied by the platforms. The YouTube video ID is much less ambiguous than any URL will be. It is also likely that a client already has logic to extract the video ID from a bunch of different sources. Letting them work with that is better. If we really value compatibility with other platforms simply generating a URI of our own suffices. Instruct clients to hash Hash prefix length Setting the length of the prefix is a balancing act. The longer the prefix, the less privacy is given. This goes the full range from 0 to 256 bits (in the case of SHA-256). If the client asks for a list of possibilities with a 0 bit long prefix it is really just asking for the entire database. This gives complete privacy to the user as the server gets no clues about what video was being accessed. But of course it is a bit annoying for all parties involved as the client would constantly request lots and lots of unneccessary data. If the clients asks for a list of possibilities with a 256 bit long prefix it is really just sending the full hash to the server. This gives almost¹ no privacy to the user as the server could match this hash 1:1 with a video ID to know exactly what was being watched. So this is no win for the user. The number we are aiming for is somewhere in the middle. I am however not convinced by @phiresky’s calculation of the prefix length. The length there seems to be argued from the data perspective and not from the user perspective. I do not even think there is any privacy difference depending on how many results the server returns. There is no relation I can see between number of results and the threat vector of not wanting to tell the server about what video I am watching. From the server perspective I do not think we care. There is already an endpoint that allows the client to submit the exact video ID, so obviously we are OK with the 256 bit variant. The database is not secret and infact readily downloadable, so we are also OK with the 0 bit variant. The only argument from the server side I can think of is to protect against a number of DoS problems. When the database grows, it may not always be OK to support continuous downloads of the entire thing. Server instances may want to scale up the minimum hash length from 0 to a more managable number in accordance with total database growth. Maybe in such a way that response sizes stay under a certain number of bytes. From the client perspective I think we want to send as little information as possible. This is where the user is and where we care about the privacy. But here are also a number of platform limitations that need to be considered. Like @ajayyy mentioned before about web browsers, some platforms may put limitations on storage. Other platforms may be on limited bandwidth. So for a first implementation I was thinking: why not leave it completely up to the client what prefix length it sends and only pick a bottom limit for the server? We have 133417 sponsorTimes in the latest DB (just downloaded from the mirror). If we aim to never have an API response include more than a 1000 we need a minimum length that cuts our hash space in approximately 134 blocks. That is just the first 8 bits of the hash (or the first 2 hexadecimal characters).² If the client feels this is problematic because of storage, memory, or bandwidth concerns, it can opt to send a longer prefix whilst sacrificing some of its user’s privacy. Future of prefixes The Safe Browsing API once again has an interesting system. The client can download a list of prefixes straight from the server. This is interesting for a number of reasons:
Now this of course assumes the client has a way to store this list of prefixes and implements logic to keep it up to date. Therefor I am not entirely sure it makes sense to make this a must from day one. Pull request Like I said at the start, I am considering looking into this and getting started on a pull request. In the mean time I would love to hear everyone’s input on my thoughts. Whether you think I am completely missing the mark, right on it, or anywhere in between. (Shall we rate it 0 to 256 bits? 😉) You will also find me in the SponsorBlock Discord as Zegnat if you would like to discuss any of the above with me outside of this issue. ¹: almost because if the video ID was not already known to the list, the hash needs to be reversed |
you're echoing a lot of my gut feelings about this topic, and clearly
laid out. Thanks! I've replied to a few things inline.
tl/dr: full ack! would love to see you open that PR.
On Fri, May 22, 2020 at 01:16:02PM -0700, Martijn van der Ven wrote:
To start I would like to point at an implementation of k-anonimity for
data queries that pre-dates Pwned Passwords: the “Safe Browsing”
(Chrome) and “Block dangerous and deceptive content” (Firefox) systems.
true! i can't believe i didn't think of it.
I would say hashing is absolutely required. It is the only way to
ensure no actual information about the video ID is shared with
SponsorBlockServer. Even part of a video ID is still information.
+1 to requiring hashing.
So let us say we hash every video identifier with a SHA-256 function.
Let us also assume that storage and querying the database is trivial.
(Because any debate about `TEXT` vs `BLOB` and optimisation of `WHERE`
statements can be had when actual benchmarking can be done.)
+1 to stop bikeshedding.
**What to hash**
[...]
Instead stick to clear identifiers supplied by the platforms. The
YouTube video ID is much less ambiguous than any URL will be. It is
[...]
If we really value compatibility with other platforms simply generating
a URI of our own suffices. Instruct clients to hash `yt:` followed by
the YouTube video ID. That would allow simple future expansion without
seems like a solid idea.
**Hash prefix length**
[...]
So for a first implementation I was thinking: why not leave it
completely up to the client what prefix length it sends and only pick a
bottom limit for the server?
we should define at least some bounds: retrieving the whole database
this way is way more expensive than downloading it, potentially opening
the server up to DoS, and personally I'd also let misconfigured clients
(who send a too specific prefix) hard-fail, so it gets fixed quickly.
as a starting point, I'd suggest >=8bits, but <= 50% of the hash.
**Pull request**
Like I said at the start, I am considering looking into this and
getting started on a pull request. In the mean time I would love to
hear everyone’s input on my thoughts. Whether you think I am completely
missing the mark, right on it, or anywhere in between. (Shall we rate
it 0 to 256 bits? 😉)
you're getting all 256 from me :)
i'll definitely review your PR once it's ready (given my limited
javascript skills it'll be more on the concepts than the implementation
details).
|
Sounds great! I'm going to repeat some things I've already said on discord to have them here more publicly.
I think hashing just the video id is fine for now. I think the service should be added as a separate column (when needed). The service that they are getting info from isn't a large privacy concern and this would more easily allow it to be split to a different database or server in the future depending on growth.
Like @girst , I think there should be minimums, but allowing it to be changed client-side isn't a bad idea.
I want to implement some form of caching in the future but ideally this will hold the actual data instead of just hashes. This would only apply to videos older than some set time period. The issue with setting up caching is that fetching the upload time actually takes time. This means that sponsor fetching would be slower for every video as it would fetch the upload time, wait for a response, then fetch times. This is why the current implementation of whitelisting sometimes misses zero second sponsors (though in the latest beta I have added an option to wait before fetching to resolve this issue in exchange for slower sponsor fetching). I feel that caching can be dealt with separately from this. |
I love that you are tackling privacy issues! I would, however, like to raise one concern regarding the effectiveness of this specific approach. I am not very active/experienced in the field of privacy preserving technology so take this with a grain of salt but I am fairly certain the base idea holds merit. Some assumptions I'll just make explicit:
Obviously hashing doesn't prevent the server from getting the set of videos that were requested. The hash is only a way of specifying a random set of videos, including the one at interest.
There are many scenarios in which it is easy to identify which sequence of videos is the correct one. For example, if part 1 and part 2 of a series are in columns 1 & 2 it is very unlikely that those were chosen based on random chance. Similarly if a request for a new video is made after 14 minutes when only one video in the previous column was 14 minutes in length there is an increased likelihood of that video being the one that is watched. From a practical perspective I see that it could be argued that this is still an improvement in privacy, even if an attack like this is sometimes possible (after all it still requires a lot of effort). I would however also like to at least mention the counterpoint that implementing such a scheme could give a false confidence in the privacy afforded by this service. Keep in mind that human intuition on probability can be very misleading so maybe this attack quickly becomes infeasible. Some actual calculations might be required to assess the feasibility of this approach. |
Still have to test performance before enabling globally, so it will be just an option for now. I might try enabling it for a certain percentage of all requests (50%) by default and see how it goes. |
On Fri, Sep 04, 2020 at 09:05:29AM -0700, Ajay Ramachandran wrote:
Still have to test performance before enabling globally, so it will be
just an option for now. I might try enabling it for a certain
percentage of all requests (50%) by default and see how it goes.
I guess this statement is regarding the addon, no?
I'd like to use this endpoint with my own client as soon as it's
available (right now, it still 404s). Any idea when it's going live,
ajayyy?
|
@girst Yes, it's about the addon. I have noticed that the public database size doubles with this change, so I am going to look into splitting this into a separate database file. It might not be live for a bit. |
@girst The API is now live on the main server |
On Tue, Sep 15, 2020 at 09:56:31AM -0700, Ajay Ramachandran wrote:
The API is now live on the main server
what am i doing wrong?
% curl https://sponsor.ajay.app/api/skipSegments/`printf v-ajyq_DeDA|sha256sum|head -c8`
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Error</title>
</head>
<body>
<pre>Cannot GET /api/skipSegments/14e4b9a7</pre>
</body>
</html>
if i try the test-server(?), i get a different result:
% curl https://sponsor.ajay.app/test/api/skipSegments/`printf v-ajyq_DeDA|sha256sum|head -c8`
Not Found
|
Sorry about that, should be good now |
On Tue, Sep 15, 2020 at 11:16:52AM -0700, Ajay Ramachandran wrote:
Sorry about that, should be good now
It is! Thanks again!
|
As I understand, currently, the user submits a video ID (e.g.
dQw4w9WgXcQ
), and gets back a single JSON-object.Here I propose to add a new endpoint, e.g.
/api/anonymousGetVideoSponsorTimes
that takes the first n (e.g. 5) characters of a hash (sha1 is fine) and returns a list of possible results, like in the example below.This approach is used in Troy Hunt's Have I Been Pwned API; see https://www.troyhunt.com/ive-just-launched-pwned-passwords-version-2/ and https://blog.cloudflare.com/validating-leaked-passwords-with-k-anonymity/ for example.
input:
{ hash_prefix: <sha1sum("dQw4w9WgXcQ").substr(0,5)> }
(e.g.
{ hash_prefix: '3dd08' }
)output:
Since Youtube IDs are sparse, and furthermore SponsorBlock only has a small part of IDs indexed, each query will return only a small amount of results, if any. If result length were to get out of hands in the future, it would be easy to increase the number of input characters required.
For performance reasons, the database should grow a new column,
sha1sum
. A pseudo-SQL query for such a request might look like this:Which hashing algorithm is used is not very important, as the user will only send a fraction of the hash to the server. SHA1 has a reasonable length, I'd say. (let's avoid MD5, though ;-) )
The text was updated successfully, but these errors were encountered: