Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve user's privacy with k-anonymity #25

Closed
girst opened this issue Aug 22, 2019 · 38 comments
Closed

Preserve user's privacy with k-anonymity #25

girst opened this issue Aug 22, 2019 · 38 comments

Comments

@girst
Copy link

girst commented Aug 22, 2019

As I understand, currently, the user submits a video ID (e.g. dQw4w9WgXcQ), and gets back a single JSON-object.

Here I propose to add a new endpoint, e.g. /api/anonymousGetVideoSponsorTimes that takes the first n (e.g. 5) characters of a hash (sha1 is fine) and returns a list of possible results, like in the example below.

This approach is used in Troy Hunt's Have I Been Pwned API; see https://www.troyhunt.com/ive-just-launched-pwned-passwords-version-2/ and https://blog.cloudflare.com/validating-leaked-passwords-with-k-anonymity/ for example.

input: { hash_prefix: <sha1sum("dQw4w9WgXcQ").substr(0,5)> }
(e.g. { hash_prefix: '3dd08' })

output:

[
 {
  videoID: 'dQw4w9WgXcQ',
  sponorTimes: array [float],
  UUIDs: array [string] //The ID for this sponsor time, used to submit votes
 },
 {
  videoID: 'ah20943fdhj7'
  sponorTimes: array [float],
  UUIDs: array [string] //The ID for this sponsor time, used to submit votes
 },
// ...
]

Since Youtube IDs are sparse, and furthermore SponsorBlock only has a small part of IDs indexed, each query will return only a small amount of results, if any. If result length were to get out of hands in the future, it would be easy to increase the number of input characters required.

For performance reasons, the database should grow a new column, sha1sum. A pseudo-SQL query for such a request might look like this:

SELECT * FROM sponsorTimes WHERE sha1sum LIKE '3dd08%'

Which hashing algorithm is used is not very important, as the user will only send a fraction of the hash to the server. SHA1 has a reasonable length, I'd say. (let's avoid MD5, though ;-) )

@girst
Copy link
Author

girst commented Aug 22, 2019

Technically, no new endpoint would be required. if the input has a hash_prefix key, query by sha1 prefix, if it has a videoID key, by video ID.

Storing hashes as binary blobs instead of strings is likely more efficient too; see for example https://stackoverflow.com/a/24011877

@ajayyy
Copy link
Owner

ajayyy commented Aug 23, 2019

Smart, I don't know how well this will work with how little videoIDs are in this DB, but it definitely could work.

@girst
Copy link
Author

girst commented Aug 23, 2019 via email

@ajayyy
Copy link
Owner

ajayyy commented Aug 23, 2019

Ah, true. I think a blob could work, since it could be converted from a string to blob (base 64 or something)

@ajayyy
Copy link
Owner

ajayyy commented Aug 23, 2019

One thing to note about this is is could be quite expensive, since it doesn't just send all sponsors for one video, it uses the weighted random generator.

@girst
Copy link
Author

girst commented Aug 23, 2019 via email

@ajayyy
Copy link
Owner

ajayyy commented Aug 23, 2019

Go for it! I think putting setting it up to start putting sha1sums is a good start, then we can add the rest after. Is there a reason to use sha1 over sha256?

@girst
Copy link
Author

girst commented Aug 23, 2019 via email

@ajayyy
Copy link
Owner

ajayyy commented Aug 23, 2019

That makes sense. So really, the only disadvantage is more data throughput and more calculations server side.

@girst
Copy link
Author

girst commented Aug 23, 2019 via email

@girst
Copy link
Author

girst commented Aug 24, 2019 via email

@afontenot
Copy link

afontenot commented Aug 30, 2019

I was going to file an issue suggesting exactly this, but I have one issue: I don't think there are likely enough videos submitted to make this sufficiently anonymous. It seems to me, that unless and until the database gets huge, the best thing to do would be to just have each client download the whole database, and maintain it locally. It's currently only ~3MB, and I expect it will stay at a manageable size for the foreseeable future.

To update (probably done every time a video is loaded), you wouldn't need to download the whole database again, just add an API endpoint for getting all entries since timestamp.

It's not 100% a fair comparison, but this is what other ad blocking software does: you don't send every URL you visit to a remote server, you download and sync a list that tells you what to block.

@girst
Copy link
Author

girst commented Aug 31, 2019 via email

@ajayyy
Copy link
Owner

ajayyy commented Aug 31, 2019

Is there an easy way to have a database in the browser? I was thinking of setting up a variant of the server that does what you are describing, and then making it so the extension can pull from that local server instead.

@girst
Copy link
Author

girst commented Aug 31, 2019 via email

@ajayyy
Copy link
Owner

ajayyy commented Aug 31, 2019

Yea, the other issue is that localStorage does have a very low max storage usage, so it would not scale well.

@ajayyy
Copy link
Owner

ajayyy commented Sep 4, 2019

I think an easier way would be to use an all lower case search of the video IDs (not just the first letters, a random subset of letters). The seem random enough, and when you use 3 characters, you usually get a few results.

@girst
Copy link
Author

girst commented Sep 4, 2019 via email

@ajayyy
Copy link
Owner

ajayyy commented Sep 4, 2019

Isn't that the same probability, just that there is more precision with more characters? If someone requests for "sh3", their video could be anything that contains "Sh3", "sH3", etc.

Maybe I am just misunderstanding.

@girst
Copy link
Author

girst commented Sep 4, 2019 via email

@phiresky
Copy link

phiresky commented Dec 9, 2019

Currently, there are 25k entries in the DB. this means that on average log2(25k) = 15 bit are needed to uniquely identify one video. If we want every request to pull e.g. 16 videos for privacy, then we need to query by a prefix of log2(25k) - log2(16) = 11 bits.

youtube video ids are base64 strings, so they have 6 bit per character. Hashing the id here doesn't matter regarding privacy I think but even though youtube video ids are pretty obviously random, it's probably still better to hash it to be safe of the distribution. Also it allows for adding other websites later (e.g. hash whole normalized URL which is probably a better idea anyways) and if youtube id algo changes.

I wouldn't use BLOB in sqlite for this since it's harder to filter by, better use hex strings.

Hex strings are log2(16)=4 bit per character.

So right now it would be pretty easy e.g. add a videoID_sha256 column then do queries by the first three characters of the hash (12bits) which would return on average 8 videos.

sqlite3 is able to use indices for prefix GLOB queries (select where videoID_sha256 GLOB 'a0e*'), so the performance impact on the server would be minimal.

For each doubling of the database size, we in theory want to increase the length of the prefix we are looking for by one bit. If we go by hex character prefixes this isn't really possible since we can only increase the prefix every 4 bits or every 16xing of db size.

@phiresky
Copy link

phiresky commented Dec 9, 2019

Alternatively to using sqlite3 GLOB it would be possible to filter by an exact number of prefix bits by

(a) just storing the sha hash prefix in binary strings (pretty easy since we'll never need more than 32 prefix bit characters which would require 32bytes of storage per entry in this case)

(b) doing range queries on the hex strings: Convert the sha256 to a bigint, set all bits apart from the first or last 11 to zero (=lower bound) and then to 1 (=upper bound), e.g. with lowerbound = shaint & ((1<<11) - 1) and upperbound = lowerbound + (1 << 11), and then doing the query as
select * from sponsorTimes where videoID_sha256 >= lowerbound and videoID_sha256 < upperbound (abusing alphabetic text ordering here)

Probably neither of these is worth it and just filtering by sha256 prefix is fine, especially since many videos we're looking for aren't even in the result set so just using the first 12 bits is fine until we switch to 16 bit when db size has increased to like 128k entries.

@ajayyy
Copy link
Owner

ajayyy commented Dec 9, 2019

I think 16 is too much. Maybe there can be an option to do that, but I think the default should be lower.

@devnoname120
Copy link

Note that this wouldn't be able to fix the call to invidio.us:
https://github.com/ajayyy/SponsorBlock/blob/1abc1b9b28d59d2020d26810156cbdfbe3e8f7fa/content.js#L441

@ajayyy
Copy link
Owner

ajayyy commented Jan 5, 2020

@devnoname120

ajayyy/SponsorBlock#189 (comment)

@ajayyy
Copy link
Owner

ajayyy commented Feb 11, 2020

Invidious API usage has been removed.

@Zegnat
Copy link
Contributor

Zegnat commented May 22, 2020

I came upon SponsorBlock today for the first time and a browser-around let me to this issue. As I personally like to worry about privacy and would not want to send every single video ID to a SponsorBlockServer instance (unless maybe I was operating one within my own network). So I considered looking into this and maybe doing a PR. But first I would like to share some thoughts.

To start I would like to point at an implementation of k-anonimity for data queries that pre-dates Pwned Passwords: the “Safe Browsing” (Chrome) and “Block dangerous and deceptive content” (Firefox) systems. Information about it is a bit spread around, but some good resources are: the API Documentation (look for “Update API”), more links collected on the Mozilla Wiki, and a study by Gerbet et al.

I think this is a good implementation to keep in the back of our heads when continueing the discussion because of the many parallels there are. Safe Browsing is set-up to facilitate a client (web browser) wanting to check a centralised repository (Google’s collection of mallicious links) for something about a resource (website) without telling the repository about every resource the client uses. Now make the client a video player, the repository SponsorBlockServer, and the resource a YouTube video, and we are in the exact same situation.

Safe Browsing also has a number of points that address some arguments raised by this discussion. You do not need to study it before reading on, just do not be surprised if I reference it again (and again, and again) 😄

I will consider the threat vector as: the SponsorBlockServer will get to know every video the user watches. Leave other discussions for later. (If we do not trust the network, we may need obfuscation techniques like padding. If we do not trust the client … well that is a separate can of worms.)

Hashing

I would say hashing is absolutely required. It is the only way to ensure no actual information about the video ID is shared with SponsorBlockServer. Even part of a video ID is still information. Further more, are we sure YouTube video IDs are randomly distributed throughout the possible ID space? Is a - just as likely to be part of an ID as an a? I do not have a big enough dataset to run this analysis and I do not think we need to do so at all. Hashing means we eliminate that problem.

It does not really matter what hashing algorithm is chosen. As long as it upholds the default assumptions for a hash we can use it to map both value spaces to eachother: the amount of IDs known to SponsorBlockServer to the amount of IDs that can possibly exist in YouTube.

Pwned Password uses SHA-1. But Troy has also explained that he is not really hashing to seriously protect the data. Only to make sure “any personal info” like email addresses “in the source data is obfuscated”. People were also “cracking the passwords for fun” and succeeding at this. (Multiple discussions have happened around this, quotes taken from a February 2018 blog post.)

I am honestly not sure it matters much for prefixes in our specific case, but Safe Browsing uses SHA-256 and as I said might be closer to our flow. I also do not expect SponsorBlock clients to exist on many platforms that would not have access to SHA-256 primitives.

So let us say we hash every video identifier with a SHA-256 function. Let us also assume that storage and querying the database is trivial. (Because any debate about TEXT vs BLOB and optimisation of WHERE statements can be had when actual benchmarking can be done.)

What to hash

This is something that was only very shortly touched on in the discussion here by @phiresky:

[…] it allows for adding other websites later (e.g. hash whole normalized URL which is probably a better idea anyways) and if youtube id algo changes.

This is a step that once again exists in the Safe Browsing API. They do URL canonicalisation. However my recommendation would be to not do that. In my experience anything close to trying to normalise a URL leads to differences between clients. As a quick note: RFC 3986 includes steps for “Normalization”, the WHATWG URL standard includes steps for “equivalence”, Safe Browsing has steps for “Canonicalization”. If you are unlucky the language you are writing your client in will have access to anything from 0 to a 100 implementations of any of these.

Not only that, you would also end up having to specify additional steps per platform. YouTube has lots of possible URLs pointing at the same video. Sometimes from entirely different domains (youtu.be anyone?).

Instead stick to clear identifiers supplied by the platforms. The YouTube video ID is much less ambiguous than any URL will be. It is also likely that a client already has logic to extract the video ID from a bunch of different sources. Letting them work with that is better.

If we really value compatibility with other platforms simply generating a URI of our own suffices. Instruct clients to hash yt: followed by the YouTube video ID. That would allow simple future expansion without messing with URLs.

Hash prefix length

Setting the length of the prefix is a balancing act. The longer the prefix, the less privacy is given. This goes the full range from 0 to 256 bits (in the case of SHA-256).

If the client asks for a list of possibilities with a 0 bit long prefix it is really just asking for the entire database. This gives complete privacy to the user as the server gets no clues about what video was being accessed. But of course it is a bit annoying for all parties involved as the client would constantly request lots and lots of unneccessary data.

If the clients asks for a list of possibilities with a 256 bit long prefix it is really just sending the full hash to the server. This gives almost¹ no privacy to the user as the server could match this hash 1:1 with a video ID to know exactly what was being watched. So this is no win for the user.

The number we are aiming for is somewhere in the middle. I am however not convinced by @phiresky’s calculation of the prefix length. The length there seems to be argued from the data perspective and not from the user perspective. I do not even think there is any privacy difference depending on how many results the server returns. There is no relation I can see between number of results and the threat vector of not wanting to tell the server about what video I am watching.

From the server perspective I do not think we care. There is already an endpoint that allows the client to submit the exact video ID, so obviously we are OK with the 256 bit variant. The database is not secret and infact readily downloadable, so we are also OK with the 0 bit variant.

The only argument from the server side I can think of is to protect against a number of DoS problems. When the database grows, it may not always be OK to support continuous downloads of the entire thing. Server instances may want to scale up the minimum hash length from 0 to a more managable number in accordance with total database growth. Maybe in such a way that response sizes stay under a certain number of bytes.

From the client perspective I think we want to send as little information as possible. This is where the user is and where we care about the privacy. But here are also a number of platform limitations that need to be considered. Like @ajayyy mentioned before about web browsers, some platforms may put limitations on storage. Other platforms may be on limited bandwidth.

So for a first implementation I was thinking: why not leave it completely up to the client what prefix length it sends and only pick a bottom limit for the server?

We have 133417 sponsorTimes in the latest DB (just downloaded from the mirror). If we aim to never have an API response include more than a 1000 we need a minimum length that cuts our hash space in approximately 134 blocks. That is just the first 8 bits of the hash (or the first 2 hexadecimal characters).²

If the client feels this is problematic because of storage, memory, or bandwidth concerns, it can opt to send a longer prefix whilst sacrificing some of its user’s privacy.

Future of prefixes

The Safe Browsing API once again has an interesting system. The client can download a list of prefixes straight from the server. This is interesting for a number of reasons:

  1. If the client encounters a resource where the hash prefix is not in the list at all, it does not need to make any requests. 100% privacy!
  2. The server can make clear to the client exactly what prefixes it will respond to. If a certain 2 bit prefix only includes 10 videos, it does not need to split this up further. If another 8 bit prefix includes 10000 videos maybe it helps to only have 10 bit prefixes defined there.

Now this of course assumes the client has a way to store this list of prefixes and implements logic to keep it up to date. Therefor I am not entirely sure it makes sense to make this a must from day one.

Pull request

Like I said at the start, I am considering looking into this and getting started on a pull request. In the mean time I would love to hear everyone’s input on my thoughts. Whether you think I am completely missing the mark, right on it, or anywhere in between. (Shall we rate it 0 to 256 bits? 😉)

You will also find me in the SponsorBlock Discord as Zegnat if you would like to discuss any of the above with me outside of this issue.


¹: almost because if the video ID was not already known to the list, the hash needs to be reversed
²: this is just math. In reality we may have some uneven distributions through sheer randomness of the data. As our full dataset is still relatively small it would be simple to run the numbers and pick a good cut-off point.

@girst
Copy link
Author

girst commented May 22, 2020 via email

@ajayyy
Copy link
Owner

ajayyy commented May 22, 2020

Sounds great!

I'm going to repeat some things I've already said on discord to have them here more publicly.

What to hash

I think hashing just the video id is fine for now. I think the service should be added as a separate column (when needed). The service that they are getting info from isn't a large privacy concern and this would more easily allow it to be split to a different database or server in the future depending on growth.

Prefixing

Like @girst , I think there should be minimums, but allowing it to be changed client-side isn't a bad idea.

Caching

I want to implement some form of caching in the future but ideally this will hold the actual data instead of just hashes. This would only apply to videos older than some set time period.

The issue with setting up caching is that fetching the upload time actually takes time. This means that sponsor fetching would be slower for every video as it would fetch the upload time, wait for a response, then fetch times. This is why the current implementation of whitelisting sometimes misses zero second sponsors (though in the latest beta I have added an option to wait before fetching to resolve this issue in exchange for slower sponsor fetching).

I feel that caching can be dealt with separately from this.

@hatzel
Copy link

hatzel commented Jun 24, 2020

I love that you are tackling privacy issues! I would, however, like to raise one concern regarding the effectiveness of this specific approach. I am not very active/experienced in the field of privacy preserving technology so take this with a grain of salt but I am fairly certain the base idea holds merit.

Some assumptions I'll just make explicit:

  • Attack scenario: the server want's to know which videos a user (as defined by an IP) is watching
  • For simplicity: a user has a fixed/static and unique IP address

Obviously hashing doesn't prevent the server from getting the set of videos that were requested. The hash is only a way of specifying a random set of videos, including the one at interest.
What the server will get from an extended video watching session is something like this (assuming k=3):

Access Time 9:08 9:21 9:35
Video 1 id <video_id> <video_id> video_id
Video 2 id <video_id> <video_id> video_id
Video 3 id <video_id> <video_id> video_id

There are many scenarios in which it is easy to identify which sequence of videos is the correct one. For example, if part 1 and part 2 of a series are in columns 1 & 2 it is very unlikely that those were chosen based on random chance. Similarly if a request for a new video is made after 14 minutes when only one video in the previous column was 14 minutes in length there is an increased likelihood of that video being the one that is watched.
More formally this can be modelled as a sequence of conditional probabilities, i.e. what's the probability of 0we7kcmgDOw following yNEWkY9D2k4. Maximizing that probability sequence would end up with the most likely sequence of videos that a user has watched.

From a practical perspective I see that it could be argued that this is still an improvement in privacy, even if an attack like this is sometimes possible (after all it still requires a lot of effort). I would however also like to at least mention the counterpoint that implementing such a scheme could give a false confidence in the privacy afforded by this service.

Keep in mind that human intuition on probability can be very misleading so maybe this attack quickly becomes infeasible. Some actual calculations might be required to assess the feasibility of this approach.

@ajayyy
Copy link
Owner

ajayyy commented Sep 4, 2020

#127

@ajayyy ajayyy closed this as completed Sep 4, 2020
@ajayyy
Copy link
Owner

ajayyy commented Sep 4, 2020

Still have to test performance before enabling globally, so it will be just an option for now. I might try enabling it for a certain percentage of all requests (50%) by default and see how it goes.

@girst
Copy link
Author

girst commented Sep 4, 2020 via email

@ajayyy
Copy link
Owner

ajayyy commented Sep 4, 2020

@girst Yes, it's about the addon. I have noticed that the public database size doubles with this change, so I am going to look into splitting this into a separate database file. It might not be live for a bit.

@ajayyy
Copy link
Owner

ajayyy commented Sep 15, 2020

@girst The API is now live on the main server

@girst
Copy link
Author

girst commented Sep 15, 2020 via email

@ajayyy
Copy link
Owner

ajayyy commented Sep 15, 2020

Sorry about that, should be good now

@girst
Copy link
Author

girst commented Sep 15, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants