Preserve user's privacy with k-anonymity #25

girst · 2019-08-22T19:04:58Z

As I understand, currently, the user submits a video ID (e.g. dQw4w9WgXcQ), and gets back a single JSON-object.

Here I propose to add a new endpoint, e.g. /api/anonymousGetVideoSponsorTimes that takes the first n (e.g. 5) characters of a hash (sha1 is fine) and returns a list of possible results, like in the example below.

This approach is used in Troy Hunt's Have I Been Pwned API; see https://www.troyhunt.com/ive-just-launched-pwned-passwords-version-2/ and https://blog.cloudflare.com/validating-leaked-passwords-with-k-anonymity/ for example.

input: { hash_prefix: <sha1sum("dQw4w9WgXcQ").substr(0,5)> }
(e.g. { hash_prefix: '3dd08' })

output:

[
 {
  videoID: 'dQw4w9WgXcQ',
  sponorTimes: array [float],
  UUIDs: array [string] //The ID for this sponsor time, used to submit votes
 },
 {
  videoID: 'ah20943fdhj7'
  sponorTimes: array [float],
  UUIDs: array [string] //The ID for this sponsor time, used to submit votes
 },
// ...
]

Since Youtube IDs are sparse, and furthermore SponsorBlock only has a small part of IDs indexed, each query will return only a small amount of results, if any. If result length were to get out of hands in the future, it would be easy to increase the number of input characters required.

For performance reasons, the database should grow a new column, sha1sum. A pseudo-SQL query for such a request might look like this:

SELECT * FROM sponsorTimes WHERE sha1sum LIKE '3dd08%'

Which hashing algorithm is used is not very important, as the user will only send a fraction of the hash to the server. SHA1 has a reasonable length, I'd say. (let's avoid MD5, though ;-) )

The text was updated successfully, but these errors were encountered:

girst · 2019-08-22T19:11:06Z

Technically, no new endpoint would be required. if the input has a hash_prefix key, query by sha1 prefix, if it has a videoID key, by video ID.

Storing hashes as binary blobs instead of strings is likely more efficient too; see for example https://stackoverflow.com/a/24011877

ajayyy · 2019-08-23T18:53:31Z

Smart, I don't know how well this will work with how little videoIDs are in this DB, but it definitely could work.

girst · 2019-08-23T19:09:15Z

On Fri, Aug 23, 2019 at 11:53:31AM -0700, Ajay Ramachandran wrote: I don't know how well this will work with how little videoIDs are in this DB

It doesn't matter when we're only returning a single result. That doesn't compromise the user's privacy, because maybe they have requested an ID we don't know about. I think as a first step, I'd be great to modify the existing database to add a `sha1sum` field. It would be good if its type were a binary blob (so we can search on it fast), but I don't know if that's easy to handle with JavaScript. The Easy Way™ would be to make it a plain old string, and use a `LIKE %` query, but that is a relatively expensive operation. What do you think?

ajayyy · 2019-08-23T19:30:35Z

Ah, true. I think a blob could work, since it could be converted from a string to blob (base 64 or something)

ajayyy · 2019-08-23T19:31:17Z

One thing to note about this is is could be quite expensive, since it doesn't just send all sponsors for one video, it uses the weighted random generator.

girst · 2019-08-23T19:48:07Z

On Fri, Aug 23, 2019 at 12:31:17PM -0700, Ajay Ramachandran wrote: it uses the weighted random generator.

Oh, I didn't notice that! Good catch! That function does look involved, to say the least! ;-) It's probably not a bad idea if we'd do some preliminary evaluation. Once we have the hashes in the database, we can have a look at how many bits of the hash return how many results (median). My uninformed guess is that even sending just 5*16 bits (i.e. 5 hexadecimal chars) will return just a single result. Then, I'd go for the tried-and-true method of "just going with the flow" and make the webextension send 5*16 bits of the hash. if it turns out that this increases cpu time too much, we can either reevaluate the complete idea, or increase the number of bits required (and return a 400 when the client tries to use too few bits). If you want, I could prepare an sqlite query for adding a new column to the database and populating it with the sha1sums.

ajayyy · 2019-08-23T20:06:19Z

Go for it! I think putting setting it up to start putting sha1sums is a good start, then we can add the rest after. Is there a reason to use sha1 over sha256?

girst · 2019-08-23T20:21:09Z

On Fri, Aug 23, 2019 at 01:06:19PM -0700, Ajay Ramachandran wrote: Go for it! I think putting setting it up to start putting sha1sums is a good start,

to be clear: I'm no javascript developer; i'd have to leave implementing that to you. I can work on the sqlite stuff, though.

Is there a reason to use sha1 over sha256?

it uses less space. we don't require a secure hash for this at all. technically, i think, one could implement something similar with for example CRC32 (you'd lose some precision in truncating). we do need to mangle the youtube-ids somehow, since they're sparse to begin with. so if the user asked for the prefix {video_id:"abcd*"}, and we have one video id (and very likely only 1) with that prefix, we could conclude that the user wanted video abcdefgh, since not many video ids begin with abcd. however, if we hash the ids first, then (by design of a secure hasing algorithm) we cannot conclude anything about the input. I'm having a hard time coming up with a satisfying explanation, so feel free to ask for more details!

ajayyy · 2019-08-23T20:24:41Z

That makes sense. So really, the only disadvantage is more data throughput and more calculations server side.

girst · 2019-08-23T20:28:53Z

On Fri, Aug 23, 2019 at 01:24:41PM -0700, Ajay Ramachandran wrote: That makes sense. So really, the only disadvantage is more data throughput and more calculations server side.

exactly. and both of those disadvantages can be controlled (by increasing the number of required prefix-bits (and decreasing the privacy of the user)), if it gets out of hand (which I don't think it will, unless we're adding millions of videoIDs). I'll have a look at the sql stuff over the weekend!

girst · 2019-08-24T23:03:08Z

As promised, here's a quick and reliable way to add the sha1sum column. It turns out that sqlite has a sha1 function, but only as an extension. You'll need a C compiler (e.g. gcc or clang) and sqlite-devel (RPM based systems, sqlite-dev on Debian systems I presume). 1. Download sha1.c wget 'https://www.sqlite.org/src/raw/ext/misc/sha1.c?name=df0a667211baa2c0612d8486acbf6331b9f8633fd4d605c17c7cccd26d59c6bd' -O sha1.c 2. Install sqlite-devel (Fedora) or libsqlite3-dev (Debian/Ubuntu) sudo dnf install sqlite-devel # OR! sudo apt-get install libsqlite3-dev 3. compile it using gcc -g -fPIC -shared sha1.c -o sha1.so -std=gnu99 Finally, launch sqlite3 database.db from the directory of your compiled sha1.so and issue: SELECT load_extension("./sha1.so"); BEGIN TRANSACTION; ALTER TABLE sponsorTimes ADD COLUMN sha1sum BLOB; UPDATE sponsorTimes SET sha1sum = sha1(videoID) WHERE sha1sum ISNULL; COMMIT; One can then query for hashes using (note the zeroed out missing digits) SELECT * from sponsorTimes WHERE sha1sum BETWEEN '89b4800000000000000000000000000000000000' AND '89b4900000000000000000000000000000000000'; This translates into simple >= and < operations and is therefore pretty efficient. let me know if that works for you!

afontenot · 2019-08-30T23:38:17Z

I was going to file an issue suggesting exactly this, but I have one issue: I don't think there are likely enough videos submitted to make this sufficiently anonymous. It seems to me, that unless and until the database gets huge, the best thing to do would be to just have each client download the whole database, and maintain it locally. It's currently only ~3MB, and I expect it will stay at a manageable size for the foreseeable future.

To update (probably done every time a video is loaded), you wouldn't need to download the whole database again, just add an API endpoint for getting all entries since timestamp.

It's not 100% a fair comparison, but this is what other ad blocking software does: you don't send every URL you visit to a remote server, you download and sync a list that tells you what to block.

girst · 2019-08-31T14:13:45Z

On Fri, Aug 30, 2019 at 04:38:17PM -0700, afontenot wrote: I don't think there are likely enough videos submitted to make this sufficiently anonymous.

that does not matter for k-anonymity: when only 1 hash matches, the probability that the user requested a video that is not in our database is high, so even then we can't infer much of anything.

To update (probably done every time a video is loaded), you wouldn't need to download the whole database again, just add an API endpoint for getting all entries since `timestamp`.

partial database exports probably won't scale when/if the project really takes off, both in terms of transfer and storage size. i'm all for the full sqlite dumps, as they bring resiliency to the project.

ajayyy · 2019-08-31T16:20:02Z

Is there an easy way to have a database in the browser? I was thinking of setting up a variant of the server that does what you are describing, and then making it so the extension can pull from that local server instead.

girst · 2019-08-31T17:13:14Z

i'd suggest opening a new bug for that, but in the meantime: i don't think you can have the sqlite db directly in the addon. two options come to mind, both i think aren't great for that use case: 1. use the localStorage apis: https://developer.mozilla.org/en-US/docs/Web/API/Web_Storage_API/Local_storage i'm not familiar with them, but it looks like they only support a key-value lookup and are meant more for config values. there probably is a relatively low limit on how much space (i.e. table rows) can be stored. the get-partial-db api would then need to return e.g. json, and the extension would insert the values into the store. 2. use a companion application through nativeMessaging: https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/Native_messaging this way the calls to the sql db could be outsourced to a out-of-browser program, with the database somewhere on disk. this app would have to be programmed and maintained (extra work), and would likely be somewhat os-specific. afaik this can't be used on android, but don't quote me on that. also, installing the companion app must be done by the user manually, and requires dicking around in ~/.mozilla. it can't be done via the extension for (obvious) security reasons.

ajayyy · 2019-08-31T18:07:48Z

Yea, the other issue is that localStorage does have a very low max storage usage, so it would not scale well.

ajayyy · 2019-09-04T17:48:55Z

I think an easier way would be to use an all lower case search of the video IDs (not just the first letters, a random subset of letters). The seem random enough, and when you use 3 characters, you usually get a few results.

girst · 2019-09-04T18:07:29Z

On Wed, Sep 04, 2019 at 10:48:55AM -0700, Ajay Ramachandran wrote: I think an easier way would be to use an all lower case search of the video IDs. The seem random enough, and when you use 3 characters, you usually get a few results.

hard no! it's not about getting "a few results". this approach gives you nothing, but renders the privacy aspect moot. video ids are sparse, but small and partially known. when a user requests a partial id abc?????, the probability that we can associate it with the real id is far higher and easier than when the user requests a partial hash abcde???????????????????????????????????. but it's not just about good practices alone: using (properly implemented) proven-in-the-wild solutions gives users confidence in said solution; one cooked up by a gut-feeling of "probably good enough", used nowhere else, requires users to trust you that you have carefully studied the trade-offs. are you worried about performance, implementation work, or something else entirely? I'm sure we can find a solution to it. :-)

ajayyy · 2019-09-04T18:12:03Z

Isn't that the same probability, just that there is more precision with more characters? If someone requests for "sh3", their video could be anything that contains "Sh3", "sH3", etc.

Maybe I am just misunderstanding.

girst · 2019-09-04T18:21:50Z

please just let's all agree to implement the feature as it has been proven in the wild (e.g. in the haveibeenpwned-api). diverging from known-working algorithms costs time and gives us nothing at best, compromises users' privacy at worst. none of us have studied this field; we can't possibly know where the pitfalls of a custom approach are.

phiresky · 2019-12-09T19:05:27Z

Currently, there are 25k entries in the DB. this means that on average log2(25k) = 15 bit are needed to uniquely identify one video. If we want every request to pull e.g. 16 videos for privacy, then we need to query by a prefix of log2(25k) - log2(16) = 11 bits.

youtube video ids are base64 strings, so they have 6 bit per character. Hashing the id here doesn't matter regarding privacy I think but even though youtube video ids are pretty obviously random, it's probably still better to hash it to be safe of the distribution. Also it allows for adding other websites later (e.g. hash whole normalized URL which is probably a better idea anyways) and if youtube id algo changes.

I wouldn't use BLOB in sqlite for this since it's harder to filter by, better use hex strings.

Hex strings are log2(16)=4 bit per character.

So right now it would be pretty easy e.g. add a videoID_sha256 column then do queries by the first three characters of the hash (12bits) which would return on average 8 videos.

sqlite3 is able to use indices for prefix GLOB queries (select where videoID_sha256 GLOB 'a0e*'), so the performance impact on the server would be minimal.

For each doubling of the database size, we in theory want to increase the length of the prefix we are looking for by one bit. If we go by hex character prefixes this isn't really possible since we can only increase the prefix every 4 bits or every 16xing of db size.

phiresky · 2019-12-09T19:17:11Z

Alternatively to using sqlite3 GLOB it would be possible to filter by an exact number of prefix bits by

(a) just storing the sha hash prefix in binary strings (pretty easy since we'll never need more than 32 prefix bit characters which would require 32bytes of storage per entry in this case)

(b) doing range queries on the hex strings: Convert the sha256 to a bigint, set all bits apart from the first or last 11 to zero (=lower bound) and then to 1 (=upper bound), e.g. with lowerbound = shaint & ((1<<11) - 1) and upperbound = lowerbound + (1 << 11), and then doing the query as
select * from sponsorTimes where videoID_sha256 >= lowerbound and videoID_sha256 < upperbound (abusing alphabetic text ordering here)

Probably neither of these is worth it and just filtering by sha256 prefix is fine, especially since many videos we're looking for aren't even in the result set so just using the first 12 bits is fine until we switch to 16 bit when db size has increased to like 128k entries.

ajayyy · 2019-12-09T21:31:33Z

I think 16 is too much. Maybe there can be an option to do that, but I think the default should be lower.

devnoname120 · 2020-01-05T18:03:38Z

Note that this wouldn't be able to fix the call to invidio.us:
https://github.com/ajayyy/SponsorBlock/blob/1abc1b9b28d59d2020d26810156cbdfbe3e8f7fa/content.js#L441

ajayyy · 2020-01-05T18:07:36Z

@devnoname120

ajayyy/SponsorBlock#189 (comment)

ajayyy · 2020-02-11T03:28:15Z

Invidious API usage has been removed.

Zegnat · 2020-05-22T20:15:49Z

I came upon SponsorBlock today for the first time and a browser-around let me to this issue. As I personally like to worry about privacy and would not want to send every single video ID to a SponsorBlockServer instance (unless maybe I was operating one within my own network). So I considered looking into this and maybe doing a PR. But first I would like to share some thoughts.

To start I would like to point at an implementation of k-anonimity for data queries that pre-dates Pwned Passwords: the “Safe Browsing” (Chrome) and “Block dangerous and deceptive content” (Firefox) systems. Information about it is a bit spread around, but some good resources are: the API Documentation (look for “Update API”), more links collected on the Mozilla Wiki, and a study by Gerbet et al.

I think this is a good implementation to keep in the back of our heads when continueing the discussion because of the many parallels there are. Safe Browsing is set-up to facilitate a client (web browser) wanting to check a centralised repository (Google’s collection of mallicious links) for something about a resource (website) without telling the repository about every resource the client uses. Now make the client a video player, the repository SponsorBlockServer, and the resource a YouTube video, and we are in the exact same situation.

Safe Browsing also has a number of points that address some arguments raised by this discussion. You do not need to study it before reading on, just do not be surprised if I reference it again (and again, and again) 😄

I will consider the threat vector as: the SponsorBlockServer will get to know every video the user watches. Leave other discussions for later. (If we do not trust the network, we may need obfuscation techniques like padding. If we do not trust the client … well that is a separate can of worms.)

Hashing

I would say hashing is absolutely required. It is the only way to ensure no actual information about the video ID is shared with SponsorBlockServer. Even part of a video ID is still information. Further more, are we sure YouTube video IDs are randomly distributed throughout the possible ID space? Is a - just as likely to be part of an ID as an a? I do not have a big enough dataset to run this analysis and I do not think we need to do so at all. Hashing means we eliminate that problem.

It does not really matter what hashing algorithm is chosen. As long as it upholds the default assumptions for a hash we can use it to map both value spaces to eachother: the amount of IDs known to SponsorBlockServer to the amount of IDs that can possibly exist in YouTube.

Pwned Password uses SHA-1. But Troy has also explained that he is not really hashing to seriously protect the data. Only to make sure “any personal info” like email addresses “in the source data is obfuscated”. People were also “cracking the passwords for fun” and succeeding at this. (Multiple discussions have happened around this, quotes taken from a February 2018 blog post.)

I am honestly not sure it matters much for prefixes in our specific case, but Safe Browsing uses SHA-256 and as I said might be closer to our flow. I also do not expect SponsorBlock clients to exist on many platforms that would not have access to SHA-256 primitives.

So let us say we hash every video identifier with a SHA-256 function. Let us also assume that storage and querying the database is trivial. (Because any debate about TEXT vs BLOB and optimisation of WHERE statements can be had when actual benchmarking can be done.)

What to hash

This is something that was only very shortly touched on in the discussion here by @phiresky:

[…] it allows for adding other websites later (e.g. hash whole normalized URL which is probably a better idea anyways) and if youtube id algo changes.

This is a step that once again exists in the Safe Browsing API. They do URL canonicalisation. However my recommendation would be to not do that. In my experience anything close to trying to normalise a URL leads to differences between clients. As a quick note: RFC 3986 includes steps for “Normalization”, the WHATWG URL standard includes steps for “equivalence”, Safe Browsing has steps for “Canonicalization”. If you are unlucky the language you are writing your client in will have access to anything from 0 to a 100 implementations of any of these.

Not only that, you would also end up having to specify additional steps per platform. YouTube has lots of possible URLs pointing at the same video. Sometimes from entirely different domains (youtu.be anyone?).

Instead stick to clear identifiers supplied by the platforms. The YouTube video ID is much less ambiguous than any URL will be. It is also likely that a client already has logic to extract the video ID from a bunch of different sources. Letting them work with that is better.

If we really value compatibility with other platforms simply generating a URI of our own suffices. Instruct clients to hash yt: followed by the YouTube video ID. That would allow simple future expansion without messing with URLs.

Hash prefix length

Setting the length of the prefix is a balancing act. The longer the prefix, the less privacy is given. This goes the full range from 0 to 256 bits (in the case of SHA-256).

If the client asks for a list of possibilities with a 0 bit long prefix it is really just asking for the entire database. This gives complete privacy to the user as the server gets no clues about what video was being accessed. But of course it is a bit annoying for all parties involved as the client would constantly request lots and lots of unneccessary data.

If the clients asks for a list of possibilities with a 256 bit long prefix it is really just sending the full hash to the server. This gives almost¹ no privacy to the user as the server could match this hash 1:1 with a video ID to know exactly what was being watched. So this is no win for the user.

The number we are aiming for is somewhere in the middle. I am however not convinced by @phiresky’s calculation of the prefix length. The length there seems to be argued from the data perspective and not from the user perspective. I do not even think there is any privacy difference depending on how many results the server returns. There is no relation I can see between number of results and the threat vector of not wanting to tell the server about what video I am watching.

From the server perspective I do not think we care. There is already an endpoint that allows the client to submit the exact video ID, so obviously we are OK with the 256 bit variant. The database is not secret and infact readily downloadable, so we are also OK with the 0 bit variant.

The only argument from the server side I can think of is to protect against a number of DoS problems. When the database grows, it may not always be OK to support continuous downloads of the entire thing. Server instances may want to scale up the minimum hash length from 0 to a more managable number in accordance with total database growth. Maybe in such a way that response sizes stay under a certain number of bytes.

From the client perspective I think we want to send as little information as possible. This is where the user is and where we care about the privacy. But here are also a number of platform limitations that need to be considered. Like @ajayyy mentioned before about web browsers, some platforms may put limitations on storage. Other platforms may be on limited bandwidth.

So for a first implementation I was thinking: why not leave it completely up to the client what prefix length it sends and only pick a bottom limit for the server?

We have 133417 sponsorTimes in the latest DB (just downloaded from the mirror). If we aim to never have an API response include more than a 1000 we need a minimum length that cuts our hash space in approximately 134 blocks. That is just the first 8 bits of the hash (or the first 2 hexadecimal characters).²

If the client feels this is problematic because of storage, memory, or bandwidth concerns, it can opt to send a longer prefix whilst sacrificing some of its user’s privacy.

Future of prefixes

The Safe Browsing API once again has an interesting system. The client can download a list of prefixes straight from the server. This is interesting for a number of reasons:

If the client encounters a resource where the hash prefix is not in the list at all, it does not need to make any requests. 100% privacy!
The server can make clear to the client exactly what prefixes it will respond to. If a certain 2 bit prefix only includes 10 videos, it does not need to split this up further. If another 8 bit prefix includes 10000 videos maybe it helps to only have 10 bit prefixes defined there.

Now this of course assumes the client has a way to store this list of prefixes and implements logic to keep it up to date. Therefor I am not entirely sure it makes sense to make this a must from day one.

Pull request

Like I said at the start, I am considering looking into this and getting started on a pull request. In the mean time I would love to hear everyone’s input on my thoughts. Whether you think I am completely missing the mark, right on it, or anywhere in between. (Shall we rate it 0 to 256 bits? 😉)

You will also find me in the SponsorBlock Discord as Zegnat if you would like to discuss any of the above with me outside of this issue.

¹: almost because if the video ID was not already known to the list, the hash needs to be reversed
²: this is just math. In reality we may have some uneven distributions through sheer randomness of the data. As our full dataset is still relatively small it would be simple to run the numbers and pick a good cut-off point.

girst · 2020-05-22T20:55:50Z

you're echoing a lot of my gut feelings about this topic, and clearly laid out. Thanks! I've replied to a few things inline. tl/dr: full ack! would love to see you open that PR.

On Fri, May 22, 2020 at 01:16:02PM -0700, Martijn van der Ven wrote: To start I would like to point at an implementation of k-anonimity for data queries that pre-dates Pwned Passwords: the “Safe Browsing” (Chrome) and “Block dangerous and deceptive content” (Firefox) systems.

true! i can't believe i didn't think of it.

I would say hashing is absolutely required. It is the only way to ensure no actual information about the video ID is shared with SponsorBlockServer. Even part of a video ID is still information.

+1 to requiring hashing.

So let us say we hash every video identifier with a SHA-256 function. Let us also assume that storage and querying the database is trivial. (Because any debate about `TEXT` vs `BLOB` and optimisation of `WHERE` statements can be had when actual benchmarking can be done.)

+1 to stop bikeshedding.

**What to hash**

[...]

Instead stick to clear identifiers supplied by the platforms. The YouTube video ID is much less ambiguous than any URL will be. It is

[...]

If we really value compatibility with other platforms simply generating a URI of our own suffices. Instruct clients to hash `yt:` followed by the YouTube video ID. That would allow simple future expansion without

seems like a solid idea.

**Hash prefix length**

[...]

So for a first implementation I was thinking: why not leave it completely up to the client what prefix length it sends and only pick a bottom limit for the server?

we should define at least some bounds: retrieving the whole database this way is way more expensive than downloading it, potentially opening the server up to DoS, and personally I'd also let misconfigured clients (who send a too specific prefix) hard-fail, so it gets fixed quickly. as a starting point, I'd suggest >=8bits, but <= 50% of the hash.

**Pull request** Like I said at the start, I am considering looking into this and getting started on a pull request. In the mean time I would love to hear everyone’s input on my thoughts. Whether you think I am completely missing the mark, right on it, or anywhere in between. (Shall we rate it 0 to 256 bits? 😉)

you're getting all 256 from me :) i'll definitely review your PR once it's ready (given my limited javascript skills it'll be more on the concepts than the implementation details).

ajayyy · 2020-05-22T23:33:57Z

Sounds great!

I'm going to repeat some things I've already said on discord to have them here more publicly.

What to hash

I think hashing just the video id is fine for now. I think the service should be added as a separate column (when needed). The service that they are getting info from isn't a large privacy concern and this would more easily allow it to be split to a different database or server in the future depending on growth.

Prefixing

Like @girst , I think there should be minimums, but allowing it to be changed client-side isn't a bad idea.

Caching

I want to implement some form of caching in the future but ideally this will hold the actual data instead of just hashes. This would only apply to videos older than some set time period.

The issue with setting up caching is that fetching the upload time actually takes time. This means that sponsor fetching would be slower for every video as it would fetch the upload time, wait for a response, then fetch times. This is why the current implementation of whitelisting sometimes misses zero second sponsors (though in the latest beta I have added an option to wait before fetching to resolve this issue in exchange for slower sponsor fetching).

I feel that caching can be dealt with separately from this.

hatzel · 2020-06-24T14:45:33Z

I love that you are tackling privacy issues! I would, however, like to raise one concern regarding the effectiveness of this specific approach. I am not very active/experienced in the field of privacy preserving technology so take this with a grain of salt but I am fairly certain the base idea holds merit.

Some assumptions I'll just make explicit:

Attack scenario: the server want's to know which videos a user (as defined by an IP) is watching
For simplicity: a user has a fixed/static and unique IP address

Obviously hashing doesn't prevent the server from getting the set of videos that were requested. The hash is only a way of specifying a random set of videos, including the one at interest.
What the server will get from an extended video watching session is something like this (assuming k=3):

Access Time	9:08	9:21	9:35
Video 1 id	<video_id>	<video_id>	video_id
Video 2 id	<video_id>	<video_id>	video_id
Video 3 id	<video_id>	<video_id>	video_id

There are many scenarios in which it is easy to identify which sequence of videos is the correct one. For example, if part 1 and part 2 of a series are in columns 1 & 2 it is very unlikely that those were chosen based on random chance. Similarly if a request for a new video is made after 14 minutes when only one video in the previous column was 14 minutes in length there is an increased likelihood of that video being the one that is watched.
More formally this can be modelled as a sequence of conditional probabilities, i.e. what's the probability of 0we7kcmgDOw following yNEWkY9D2k4. Maximizing that probability sequence would end up with the most likely sequence of videos that a user has watched.

From a practical perspective I see that it could be argued that this is still an improvement in privacy, even if an attack like this is sometimes possible (after all it still requires a lot of effort). I would however also like to at least mention the counterpoint that implementing such a scheme could give a false confidence in the privacy afforded by this service.

Keep in mind that human intuition on probability can be very misleading so maybe this attack quickly becomes infeasible. Some actual calculations might be required to assess the feasibility of this approach.

ajayyy · 2020-09-04T16:03:38Z

#127

ajayyy · 2020-09-04T16:05:12Z

Still have to test performance before enabling globally, so it will be just an option for now. I might try enabling it for a certain percentage of all requests (50%) by default and see how it goes.

girst · 2020-09-04T17:18:05Z

Thank you @Zegnat, @Joe-Dowd and @ajayyy for your work!

On Fri, Sep 04, 2020 at 09:05:29AM -0700, Ajay Ramachandran wrote: Still have to test performance before enabling globally, so it will be just an option for now. I might try enabling it for a certain percentage of all requests (50%) by default and see how it goes.

I guess this statement is regarding the addon, no? I'd like to use this endpoint with my own client as soon as it's available (right now, it still 404s). Any idea when it's going live, ajayyy?

ajayyy · 2020-09-04T20:05:11Z

@girst Yes, it's about the addon. I have noticed that the public database size doubles with this change, so I am going to look into splitting this into a separate database file. It might not be live for a bit.

ajayyy · 2020-09-15T16:56:15Z

@girst The API is now live on the main server

girst · 2020-09-15T18:09:04Z

On Tue, Sep 15, 2020 at 09:56:31AM -0700, Ajay Ramachandran wrote: The API is now live on the main server

what am i doing wrong? % curl https://sponsor.ajay.app/api/skipSegments/`printf v-ajyq_DeDA|sha256sum|head -c8` <!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"> <title>Error</title> </head> <body> <pre>Cannot GET /api/skipSegments/14e4b9a7</pre> </body> </html> if i try the test-server(?), i get a different result: % curl https://sponsor.ajay.app/test/api/skipSegments/`printf v-ajyq_DeDA|sha256sum|head -c8` Not Found

ajayyy · 2020-09-15T18:16:36Z

Sorry about that, should be good now

girst · 2020-09-15T18:18:44Z

On Tue, Sep 15, 2020 at 11:16:52AM -0700, Ajay Ramachandran wrote: Sorry about that, should be good now

It is! Thanks again!

NDevTK mentioned this issue Aug 24, 2019

Concerns ajayyy/SponsorBlock#137

Closed

WebFreak001 mentioned this issue Dec 12, 2019

Mirror sqlite database locally ajayyy/SponsorBlock#193

Closed

ajayyy mentioned this issue Dec 29, 2019

Confirm on video that is not public [TESTING] ajayyy/SponsorBlock#210

Closed

ajayyy mentioned this issue Mar 12, 2020

SponsorBlock Implementation TeamNewPipe/NewPipe#3205

Closed

Zegnat mentioned this issue May 23, 2020

Query by hash prefix #86

Closed

ajayyy mentioned this issue Jun 29, 2020

Locally compare sponsor segments ajayyy/SponsorBlock#398

Closed

ajayyy closed this as completed Sep 4, 2020

ajayyy mentioned this issue May 21, 2021

Investigate alternative video info API (used for determining if a video is new and whitelisting) ajayyy/SponsorBlock#751

Closed

Preserve user's privacy with k-anonymity #25

Preserve user's privacy with k-anonymity #25

Comments

girst commented Aug 22, 2019

girst commented Aug 22, 2019

ajayyy commented Aug 23, 2019

girst commented Aug 23, 2019 via email

ajayyy commented Aug 23, 2019

ajayyy commented Aug 23, 2019

girst commented Aug 23, 2019 via email

ajayyy commented Aug 23, 2019

girst commented Aug 23, 2019 via email

ajayyy commented Aug 23, 2019

girst commented Aug 23, 2019 via email

girst commented Aug 24, 2019 via email

afontenot commented Aug 30, 2019 • edited Loading

girst commented Aug 31, 2019 via email

ajayyy commented Aug 31, 2019

girst commented Aug 31, 2019 via email

ajayyy commented Aug 31, 2019

ajayyy commented Sep 4, 2019 • edited Loading

girst commented Sep 4, 2019 via email

ajayyy commented Sep 4, 2019

girst commented Sep 4, 2019 via email

phiresky commented Dec 9, 2019 • edited Loading

phiresky commented Dec 9, 2019 • edited Loading

ajayyy commented Dec 9, 2019

devnoname120 commented Jan 5, 2020

ajayyy commented Jan 5, 2020

ajayyy commented Feb 11, 2020

Zegnat commented May 22, 2020

girst commented May 22, 2020 via email

ajayyy commented May 22, 2020

hatzel commented Jun 24, 2020

ajayyy commented Sep 4, 2020

ajayyy commented Sep 4, 2020

girst commented Sep 4, 2020 via email

ajayyy commented Sep 4, 2020

ajayyy commented Sep 15, 2020

girst commented Sep 15, 2020 via email

ajayyy commented Sep 15, 2020

girst commented Sep 15, 2020 via email

afontenot commented Aug 30, 2019 •

edited

Loading

ajayyy commented Sep 4, 2019 •

edited

Loading

phiresky commented Dec 9, 2019 •

edited

Loading

phiresky commented Dec 9, 2019 •

edited

Loading