Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check revoked certificates with OCSP (#2) #3

Merged
merged 25 commits into from
Apr 25, 2020

Conversation

FabienChaynes
Copy link
Collaborator

(#2)

This PR adds a test for revoked certificates using OCSP.
It does not use CRL.

The main inspiration was from the ruby doc.

I used https://www.digicert.com/digicert-root-certificates.htm to test it on several certificates.

Copy link
Owner

@jarthod jarthod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great, thanks !

I've ran this locally against all updown.io URLs (using the rake crosstest:ssl I made) to see how many do we have and verify if they are correct. I didn't get saw any revoked cert which is sad and good at the same time ;)

What I noticed is that hundreds of them failed with "SSL certificate test failed: Net::ReadTimeout" (or "end of file reached" or "Connection reset by peer" or "execution expired") so I guessed this is either because the OCSP endpoint is not working, or because I'm being throttled. By testing some of them again alone it worked so I think most of them are throttling. That's probably ok because in production the tests wouldn't run this fast and from 8 IPs.

One thing though is that it would be good to know that it's the OCSP test which failed and not the website which did a timeout, so maybe change the message a bit so we can know (with a grep for example) if it's an revocation test fail or a website fail.

And also there's the question of what to do when the OCSP endpoint doesn't work (not throttling). Because currently the certificate would always be considered untested (nil) and return this test error. I guess we'll have to see if that's a common case or not. And either tell the issuer to fix their OCSP, fix our code, or maybe hardcode that it's not working to ignore the OCSP test for them ☹. We'll check that once we have production numbers to back it up.

Nice work 👏 and welcome to the dirty/legacy world of PKI (Private Key Infrastructures)

@FabienChaynes
Copy link
Collaborator Author

One thing though is that it would be good to know that it's the OCSP test which failed and not the website which did a timeout, so maybe change the message a bit so we can know (with a grep for example) if it's an revocation test fail or a website fail.

I see. Maybe we should just not fail at all if the OCSP test isn't conclusive and keep going. It feels kinda lame to not be able to detect other problems when we can just because we can't see whether or not the cert was revoked.

But as you said, we can check first if it's a common case or not.
Even if it's not though, slow queries and downtimes happen, and it doesn't feel right to not account for those cases.

lib/ssl-test.rb Outdated Show resolved Hide resolved
@FabienChaynes FabienChaynes changed the title [WIP] Check revoked certificates with OCSP (#2) Check revoked certificates with OCSP (#2) Jun 2, 2019
Copy link
Owner

@jarthod jarthod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes if it happens too much it's ok to skip OCSP when it fails, but it's lame so we'll try the optimistic way first ;)

But as you said, we can check first if it's a common case or not.
Even if it's not though, slow queries and downtimes happen, and it doesn't feel right to not account for those cases.

Indeed, but if it's temporary it's ok, updown handles going from state valid to unknown to valid without notifications. If it's permanent that's another story.

Also one risk (for updown) if we do ignore OCSP result when down, is that if the cert is revoked (updown sends alert) and then the OCSP endpoint die, we'll say the certificate is valid and send another alert, which means false positive and a lot of spam if the OCSP endpoints flaps.

lib/ssl-test.rb Outdated Show resolved Hide resolved
lib/ssl-test.rb Outdated Show resolved Hide resolved
lib/ssl-test.rb Show resolved Hide resolved
@jarthod
Copy link
Owner

jarthod commented Jun 3, 2019

Oh and by the way if we could support OCSP stapling that would be great (and reduce the number of requests). OCSP stapling basically attach the OCSP response server side during the TLS handshake so you don't have to query the OCSP server, you can just validate the embedded OCSP signature. It's enabled on https://updown.io for example. It's a TLS extension field so we may be able to find it somewhere in the store_context maybe? I didn't found any ruby examples of course..

lib/ssl-test.rb Outdated Show resolved Hide resolved
@FabienChaynes
Copy link
Collaborator Author

And also there's the question of what to do when the OCSP endpoint doesn't work (not throttling).

In the current version of the code, we'll only explicitly state that the OCSP test failed if there was a connection error.
If something else happens (the OCSP response was invalid, we couldn't verify that it had a valid signature, the nonce is present but different from the request, the response/certificate serial are different), we'll just keep going without checking the certificate revocation. Is it what you had in mind or you also want to fail for these cases?

@FabienChaynes
Copy link
Collaborator Author

Oh and by the way if we could support OCSP stapling that would be great

I couldn't find a way to support it. This seems to say that it doesn't.

@jarthod
Copy link
Owner

jarthod commented Apr 20, 2020

I couldn't find a way to support it. This seems to say that it doesn't.

Ok fine

In the current version of the code, we'll only explicitly state that the OCSP test failed if there was a connection error. If something else happens (the OCSP response was invalid, we couldn't verify that it had a valid signature, the nonce is present but different from the request, the response/certificate serial are different), we'll just keep going without checking the certificate revocation. Is it what you had in mind or you also want to fail for these cases?

Well that's good to start with but I still want to be able to monitor this. Here is an idea: maybe we can return the validation as ok but still return a message with "OCSP error..." (should say where/why it failed). So when valid = true the message can be considered a notice or warning by the user.

In my case updown could instrument this message in kibana first to measure the number of occurrences of each error (and on which domain) and maybe later display this message to users as a warning without considering the certificate invalid.

WDYT?

@FabienChaynes
Copy link
Collaborator Author

FabienChaynes commented Apr 20, 2020

Sounds reasonable. As long as we're not hard-failing just when we can't check the OCSP, I'm ok with it.

So in the current state:

  • When we have an unhandled exception (which will happen if we can't reach the OCSP responder), we will return [nil, "OCSP test failed: #{message}"] = "we don't know how the SSL test went for this site, here's why the OCSP test failed"
  • When the certificate was revoked, we will return [false, "SSL certificate revoked: #{message} (revocation date: #{revocation_date})", cert] = "The SSL test failed for this site because of a cert revocation"
  • When we received the OCSP response but something isn't going well, we will return [true, "OCSP test couldn't be performed: #{message}", cert] = "The SSL test for this site went well, we couldn't perform the OCSP test, here's why"

Google and Updown are already websites in the last category. Here's a comparison with also a revoked cert and a test without any problem:

irb(main):001:0> SSLTest.test("https://google.com/")
=> [true, "OCSP test couldn't be performed: OCSP response failed: Your request is unauthorized", #<OpenSSL::X509::Certificate: subject=#<OpenSSL::X509::Name CN=*.google.com,O=Google LLC,L=Mountain View,ST=California,C=US>, issuer=#<OpenSSL::X509::Name CN=GTS CA 1O1,O=Google Trust Services,C=US>, serial=#<OpenSSL::BN:0x0000000007714920>, not_before="2020-04-07T09:35:04.000Z", not_after="2020-06-30T09:35:04.000Z">]

irb(main):002:0> SSLTest.test("https://updown.io/")
=> [true, "OCSP test couldn't be performed: OCSP response failed: Your request is unauthorized", #<OpenSSL::X509::Certificate: subject=#<OpenSSL::X509::Name CN=updown.io>, issuer=#<OpenSSL::X509::Name CN=Let's Encrypt Authority X3,O=Let's Encrypt,C=US>, serial=#<OpenSSL::BN:0x0000000007752018>, not_before="2020-03-29T20:39:27.000Z", not_after="2020-06-27T20:39:27.000Z">]

irb(main):003:0> SSLTest.test("https://github.com/")
=> [true, nil, #<OpenSSL::X509::Certificate: subject=#<OpenSSL::X509::Name CN=github.com,O=GitHub\, Inc.,L=San Francisco,ST=California,C=US,serialNumber=5157550,jurisdictionST=Delaware,jurisdictionC=US,businessCategory=Private Organization>, issuer=#<OpenSSL::X509::Name CN=DigiCert SHA2 Extended Validation Server CA,OU=www.digicert.com,O=DigiCert Inc,C=US>, serial=#<OpenSSL::BN:0x000000000776f208>, not_before="2018-05-08T00:00:00.000Z", not_after="2020-06-03T12:00:00.000Z">]

irb(main):004:0> SSLTest.test("https://revoked.badssl.com/")
=> [false, "SSL certificate revoked: The certificate was revoked for an unknown reason (revocation date: 2019-10-07 20:30:39 UTC)", #<OpenSSL::X509::Certificate: subject=#<OpenSSL::X509::Name CN=revoked.badssl.com,O=Lucas Garron Torres,L=Walnut Creek,ST=California,C=US>, issuer=#<OpenSSL::X509::Name CN=DigiCert SHA2 Secure Server CA,O=DigiCert Inc,C=US>, serial=#<OpenSSL::BN:0x000000000778b020>, not_before="2019-10-04T00:00:00.000Z", not_after="2021-10-08T12:00:00.000Z">]

I think it's coming from the fact that the chain has 3 certs:

  • updown.io issued by Let's Encrypt Authority X3
  • Let's Encrypt Authority X3 issued by DST Root CA X3
  • DST Root CA X3 issued by DST Root CA X3 (root cert)

And currently, I'm asking the root OCSP responder to check if updown's certificate was revoked. But they don't know it. I probably need to ask Let's Encrypt OCSP responder to check if updown's certificate was revoked and then the root OCSP responder to check if Let's Encrypt's certificate was revoked (handled in 1511899).

@jarthod
Copy link
Owner

jarthod commented Apr 21, 2020

Thanks, for the error case ([nil, "SSL certificate test failed: #{e.message}"]) now that we can have OCSP errors I think we should include the cert in the output (because now it can be present), if it's nil so be it. Just so in case of throttling failure for example we still have the cert:

> SSLTest.test("https://shard1-us1.lab.digital.ringcentral.com/connect")
=> [nil, "OCSP test failed: execution expired"]

(reproduced that by spamming a bit ^^)

I've ran my rake crosstest:ssl again using all production domains and the last version of your code, got a handful of execution expired, read timeout, connection reset by peer) mostly because of throttling (I tried some of them again after and it worked well) but no weird errors \o/

So looking pretty good so far. One thing I don't like is that's we're gonna spam the endpoints even more with intermediate certificates, which will always be the same, that's really a case where caching would help :/

OCSP is made to be cached heavily, and that's why they include the values you called _this_update and _next_update. Here is an example from google.com cert:

[#<OpenSSL::OCSP::CertificateId:0x00007f7791e86fd8>, 0, 21996, nil, "2020-04-20T04:43:55.000Z", "2020-04-27T04:43:55.000Z", []]

We can see they say they cache for 7 days, which means that we should be able to cache the result until 2020-04-27T04:43:55.000Z without loosing any speed.

IMO we should do that at least for intermediary certs (to keep the cache small) but why no for first certs later. Something really simple in-memory with a Ruby hash is enough IMO, the key would be a hash
or identifier of the cert (something not too big if possible) and the value would need to contain status, reason, revocation_time and next_updated I would say? if you check the cache and it's empty or next_update is in the past, we fetch again. What do you think?

@FabienChaynes
Copy link
Collaborator Author

FabienChaynes commented Apr 21, 2020

Ok for returning the cert when the check fails (66f6e59).

The cache is handled in b91d3a6 and is also there for the first cert. Unless you really think it's better to only do it for the intermediate certs to preserve the memory? I'm not 100% convinced but it'd probably be quick to remove anyway so up to you (just made me realize that if we keep it like that, I'll have to remove cert from the arguments because it's not used).

- Avoid performing useless operation when request cached
lib/ssl-test.rb Outdated Show resolved Hide resolved
lib/ssl-test.rb Outdated
chain[0..-2].each_with_index do |current_checked_cert, i|
# https://tools.ietf.org/html/rfc5280#section-4.1.2.2
# The serial number [...] MUST be unique for each certificate issued by a given CA (i.e., the issuer name and serial number identify a unique certificate)
unicity_key = [current_checked_cert.issuer.to_s, current_checked_cert.serial.to_s]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure but had the feeling that constructing an array here as hash key would be less efficient than concatenating the strings so I checked and it's the case:

require 'benchmark'

str1 = "yolo"
str2 = "yata"

N = 1_000_000

Benchmark.bm do |x|
  x.report("array") {
    h = {}
    N.times {
      key = [str1, str2]
      h[key] = {}
    }
  }
  x.report("string") {
    h = {}
    N.times {
      key = "#{str1}/#{str2}"
      h[key] = {}
    }
  }
end

# $ ruby bench.rb
#         user     system      total        real
#  array  0.714008   0.000000   0.714008 (  0.714285)
# string  0.180115   0.000000   0.180115 (  0.180154)

Not a big deal but better using a single string here

@jarthod
Copy link
Owner

jarthod commented Apr 22, 2020

Ok for returning the cert when the check fails (66f6e59).

Thanks!

The cache is handled in b91d3a6 and is also there for the first cert. Unless you really think it's better to only do it for the intermediate certs to preserve the memory? I'm not 100% convinced but it'd probably be quick to remove anyway so up to you.

I think it's fine, I measured it using crosstest with this between each test:

GC.start
cache = SSLTest.instance_variable_get('@ocsp_response_cache')
pid, size = `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i)
puts "Cache size: #{cache&.size} / Process: #{size} kB"

I confirmed with the version before caching that there's no leak, the process quickly rose to ~141MB and stayed there. Now testing the version with caching with 1000 SSL tests (some of them using the same domain) gave a process starting at ~149MB (not sure why, sometimes it's more sometimes less, ruby 🤷), but what matters is the increase and after 1000 sites tested, cache size of 366, the memory used by the process was still 149MB 🤔 so weird but it means it's totally OK this way (I saw it rising some times but decreasing after so I'm guessing the memory used by mongoid largely outrank this little hash) 👍

@jarthod
Copy link
Owner

jarthod commented Apr 22, 2020

I think after that little key fix, if you could find a couple tests with other revokation cases (ideally one with the intermediary revoked instead of the first cert) that would be great and we can try this in production :)

@FabienChaynes
Copy link
Collaborator Author

I wasn't sure but had the feeling that constructing an array here as hash key would be less efficient than concatenating the strings

I had the same feeling but I didn't benchmark and went for the array option because it felt cleaner. Also it was before knowing that the cert serial had to be an integer. If it was free, it might have caused collisions because we couldn't know where the name ended and where the cert serial started. But in the end it's not an issue.
I wasn't expecting such a big ratio.
Anyway, I'm convinced now, so done in 89e9047.


I made some internal methods private. Just kept test and test_ocsp_revocation public because I thought the latter could also be useful.


if you could find a couple tests with other revokation cases (ideally one with the intermediary revoked instead of the first cert) that would be great

Unfortunately I couldn't 🙁 All the ones I found were "revoked for an unknown reason" (never at the intermediate cert level) or were even expired so the revocation didn't have the time to show up.

I wanted to fake the different cases, but it looks like minitest needs another gem to do this...

The one about "returns no error on valid SNI website" is now flapping on OCSP and I fixed another one because the error changed slightly.

"returns error on untrusted root" is actually returning "error code 19: self signed certificate in certificate chain" now too, but I didn't change this one.

@jarthod
Copy link
Owner

jarthod commented Apr 22, 2020

All right ! I prepared a branch on updown side (ssl-revocation) and ran one last big crosstest, I found two cases with ruby exception that we should probably fix while it's hot, it's from government sites from France and Polynésie Française mostly ^^:

  nil (OCSP test failed: undefined method `[]' for nil:NilClass) instead of true (): https://covid19.reserve-civique.gouv.fr/ (kx1n)
  nil (OCSP test failed: undefined method `[]' for nil:NilClass) instead of true (): https://www.demarches-simplifiees.fr (edes)
  nil (OCSP test failed: undefined method `[]' for nil:NilClass) instead of true (): https://histovec.interieur.gouv.fr (gqat)
  nil (OCSP test failed: undefined method `[]' for nil:NilClass) instead of true (): https://www.i-mata.gov.pf/ (cbgi)
  nil (OCSP test failed: undefined method `[]' for nil:NilClass) instead of true (): https://doc.projet.gov.pf/ (nnoe)
  nil (OCSP test failed: undefined method `[]' for nil:NilClass) instead of true (): https://www.paraoa.gov.pf/ (yz7l)
  nil (OCSP test failed: undefined method `[]' for nil:NilClass) instead of true (): https://www.hotufina.gov.pf/ (7f91)
  nil (OCSP test failed: undefined method `[]' for nil:NilClass) instead of true (): https://www.service-public.pf/ (vspg)
  nil (OCSP test failed: undefined method `[]' for nil:NilClass) instead of true (): https://www.otia.gov.pf/ (qfnt)
  nil (OCSP test failed: undefined method `[]' for nil:NilClass) instead of true (): https://www.service-public.pf/sipf/wp-admin/ (qgsf)
  nil (OCSP test failed: undefined method `[]' for nil:NilClass) instead of true (): https://gestion.projet.gov.pf (9osu)
  nil (OCSP test failed: undefined method `[]' for nil:NilClass) instead of true (): https://www.tiarama.gov.pf/ (4njq)
  nil (OCSP test failed: undefined method `[]' for nil:NilClass) instead of true (): https://www.turama.gov.pf/ (vh8s)
  nil (OCSP test failed: undefined method `value' for nil:NilClass) instead of true (): https://www.anonymisation.gov.pf/code-anonymat/avec-dn (fwwy)
  nil (OCSP test failed: undefined method `[]' for nil:NilClass) instead of true (): https://auth.gov.pf/auth/ (golz)
  nil (OCSP test failed: undefined method `[]' for nil:NilClass) instead of true (): https://www.mes-demarches.gov.pf/ (qetl)
  nil (OCSP test failed: undefined method `[]' for nil:NilClass) instead of true (): https://www.oini.gov.pf (obhz)
  nil (OCSP test failed: undefined method `[]' for nil:NilClass) instead of true (): https://www.tefenua.gov.pf/ (shcv)

They do not seem to flap so you should be able to reproduce them easily with the URL.

@FabienChaynes
Copy link
Collaborator Author

FabienChaynes commented Apr 22, 2020

that we should probably fix while it's hot

Definitely!

https://covid19.reserve-civique.gouv.fr/

Come on, a Covid-19 site as the first one 😂 You can't tell me that it was random!

So the errors were about missing info to get the OCSP responder. Most of them were the OCSP URI which was missing, for one of them the it was the extension containing it (authorityInfoAccess) which was missing. I added specs for both case and here's the results after the fix:

irb(main):004:0> SSLTest.test("https://covid19.reserve-civique.gouv.fr/")
=> [true, "OCSP test couldn't be performed: Missing OCSP URI in authorityInfoAccess extension", #<OpenSSL::X509::Certificate: subject=#<OpenSSL::X509::Name serialNumber=S15762003,CN=covid19.reserve-civique.gouv.fr,OU=0002 13000680200016,O=SG MIN AFFAIRES SOC,L=PARIS SP 07,C=FR>, issuer=#<OpenSSL::X509::Name CN=Certigna Services CA,2.5.4.97=#0C144E545246522D3438313436333038313030303336,OU=0002 48146308100036,O=DHIMYOTIS,C=FR>, serial=#<OpenSSL::BN:0x0000000007bba240>, not_before="2020-03-19T17:20:17.000Z", not_after="2022-03-19T17:20:17.000Z">]

irb(main):006:0> SSLTest.test("https://www.anonymisation.gov.pf/code-anonymat/avec-dn")
=> [true, "OCSP test couldn't be performed: Missing authorityInfoAccess extension", #<OpenSSL::X509::Certificate: subject=#<OpenSSL::X509::Name CN=www.anonymisation.gov.pf,serialNumber=35785DAN953,OU=RC-PF 002238,2.5.4.97=#0C0C4E545250462D303032323338,O=Service de l'informatique du territoire,L=Papeete,C=FR>, issuer=#<OpenSSL::X509::Name CN=Certinomis - AA et Agents,OU=0002 433998903,O=Certinomis,C=FR>, serial=#<OpenSSL::BN:0x0000000007bcc4b8>, not_before="2018-07-05T10:19:00.000Z", not_after="2020-07-04T10:19:00.000Z">]

irb(main):007:0> ["https://www.demarches-simplifiees.fr", "https://histovec.interieur.gouv.fr", "https://www.i-mata.gov.pf/", "https://doc.projet.gov.pf/", "https://www.paraoa.gov.pf/", "https://www.hotufina.gov.pf/", "https://www.service-public.pf/", "https://www.otia.gov.pf/", "https://www.service-public.pf/sipf/wp-admin/", "https://gestion.projet.gov.pf", "https://www.tiarama.gov.pf/", "https://www.turama.gov.pf/", "https://www.anonymisation.gov.pf/code-anonymat/avec-dn", "https://auth.gov.pf/auth/", "https://www.mes-demarches.gov.pf/", "https://www.oini.gov.pf", "https://www.tefenua.gov.pf/"].map {|u| SSLTest.test(u) }.map {|r| r[1]}.uniq
=> ["OCSP test couldn't be performed: Missing OCSP URI in authorityInfoAccess extension", "OCSP test couldn't be performed: Missing authorityInfoAccess extension"]

@jarthod
Copy link
Owner

jarthod commented Apr 23, 2020

Come on, a Covid-19 site as the first one  You can't tell me that it was random!

Haha, well I'm not even ordering the checks so that's mongo choice ;)
I'm not so surprised that the bad players not supporting OCSP are gov websites though xD

Thanks works great now, I just deployed my branch using your branch on staging, works so far:
image

@jarthod jarthod merged commit 5ca16df into master Apr 25, 2020
jarthod added a commit that referenced this pull request Apr 25, 2020
@jarthod jarthod deleted the 2-detect-revoked-certificates branch April 25, 2020 10:36
@jarthod jarthod mentioned this pull request Apr 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants