Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors when running crawler behind corporate proxy #1094

Closed
azat-alimov-db opened this issue Dec 12, 2023 · 17 comments · Fixed by #1095
Closed

Errors when running crawler behind corporate proxy #1094

azat-alimov-db opened this issue Dec 12, 2023 · 17 comments · Fixed by #1095

Comments

@azat-alimov-db
Copy link

azat-alimov-db commented Dec 12, 2023

Hello,

Thank you for helping out with a question about proxy settings for crawler. Now when I'm trying to run a test indexing, I'm getting the following error:
2023-12-12 21:01:48 WARNING Macroscope.Main:317: Skipping due to an unexpected exception {"index":"test","crawler":"coder","err":"Decoding of CommitInfoRequest {commitInfoRequestIndex = "test", commitInfoRequestCrawler = "coder", commitInfoRequestEntity = Enumerated {enumerated = Right EntityTypeENTITY_TYPE_ORGANIZATION}, commitInfoRequestOffset = 0} failed with: "Error in $: Failed reading: not a valid json value at '<!DOCTYPEhtmlPUBLIC-W3CDTDXHTML1.0TransitionalENhttp:www.w3.orgTRxhtml1DTDxhtm'"\nCallStack (from HasCallStack):\n error, called at src/Relude/Debug.hs:289:11 in relude-1.2.0.0-Jiwa4gfuZvkK1snRof3V:Relude.Debug\n error, called at src/Monocle/Client.hs:107:17 in monocle-0.1.10.0-1juCsBb4vJ35WvYo0D138g:Monocle.Client"}

Here is a config:
workspaces:
- name: test
crawlers:
- name: "coder"
provider:
github_organization: coder
update_since: '2023-01-01'

Any idea what that would mean?
Appreciate any hints

@TristanCacqueray
Copy link
Contributor

It seems like this is happening when the crawler uses the proxy to connect to the api. We probably need a different variable name for that case.

@azat-alimov-db
Copy link
Author

That what we set as a proxy setting:

  • name: HTTP_PROXY
    value: http://corporate_proxy:8080
    - name: HTTPS_PROXY
    value: http://corporate_proxy:8080

@TristanCacqueray
Copy link
Contributor

Could you try removing the HTTP_PROXY variable, it should be the one used for the connections from the crawler to the api.

@azat-alimov-db
Copy link
Author

Ok, tried, looks like the connection can be established now, but getting SSL errors:
2023-12-12 21:55:59 WARNING Monocle.Effects:526: network error {"index":"test","crawler":"coder","stream":"Projects","count":7,"limit":7,"loc":"api.github.com:443/graphql","failed":"InternalException ProtocolError "error:0A000086:SSL routines::certificate verify failed""}

Any hints on configuring SSL certs for crawler (since there is a replacement of SSL cert with our org signed certificate, when going through the proxy) or maybe any way to run crawler in insecure mode?

@TristanCacqueray
Copy link
Contributor

Alright, thanks.

SSL is implemented by openssl, so setting SSL_CERT_FILE should work.

@azat-alimov-db
Copy link
Author

gotcha, thank you. I'll work on that tomorrow, since will need to update a deployment yaml and mount ssl certs somewhere as a secret

@mergify mergify bot closed this as completed in #1095 Dec 13, 2023
@morucci morucci reopened this Dec 13, 2023
@morucci
Copy link
Collaborator

morucci commented Dec 13, 2023

The related change is merged. New container image should be published soon. https://github.com/change-metrics/monocle/actions/runs/7199715334

@azat-alimov-db
Copy link
Author

azat-alimov-db commented Dec 13, 2023

Hello,

I added a certificate to a deployment and set the env var to:

- name: SSL_CERT_FILE
value: /etc/pki/tls/certs/db-server-ca-6.cer

Then tested with curl and connection works fine via proxy:

bash-4.2$ curl -v -o /dev/null https://api.github.com
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* About to connect() to proxy *** port 8080 (#0)
*   Trying 10.245.32.5...
* Connected to *** (10.245.32.5) port 8080 (#0)
* Establish HTTP proxy tunnel to api.github.com:443
> CONNECT api.github.com:443 HTTP/1.1
> Host: api.github.com:443
> User-Agent: curl/7.29.0
> Proxy-Connection: Keep-Alive
> 
< HTTP/1.0 200 Connection established
< 
* Proxy replied OK to CONNECT request
* Initializing NSS with certpath: sql:/etc/pki/nssdb
*   CAfile: /etc/pki/tls/certs/db-server-ca-6.cer
  CApath: none
* SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
* Server certificate:
***
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: api.github.com
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Wed, 13 Dec 2023 19:44:29 GMT
< ETag: W/"4f825cc84e1c733059d46e76e6df9db557ae5254f9625dfe8e1b09499c449438"
< Vary: Accept, Accept-Encoding, Accept, X-Requested-With
< Server: GitHub.com
< Connection: Keep-Alive
< Content-Type: application/json; charset=utf-8
< Accept-Ranges: bytes
< Cache-Control: public, max-age=60, s-maxage=60
< Content-Length: 2262
< Referrer-Policy: origin-when-cross-origin, strict-origin-when-cross-origin
< X-Frame-Options: deny
< X-RateLimit-Used: 1
< X-XSS-Protection: 0
< X-RateLimit-Limit: 60
< X-RateLimit-Reset: 1702500276
< X-GitHub-Media-Type: github.v3; format=json
< X-GitHub-Request-Id: 5F6A:3D26CA:1FEADE:204AFE:657A09A4
< X-RateLimit-Resource: core
< X-RateLimit-Remaining: 59
< X-Content-Type-Options: nosniff
< Content-Security-Policy: default-src 'none'
< Strict-Transport-Security: max-age=31536000; includeSubdomains; preload
< Access-Control-Allow-Origin: *
< Access-Control-Expose-Headers: ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset
< x-github-api-version-selected: 2022-11-28
< 
{ [data not shown]
100  2262  100  2262    0     0   5835      0 --:--:-- --:--:-- --:--:--  5844
* Connection #0 to host *** left intact

But crawler still receives the following:
"2023-12-13 18:46:49 WARNING Macroscope.Main:317: Skipping due to an unexpected exception {"index":"test","crawler":"coder","err":"HttpExceptionRequest Request {\n host = \"api.github.com\"\n port = 443\n secure = True\n requestHeaders = [(\"Authorization\",\"<REDACTED>\"),(\"User-Agent\",\"change-metrics/monocle\"),(\"Content-Type\",\"application/json\")]\n path = \"/graphql\"\n queryString = \"\"\n method = \"POST\"\n proxy = Nothing\n rawBody = False\n redirectCount = 10\n responseTimeout = ResponseTimeoutDefault\n requestVersion = HTTP/1.1\n proxySecureMode = ProxySecureWithConnect\n}\n (InternalException ProtocolError \"error:0A000086:SSL routines::certificate verify failed\")"}"

Is it possible set it to insecure?

Appreciate any further suggestions

@TristanCacqueray
Copy link
Contributor

Perhaps you can try setting TLS_NO_VERIFY to 1

@azat-alimov-db
Copy link
Author

@azat-alimov-db
Copy link
Author

azat-alimov-db commented Dec 13, 2023

Any idea why I get the a "Network error" from web UI (api), when trying to access it via browser

Logs of api service not throwing any suspicious errors and moreover it that I received 200:
[13/Dec/2023:20:17:46 +0000] "GET / HTTP/1.1" 200 - "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36""

I've exposed the service via Cloud Load Balancer on GCP GKE, with LoadBalancer service type:

apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: api
    app.kubernetes.io/component: api
    app.kubernetes.io/part-of: monocle
  name: api-external
  annotations:
    networking.gke.io/internal-load-balancer-allow-global-access: "true"
    networking.gke.io/load-balancer-type: "Internal"
spec:
  type: LoadBalancer
  ports:
    - name: http-rest-api
      port: 8080
      targetPort: 8080
  selector:
    app.kubernetes.io/name: api
status:
  loadBalancer: {}

@TristanCacqueray
Copy link
Contributor

Have you try setting COMPOSE_MONOCLE_PUBLIC_URL ?

@azat-alimov-db
Copy link
Author

yep, set that for api and crawler, but still getting the same "Network error" message

@TristanCacqueray
Copy link
Contributor

Oops I meant MONOCLE_PUBLIC_URL, this should be the url you are using to access the web UI, it is only needed for the api container. It defaults to localhost, so if you look in your browser network inspect tab, you should see that the network error message happens because the client tries to connect to localhost.

@azat-alimov-db
Copy link
Author

Sorry can't give you a screenshots, but while looking into Chrome developer tools, I see the following for "about" request:
General:

Request URL:
http://localhost:8080/api/2/about
Referrer Policy:
strict-origin-when-cross-origin

Request Headers:

Accept:
*/*
Access-Control-Request-Headers:
content-type
Access-Control-Request-Method:
POST
Origin:
http://100.88.10.138:8080
Sec-Fetch-Mode:
cors
User-Agent:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36

@azat-alimov-db
Copy link
Author

Awesome, that did the trick. Thank you very much @TristanCacqueray !
Let me play around with that great tool.

Feel free to close this issue record

@TristanCacqueray
Copy link
Contributor

You're welcome, have fun!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants