Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Payara6: URL encoding has changed so going through Apache or directly to 8080 can break things; export download, privateURL download. #9797

Closed
kcondon opened this issue Aug 18, 2023 · 10 comments · Fixed by #9835
Assignees
Milestone

Comments

@kcondon
Copy link
Contributor

kcondon commented Aug 18, 2023

Tested on payara 6 .8 on java 17: #9764

After running the installer, found siteURL had :8080 at the end of fqdn. Later I removed it but noticed it seemed to cause trouble when it was 8080 for private url file download, worked when removed and trouble for downloading dataset metadata export when not there and worked when 8080 was there, so opposite. Might this be a more general parsing/encoding issue that could appear elsewhere?

The errors:

  1. Trying to download a file from a private url with port 8080 set for siteURL:
    http://dataverse-internal.iq.harvard.edu:8080/privateurl.xhtml?token=f09ccd96-5e3e-4299-a57e-f0ae7dbe5d55
status | "ERROR" -- | -- code | 403 message | "Not authorized to access this object via this API endpoint. Please check your code for typos, or consult our API guide at http://guides.dataverse.org." requestUrl | "http://dataverse-internal.iq.harvard.edu:8080/api/v1/access/datafile/24?gbrecs=true" requestMethod | "GET"
  1. Trying to download dataset export metadata when port 8080 is not set (ie going through apache on 80):

{"status":"ERROR","message":"A dataset with the persistentId doi%3A10.70122/FK2/JAKDVW could not be found."}

[2023-08-18T15:47:47.365+0000] [Payara 6.2023.8] [INFO] [] [edu.harvard.iq.dataverse.AbstractGlobalIdServiceBean] [tid: _ThreadID=97 _ThreadName=http-thread-pool::jk-connector(4)] [timeMillis: 1692373667365] [levelValue: 800] [[
Error parsing identifier: doi%3A10.70122/FK2/JAKDVW: ':' not found in string]]

[Update] the export issue seemed to be corrected by adding https rather than http when going through port 80. I'd just removed 8080 from siteURL without adjusting http.

Note that my payara 5 installation uses http:// and port 8080 and it appears to work. Would need to retest to confirm.

@scolapasta scolapasta added this to the 6.0 milestone Aug 18, 2023
@landreev
Copy link
Contributor

I can't seem to be able to reproduce the broken metadata links issue (Error parsing identifier: doi%3A10.70122/FK2/JAKDVW: ':' not found).
I go through apache: https://dataverse-internal.iq.harvard.edu/dataset.xhtml?persistentId=doi:10.70122/FK2/WPCTZB, and everything in the metadata tab appears to be working.
Hmm.

@landreev
Copy link
Contributor

landreev commented Aug 24, 2023

[edit: not an issue - the <protocol> part of the error message must have gotten eaten because of the angle brackets in the original issue description] Also, as an extra weird piece, trying to pass a double-encoded identifier to the api, results in an error message in the log that's different from the one reported, above:

using https://dataverse-internal.iq.harvard.edu/api/datasets/export?exporter=schema.org&persistentId=doi%253A10.70122/FK2/WPCTZB

Error parsing identifier: doi%3A10.70122/FK2/WPCTZB: '<protocol>:' not found in string

I'm very confused. [edit: don't be]

@landreev
Copy link
Contributor

(we need to look closer into this, there may be situations where this still can be reproduced. like maybe if there is a login page redirect involved, or something like that)

@qqmyers
Copy link
Member

qqmyers commented Aug 24, 2023

FWIW: The actual PID parsing changed when PermaLinks was added, so we may have different error messages now and/or be more sensitive to strings not getting decoded in the PID recognition code now.

@landreev
Copy link
Contributor

OK, the "different error message" was a non issue - the error messages in the original description were added as regular text, without the backticks, so the <protocol> part must have been dropped on account of the angle quotes.

@kcondon
Copy link
Contributor Author

kcondon commented Aug 24, 2023 via email

@landreev
Copy link
Contributor

not "angle quotes", lol... but you know what I meant.

@landreev
Copy link
Contributor

landreev commented Aug 24, 2023

So, anyway, this is what we are looking at: something double-encoded that ":" character when Kevin was testing it (more than once). The error messages are in /usr/local/payara6.8.installed/glassfish/domains/domain1/logs/server.log* on dataverse-internal, and here is one in the apache access log on the day/time the issue was opened:

98.118.33.138 - - [18/Aug/2023:15:50:20 +0000] "GET /api/datasets/export?exporter=OAI_ORE&persistentId=doi%253A10.70122/FK2/JAKDVW HTTP/1.1" 404 108

...but I haven't been able to make it happen for me now.
Based on the same access log, it happened after he got bounced to the login page and back to the dataset page. And of course the login page encodes the whole url string, like this:

98.118.33.138 - - [18/Aug/2023:15:50:14 +0000] "POST /loginpage.xhtml?redirectPage=%2Fdataset.xhtml%3FpersistentId%3Ddoi%3A10.70122%2FFK2%2FJAKDVW HTTP/1.1" 200 161

... but, again, I tried it with a draft, reproducing the same login page redirect loop - and it didn't happen for me.
OK, I can't spend any more time on this... but, weird, huh?

@landreev
Copy link
Contributor

I did figure out how to reproduce this, btw. It's not that bad.

@landreev
Copy link
Contributor

Here's what appears to be taking place:

Rather than trying to figure out why this has started happening under p6, I feel like we should just add a defensive '%3A' -> ':' substitution to the persistent id check. I made a draft pr with the fix. It looks like this has already been happening once in a while - I see an occasional doi%253A resulting in a 404 in the prod. access logs (probably on account of bookmarked or harvested urls? - idk). So, would be a reasonable fix to add regardless.

The fact that this is happening because of the extra http: -> https: redirect is by itself weird; because that's done entirely under apache... so not immediately clear how the p6 upgrade would even affect that... But, once again, I'm not sure we want to spend much time figuring out the why part.

The above feels like more text than this problem warranted already.

@landreev landreev self-assigned this Aug 28, 2023
kcondon added a commit that referenced this issue Aug 29, 2023
fix for double-url-encoded ":" in "hdl:". #9797
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants