Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hanging? #7

Open
sje30 opened this issue Jan 31, 2017 · 12 comments
Open

hanging? #7

sje30 opened this issue Jan 31, 2017 · 12 comments

Comments

@sje30
Copy link

sje30 commented Jan 31, 2017

when I try this (one of my DOIs) it seems to hang a long time. Do we need a timeout?

http_status(GET("http://doi.org/10.1016/j.neuropharm.2015.07.027"))$message

The paper works ok though when I visit the DOI in a browser.

@pboesu
Copy link

pboesu commented Jan 31, 2017

I have similar issues with a set of articles, also by Elsevier. DOIs resolve fine in browser, but GET() never completes (aborted after several hours). The offending DOIs for me are

http://doi.org/10.1016/j.dsr2.2016.12.010
http://doi.org/10.1016/j.dsr2.2015.07.002
http://doi.org/10.1016/j.dsr2.2015.06.023
http://doi.org/10.1016/j.dsr2.2015.05.009

Adding a timeout and wrapping the GET call in try fixes the issue as far as completing the loop:

vec[doi] <- try(http_status(GET(uniquedois[doi],timeout(30)))$message)

but I'd be curious to know if anyone has an idea why GET fails in the first place.

@sje30
Copy link
Author

sje30 commented Jan 31, 2017

Is it possibly some kind of anti-crawling device? simplifying a bit to a wget call from the command line shows that wget also doesn't complete.

$ wget http://doi.org/10.1016/j.dsr2.2015.05.009
--2017-01-31 21:26:26--  http://doi.org/10.1016/j.dsr2.2015.05.009
Resolving doi.org... 95.138.171.129, 38.100.138.162, 54.191.229.235, ...
Connecting to doi.org|95.138.171.129|:80... connected.
HTTP request sent, awaiting response... 303 See Other
Location: http://linkinghub.elsevier.com/retrieve/pii/S0967064515001757 [following]
--2017-01-31 21:26:26--  http://linkinghub.elsevier.com/retrieve/pii/S0967064515001757
Resolving linkinghub.elsevier.com... 198.185.19.37
Connecting to linkinghub.elsevier.com|198.185.19.37|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /retrieve/articleSelectSinglePerm?Redirect=http%3A%2F%2Fwww.sciencedirect.com%2Fscience%2Farticle%2Fpii%2FS0967064515001757%3Fvia%253Dihub&key=0975941c8eaa7e0d5f565539e746120441f9f269 [following]
--2017-01-31 21:26:27--  http://linkinghub.elsevier.com/retrieve/articleSelectSinglePerm?Redirect=http%3A%2F%2Fwww.sciencedirect.com%2Fscience%2Farticle%2Fpii%2FS0967064515001757%3Fvia%253Dihub&key=0975941c8eaa7e0d5f565539e746120441f9f269
Reusing existing connection to linkinghub.elsevier.com:80.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://www.sciencedirect.com/science/article/pii/S0967064515001757?via%3Dihub [following]
--2017-01-31 21:26:27--  http://www.sciencedirect.com/science/article/pii/S0967064515001757?via%3Dihub
Resolving www.sciencedirect.com... 23.55.150.153
Connecting to www.sciencedirect.com|23.55.150.153|:80... connected.
HTTP request sent, awaiting response...   C-c 

@rossmounce
Copy link
Owner

@sje30 you're right.

You see I wrote this in R to make it a bit more cross-platform / portable.
But originally, and if I was just doing this for myself I'd write this in pure shell script.
The notable thing about Elsevier is that they require a user-agent with wget.

So this will work for you:

wget --user-agent="Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)" 'http://doi.org/10.1016/j.dsr2.2016.12.010'

So I need to update the script to use a user-agent string with httr.
Thanks for the feedback guys! Difficult for me to encounter this bug - I've never had the misfortune to publish with Elsevier 😜

@rossmounce
Copy link
Owner

Hmmm...

wget with user-agent has no problem with the DOIs from my ORCID.

But when I try to set a user-agent with httr odd things start happening. HTTP 400 and HTTP 401 codes!

vec[doi] <- http_status(GET(uniquedois[doi], user_agent("Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)")))$message
2   Client error: (400) Bad Request  http://doi.org/10.7287/peerj.preprints.773v1
11 Client error: (401) Unauthorized            http://doi.org/10.1038/nature10266

@rossmounce
Copy link
Owner

For debugging purposes, it may be useful to compare httr against wget
To run wget on one's list of DOIs, with a suitable user-agent, do:

wget --user-agent="Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)" -w1 -i mydois.txt -o newlog.log

@rossmounce
Copy link
Owner

rossmounce commented Jan 31, 2017

I think Nature returns a 401 if the article is paywalled.

Wget log

--2017-01-31 22:00:34--  http://doi.org/10.1038/nature10266
Connecting to doi.org (doi.org)|208.254.38.90|:80... connected.
HTTP request sent, awaiting response... 303 See Other
Location: http://www.nature.com/doifinder/10.1038/nature10266 [following]
--2017-01-31 22:00:36--  http://www.nature.com/doifinder/10.1038/nature10266
Resolving www.nature.com (www.nature.com)... 95.101.129.18, 95.101.129.16
Connecting to www.nature.com (www.nature.com)|95.101.129.18|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /nature/journal/v476/n7359/full/nature10266.html [following]
--2017-01-31 22:00:37--  http://www.nature.com/nature/journal/v476/n7359/full/nature10266.html
Reusing existing connection to www.nature.com:80.
HTTP request sent, awaiting response... 401 Unauthorized

Username/Password Authentication Failed.

@sje30
Copy link
Author

sje30 commented Jan 31, 2017 via email

@wetneb
Copy link

wetneb commented Jan 31, 2017

I haven't read everything but I'm not sure dissemin would help much here. Maybe Zotero's translation server?

@sje30
Copy link
Author

sje30 commented Jan 31, 2017 via email

@pboesu
Copy link

pboesu commented Jan 31, 2017

dissemin creates yet another set of problems for me: It won't parse publications from Inter Research (a small independent publisher), even though the papers have valid DOIs and are recognized by ORCID.

Thanks for the user_agent tip! That solves the Elsevier issues for me.

@wetneb
Copy link

wetneb commented Jan 31, 2017

@sje30 yes, but we don't do any scraping (which is necessary for accurate results). The issue seems to be about making requests to the publisher's website, which is something we can't help with.

@sje30
Copy link
Author

sje30 commented Jan 31, 2017

ok, thanks @wetneb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants