hanging? #7

sje30 · 2017-01-31T19:48:37Z

when I try this (one of my DOIs) it seems to hang a long time. Do we need a timeout?

http_status(GET("http://doi.org/10.1016/j.neuropharm.2015.07.027"))$message

The paper works ok though when I visit the DOI in a browser.

The text was updated successfully, but these errors were encountered:

pboesu · 2017-01-31T20:39:45Z

I have similar issues with a set of articles, also by Elsevier. DOIs resolve fine in browser, but GET() never completes (aborted after several hours). The offending DOIs for me are

http://doi.org/10.1016/j.dsr2.2016.12.010
http://doi.org/10.1016/j.dsr2.2015.07.002
http://doi.org/10.1016/j.dsr2.2015.06.023
http://doi.org/10.1016/j.dsr2.2015.05.009

Adding a timeout and wrapping the GET call in try fixes the issue as far as completing the loop:

vec[doi] <- try(http_status(GET(uniquedois[doi],timeout(30)))$message)

but I'd be curious to know if anyone has an idea why GET fails in the first place.

sje30 · 2017-01-31T21:28:20Z

Is it possibly some kind of anti-crawling device? simplifying a bit to a wget call from the command line shows that wget also doesn't complete.

$ wget http://doi.org/10.1016/j.dsr2.2015.05.009
--2017-01-31 21:26:26--  http://doi.org/10.1016/j.dsr2.2015.05.009
Resolving doi.org... 95.138.171.129, 38.100.138.162, 54.191.229.235, ...
Connecting to doi.org|95.138.171.129|:80... connected.
HTTP request sent, awaiting response... 303 See Other
Location: http://linkinghub.elsevier.com/retrieve/pii/S0967064515001757 [following]
--2017-01-31 21:26:26--  http://linkinghub.elsevier.com/retrieve/pii/S0967064515001757
Resolving linkinghub.elsevier.com... 198.185.19.37
Connecting to linkinghub.elsevier.com|198.185.19.37|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /retrieve/articleSelectSinglePerm?Redirect=http%3A%2F%2Fwww.sciencedirect.com%2Fscience%2Farticle%2Fpii%2FS0967064515001757%3Fvia%253Dihub&key=0975941c8eaa7e0d5f565539e746120441f9f269 [following]
--2017-01-31 21:26:27--  http://linkinghub.elsevier.com/retrieve/articleSelectSinglePerm?Redirect=http%3A%2F%2Fwww.sciencedirect.com%2Fscience%2Farticle%2Fpii%2FS0967064515001757%3Fvia%253Dihub&key=0975941c8eaa7e0d5f565539e746120441f9f269
Reusing existing connection to linkinghub.elsevier.com:80.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://www.sciencedirect.com/science/article/pii/S0967064515001757?via%3Dihub [following]
--2017-01-31 21:26:27--  http://www.sciencedirect.com/science/article/pii/S0967064515001757?via%3Dihub
Resolving www.sciencedirect.com... 23.55.150.153
Connecting to www.sciencedirect.com|23.55.150.153|:80... connected.
HTTP request sent, awaiting response...   C-c

rossmounce · 2017-01-31T21:33:23Z

@sje30 you're right.

You see I wrote this in R to make it a bit more cross-platform / portable.
But originally, and if I was just doing this for myself I'd write this in pure shell script.
The notable thing about Elsevier is that they require a user-agent with wget.

So this will work for you:

wget --user-agent="Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)" 'http://doi.org/10.1016/j.dsr2.2016.12.010'

So I need to update the script to use a user-agent string with httr.
Thanks for the feedback guys! Difficult for me to encounter this bug - I've never had the misfortune to publish with Elsevier 😜

rossmounce · 2017-01-31T21:54:27Z

Hmmm...

wget with user-agent has no problem with the DOIs from my ORCID.

But when I try to set a user-agent with httr odd things start happening. HTTP 400 and HTTP 401 codes!

vec[doi] <- http_status(GET(uniquedois[doi], user_agent("Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)")))$message
2   Client error: (400) Bad Request  http://doi.org/10.7287/peerj.preprints.773v1
11 Client error: (401) Unauthorized            http://doi.org/10.1038/nature10266

rossmounce · 2017-01-31T22:03:19Z

For debugging purposes, it may be useful to compare httr against wget
To run wget on one's list of DOIs, with a suitable user-agent, do:

wget --user-agent="Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)" -w1 -i mydois.txt -o newlog.log

rossmounce · 2017-01-31T22:05:47Z

I think Nature returns a 401 if the article is paywalled.

Wget log

--2017-01-31 22:00:34--  http://doi.org/10.1038/nature10266
Connecting to doi.org (doi.org)|208.254.38.90|:80... connected.
HTTP request sent, awaiting response... 303 See Other
Location: http://www.nature.com/doifinder/10.1038/nature10266 [following]
--2017-01-31 22:00:36--  http://www.nature.com/doifinder/10.1038/nature10266
Resolving www.nature.com (www.nature.com)... 95.101.129.18, 95.101.129.16
Connecting to www.nature.com (www.nature.com)|95.101.129.18|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /nature/journal/v476/n7359/full/nature10266.html [following]
--2017-01-31 22:00:37--  http://www.nature.com/nature/journal/v476/n7359/full/nature10266.html
Reusing existing connection to www.nature.com:80.
HTTP request sent, awaiting response... 401 Unauthorized

Username/Password Authentication Failed.

sje30 · 2017-01-31T22:41:36Z

I think Nature returns a 401 if one if the article is paywalled.

Maybe at this stage we can link tools up with http://dissem.in ( CC @wetneb) dissem.in does a great job of parsing ORCID records and working out which entries are freely available and which are paywalled.

wetneb · 2017-01-31T22:45:50Z

I haven't read everything but I'm not sure dissemin would help much here. Maybe Zotero's translation server?

sje30 · 2017-01-31T22:51:26Z

I haven't read everything but I'm not sure dissemin would help much here. Maybe Zotero's translation server?

sorry Antonin... just assuming given my ORCID profile, you must have a way of dividing the DOIs for my papers into those that are OA and those that are not?

pboesu · 2017-01-31T22:52:06Z

dissemin creates yet another set of problems for me: It won't parse publications from Inter Research (a small independent publisher), even though the papers have valid DOIs and are recognized by ORCID.

Thanks for the user_agent tip! That solves the Elsevier issues for me.

wetneb · 2017-01-31T22:53:24Z

@sje30 yes, but we don't do any scraping (which is necessary for accurate results). The issue seems to be about making requests to the publisher's website, which is something we can't help with.

sje30 · 2017-01-31T23:38:43Z

ok, thanks @wetneb.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hanging? #7

hanging? #7

sje30 commented Jan 31, 2017

pboesu commented Jan 31, 2017

sje30 commented Jan 31, 2017

rossmounce commented Jan 31, 2017

rossmounce commented Jan 31, 2017

rossmounce commented Jan 31, 2017

rossmounce commented Jan 31, 2017 •

edited

Loading

sje30 commented Jan 31, 2017 via email

wetneb commented Jan 31, 2017

sje30 commented Jan 31, 2017 via email

pboesu commented Jan 31, 2017

wetneb commented Jan 31, 2017

sje30 commented Jan 31, 2017

hanging? #7

hanging? #7

Comments

sje30 commented Jan 31, 2017

pboesu commented Jan 31, 2017

sje30 commented Jan 31, 2017

rossmounce commented Jan 31, 2017

rossmounce commented Jan 31, 2017

rossmounce commented Jan 31, 2017

rossmounce commented Jan 31, 2017 • edited Loading

sje30 commented Jan 31, 2017 via email

wetneb commented Jan 31, 2017

sje30 commented Jan 31, 2017 via email

pboesu commented Jan 31, 2017

wetneb commented Jan 31, 2017

sje30 commented Jan 31, 2017

rossmounce commented Jan 31, 2017 •

edited

Loading