-
-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
robotstxt #25
Comments
Thanks for the submission 🚀 - seeking reviewers |
Reviewers: @richfitz @Ironholds |
As you may notice, most of these quibbles are well, quibbles. On the whole the package is excellent :). I'll try to put time into a more detailed review later today or early tomorrow. |
Thanks for the ropensci infrastructure and thanks for the feedback - much-much-much appreciated. in response to @Ironholds
question of my own
|
I think that's totally fine behaviour, but I'd caution you, in that scenario, to use more formal/explanatory description text, noting that problematic data will be returned. (On the subject of formality, working on a more formal review now) |
On the whole: solid work! Some thoughts:
|
in response to @Ironholds II
|
in regard to how path_allowed() handles problems with permission deduction
|
Dear Oliver (@Ironholds ), thank you very much for taking the time to review my package and comment on it. I think the package gained a lot from your suggestions and I myself learned a lot as well. I really appreciate the effort. I would like to add you as contributor - what is the ropensci policy on that matter - within DESCRIPTION and README, if that's ok with you. |
I feel like I should say "the package looks really good now, sorry for not mentioning it earlier, this week has been nuts" before accepting or rejecting ;p. |
I'm sorry, I have left this very late so I will just jot down the thoughts I have rather than a full review, in the hope that it is useful. My biggest major comment is that I don't see what R6 is adding here; it seems that the return value could be a list (which contains one function). Don't get me wrong -- I love R6. But in cases where reference semantics aren't needed or being used it seems an unnecessary weirdness for most users (and in my experience users do find them weird). See the code below for a pure-list implementation of your function (purposefully left very similar) to show that there is no real difference in implementation. The only difference is that initialisation looks less weird ( robotstxt <- function(domain, text) {
## check input
self <- list()
if (missing(domain)) {
self$domain <- NA
}
if (!missing(text)){
self$text <- text
if(!missing(domain)){
self$domain <- domain
}
}else{
if(!missing(domain)){
self$domain <- domain
self$text <- get_robotstxt(domain)
}else{
stop("robotstxt: I need text to initialize.")
}
}
## fill fields with default data
tmp <- parse_robotstxt(self$text)
self$bots <- tmp$useragents
self$comments <- tmp$comments
self$permissions <- tmp$permissions
self$crawl_delay <- tmp$crawl_delay
self$host <- tmp$host
self$sitemap <- tmp$sitemap
self$other <- tmp$other
self$check <- function(paths="/", bot="*", permission=self$permissions) {
paths_allowed(permissions=permission, paths=paths, bot=bot)
}
class(self) <- "robotstxt"
self
} (this passes checks in the package by deleting only the I really like the idea of downloading the file once and testing repeatedly. I wonder if that could be extended with the assumption that the file does not change during an R session. Then one could memoise the calls and avoid having to have any sort of object for the users to worry about. cache <- new.env(parent=emptyenv())
path_allowed <- function(url) {
x <- httr::parse_url(url)
obj <- cache[[x$hostname]]
if (is.null(obj)) {
cache[[x$hostname]] <- obj <- robotstxt(hostname)
}
obj$check(x$path)
} Minor commentsAvoid In I find that You have a partial match of |
Pfuh, some thoughts -aye? Sounds all very interesting and helpful in the sense that I really have to think about it. Now, this will take me a while. Thank you very much - most appreciated! |
I am working on implementing all of your suggestion. But I really would like to understand better what happens in ropensci/robotstxt#6 . Do you have maybe a reference or a catch phrase for me that I can look up? What I specifically have a hard time wrapping my head around is that the function within the list knows where to find the data. example # function definition
robotstxt_dev <- function(domain, text) {
## check input
self <- list()
if (missing(domain)) {
self$domain <- NA
}
if (!missing(text)){
self$text <- text
if(!missing(domain)){
self$domain <- domain
}
}else{
if(!missing(domain)){
self$domain <- domain
self$text <- get_robotstxt(domain)
}else{
stop("robotstxt: I need text to initialize.")
}
}
## fill fields with default data
tmp <- parse_robotstxt(self$text)
self$bots <- tmp$useragents
self$comments <- tmp$comments
self$permissions <- tmp$permissions
self$crawl_delay <- tmp$crawl_delay
self$host <- tmp$host
self$sitemap <- tmp$sitemap
self$other <- tmp$other
self$check <- function(paths="/", bot="*", permission=self$permissions) {
paths_allowed(permissions=permission, paths=paths, bot=bot)
}
class(self) <- "robotstxt"
self
}
# example
library(robotstxt)
blah <- robotstxt_dev("pmeissner.com")$check
blah("_layouts") At the moment, that feels like pure black magic. |
I have now implemented all suggestions.
Thank you very very much for taking the time to go through my code. The package gained a lot from that, I think and I learned lot for myself. I really appreciate the effort. Thanks. |
Just some further thoughts ... In regard to discarding R6 because of not using "reference semantics" I have done some research - ah, that's what it is.
|
seems ready to merge in. I do see ✔ checking for missing documentation entries
W checking for code/documentation mismatches
Codoc mismatches from documentation object 'robotstxt':
robotstxt
Code: function(domain = NULL, text = NULL)
Docs: function(domain = "mydomain.com")
Argument names in code not in docs:
text
Mismatches in argument default values:
Name: 'domain' Code: NULL Docs: "mydomain.com"
robotstxt
Code: function(domain = NULL, text = NULL)
Docs: function(text = "User-agent: *\nDisallow: /")
Argument names in code not in docs:
domain
Mismatches in argument names:
Position: 1 Code: domain Docs: text
Mismatches in argument default values:
Name: 'text' Code: NULL Docs: "User-agent: *\nDisallow: /"
W checking Rd \usage sections
Undocumented arguments in documentation object 'get_robotstxt'
‘warn’
Undocumented arguments in documentation object 'robotstxt'
‘domain’ ‘text’ can these be fixed first? |
Sorry, I missed that. @sckott thanks for checking - its all fixed. All checks now pass locally (Ubuntu) without errors, warnings or notes, the same is true for winbuilder checks. |
Great! Thanks. Approved. I added you to a team with admin access, and now you should be able to transfer the repo to |
Coooool. Transfer repo? : It is done via Settings > Danger Zone > Transfer ownership : "ropenscilabs" - right? |
yes
same as before - you are still the maintainer of your pkg, and can submit to CRAN as you like. You can push new version to CRAN before transferring if you like |
Transfered to ropenscilabs - everything works fine. |
awesome, thanks for your work and contribution |
Nah..., I am the one indebted for getting all that great feadback - thanks for letting me join the club. |
Web scraping allows to gather information of scientific value - mainly social science related in my experience. While scraping web pages one should respect permissions declared in robots.txt files.
The package provides functions to retrieve and parse robots.txt files. The core functionality is to check a bots/users permission for one or more resources (paths) for a given domain. To ease checking all functions have been bundled with relevant data into an R6
robotstxt
class but everything works functional or object oriented depending on the users preferences.https://github.com/petermeissner/robotstxt
robots.txt files like:
Package developers and users that want an easy way to be nice while gathering data from the web.
None that I know of.
Yes, good guidelines!
No.
With or without ropensci.
yes, MIT
devtools::check()
produce any errors or warnings? If so paste them below.no:
Does not apply.
No, no resubmit.
The text was updated successfully, but these errors were encountered: