robotstxt #25

petermeissner · 2016-01-09T20:17:17Z

1. What does this package do? (explain in 50 words or less)

Web scraping allows to gather information of scientific value - mainly social science related in my experience. While scraping web pages one should respect permissions declared in robots.txt files.
The package provides functions to retrieve and parse robots.txt files. The core functionality is to check a bots/users permission for one or more resources (paths) for a given domain. To ease checking all functions have been bundled with relevant data into an R6 robotstxt class but everything works functional or object oriented depending on the users preferences.

1. Paste the full DESCRIPTION file inside a code block (bounded by ``` on either end).

Package: robotstxt
Type: Package
Title: A 'robots.txt' Parser and Webbot/Webspider/Webcrawler Permissions Checker
Version: 0.1.0
Author: Peter Meissner
Maintainer: Peter Meissner <retep.meissner@gmai.com>
Description: Class ('R6') and accompanying methods to
    parse and check 'robots.txt' files. Data fields are provided as
    data frames and vectors. Permissions can be checked by providing
    path character vectors and optional bot names.
License: MIT + file LICENSE
LazyData: TRUE
BugReports: https://github.com/petermeissner/robotstxt/issues
URL: https://github.com/petermeissner/robotstxt
Imports:
    R6 (>= 2.1.1),
    stringr (>= 1.0.0),
    httr (>= 1.0.0)
Suggests:
    knitr,
    rmarkdown,
    dplyr,
    testthat
Depends:
    R (>= 3.0.0)
VignetteBuilder: knitr
RoxygenNote: 5.0.1

1. URL for the package (the development repository, not a stylized html page)

https://github.com/petermeissner/robotstxt

1. What data source(s) does it work with (if applicable)?

robots.txt files like:

Package developers and users that want an easy way to be nice while gathering data from the web.

1. Are there other R packages that accomplish the same thing? If so, what is different about yours?

None that I know of.

1. Check the box next to each policy below, confirming that you agree. These are mandatory.
This package does not violate the Terms of Service of any service it interacts with
The repository has continuous integration with Travis and/or another service: https://travis-ci.org/petermeissner/robotstxt
The package contains a vignette:
The package contains a reasonably complete readme with devtools install instructions
The package contains unit tests
The package only exports functions to the NAMESPACE that are intended for end users
1. Do you agree to follow the [rOpenSci packaging guidelines]

Yes, good guidelines!

Are there any package dependencies not on CRAN?

No.

Do you intend for this package to go on CRAN?

With or without ropensci.

Does the package have a CRAN accepted license?

yes, MIT

Did devtools::check() produce any errors or warnings? If so paste them below.

no:

* DONE
Status: OK

R CMD check succeeded

1. Please add explanations below for any exceptions to the above:

Does not apply.

1. If this is a resubmission following rejection, please explain the change in cirucmstances.

No, no resubmit.

The text was updated successfully, but these errors were encountered:

sckott · 2016-01-10T18:08:14Z

Thanks for the submission 🚀 - seeking reviewers

sckott · 2016-01-10T19:42:54Z

Reviewers: @richfitz @Ironholds

Ironholds · 2016-01-10T20:12:59Z

Unless you own gmai.com there's a typo in your DESCRIPTION file ;p.
Do you need the alternate df implementations? They're fun but also a lot of extra code to replace stringsAsFactors = FALSE ;P.
The assignment here seems extraneous.
"did apply" to "applied" here

As you may notice, most of these quibbles are well, quibbles. On the whole the package is excellent :). I'll try to put time into a more detailed review later today or early tomorrow.

petermeissner · 2016-01-11T09:04:46Z

Thanks for the ropensci infrastructure and thanks for the feedback - much-much-much appreciated.

in response to @Ironholds

Unless you own gmai.com there's a typo in your DESCRIPTION file ;p.

... that is why devtools::build_win() never called back ... (Windows build only complains about qpdf missing link1 link2)
fixed

Do you need the alternate df implementations? They're fun but also a lot of extra code to replace stringsAsFactors = FALSE ;P.

fixed

The assignment here seems extraneous.

after checking extranous here and here ...
yes, developement relict
fixed

"did apply" to "applied" here

fixed

question of my own

path_allowed() (link) will response to unforseen permission constallations with a warning and return all data that led to the problem.

Do you think that design decision is wise/ok ??
Is it better to return a warning and FALSE for each path?
Should it stop with an error? I think not - when used by headless programs this might break them also the problem is not fatal/severe

Ironholds · 2016-01-12T00:04:58Z

I think that's totally fine behaviour, but I'd caution you, in that scenario, to use more formal/explanatory description text, noting that problematic data will be returned.

(On the subject of formality, working on a more formal review now)

Ironholds · 2016-01-12T00:24:45Z

On the whole: solid work! Some thoughts:

There are a lot of functions, such as rt_get_fields, which are documented but not exported. What this means in practice is that the code can't be found or used but the documentation does appear within the package index, which is pretty confusing. I'd suggest adding the "internal" roxygen2 keyword to the docs for those functions so that you don't index them.
The vignette is still using "Vignette Title" for indexing, so that's what CRAN will list it as (do not feel bad, I have made this mistake 150m times and never notice until I've already submitted it to CRAN and they've accepted >.>)
The vignette opens "The package provides a class (‘R6’)". This makes it sound like the class is called R6, at least to me.
The vignette could really benefit from a brief explanation of what robots.txt files are and why they matter at the top, with some examples, rather than just links at the bottom.

petermeissner · 2016-01-12T13:00:34Z

in response to @Ironholds II

drop unexported from manual

done

vignette title / index entry

done

The vignette opens "The package ...

done

The vignette could really benefit from a brief explanation of what robots.txt files are and why they matter ...

done

petermeissner · 2016-01-13T09:23:37Z

in regard to how path_allowed() handles problems with permission deduction

path_allowed() will now warn more informative describing what happened:

robotstxt::path_allowed() says:
Encountered problems while trying to determine bot permissions, returning NA instead of TRUE/FALSE
furthermore, path_allowed() will return NA in cases where the permission cannot be determined

petermeissner · 2016-01-19T18:33:34Z

Dear Oliver (@Ironholds ),

thank you very much for taking the time to review my package and comment on it. I think the package gained a lot from your suggestions and I myself learned a lot as well. I really appreciate the effort. I would like to add you as contributor - what is the ropensci policy on that matter - within DESCRIPTION and README, if that's ok with you.

Ironholds · 2016-01-20T01:50:31Z

I feel like I should say "the package looks really good now, sorry for not mentioning it earlier, this week has been nuts" before accepting or rejecting ;p.

richfitz · 2016-02-11T20:58:26Z

I'm sorry, I have left this very late so I will just jot down the thoughts I have rather than a full review, in the hope that it is useful.

My biggest major comment is that I don't see what R6 is adding here; it seems that the return value could be a list (which contains one function).

Don't get me wrong -- I love R6. But in cases where reference semantics aren't needed or being used it seems an unnecessary weirdness for most users (and in my experience users do find them weird). See the code below for a pure-list implementation of your function (purposefully left very similar) to show that there is no real difference in implementation. The only difference is that initialisation looks less weird (x <- robotstxt() not x <- robotstxt$new()).

robotstxt <- function(domain, text) {
  ## check input
  self <- list()
  if (missing(domain)) {
    self$domain <- NA
  }
  if (!missing(text)){
    self$text <- text
    if(!missing(domain)){
      self$domain <- domain
    }
  }else{
    if(!missing(domain)){
      self$domain <- domain
      self$text   <- get_robotstxt(domain)
    }else{
      stop("robotstxt: I need text to initialize.")
    }
  }
  ## fill fields with default data

  tmp <- parse_robotstxt(self$text)
  self$bots        <- tmp$useragents
  self$comments    <- tmp$comments
  self$permissions <- tmp$permissions
  self$crawl_delay <- tmp$crawl_delay
  self$host        <- tmp$host
  self$sitemap     <- tmp$sitemap
  self$other       <- tmp$other

  self$check <- function(paths="/", bot="*", permission=self$permissions) {
    paths_allowed(permissions=permission, paths=paths, bot=bot)
  }

  class(self) <- "robotstxt"
  self
}

(this passes checks in the package by deleting only the $new in the test file).

I really like the idea of downloading the file once and testing repeatedly. I wonder if that could be extended with the assumption that the file does not change during an R session. Then one could memoise the calls and avoid having to have any sort of object for the users to worry about.

cache <- new.env(parent=emptyenv())
path_allowed <- function(url) {
  x <- httr::parse_url(url)
  obj <- cache[[x$hostname]]
  if (is.null(obj)) {
    cache[[x$hostname]] <- obj <- robotstxt(hostname)
  }
  obj$check(x$path)
}

Minor comments

Avoid httr::content(rtxt) in package code (see ?httr::content) in favour of checking that the return type was correct. In development versions of httr I find I also have to declare encoding="UTF-8" to avoid a message on every request, too.

In get_robotstxt, is warning the appropriate behaviour? I would have thought that for 403 and 5xx errors it should be an error. For 404 errors could we not assume that there is_ no robots.txt and therefore it's all fair game?

I find that missing() lends itself to hard-to-diagnose errors when functions are several layers deep; consider explicit NULL arguments and is.null() tests for missing-ness (purely style).

You have a partial match of $useragent to $useragents in R/permissions.R and throughout tests/testthat/test_parser.R

petermeissner · 2016-02-11T22:14:45Z

Pfuh, some thoughts -aye? Sounds all very interesting and helpful in the sense that I really have to think about it. Now, this will take me a while. Thank you very much - most appreciated!

petermeissner · 2016-03-22T08:36:57Z

@richfitz

I am working on implementing all of your suggestion. But I really would like to understand better what happens in ropensci/robotstxt#6 .

Do you have maybe a reference or a catch phrase for me that I can look up? What I specifically have a hard time wrapping my head around is that the function within the list knows where to find the data.

example

# function definition

robotstxt_dev <- function(domain, text) {
  ## check input
  self <- list()
  if (missing(domain)) {
    self$domain <- NA
  }
  if (!missing(text)){
    self$text <- text
    if(!missing(domain)){
      self$domain <- domain
    }
  }else{
    if(!missing(domain)){
      self$domain <- domain
      self$text   <- get_robotstxt(domain)
    }else{
      stop("robotstxt: I need text to initialize.")
    }
  }
  ## fill fields with default data

  tmp <- parse_robotstxt(self$text)
  self$bots        <- tmp$useragents
  self$comments    <- tmp$comments
  self$permissions <- tmp$permissions
  self$crawl_delay <- tmp$crawl_delay
  self$host        <- tmp$host
  self$sitemap     <- tmp$sitemap
  self$other       <- tmp$other

  self$check <- function(paths="/", bot="*", permission=self$permissions) {
    paths_allowed(permissions=permission, paths=paths, bot=bot)
  }

  class(self) <- "robotstxt"
  self
}

# example
library(robotstxt)
blah <- robotstxt_dev("pmeissner.com")$check
blah("_layouts")

At the moment, that feels like pure black magic.

petermeissner · 2016-03-23T22:51:30Z

@richfitz

I have now implemented all suggestions.

In regard to the caching / memoisation suggestion I took the freedom to implement the caching in another function (get_robotstxt()) instead - I think it is more appropriate to do the caching at this step.
Also, I implemented the related suggestion to allow for one single function call to check permissions. Nonetheless, I kept the possibility to create robots.txt objects. (see updated examples in README and Vignette)

Thank you very very much for taking the time to go through my code. The package gained a lot from that, I think and I learned lot for myself. I really appreciate the effort. Thanks.

petermeissner · 2016-03-31T19:12:06Z

Just some further thoughts ...

In regard to discarding R6 because of not using "reference semantics" I have done some research - ah, that's what it is.

First, I will still leave R6 out because it can be done without - as you have suggested - and its one dependency less.
But, using reference semantics is not the only reason to use Oo-design.
In this case I think the tight connection of data and functions would be reason enough to use it.
In regard to users who might be confused by using robotstxt$new() instead of robotstxt(): I do not think that the first is more confusing than the latter. In the first case the user is explicitly told that he will use something that might be unusual. In the second case it is more like giving the user something that looks like something he understands but than letting it do black magic (the black magic is called clojure and is very well explained in Wickham 2015: "Advanced R", p. 175ff, in particular p. 186)

petermeissner · 2016-04-26T08:09:48Z

Hey everyone (@sckott ), so what is the next step? All suggestions have been gratefully aceppted and incorporated. Is there another thumbs up needed from @richfitz or something else I should/can do or?

sckott · 2016-04-26T08:37:17Z

seems ready to merge in. I do see

✔  checking for missing documentation entries
W  checking for code/documentation mismatches
   Codoc mismatches from documentation object 'robotstxt':
   robotstxt
     Code: function(domain = NULL, text = NULL)
     Docs: function(domain = "mydomain.com")
     Argument names in code not in docs:
       text
     Mismatches in argument default values:
       Name: 'domain' Code: NULL Docs: "mydomain.com"
   robotstxt
     Code: function(domain = NULL, text = NULL)
     Docs: function(text = "User-agent: *\nDisallow: /")
     Argument names in code not in docs:
       domain
     Mismatches in argument names:
       Position: 1 Code: domain Docs: text
     Mismatches in argument default values:
       Name: 'text' Code: NULL Docs: "User-agent: *\nDisallow: /"

W  checking Rd \usage sections
   Undocumented arguments in documentation object 'get_robotstxt'
     ‘warn’

   Undocumented arguments in documentation object 'robotstxt'
     ‘domain’ ‘text’

can these be fixed first?

petermeissner · 2016-04-26T19:31:00Z

Sorry, I missed that. @sckott thanks for checking - its all fixed.

All checks now pass locally (Ubuntu) without errors, warnings or notes, the same is true for winbuilder checks.

sckott · 2016-04-27T07:32:05Z

Great! Thanks. Approved.

I added you to a team with admin access, and now you should be able to transfer the repo to ropenscilabs - we move pkgs to ropensci github org once more mature - your pkg is still a fully ropensci pkg even though in ropenscilabs

petermeissner · 2016-04-27T08:04:29Z

Coooool.

Transfer repo? : It is done via Settings > Danger Zone > Transfer ownership : "ropenscilabs" - right?
How is pushing to CRAN handled thereafter? I would like to push the current thing to CRAN.

sckott · 2016-04-27T08:19:05Z

It is done via Settings > Danger Zone > Transfer ownership : "ropenscilabs"

yes

How is pushing to CRAN handled thereafter? I would like to push the current thing to CRAN.

same as before - you are still the maintainer of your pkg, and can submit to CRAN as you like. You can push new version to CRAN before transferring if you like

petermeissner · 2016-04-27T17:45:12Z

Transfered to ropenscilabs - everything works fine.

sckott · 2016-04-27T18:14:20Z

awesome, thanks for your work and contribution

petermeissner · 2016-04-27T18:20:02Z

Nah..., I am the one indebted for getting all that great feadback - thanks for letting me join the club.

sckott added package 2/seeking-reviewer(s) labels Jan 10, 2016

sckott added 1/editor-checks and removed 2/seeking-reviewer(s) labels Jan 10, 2016

richfitz mentioned this issue Jan 22, 2016

Adding assertr to ROpenSci #23

Closed

10 tasks

sckott added the 4/review(s)-in-awaiting-changes label Feb 11, 2016

petermeissner mentioned this issue Apr 26, 2016

Fix documentation errors ropensci/robotstxt#12

Closed

sckott added 6/approved and removed 4/review(s)-in-awaiting-changes labels Apr 27, 2016

petermeissner closed this as completed Apr 27, 2016

noamross added topic:parsing and removed 1/editor-checks labels Jan 17, 2017

noamross assigned sckott Jan 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

robotstxt #25

robotstxt #25

petermeissner commented Jan 9, 2016

sckott commented Jan 10, 2016

sckott commented Jan 10, 2016

Ironholds commented Jan 10, 2016

petermeissner commented Jan 11, 2016

Ironholds commented Jan 12, 2016

Ironholds commented Jan 12, 2016

petermeissner commented Jan 12, 2016

petermeissner commented Jan 13, 2016

petermeissner commented Jan 19, 2016

Ironholds commented Jan 20, 2016

richfitz commented Feb 11, 2016

petermeissner commented Feb 11, 2016

petermeissner commented Mar 22, 2016

petermeissner commented Mar 23, 2016

petermeissner commented Mar 31, 2016

petermeissner commented Apr 26, 2016

sckott commented Apr 26, 2016

petermeissner commented Apr 26, 2016

sckott commented Apr 27, 2016

petermeissner commented Apr 27, 2016

sckott commented Apr 27, 2016

petermeissner commented Apr 27, 2016

sckott commented Apr 27, 2016

petermeissner commented Apr 27, 2016

robotstxt #25

robotstxt #25

Comments

petermeissner commented Jan 9, 2016

sckott commented Jan 10, 2016

sckott commented Jan 10, 2016

Ironholds commented Jan 10, 2016

petermeissner commented Jan 11, 2016

in response to @Ironholds

question of my own

Ironholds commented Jan 12, 2016

Ironholds commented Jan 12, 2016

petermeissner commented Jan 12, 2016

in response to @Ironholds II

petermeissner commented Jan 13, 2016

in regard to how path_allowed() handles problems with permission deduction

petermeissner commented Jan 19, 2016

Ironholds commented Jan 20, 2016

richfitz commented Feb 11, 2016

Minor comments

petermeissner commented Feb 11, 2016

petermeissner commented Mar 22, 2016

petermeissner commented Mar 23, 2016

petermeissner commented Mar 31, 2016

petermeissner commented Apr 26, 2016

sckott commented Apr 26, 2016

petermeissner commented Apr 26, 2016

sckott commented Apr 27, 2016

petermeissner commented Apr 27, 2016

sckott commented Apr 27, 2016

petermeissner commented Apr 27, 2016

sckott commented Apr 27, 2016

petermeissner commented Apr 27, 2016