Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java advice from Thib / LOCKSS / cbeer #18

Closed
ndushay opened this issue May 2, 2017 · 4 comments
Closed

Java advice from Thib / LOCKSS / cbeer #18

ndushay opened this issue May 2, 2017 · 4 comments

Comments

@ndushay
Copy link
Contributor

ndushay commented May 2, 2017

(may spawn sub-tickets)

  • CLI intro
  • CLI examples
  • do they use bagit?
  • what do they do for validation?
  • how do they do http proxy?
  • how do they do rate limiting?
@ndushay ndushay added the ready label May 2, 2017
@ndushay ndushay changed the title Java advice from Thib / LOCKSS Java advice from Thib / LOCKSS / cbeer May 2, 2017
@ndushay
Copy link
Contributor Author

ndushay commented May 3, 2017

Set up meeting with Thib for Wed May 3 at 11am

@ndushay ndushay added in progress and removed ready labels May 3, 2017
@ndushay
Copy link
Contributor Author

ndushay commented May 3, 2017

http://commons.apache.org/proper/commons-cli/

  • create a bunch of Option instances
    • do this with Option.Builder (OptionBuilder is a deprecated version of same)
      • see javadoc for example usage -- chaining methods ending with .build()
  • add Option instances to an Options instance (via .addOption())
  • create DefaultParser and use parse() to parse out options
    • if passed Options ok, then CommandLine instance created
      • query CommandLine object, e.g. hasOption() and getOptionValue()

Given the above, how do we best use this in situ?

  • pass around CommandLine object?
    • means other classes need to know what CommandLine object is
    • CommandLine is not extensible
    • possible alternatives
      • local struct for Options and interface for CommandLine
        • think of parsing proxy arg that you want to split into hostname and port and might need port at diff point than hostname ... makes it possible to parse this arg once
      • put relevant arguments in appropriate constructors

@ndushay
Copy link
Contributor Author

ndushay commented May 3, 2017

Thib will send us links, e.g. org.lockss.tdb (tdb = title database), exemplar of locally created interface

  • CommandLIneAdapter is perhaps interface of interest;
  • TdbParse
    • run()
    • addOptions()

@ndushay
Copy link
Contributor Author

ndushay commented May 3, 2017

LOCKSS practices (bagit, checksums)

  • bagit not used
  • no checksum validation
  • downloader/storage layers responsible for getting the files - "success or failure"
    • possibly checksum validation later in processing
  • most of their datasources don't have checksums
  • bagit might be useful approach if they start downloading from sites with checksums available

large file downloading

  • open java.net.url connection and hope it's okay (works better than you might think)
    • what if your target has multiple IP addresses?
  • LOCKSS uses HttpClient3 (timeouts, can configure for IP addresses, error handling, more resilient)
  • HttpClient4 vs. 3
    • we should use 4 from get-go.
    • Fernando is the go to guy for HttpClient
  • daemons in use

Proxying

  • if global proxy for target, it just works
  • want to use a particular proxy: java.net.url doesn't allow. Use HttpClient library.
  • Thib will set up something with Fernando (Naomi, John, Tommy)

Rate Limiting

  • perhaps not as important for large files
    • e.g. article may be text and a lot of images
  • okay to be bursty as long as it's not too rigorous
  • pay attention to HTTP 429/503 (?) - "please wait x seconds"
    • robots.txt refers to web crawling, not API harvesting (?)
    • set User-Agent
  • two approaches:
    • request object has method can I make this request now? If not, sleep/wait.
    • rate limiter is the object that does the blocking

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant