Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New features in google cloud storage #5703

Closed
sdavidfr opened this issue Jan 14, 2021 · 9 comments
Closed

New features in google cloud storage #5703

sdavidfr opened this issue Jan 14, 2021 · 9 comments
Assignees
Labels
api: storage Issues related to the Cloud Storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@sdavidfr
Copy link

What component of google-cloud-cpp is this feature request for?
google/cloud/storage

Is your feature request related to a problem? Please describe.

  1. I need to configure curl certificate
  2. I need to retrieve the metadata of the prefix in one http request (because it is very slow to get one by one)
  3. I need to start listing from a previous pageToken

Describe the solution you'd like

  1. Add ssl_dir_path, verify_ssl function in ChannelOptions
  2. Add includeTrailingDelimiter option in ListObjectsRequest
  3. Initialize next_page_token_ with the pageToken available in Request

My solution:

===================================================================
--- google/cloud/storage/client_options.h	(revision 182516)
+++ google/cloud/storage/client_options.h	(working copy)
@@ -46,6 +46,8 @@
 class ChannelOptions {
  public:
   std::string ssl_root_path() const { return ssl_root_path_; }
+  std::string ssl_dir_path() const { return ssl_dir_path_; }
+  bool verify_ssl() const { return verify_ssl_; }
 
   ChannelOptions& set_ssl_root_path(std::string ssl_root_path) {
     ssl_root_path_ = std::move(ssl_root_path);
@@ -52,8 +54,20 @@
     return *this;
   }
 
+  ChannelOptions& set_ssl_dir_path(std::string ssl_dir_path) {
+      ssl_dir_path_ = std::move(ssl_dir_path);
+      return *this;
+  }
+
+  ChannelOptions& set_verify_ssl(bool verify_ssl) {
+      verify_ssl_ = verify_ssl;
+      return *this;
+  }
+
  private:
   std::string ssl_root_path_;
+  std::string ssl_dir_path_;
+  bool verify_ssl_;
 };
 
 /**
Index: google/cloud/storage/internal/bucket_requests.cc
===================================================================
--- google/cloud/storage/internal/bucket_requests.cc	(revision 182516)
+++ google/cloud/storage/internal/bucket_requests.cc	(working copy)
@@ -461,7 +461,7 @@
 }
 
 std::ostream& operator<<(std::ostream& os, ListBucketsRequest const& r) {
-  os << "ListBucketsRequest={project_id=" << r.project_id();
+  os << "ListBucketsRequest={project_id=" << r.project_id() << ", pageToken=" << r.page_token();
   r.DumpOptions(os, ", ");
   return os << "}";
 }
Index: google/cloud/storage/internal/curl_client.cc
===================================================================
--- google/cloud/storage/internal/curl_client.cc	(revision 182516)
+++ google/cloud/storage/internal/curl_client.cc	(working copy)
@@ -308,6 +308,8 @@
     return status;
   }
   builder.AddQueryParameter("project", request.project_id());
+  if (!request.page_token().empty())
+      builder.AddQueryParameter("pageToken", request.page_token());
   return ParseFromHttpResponse<ListBucketsResponse>(
       builder.BuildRequest().MakeRequest(std::string{}));
 }
Index: google/cloud/storage/internal/curl_handle_factory.cc
===================================================================
--- google/cloud/storage/internal/curl_handle_factory.cc	(revision 182516)
+++ google/cloud/storage/internal/curl_handle_factory.cc	(working copy)
@@ -29,6 +29,13 @@
 
 void CurlHandleFactory::SetCurlOptions(CURL* handle,
                                        ChannelOptions const& options) {
+  curl_easy_setopt(handle, CURLOPT_SSL_VERIFYPEER, options.verify_ssl() ? 1L : 0L);
+  curl_easy_setopt(handle, CURLOPT_SSL_VERIFYHOST, options.verify_ssl() ? 1L : 0L);
+
+  if (!options.ssl_dir_path().empty()) {
+    SetCurlStringOption(handle, CURLOPT_CAPATH,
+                        options.ssl_dir_path().c_str());
+  }
   if (!options.ssl_root_path().empty()) {
     SetCurlStringOption(handle, CURLOPT_CAINFO,
                         options.ssl_root_path().c_str());
Index: google/cloud/storage/internal/object_requests.h
===================================================================
--- google/cloud/storage/internal/object_requests.h	(revision 182516)
+++ google/cloud/storage/internal/object_requests.h	(working copy)
@@ -60,7 +60,7 @@
 class ListObjectsRequest
     : public GenericRequest<ListObjectsRequest, MaxResults, Prefix, Delimiter,
                             StartOffset, EndOffset, Projection, UserProject,
-                            Versions> {
+                            Versions, IncludeTrailingDelimiter> {
  public:
   ListObjectsRequest() = default;
   explicit ListObjectsRequest(std::string bucket_name)
Index: google/cloud/storage/internal/range_from_pagination.h
===================================================================
--- google/cloud/storage/internal/range_from_pagination.h	(revision 182516)
+++ google/cloud/storage/internal/range_from_pagination.h	(working copy)
@@ -109,6 +109,7 @@
       : request_(std::move(request)),
         next_page_loader_(std::move(loader)),
         get_items_(std::move(get_items)),
+        next_page_token_(request_.page_token()),
         on_last_page_(false) {
     current_ = current_page_.begin();
   }
Index: google/cloud/storage/well_known_parameters.h
===================================================================
--- google/cloud/storage/well_known_parameters.h	(revision 182516)
+++ google/cloud/storage/well_known_parameters.h	(working copy)
@@ -453,6 +453,11 @@
   static char const* well_known_parameter_name() { return "delimiter"; }
 };
 
+struct IncludeTrailingDelimiter : public internal::WellKnownParameter<IncludeTrailingDelimiter, std::string> {
+    using WellKnownParameter<IncludeTrailingDelimiter, std::string>::WellKnownParameter;
+    static char const* well_known_parameter_name() { return "includeTrailingDelimiter"; }
+};
+
 /**
  * Filter results to objects whose names are lexicographically equal to or after
  * StartOffset.
@sdavidfr sdavidfr added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Jan 14, 2021
@coryan
Copy link
Contributor

coryan commented Jan 14, 2021

  1. I need to configure curl certificate

That seems useful, thanks. You also added an option to completely disable SSL verification, can you explain why that would be useful? Under which circumstances you do not have the certificates to verify that GCS is actually GCS? How do you protect against MITM attacks without SSL verification?

  1. I need to retrieve the metadata of the prefix in one http request (because it is very slow to get one by one)

I do not follow, can you explain in more detail?

  1. I need to start listing from a previous pageToken

Why not just specify the last received object name in StartOffset? Listings are always lexicographically ordered. Or better yet, specify "the successor of the last object name", something that can be calculated relatively easily?

Even if one implemented this, how do you know what was the last pageToken?

@coryan coryan added the api: storage Issues related to the Cloud Storage API. label Jan 14, 2021
@sdavidfr
Copy link
Author

  1. Disabling ssl verification is only for testing
  2. It is documented but not available in the current version:
    https://cloud.google.com/storage/docs/json_api/v1/objects/list
includeTrailingDelimiter boolean If true, objects that end in exactly one instance of delimiter have their metadata included in items[] in addition to the relevant part of the object name appearing in prefixes[].
  1. Correction: this change is for ListBucketsRequest which does not have "startOffset"

@coryan
Copy link
Contributor

coryan commented Jan 14, 2021

  1. Disabling ssl verification is only for testing

Any reason you cannot use http:// for testing? I am pushing back because the option seems dangerous, and I would rather not expose all the sharp edges to our customers.

  1. It is documented but not available in the current version:
    https://cloud.google.com/storage/docs/json_api/v1/objects/list

Ack. You probably noticed I created specific feature requests for this and the CAPATH one.

  1. Correction: this change is for ListBucketsRequest which does not have "startOffset"

I see. But how do you get the pageToken from a previous request? We do not expose those through the gcs::ListBucketsReader APIs, so how do you obtain a valid token?

@sdavidfr
Copy link
Author

  1. Ok. I can change to use "http"
  2. I obtain the pageToken with overloading:
    _reader = new ListBucketsReader(req,
        [this](internal::ListBucketsRequest const& r) {
            this->_current_page_token = r.page_token();
            return this->_gs->raw_client()->ListBuckets(r);
    });
    

@coryan
Copy link
Contributor

coryan commented Jan 14, 2021

  1. I obtain the pageToken with overloading:

I see ... I am sure you know that we reserve the right to change APIs in internal:: without notice:

https://github.com/googleapis/google-cloud-cpp#api-breaking-changes

I should have asked this earlier, but what is the motivation to resume listing buckets with a given page token? I am guessing you are either (1) trying to return a "page" of buckets when implementing some kind of UI that shows a subset of the buckets at a time, or (2) to checkpoint and resume listing after the application restarts.

For (1): maybe you can just save the iterator and reader objects?
For (2): maybe we can offer a way to convert the reader into a opaque string and then reconstruct it from an opaque string. I think that would expose fewer implementation details, and would give you an API that is stable into the future.

If my guesses are wrong, maybe you can give us some more details and we can design a solution that works well for your use-case.

@coryan coryan self-assigned this Jan 14, 2021
@sdavidfr
Copy link
Author

Right, it is to return a page of buckets (and application could also restart).

I understand that you can change API in ::internal but pageToken is exposed by the Google REST API so why it is not exposed by the SDK.

@coryan
Copy link
Contributor

coryan commented Jan 14, 2021

Right, it is to return a page of buckets (and application could also restart).

Ack.

I understand that you can change API in ::internal but pageToken is exposed by the Google REST API so why it is not exposed by the SDK.

A few reasons:

  • We did not think about your use-case.
  • We try to use the same idioms across all C++ client libraries, regardless of the underlying protocol. For example, Spanner returns large datasets using streaming RPCs, while Storage buckets are listed via page tokens. We use C++ iterators for both.
  • In general, we try to expose as little was possible because the SDK could (eventually) target a different API, say gRPC instead of REST.

In any case, seems like a way to save the state of the reader (token and current location) would solve your problem. I am thinking something like this, but I need to discuss it with my team:

struct Page {
  std::vector<std::string> bucket_names;
  std::string reader_state;
};

Page BucketsPage(gcs::Client client, std::string reader_state) {
  auto reader = client.ListBuckets(gcs::RestoreBucketReader(reader_state /* an empty string starts a new reader */));
  std::vector<std::string> names;
  for (auto& b : reader) {
    if (!b || names.size() >= 100) break;
    names.push_back(b->name());
  }
  return Page{std::move(names), std::move(reader).Checkpoint()};
}

There are boundary conditions to consider: if you create a reader with a page token, what should we do with the items already returned by the iterator? Return them again? Skip them? The latter suggests we need to save more than just the pageToken. What do you do with the items that have not been returned by the iterator? If you rebuild the stream with just the next page token those will be lost.

@sdavidfr
Copy link
Author

In my implementation, i store pageToken and the last bucket i read.
When starting a new ListBuckets, i pass pageToken (taking from my hack) then skip the buckets already returned.
I think it is the client's responsibility to skip the buckets since the SDK/REST API is page based

@coryan
Copy link
Contributor

coryan commented Jan 25, 2021

I have broken down that last feature request to #5742. I am closing this because I think all the work is now in separate issues. Thanks for taking the time to describe the problem(s).

@coryan coryan closed this as completed Jan 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

2 participants