New features in google cloud storage #5703

sdavidfr · 2021-01-14T09:26:22Z

What component of google-cloud-cpp is this feature request for?
google/cloud/storage

Is your feature request related to a problem? Please describe.

I need to configure curl certificate
I need to retrieve the metadata of the prefix in one http request (because it is very slow to get one by one)
I need to start listing from a previous pageToken

Describe the solution you'd like

Add ssl_dir_path, verify_ssl function in ChannelOptions
Add includeTrailingDelimiter option in ListObjectsRequest
Initialize next_page_token_ with the pageToken available in Request

My solution:

===================================================================
--- google/cloud/storage/client_options.h	(revision 182516)
+++ google/cloud/storage/client_options.h	(working copy)
@@ -46,6 +46,8 @@
 class ChannelOptions {
  public:
   std::string ssl_root_path() const { return ssl_root_path_; }
+  std::string ssl_dir_path() const { return ssl_dir_path_; }
+  bool verify_ssl() const { return verify_ssl_; }
 
   ChannelOptions& set_ssl_root_path(std::string ssl_root_path) {
     ssl_root_path_ = std::move(ssl_root_path);
@@ -52,8 +54,20 @@
     return *this;
   }
 
+  ChannelOptions& set_ssl_dir_path(std::string ssl_dir_path) {
+      ssl_dir_path_ = std::move(ssl_dir_path);
+      return *this;
+  }
+
+  ChannelOptions& set_verify_ssl(bool verify_ssl) {
+      verify_ssl_ = verify_ssl;
+      return *this;
+  }
+
  private:
   std::string ssl_root_path_;
+  std::string ssl_dir_path_;
+  bool verify_ssl_;
 };
 
 /**
Index: google/cloud/storage/internal/bucket_requests.cc
===================================================================
--- google/cloud/storage/internal/bucket_requests.cc	(revision 182516)
+++ google/cloud/storage/internal/bucket_requests.cc	(working copy)
@@ -461,7 +461,7 @@
 }
 
 std::ostream& operator<<(std::ostream& os, ListBucketsRequest const& r) {
-  os << "ListBucketsRequest={project_id=" << r.project_id();
+  os << "ListBucketsRequest={project_id=" << r.project_id() << ", pageToken=" << r.page_token();
   r.DumpOptions(os, ", ");
   return os << "}";
 }
Index: google/cloud/storage/internal/curl_client.cc
===================================================================
--- google/cloud/storage/internal/curl_client.cc	(revision 182516)
+++ google/cloud/storage/internal/curl_client.cc	(working copy)
@@ -308,6 +308,8 @@
     return status;
   }
   builder.AddQueryParameter("project", request.project_id());
+  if (!request.page_token().empty())
+      builder.AddQueryParameter("pageToken", request.page_token());
   return ParseFromHttpResponse<ListBucketsResponse>(
       builder.BuildRequest().MakeRequest(std::string{}));
 }
Index: google/cloud/storage/internal/curl_handle_factory.cc
===================================================================
--- google/cloud/storage/internal/curl_handle_factory.cc	(revision 182516)
+++ google/cloud/storage/internal/curl_handle_factory.cc	(working copy)
@@ -29,6 +29,13 @@
 
 void CurlHandleFactory::SetCurlOptions(CURL* handle,
                                        ChannelOptions const& options) {
+  curl_easy_setopt(handle, CURLOPT_SSL_VERIFYPEER, options.verify_ssl() ? 1L : 0L);
+  curl_easy_setopt(handle, CURLOPT_SSL_VERIFYHOST, options.verify_ssl() ? 1L : 0L);
+
+  if (!options.ssl_dir_path().empty()) {
+    SetCurlStringOption(handle, CURLOPT_CAPATH,
+                        options.ssl_dir_path().c_str());
+  }
   if (!options.ssl_root_path().empty()) {
     SetCurlStringOption(handle, CURLOPT_CAINFO,
                         options.ssl_root_path().c_str());
Index: google/cloud/storage/internal/object_requests.h
===================================================================
--- google/cloud/storage/internal/object_requests.h	(revision 182516)
+++ google/cloud/storage/internal/object_requests.h	(working copy)
@@ -60,7 +60,7 @@
 class ListObjectsRequest
     : public GenericRequest<ListObjectsRequest, MaxResults, Prefix, Delimiter,
                             StartOffset, EndOffset, Projection, UserProject,
-                            Versions> {
+                            Versions, IncludeTrailingDelimiter> {
  public:
   ListObjectsRequest() = default;
   explicit ListObjectsRequest(std::string bucket_name)
Index: google/cloud/storage/internal/range_from_pagination.h
===================================================================
--- google/cloud/storage/internal/range_from_pagination.h	(revision 182516)
+++ google/cloud/storage/internal/range_from_pagination.h	(working copy)
@@ -109,6 +109,7 @@
       : request_(std::move(request)),
         next_page_loader_(std::move(loader)),
         get_items_(std::move(get_items)),
+        next_page_token_(request_.page_token()),
         on_last_page_(false) {
     current_ = current_page_.begin();
   }
Index: google/cloud/storage/well_known_parameters.h
===================================================================
--- google/cloud/storage/well_known_parameters.h	(revision 182516)
+++ google/cloud/storage/well_known_parameters.h	(working copy)
@@ -453,6 +453,11 @@
   static char const* well_known_parameter_name() { return "delimiter"; }
 };
 
+struct IncludeTrailingDelimiter : public internal::WellKnownParameter<IncludeTrailingDelimiter, std::string> {
+    using WellKnownParameter<IncludeTrailingDelimiter, std::string>::WellKnownParameter;
+    static char const* well_known_parameter_name() { return "includeTrailingDelimiter"; }
+};
+
 /**
  * Filter results to objects whose names are lexicographically equal to or after
  * StartOffset.

The text was updated successfully, but these errors were encountered:

coryan · 2021-01-14T13:59:21Z

I need to configure curl certificate

That seems useful, thanks. You also added an option to completely disable SSL verification, can you explain why that would be useful? Under which circumstances you do not have the certificates to verify that GCS is actually GCS? How do you protect against MITM attacks without SSL verification?

I need to retrieve the metadata of the prefix in one http request (because it is very slow to get one by one)

I do not follow, can you explain in more detail?

I need to start listing from a previous pageToken

Why not just specify the last received object name in StartOffset? Listings are always lexicographically ordered. Or better yet, specify "the successor of the last object name", something that can be calculated relatively easily?

Even if one implemented this, how do you know what was the last pageToken?

sdavidfr · 2021-01-14T14:27:39Z

Disabling ssl verification is only for testing
It is documented but not available in the current version:
https://cloud.google.com/storage/docs/json_api/v1/objects/list

includeTrailingDelimiter	boolean	If true, objects that end in exactly one instance of delimiter have their metadata included in items[] in addition to the relevant part of the object name appearing in prefixes[].

Correction: this change is for ListBucketsRequest which does not have "startOffset"

coryan · 2021-01-14T14:58:11Z

Disabling ssl verification is only for testing

Any reason you cannot use http:// for testing? I am pushing back because the option seems dangerous, and I would rather not expose all the sharp edges to our customers.

It is documented but not available in the current version:
https://cloud.google.com/storage/docs/json_api/v1/objects/list

Ack. You probably noticed I created specific feature requests for this and the CAPATH one.

Correction: this change is for ListBucketsRequest which does not have "startOffset"

I see. But how do you get the pageToken from a previous request? We do not expose those through the gcs::ListBucketsReader APIs, so how do you obtain a valid token?

sdavidfr · 2021-01-14T15:13:38Z

Ok. I can change to use "http"

I obtain the pageToken with overloading:

_reader = new ListBucketsReader(req,
    [this](internal::ListBucketsRequest const& r) {
        this->_current_page_token = r.page_token();
        return this->_gs->raw_client()->ListBuckets(r);
});

coryan · 2021-01-14T15:40:23Z

I obtain the pageToken with overloading:

I see ... I am sure you know that we reserve the right to change APIs in internal:: without notice:

https://github.com/googleapis/google-cloud-cpp#api-breaking-changes

I should have asked this earlier, but what is the motivation to resume listing buckets with a given page token? I am guessing you are either (1) trying to return a "page" of buckets when implementing some kind of UI that shows a subset of the buckets at a time, or (2) to checkpoint and resume listing after the application restarts.

For (1): maybe you can just save the iterator and reader objects?
For (2): maybe we can offer a way to convert the reader into a opaque string and then reconstruct it from an opaque string. I think that would expose fewer implementation details, and would give you an API that is stable into the future.

If my guesses are wrong, maybe you can give us some more details and we can design a solution that works well for your use-case.

sdavidfr · 2021-01-14T15:57:44Z

Right, it is to return a page of buckets (and application could also restart).

I understand that you can change API in ::internal but pageToken is exposed by the Google REST API so why it is not exposed by the SDK.

coryan · 2021-01-14T16:20:16Z

Right, it is to return a page of buckets (and application could also restart).

Ack.

I understand that you can change API in ::internal but pageToken is exposed by the Google REST API so why it is not exposed by the SDK.

A few reasons:

We did not think about your use-case.
We try to use the same idioms across all C++ client libraries, regardless of the underlying protocol. For example, Spanner returns large datasets using streaming RPCs, while Storage buckets are listed via page tokens. We use C++ iterators for both.
In general, we try to expose as little was possible because the SDK could (eventually) target a different API, say gRPC instead of REST.

In any case, seems like a way to save the state of the reader (token and current location) would solve your problem. I am thinking something like this, but I need to discuss it with my team:

struct Page {
  std::vector<std::string> bucket_names;
  std::string reader_state;
};

Page BucketsPage(gcs::Client client, std::string reader_state) {
  auto reader = client.ListBuckets(gcs::RestoreBucketReader(reader_state /* an empty string starts a new reader */));
  std::vector<std::string> names;
  for (auto& b : reader) {
    if (!b || names.size() >= 100) break;
    names.push_back(b->name());
  }
  return Page{std::move(names), std::move(reader).Checkpoint()};
}

There are boundary conditions to consider: if you create a reader with a page token, what should we do with the items already returned by the iterator? Return them again? Skip them? The latter suggests we need to save more than just the pageToken. What do you do with the items that have not been returned by the iterator? If you rebuild the stream with just the next page token those will be lost.

sdavidfr · 2021-01-14T17:50:03Z

In my implementation, i store pageToken and the last bucket i read.
When starting a new ListBuckets, i pass pageToken (taking from my hack) then skip the buckets already returned.
I think it is the client's responsibility to skip the buckets since the SDK/REST API is page based

coryan · 2021-01-25T14:05:12Z

I have broken down that last feature request to #5742. I am closing this because I think all the work is now in separate issues. Thanks for taking the time to describe the problem(s).

sdavidfr added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Jan 14, 2021

coryan added the api: storage Issues related to the Cloud Storage API. label Jan 14, 2021

This was referenced Jan 14, 2021

Support includeTrailingDelimiter in Client::ListObjects #5704

Closed

Support option to set ssl_dir_path (aka CURLOPT_CAPATH) #5705

Closed

coryan self-assigned this Jan 14, 2021

coryan mentioned this issue Jan 25, 2021

Provide a mechanism to save bucket pagination ranges #5742

Closed

coryan closed this as completed Jan 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New features in google cloud storage #5703

New features in google cloud storage #5703

sdavidfr commented Jan 14, 2021

coryan commented Jan 14, 2021

sdavidfr commented Jan 14, 2021

coryan commented Jan 14, 2021

sdavidfr commented Jan 14, 2021

coryan commented Jan 14, 2021

sdavidfr commented Jan 14, 2021

coryan commented Jan 14, 2021

sdavidfr commented Jan 14, 2021

coryan commented Jan 25, 2021

New features in google cloud storage #5703

New features in google cloud storage #5703

Comments

sdavidfr commented Jan 14, 2021

coryan commented Jan 14, 2021

sdavidfr commented Jan 14, 2021

coryan commented Jan 14, 2021

sdavidfr commented Jan 14, 2021

coryan commented Jan 14, 2021

sdavidfr commented Jan 14, 2021

coryan commented Jan 14, 2021

sdavidfr commented Jan 14, 2021

coryan commented Jan 25, 2021