awslabs · DmitriyMusatkin · Nov 21, 2023 · Nov 6, 2023 · Nov 6, 2023 · Nov 6, 2023
diff --git a/docs/memory_aware_request_execution.md b/docs/memory_aware_request_execution.md
@@ -0,0 +1,76 @@
+CRT S3 client was designed with throughput as a primary goal. As such, the client
+scales resource usage, such as number of parallel requests in flight, to achieve
+target throughput. The client creates buffers to hold data it is sending or
+receiving for each request and scaling requests in flight has direct impact on
+memory used. In practice, setting high target throughput or larger part size can
+lead to high observed memory usage. 
+
+To mitigate high memory usages, memory reuse improvements were recently added to
+the client along with options to limit max memory used. The following sections
+will go into more detail on aspects of those changes and how the affect the
+client. 
+
+### Memory Reuse
+At the basic level, CRT S3 client starts with a meta request for operation like
+put or get, breaks it into smaller part-sized requests and executes those in
+parallel. CRT S3 client used to allocate part sized buffer for each of those
+requests and release it right after the request was done. That approach,
+resulted in a lot of very short lived allocations and allocator thrashing,
+overall leading to memory use spikes considerably higher than whats needed. To
+address that, the client is switching to a pooled buffer approach, discussed
+below. 
+
+Note: approach described below is work in progress and concentrates on improving
+the common cases (default 8mb part sizes and part sizes smaller than 64mb).
+
+Several observations about the client usage of buffers:
+- Client does not automatically switch to buffers above default 8mb for upload, until
+  upload passes 10,000 parts (~80 GB).
+- Get operations always use either the configured part size or default of 8mb.
+  Part size for get is not adjusted, since there is no 10,000 part limitation.
+- Both Put and Get operations go through fill and drain phases. Ex. for Put, the
+  client first schedules a number of reads to 'fil' the buffers from the source
+  and as those reads complete, the buffer are send over to the networking layer
+  are 'drained'
+- individual uploadParts or ranged gets operations typically have a similar
+  lifespan (with some caveats). in practice part buffers are acquired/released
+  in bulk at the same time
+
+The buffer pooling takes advantage of some of those allocation patterns and
+works as follows. 
+The memory is split into primary and secondary areas. Secondary area is used for
+requests with part size bigger than a predefined value (currently 4 times part size)
+allocations from it got directly to allocator and are effectively old way of
+doing things. 
+
+Primary memory area is split into blocks of fixed size (part size if defined or
+8mb if not times 16). Blocks are allocated on demand. Each block is logically
+subdivided into part sized chunks. Pool allocates and releases in chunk sizes
+only, and supports acquiring several chunks (up to 4) at once. 
+
+Blocks are kept around while there are ongoing requests and are released async,
+when there is low pressure on memory.
+
+### Scheduling
+Running out of memory is a terminal condition within CRT and in general its not
+practical to try to set overall memory limit on all allocations, since it
+dramatically increases the complexity of the code that deals with cases where
+only part of a memory was allocated for a task.
+
+Comparatively, majority of memory usage within S3 Client comes from buffers
+allocated for Put/Get parts. So to control memory usage, the client will
+concentrate on controlling the number of buffers allocated. Effectively, this
+boils down to a back pressure mechanism of limiting the number of parts
+scheduled as memory gets closer to the limit. Memory used for other resources,
+ex. http connections data, various supporting structures, are not actively
+controlled and instead some memory is taken out from overall limit.
+
+Overall, scheduling does a best-effort memory limiting. At the time of
+scheduling, the client reserves memory by using buffer pool ticketing mechanism.
+Buffer is acquired from the pool using the ticket as close to the usage as
+possible (this approach peaks at lower mem usage than preallocating all mem
+upfront because buffers cannot be used right away, ex reading from file will
+fill buffers slower than they are sent, leading to decent amount of buffer reuse)
+Reservation mechanism is approximate and in some cases can lead to actual memory
+usage being higher once tickets are redeemed. The client reserves some memory to
+mitigate overflows like that.
diff --git a/include/aws/s3/private/s3_buffer_pool.h b/include/aws/s3/private/s3_buffer_pool.h
@@ -0,0 +1,133 @@
+#ifndef AWS_S3_BUFFER_ALLOCATOR_H
+#define AWS_S3_BUFFER_ALLOCATOR_H
+
+/**
+ * Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ * SPDX-License-Identifier: Apache-2.0.
+ */
+
+#include <aws/s3/s3.h>
+
+/*
+ * S3 buffer pool.
+ * Buffer pool used for pooling part sized buffers for Put/Get operations.
+ * Provides additional functionally for limiting overall memory used.
+ * High-level buffer pool usage flow:
+ * - Create buffer with overall memory limit and common buffer size, aka chunk
+ *   size (typically part size configured on client)
+ * - For each request:
+ *   -- call reserve to acquire ticket for future buffer acquisition. this will
+ *   mark memory reserved, but would not allocate it. if reserve call hits
+ *   memory limit, it fails and reservation hold is put on the whole buffer
+ *   pool. (aws_s3_buffer_pool_remove_reservation_hold can be used to remove
+ *   reservation hold).
+ *   -- once request needs memory, it can exchange ticket for a buffer using
+ *   aws_s3_buffer_pool_acquire_buffer. this operation never fails, even if it
+ *   ends up going over memory limit.
+ *   -- buffer lifetime is tied to the ticket. so once request is done with the
+ *   buffer, ticket is released and buffer returns back to the pool.
+ */
+
+AWS_EXTERN_C_BEGIN
+
+struct aws_s3_buffer_pool;
+struct aws_s3_buffer_pool_ticket;
+
+struct aws_s3_buffer_pool_usage_stats {
+    /* Effective Max memory limit. Memory limit value provided during construction minus
+     * buffer reserved for overhead of the pool */
+    size_t mem_limit;
+
+    /* How much mem is used in primary storage. includes memory used by blocks
+     * that are waiting on all allocs to release before being put back in circulation. */
+    size_t primary_used;
+    /* Overall memory allocated for blocks. */
+    size_t primary_allocated;
+    /* Reserved memory. Does not account for how that memory will map into
+     * blocks and in practice can be lower than used memory. */
+    size_t primary_reserved;
+    /* Number of blocks allocated in primary. */
+    size_t primary_num_blocks;
+
+    /* Secondary mem used. Accurate, maps directly to base allocator. */
+    size_t secondary_used;
+    /* Secondary mem reserved. Accurate, maps directly to base allocator. */
+    size_t secondary_reserved;
+};
+
+/*
+ * Create new buffer pool.
+ * chunk_size - specifies the size of memory that will most commonly be acquired
+ * from the pool (typically part size).
+ * mem_limit - limit on how much mem buffer pool can use. once limit is hit,
+ * buffers can no longer be reserved from (reservation hold is placed on the pool).
+ * Returns buffer pool pointer on success and NULL on failure.
+ */
+AWS_S3_API struct aws_s3_buffer_pool *aws_s3_buffer_pool_new(
+    struct aws_allocator *allocator,
+    size_t chunk_size,
+    size_t mem_limit);
+
+/*
+ * Destroys buffer pool.
+ * Does nothing if buffer_pool is NULL.
+ */
+AWS_S3_API void aws_s3_buffer_pool_destroy(struct aws_s3_buffer_pool *buffer_pool);
+
+/*
+ * Reserves memory from the pool for later use.
+ * Best effort and can potentially reserve memory slightly over the limit.
+ * Reservation takes some memory out of the available pool, but does not
+ * allocate it right away.
+ * On success ticket will be returned.
+ * On failure NULL is returned, error is raised and reservation hold is placed
+ * on the buffer. Any further reservations while hold is active will fail.
+ * Remove reservation hold to unblock reservations.
+ */
+AWS_S3_API struct aws_s3_buffer_pool_ticket *aws_s3_buffer_pool_reserve(
+    struct aws_s3_buffer_pool *buffer_pool,
+    size_t size);
+
+/*
+ * Whether pool has a reservation hold.
+ */
+AWS_S3_API bool aws_s3_buffer_pool_has_reservation_hold(struct aws_s3_buffer_pool *buffer_pool);
+
+/*
+ * Remove reservation hold on pool.
+ */
+AWS_S3_API void aws_s3_buffer_pool_remove_reservation_hold(struct aws_s3_buffer_pool *buffer_pool);
+
+/*
+ * Trades in the ticket for a buffer.
+ * Cannot fail and can over allocate above mem limit if reservation was not accurate.
+ * Using the same ticket twice will return the same buffer.
+ * Buffer is only valid until the ticket is released.
+ */
+AWS_S3_API struct aws_byte_buf aws_s3_buffer_pool_acquire_buffer(
+    struct aws_s3_buffer_pool *buffer_pool,
+    struct aws_s3_buffer_pool_ticket *ticket);
+
+/*
+ * Releases the ticket.
+ * Any buffers associated with the ticket are invalidated.
+ */
+AWS_S3_API void aws_s3_buffer_pool_release_ticket(
+    struct aws_s3_buffer_pool *buffer_pool,
+    struct aws_s3_buffer_pool_ticket *ticket);
+
+/*
+ * Get pool memory usage stats.
+ */
+AWS_S3_API struct aws_s3_buffer_pool_usage_stats aws_s3_buffer_pool_get_usage(struct aws_s3_buffer_pool *buffer_pool);
+
+/*
+ * Trims all unused mem from the pool.
+ * Warning: fairly slow operation, do not use in critical path.
+ * TODO: partial trimming? ex. only trim down to 50% of max?
+ */
+AWS_S3_API void aws_s3_buffer_pool_trim(struct aws_s3_buffer_pool *buffer_pool);
+
+AWS_EXTERN_C_END
+
+#endif /* AWS_S3_BUFFER_ALLOCATOR_H */
diff --git a/include/aws/s3/private/s3_client_impl.h b/include/aws/s3/private/s3_client_impl.h
@@ -196,6 +196,8 @@ struct aws_s3_upload_part_timeout_stats {
 struct aws_s3_client {
     struct aws_allocator *allocator;
 
+    struct aws_s3_buffer_pool *buffer_pool;
+
     struct aws_s3_client_vtable *vtable;
 
     struct aws_ref_count ref_count;
@@ -340,6 +342,9 @@ struct aws_s3_client {
         /* Task for processing requests from meta requests on connections. */
         struct aws_task process_work_task;
 
+        /* Task for trimming buffer bool. */
+        struct aws_task trim_buffer_pool_task;
+
         /* Number of endpoints currently allocated. Used during clean up to know how many endpoints are still in
          * memory.*/
         uint32_t num_endpoints_allocated;
@@ -378,6 +383,9 @@ struct aws_s3_client {
 
         /* Number of requests currently being prepared. */
         uint32_t num_requests_being_prepared;
+
+        /* Whether or not work processing is currently scheduled. */
+        uint32_t trim_buffer_pool_task_scheduled : 1;
     } threaded_data;
 };
 

diff --git a/include/aws/s3/private/s3_request.h b/include/aws/s3/private/s3_request.h
@@ -12,6 +12,7 @@
 #include <aws/common/thread.h>
 #include <aws/s3/s3.h>
 
+#include <aws/s3/private/s3_buffer_pool.h>
 #include <aws/s3/private/s3_checksums.h>
 
 struct aws_http_message;
@@ -22,6 +23,7 @@ enum aws_s3_request_flags {
     AWS_S3_REQUEST_FLAG_RECORD_RESPONSE_HEADERS = 0x00000001,
     AWS_S3_REQUEST_FLAG_PART_SIZE_RESPONSE_BODY = 0x00000002,
     AWS_S3_REQUEST_FLAG_ALWAYS_SEND = 0x00000004,
+    AWS_S3_REQUEST_FLAG_PART_SIZE_REQUEST_BODY = 0x00000008,
 };
 
 /**
@@ -112,6 +114,8 @@ struct aws_s3_request {
      * retried.*/
     struct aws_byte_buf request_body;
 
+    struct aws_s3_buffer_pool_ticket *ticket;
+
     /* Beginning range of this part. */
     /* TODO currently only used by auto_range_get, could be hooked up to auto_range_put as well. */
     uint64_t part_range_start;
@@ -184,7 +188,10 @@ struct aws_s3_request {
     uint32_t record_response_headers : 1;
 
     /* When true, the response body buffer will be allocated in the size of a part. */
-    uint32_t part_size_response_body : 1;
+    uint32_t has_part_size_response_body : 1;
+
+    /* When true, the request body buffer will be allocated in the size of a part. */
+    uint32_t has_part_size_request_body : 1;
 
     /* When true, this request is being tracked by the client for limiting the amount of in-flight-requests/stats. */
     uint32_t tracked_by_client : 1;

diff --git a/include/aws/s3/s3.h b/include/aws/s3/s3.h
@@ -41,6 +41,7 @@ enum aws_s3_errors {
     AWS_ERROR_S3_INCORRECT_CONTENT_LENGTH,
     AWS_ERROR_S3_REQUEST_TIME_TOO_SKEWED,
     AWS_ERROR_S3_FILE_MODIFIED,
+    AWS_ERROR_S3_EXCEEDS_MEMORY_LIMIT,
     AWS_ERROR_S3_END_RANGE = AWS_ERROR_ENUM_END_RANGE(AWS_C_S3_PACKAGE_ID)
 };
 

diff --git a/include/aws/s3/s3_client.h b/include/aws/s3/s3_client.h
@@ -344,6 +344,9 @@ struct aws_s3_client_config {
     /* Throughput target in Gbps that we are trying to reach. */
     double throughput_target_gbps;
 
+    /* How much memory can we use. */
+    size_t memory_limit_in_bytes;
+
     /* Retry strategy to use. If NULL, a default retry strategy will be used. */
     struct aws_retry_strategy *retry_strategy;
 

diff --git a/source/s3.c b/source/s3.c
@@ -41,6 +41,7 @@ static struct aws_error_info s_errors[] = {
     AWS_DEFINE_ERROR_INFO_S3(AWS_ERROR_S3_INCORRECT_CONTENT_LENGTH, "Request body length must match Content-Length header."),
     AWS_DEFINE_ERROR_INFO_S3(AWS_ERROR_S3_REQUEST_TIME_TOO_SKEWED, "RequestTimeTooSkewed error received from S3."),
     AWS_DEFINE_ERROR_INFO_S3(AWS_ERROR_S3_FILE_MODIFIED, "The file was modified during upload."),
+    AWS_DEFINE_ERROR_INFO_S3(AWS_ERROR_S3_EXCEEDS_MEMORY_LIMIT, "Request was not created due to used memory exceeding memory limit."),
 };
 /* clang-format on */
 

diff --git a/source/s3_auto_ranged_get.c b/source/s3_auto_ranged_get.c
@@ -177,13 +177,21 @@ static bool s_s3_auto_ranged_get_update(
                             meta_request,
                             AWS_S3_AUTO_RANGE_GET_REQUEST_TYPE_HEAD_OBJECT,
                             0,
-                            AWS_S3_REQUEST_FLAG_RECORD_RESPONSE_HEADERS | AWS_S3_REQUEST_FLAG_PART_SIZE_RESPONSE_BODY);
+                            AWS_S3_REQUEST_FLAG_RECORD_RESPONSE_HEADERS);
 
                         request->discovers_object_size = true;
 
                         auto_ranged_get->synced_data.head_object_sent = true;
                     }
                 } else if (auto_ranged_get->synced_data.num_parts_requested == 0) {
+
+                    struct aws_s3_buffer_pool_ticket *ticket =
+                        aws_s3_buffer_pool_reserve(meta_request->client->buffer_pool, meta_request->part_size);
+
+                    if (ticket == NULL) {
+                        goto has_work_remaining;
+                    }
+
                     /* If we aren't using a head object, then discover the size of the object while trying to get the
                      * first part. */
                     request = aws_s3_request_new(
@@ -192,6 +200,7 @@ static bool s_s3_auto_ranged_get_update(
                         1,
                         AWS_S3_REQUEST_FLAG_RECORD_RESPONSE_HEADERS | AWS_S3_REQUEST_FLAG_PART_SIZE_RESPONSE_BODY);
 
+                    request->ticket = ticket;
                     request->part_range_start = 0;
                     request->part_range_end = meta_request->part_size - 1; /* range-end is inclusive */
                     request->discovers_object_size = true;
@@ -253,12 +262,21 @@ static bool s_s3_auto_ranged_get_update(
                     auto_ranged_get->synced_data.read_window_warning_issued = 0;
                 }
 
+                struct aws_s3_buffer_pool_ticket *ticket =
+                    aws_s3_buffer_pool_reserve(meta_request->client->buffer_pool, meta_request->part_size);
+
+                if (ticket == NULL) {
+                    goto has_work_remaining;
+                }
+
                 request = aws_s3_request_new(
                     meta_request,
                     AWS_S3_AUTO_RANGE_GET_REQUEST_TYPE_PART,
                     auto_ranged_get->synced_data.num_parts_requested + 1,
                     AWS_S3_REQUEST_FLAG_PART_SIZE_RESPONSE_BODY);
 
+                request->ticket = ticket;
+
                 aws_s3_get_part_range(
                     auto_ranged_get->synced_data.object_range_start,
                     auto_ranged_get->synced_data.object_range_end,
@@ -412,10 +430,11 @@ static struct aws_future_void *s_s3_auto_ranged_get_prepare_request(struct aws_s
     /* Success! */
     AWS_LOGF_DEBUG(
         AWS_LS_S3_META_REQUEST,
-        "id=%p: Created request %p for part %d",
+        "id=%p: Created request %p for part %d part sized %d",
         (void *)meta_request,
         (void *)request,
-        request->part_number);
+        request->part_number,
+        request->has_part_size_response_body);
 
     success = true;