Skip to content

Commit

Permalink
GH-35260: [C++][Python][R] Allow users to adjust S3 log level by envi…
Browse files Browse the repository at this point in the history
…ronment variable (#38267)

### Rationale for this change

It's useful when troubleshooting issues with Arrow's S3 filesystem implementation to raise the log level. Currently, this can only be done in C++ and Python, but not from R. In addition, the log level can only be set during S3 initialization and not directly so the user has to introduce explicit S3 initialization code to turn on logging and must make sure this code is called before S3 initialization.

While discussing exposing control of log level to R, we realized that allowing the log level to be controlled by environment variable may be more intuitive and useful and would just be a good addition for C++, Python, and R. 

### What changes are included in this PR?

- A new environment variable `AWS_S3_LOG_LEVEL` with documentation for controlling S3 log level
- Updated documentation for C++, Python, and R
- A new `InitializeS3()` as a quality-of-life thing for C++ users. Feel free to ask me to remove this.

No changes are needed directly for Python and R because these implementation uses the internal implicit initializer `EnsureS3Initialized` rather than the explicit form, `InitializeS3`. And it's the behavior of the `EnsureS3Initialized` routine that's changed here.

### Are these changes tested?

Yes. I added a unit test for the new `GetS3LogLevelFromEnvOrDefault` and tested from Python and R manually. I didn't add a test to make sure the underlying `AwsInstance` gets set up correctly because it looked like it would require a refactor and didn't seem worth it.

### Are there any user-facing changes?

Yes. A new way to turn on logging for S3 and matching docs in C++, Python, and R.

* Closes: #35260

Lead-authored-by: Bryce Mecum <petridish@gmail.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
  • Loading branch information
3 people authored Oct 17, 2023
1 parent 4011058 commit 1e9f224
Show file tree
Hide file tree
Showing 9 changed files with 139 additions and 2 deletions.
32 changes: 31 additions & 1 deletion cpp/src/arrow/filesystem/s3fs.cc
Original file line number Diff line number Diff line change
Expand Up @@ -2987,7 +2987,7 @@ Status InitializeS3(const S3GlobalOptions& options) {
}

Status EnsureS3Initialized() {
return EnsureAwsInstanceInitialized({S3LogLevel::Fatal}).status();
return EnsureAwsInstanceInitialized(S3GlobalOptions::Defaults()).status();
}

Status FinalizeS3() {
Expand All @@ -3001,6 +3001,36 @@ bool IsS3Initialized() { return GetAwsInstance()->IsInitialized(); }

bool IsS3Finalized() { return GetAwsInstance()->IsFinalized(); }

S3GlobalOptions S3GlobalOptions::Defaults() {
auto log_level = S3LogLevel::Fatal;

auto result = arrow::internal::GetEnvVar("ARROW_S3_LOG_LEVEL");

if (result.ok()) {
// Extract, trim, and downcase the value of the enivronment variable
auto value =
arrow::internal::AsciiToLower(arrow::internal::TrimString(result.ValueUnsafe()));

if (value == "fatal") {
log_level = S3LogLevel::Fatal;
} else if (value == "error") {
log_level = S3LogLevel::Error;
} else if (value == "warn") {
log_level = S3LogLevel::Warn;
} else if (value == "info") {
log_level = S3LogLevel::Info;
} else if (value == "debug") {
log_level = S3LogLevel::Debug;
} else if (value == "trace") {
log_level = S3LogLevel::Trace;
} else if (value == "off") {
log_level = S3LogLevel::Off;
}
}

return S3GlobalOptions{log_level};
}

// -----------------------------------------------------------------------
// Top-level utility functions

Expand Down
8 changes: 7 additions & 1 deletion cpp/src/arrow/filesystem/s3fs.h
Original file line number Diff line number Diff line change
Expand Up @@ -332,9 +332,15 @@ struct ARROW_EXPORT S3GlobalOptions {
///
/// For more details see Aws::Crt::Io::EventLoopGroup
int num_event_loop_threads = 1;

/// \brief Initialize with default options
///
/// For log_level, this method first tries to extract a suitable value from the
/// environment variable ARROW_S3_LOG_LEVEL.
static S3GlobalOptions Defaults();
};

/// \brief Initialize the S3 APIs.
/// \brief Initialize the S3 APIs with the specified set of options.
///
/// It is required to call this function at least once before using S3FileSystem.
///
Expand Down
26 changes: 26 additions & 0 deletions cpp/src/arrow/filesystem/s3fs_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1380,5 +1380,31 @@ class TestS3FSGeneric : public S3TestMixin, public GenericFileSystemTest {

GENERIC_FS_TEST_FUNCTIONS(TestS3FSGeneric);

////////////////////////////////////////////////////////////////////////////
// S3GlobalOptions::Defaults tests

TEST(S3GlobalOptions, DefaultsLogLevel) {
// Verify we get the default value of Fatal
ASSERT_EQ(S3LogLevel::Fatal, arrow::fs::S3GlobalOptions::Defaults().log_level);

// Verify we get the value specified by env var and not the default
{
EnvVarGuard log_level_guard("ARROW_S3_LOG_LEVEL", "ERROR");
ASSERT_EQ(S3LogLevel::Error, arrow::fs::S3GlobalOptions::Defaults().log_level);
}

// Verify we trim and case-insensitively compare the environment variable's value
{
EnvVarGuard log_level_guard("ARROW_S3_LOG_LEVEL", " eRrOr ");
ASSERT_EQ(S3LogLevel::Error, arrow::fs::S3GlobalOptions::Defaults().log_level);
}

// Verify we get the default value of Fatal if our env var is invalid
{
EnvVarGuard log_level_guard("ARROW_S3_LOG_LEVEL", "invalid");
ASSERT_EQ(S3LogLevel::Fatal, arrow::fs::S3GlobalOptions::Defaults().log_level);
}
}

} // namespace fs
} // namespace arrow
2 changes: 2 additions & 0 deletions docs/source/cpp/api/filesystem.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,8 @@ S3 filesystem
.. doxygenclass:: arrow::fs::S3FileSystem
:members:

.. doxygenfunction:: arrow::fs::InitializeS3(const S3GlobalOptions& options)

Hadoop filesystem
-----------------

Expand Down
22 changes: 22 additions & 0 deletions docs/source/cpp/env_vars.rst
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,28 @@ that changing their value later will have an effect.
``libhdfs.dylib`` on macOS, ``libhdfs.so`` on other platforms).
Alternatively, one can set :envvar:`HADOOP_HOME`.

.. envvar:: ARROW_S3_LOG_LEVEL

Controls the verbosity of logging produced by S3 calls. Defaults to ``FATAL``
which only produces output in the case of fatal errors. ``DEBUG`` is recommended
when you're trying to troubleshoot issues.

Possible values include:

- ``FATAL`` (the default)
- ``ERROR``
- ``WARN``
- ``INFO``
- ``DEBUG``
- ``TRACE``
- ``OFF``

.. seealso::

`Logging - AWS SDK For C++
<https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/logging.html>`__


.. envvar:: ARROW_TRACING_BACKEND

The backend where to export `OpenTelemetry <https://opentelemetry.io/>`_-based
Expand Down
8 changes: 8 additions & 0 deletions docs/source/python/filesystems.rst
Original file line number Diff line number Diff line change
Expand Up @@ -207,6 +207,14 @@ Here are a couple examples in code::

:func:`pyarrow.fs.resolve_s3_region` for resolving region from a bucket name.

Troubleshooting
~~~~~~~~~~~~~~~

When using :class:`S3FileSystem`, output is only produced for fatal errors or
when printing return values. For troubleshooting, the log level can be set using
the environment variable ``ARROW_S3_LOG_LEVEL``. The log level must be set prior
to running any code that interacts with S3. Possible values include ``FATAL`` (the
default), ``ERROR``, ``WARN``, ``INFO``, ``DEBUG`` (recommended), ``TRACE``, and ``OFF``.

.. _filesystem-gcs:

Expand Down
22 changes: 22 additions & 0 deletions r/R/filesystem.R
Original file line number Diff line number Diff line change
Expand Up @@ -239,6 +239,14 @@ FileSelector$create <- function(base_dir, allow_not_found = FALSE, recursive = F
#' and no resource tags. To have more control over how buckets are created,
#' use a different API to create them.
#'
#' On S3FileSystem, output is only produced for fatal errors or when printing
#' return values. For troubleshooting, the log level can be set using the
#' environment variable `ARROW_S3_LOG_LEVEL` (e.g.,
#' `Sys.setenv("ARROW_S3_LOG_LEVEL"="DEBUG")`). The log level must be set prior
#' to running any code that interacts with S3. Possible values include 'FATAL'
#' (the default), 'ERROR', 'WARN', 'INFO', 'DEBUG' (recommended), 'TRACE', and
#' 'OFF'.
#'
#' @usage NULL
#' @format NULL
#' @docType class
Expand Down Expand Up @@ -462,11 +470,25 @@ default_s3_options <- list(
#'
#' @param bucket string S3 bucket name or path
#' @param ... Additional connection options, passed to `S3FileSystem$create()`
#'
#' @details By default, \code{\link{s3_bucket}} and other
#' \code{\link{S3FileSystem}} functions only produce output for fatal errors
#' or when printing their return values. When troubleshooting problems, it may
#' be useful to increase the log level. See the Notes section in
#' \code{\link{S3FileSystem}} for more information or see Examples below.
#'
#' @return A `SubTreeFileSystem` containing an `S3FileSystem` and the bucket's
#' relative path. Note that this function's success does not guarantee that you
#' are authorized to access the bucket's contents.
#' @examplesIf FALSE
#' bucket <- s3_bucket("voltrondata-labs-datasets")
#'
#' @examplesIf FALSE
#' # Turn on debug logging. The following line of code should be run in a fresh
#' # R session prior to any calls to `s3_bucket()` (or other S3 functions)
#' Sys.setenv("ARROW_S3_LOG_LEVEL", "DEBUG")
#' bucket <- s3_bucket("voltrondata-labs-datasets")
#'
#' @export
s3_bucket <- function(bucket, ...) {
assert_that(is.string(bucket))
Expand Down
8 changes: 8 additions & 0 deletions r/man/FileSystem.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

13 changes: 13 additions & 0 deletions r/man/s3_bucket.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

0 comments on commit 1e9f224

Please sign in to comment.