Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOP-18571] Collect and log Spark metrics in various method calls #303

Merged
merged 1 commit into from
Aug 9, 2024

Conversation

dolfinus
Copy link
Member

@dolfinus dolfinus commented Aug 8, 2024

Change Summary

Use SparkMetricsRecorded to collect Spark metrics from those methods:

  • DBWriter.run()
  • FileDFWriter.run()
  • JDBC.fetch(), JDBC.execute()
  • Hive.execute()

And log these metrics. But the only metrics logged in INFO level are JDBC ones, as they are reliable, others are DEBUG only.

Also I've added function to get estimated size of in-memory dataframe using Spark's SizeEstimator. This is an approximate metric, but it always present.

Few examples:

DBWriter.run with Hive:

DEBUG |DBWriter| Recorded metrics (may be incomplete!):
DEBUG         Output:
DEBUG             Written rows: 100
DEBUG             Written size: 2.5 kB
DEBUG             Created files: 1
DEBUG         Executor:
DEBUG             Total run time: 0.13 seconds
DEBUG             Total CPU time: 0.04 seconds

DBWriter.run with Postgres:

DEBUG |DBWriter| Recorded metrics (may be incomplete!):
DEBUG         Output:
DEBUG             Written rows: 100
DEBUG         Executor:
DEBUG             Total run time: 0.13 seconds
DEBUG             Total CPU time: 0.05 seconds

Postgres.fetch:

INFO |Postgres| Recorded metrics:
INFO         Input:
INFO             Read rows: 100
INFO         Driver:
INFO             In-memory data (approximate): 44.2 MB

Related issue number

Checklist

  • Commit message and PR title is comprehensive
  • Keep the change as small as possible
  • Unit and integration tests for the changes exist
  • Tests pass on CI and coverage does not decrease
  • Documentation reflects the changes where applicable
  • docs/changelog/next_release/<pull request or issue id>.<change type>.rst file added describing change
    (see CONTRIBUTING.rst for details.)
  • My PR is ready to review.

Copy link

codecov bot commented Aug 8, 2024

Codecov Report

Attention: Patch coverage is 96.11650% with 4 lines in your changes missing coverage. Please review.

Project coverage is 95.36%. Comparing base (abff632) to head (489d72d).

Files Patch % Lines
onetl/db/db_writer/db_writer.py 90.00% 1 Missing and 1 partial ⚠️
onetl/file/file_df_writer/file_df_writer.py 88.23% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #303      +/-   ##
===========================================
+ Coverage    94.96%   95.36%   +0.39%     
===========================================
  Files          225      225              
  Lines         8823     8860      +37     
  Branches      1507     1499       -8     
===========================================
+ Hits          8379     8449      +70     
+ Misses         312      292      -20     
+ Partials       132      119      -13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dolfinus dolfinus marked this pull request as ready for review August 9, 2024 11:37
@dolfinus dolfinus merged commit 3c25405 into develop Aug 9, 2024
45 checks passed
@dolfinus dolfinus deleted the feature/DOP-18571 branch August 9, 2024 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants