Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33841][CORE][3.1] Fix issue with jobs disappearing intermittently from the SHS under high load #30847

Closed

Conversation

vladhlinsky
Copy link
Contributor

What changes were proposed in this pull request?

Mark SHS event log entries that were processing at the beginning of the checkForLogs run as not stale and check for this mark before deleting an event log. This fixes the issue when a particular job was displayed in the SHS and disappeared after some time, but then, in several minutes showed up again.

Why are the changes needed?

The issue is caused by SPARK-29043, which is designated to improve the concurrent performance of the History Server. The change breaks the "app deletion" logic because of missing proper synchronization for processing event log entries. Since SHS now filters out all processing event log entries, such entries do not have a chance to be updated with the new lastProcessed time and thus any entity that completes processing right after filtering and before the check for stale entities will be identified as stale and will be deleted from the UI until the next checkForLogs run. This is because updated lastProcessed time is used as criteria, and event log entries that missed to be updated with a new time, will match that criteria.

The issue can be reproduced by generating a big number of event logs and uploading them to the SHS event log directory on S3. Essentially, around 439(49.6 MB) copies of an event log file were created using shs-monitor script. Strange behavior of SHS counting the total number of applications was noticed - at first, the number was increasing as expected, but with the next page refresh, the total number of applications decreased. No errors were logged by SHS.

252 entities are displayed at 21:20:23:
1-252-entries-at-21-20
178 entities are displayed at 21:22:15:
2-178-at-21-22

Does this PR introduce any user-facing change?

Yes, SHS users won't face the behavior when the number of displayed applications decreases periodically.

How was this patch tested?

Tested using shs-monitor script:

  • Build SHS with the proposed change
  • Download Hadoop AWS and AWS Java SDK
  • Prepare S3 bucket and user for programmatic access, grant required roles to the user. Get access key and secret key
  • Configure SHS to read event logs from S3
  • Start monitor script to query SHS API
  • Run producers
  • Wait for SHS to load all the applications
  • Verify that the number of loaded applications increases continuously over time

For more details, please refer to the shs-monitor repository.

@vladhlinsky
Copy link
Contributor Author

cc @HeartSaVioR

@github-actions github-actions bot added the CORE label Dec 18, 2020
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-33841][CORE] Fix issue with jobs disappearing intermittently from the SHS under high load [SPARK-33841][CORE][3.1] Fix issue with jobs disappearing intermittently from the SHS under high load Dec 18, 2020
@HeartSaVioR
Copy link
Contributor

retest this, please

@HeartSaVioR
Copy link
Contributor

ok to test

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 and I'd consider @tgravescs approved this PR as only target branch is different.

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Test build #133037 has finished for PR 30847 at commit 1cb2e1f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37636/

@dongjoon-hyun
Copy link
Member

Retest this please

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37636/

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37638/

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37638/

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thanks, @vladhlinsky .
Merged to branch-3.1 for Apache Spark 3.1.0.

dongjoon-hyun pushed a commit that referenced this pull request Dec 18, 2020
…tly from the SHS under high load

### What changes were proposed in this pull request?

Mark SHS event log entries that were `processing` at the beginning of the `checkForLogs` run as not stale and check for this mark before deleting an event log. This fixes the issue when a particular job was displayed in the SHS and disappeared after some time, but then, in several minutes showed up again.

### Why are the changes needed?

The issue is caused by [SPARK-29043](https://issues.apache.org/jira/browse/SPARK-29043), which is designated to improve the concurrent performance of the History Server. The [change](https://github.com/apache/spark/pull/25797/files#) breaks the ["app deletion" logic](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R563) because of missing proper synchronization for `processing` event log entries. Since SHS now [filters out](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R462) all `processing` event log entries, such entries do not have a chance to be [updated with the new `lastProcessed`](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R472) time and thus any entity that completes processing right after [filtering](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R462) and before [the check for stale entities](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R560) will be identified as stale and will be deleted from the UI until the next `checkForLogs` run. This is because [updated `lastProcessed` time is used as criteria](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R557), and event log entries that missed to be updated with a new time, will match that criteria.

The issue can be reproduced by generating a big number of event logs and uploading them to the SHS event log directory on S3. Essentially, around 439(49.6 MB) copies of an event log file were created using [shs-monitor](https://github.com/vladhlinsky/shs-monitor/tree/branch-3.1) script. Strange behavior of SHS counting the total number of applications was noticed - at first, the number was increasing as expected, but with the next page refresh, the total number of applications decreased. No errors were logged by SHS.

252 entities are displayed at `21:20:23`:
![1-252-entries-at-21-20](https://user-images.githubusercontent.com/61428392/102653857-40901f00-4178-11eb-9d61-6a20e359abb2.png)
178 entities are displayed at `21:22:15`:
![2-178-at-21-22](https://user-images.githubusercontent.com/61428392/102653900-530a5880-4178-11eb-94fb-3f28b082b25a.png)

### Does this PR introduce _any_ user-facing change?

Yes, SHS users won't face the behavior when the number of displayed applications decreases periodically.

### How was this patch tested?

Tested using [shs-monitor](https://github.com/vladhlinsky/shs-monitor/tree/branch-3.1) script:
* Build SHS with the proposed change
* Download Hadoop AWS and AWS Java SDK
* Prepare S3 bucket and user for programmatic access, grant required roles to the user. Get access key and secret key
* Configure SHS to read event logs from S3
* Start [monitor](https://github.com/vladhlinsky/shs-monitor/blob/branch-3.1/monitor.sh) script to query SHS API
* Run [producers](https://github.com/vladhlinsky/shs-monitor/blob/branch-3.1/producer.sh)
* Wait for SHS to load all the applications
* Verify that the number of loaded applications increases continuously over time

For more details, please refer to the [shs-monitor](https://github.com/vladhlinsky/shs-monitor/tree/branch-3.1) repository.

Closes #30847 from vladhlinsky/SPARK-33841-branch-3.1.

Authored-by: Vlad Glinsky <vladhlinsky@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@SparkQA
Copy link

SparkQA commented Dec 19, 2020

Test build #133039 has finished for PR 30847 at commit 1cb2e1f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants