Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I use this code to profile my GC? #6

Open
guimaluf opened this issue Apr 27, 2019 · 1 comment
Open

How can I use this code to profile my GC? #6

guimaluf opened this issue Apr 27, 2019 · 1 comment

Comments

@guimaluf
Copy link

Hi all,

I read the article 'An Experimental Evaluation of GC on Big Data Applications' and I'm willing to reproduce part of it in my setup.
Isn't clear to me how can I use the SparkProfile.jar package. How it will get GC stats, where it will print output, etc.

I would like to thank you for the research and I appreciate any help

@JerryLead
Copy link
Owner

JerryLead commented Jun 6, 2019

@guimaluf

Hi guimaluf, thanks for your interest in our work.

I'm sorry that it is a little complex to use this profiler, because I developed a number of parsers and analyzers to obtain statistics from task logs, gc logs, CPU logs, etc. Some of them are used to obtain the statistical results as presented in our paper, while others are obsolete. The usage of this profiler is as follows.

After running a Spark application, e.g., app-20170623113634-0010, we first run SparkAppJsonSaver.java to save this application's performance metrics (e.g., application execution time, stage metrics, task metrics in each stage, executor metrics, etc.) via REST APIs (referred to http://spark.apache.org/docs/latest/monitoring.html) to a directory (e.g., APPdir). This SparkAppJsonSaver.java also fetches the GC log from each executor to a file (e.g., to be APPdir/executors/executor-id/stdout), if we enable the executor to output GC log via GC commands as spark.executor.extraJavaOptions="-XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime. Note that each executor is a JVM and the GC activities are logged in executor's log stdout.

After that, we use SparkAppProfiler.java to analyze and output interesting statistics. In particular, for GC analysis, we use the gc log parsers in src/main/generalGC to parse the GC log of each executor into formatted statistics, such as

[Young](YGC) time = 2.083, beforeGC = 126.4658203125, afterGC = 14.2392578125, allocated = 151.25, gcPause = 0.0663858s, gcCause = GC (Allocation Failure) 
[Young](YGC) time = 2.877, beforeGC = 141.3876953125, afterGC = 9.5703125, allocated = 151.25, gcPause = 0.1134074s, gcCause = GC (Allocation Failure) 
[Young](FGC) time = 3.01, beforeGC = 26.33984375, afterGC = 26.33984375, allocated = 151.25, gcPause = 0.0014977s, gcCause = GC (CMS Initial Mark) 
[Young](YGC) time = 4.527, beforeGC = 144.0703125, afterGC = 10.8642578125, allocated = 151.25, gcPause = 0.1209985s, gcCause = GC (Allocation Failure)

This formatted statistics records the GC pause time and related memory usage after each young/old GC pause.

Finally, we can use the python code in src/python to plot the GC curves as that in Figure 7 in our paper.

In general, this profiler covers almost all the fine-grained metrics of a Spark application, including the metrics of application, stages, tasks, executors, etc. If you focus on analyzing the GC logs of executors, please refer to the parsers in src/main/generalGC. If you only want to observe the GC metrics of some executors, you can also refer to https://gceasy.io/ for GUI-based general GC analyzer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants