-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datapoints of a running analysis fail because of time/retry limit while worker attempts to upload datapoint.zip to server #366
Comments
I know a bunch of timeout / file size related issues have been addressed recently. Hopefully 2.6.0 will fix a bunch of those. There is also this commit from a while back that may help you: |
Hello @brianlball, Thank you for the quick response. I was aware of this, and I have already increased the nginx upload limit. This server already has this fix applied. Thank you. |
I imagine that one of the OpenStudio-server/server/app/workers/run_simulate_data_point.rb Lines 306 to 320 in d61dcbc
The other area smells like general performance issues. In version 2.5.1 and greater we moved away from delayed_jobs for background tasks and over to redis/resque for overall perfomance and better concurrency management. It would be nice to try version 2.6.0 and see if this problem still exists. Note that images for 2.6.0 are being built today. |
Similar issues were observed with 2.6.0 Sample Datapoint Simulation Log# Logfile created on 2018-08-17 14:07:41 +0000 by logger.rb/47272
[14:07:41.693608 INFO] Server host is http://web:80
[14:07:41.693784 INFO] Analysis directory is /mnt/openstudio/analysis_bd9b4922-2e0a-4985-84ca-d86551420bc3
[14:07:41.694036 INFO] Simulation directory is /mnt/openstudio/analysis_bd9b4922-2e0a-4985-84ca-d86551420bc3/data_point_29ab789f-5958-49e1-9d2a-3b99af0335ad
[14:07:41.694096 INFO] Run datapoint type/file is workflow
[14:07:41.694151 INFO] Starting initialize_worker for datapoint 29ab789f-5958-49e1-9d2a-3b99af0335ad
[14:07:41.694247 INFO] receipt_file already exists, moving on
[14:08:03.898766 INFO] Downloading datapoint from http://web:80/data_points/29ab789f-5958-49e1-9d2a-3b99af0335ad.json
[14:08:32.760602 INFO] Creating Workflow Manager instance
[14:08:32.760754 INFO] Directory is /mnt/openstudio/analysis_bd9b4922-2e0a-4985-84ca-d86551420bc3/data_point_29ab789f-5958-49e1-9d2a-3b99af0335ad
[14:08:32.760895 INFO] Opening run.log file '/mnt/openstudio/analysis_bd9b4922-2e0a-4985-84ca-d86551420bc3/data_point_29ab789f-5958-49e1-9d2a-3b99af0335ad/run/run.log'
[14:08:32.796586 INFO] Running workflow
[14:09:31.129801 INFO] Final run state is finished
[14:09:31.189768 INFO] Saving files/reports back to the server
[14:09:31.190108 INFO] Saving report /mnt/openstudio/analysis_bd9b4922-2e0a-4985-84ca-d86551420bc3/data_point_29ab789f-5958-49e1-9d2a-3b99af0335ad/reports/openstudio_results_report.html to http://web:80/data_points/29ab789f-5958-49e1-9d2a-3b99af0335ad/upload_file
[14:09:48.472696 INFO] Saving report responded with {"id":"5b76d729980f0b0d68ad2102","data_point_id":"29ab789f-5958-49e1-9d2a-3b99af0335ad","display_name":"openstudio_results_report","type":"Report","created_at":"2018-08-17T14:09:45.674Z","updated_at":"2018-08-17T14:09:45.674Z"}
[14:09:48.472989 INFO] Saving report /mnt/openstudio/analysis_bd9b4922-2e0a-4985-84ca-d86551420bc3/data_point_29ab789f-5958-49e1-9d2a-3b99af0335ad/reports/btap_view_model_report.html to http://web:80/data_points/29ab789f-5958-49e1-9d2a-3b99af0335ad/upload_file
[14:09:57.142299 INFO] Saving report responded with {"id":"5b76d734980f0b02320c874f","data_point_id":"29ab789f-5958-49e1-9d2a-3b99af0335ad","display_name":"btap_view_model_report","type":"Report","created_at":"2018-08-17T14:09:56.147Z","updated_at":"2018-08-17T14:09:56.147Z"}
[14:09:57.142641 INFO] Saving report /mnt/openstudio/analysis_bd9b4922-2e0a-4985-84ca-d86551420bc3/data_point_29ab789f-5958-49e1-9d2a-3b99af0335ad/reports/eplustbl.html to http://web:80/data_points/29ab789f-5958-49e1-9d2a-3b99af0335ad/upload_file
[14:10:07.078740 INFO] Saving report responded with {"id":"5b76d73a980f0b0de0ad20fb","data_point_id":"29ab789f-5958-49e1-9d2a-3b99af0335ad","display_name":"eplustbl","type":"Report","created_at":"2018-08-17T14:10:02.485Z","updated_at":"2018-08-17T14:10:02.485Z"}
[14:10:07.079081 INFO] Saving report /mnt/openstudio/analysis_bd9b4922-2e0a-4985-84ca-d86551420bc3/data_point_29ab789f-5958-49e1-9d2a-3b99af0335ad/reports/btap_view_model_report.json to http://web:80/data_points/29ab789f-5958-49e1-9d2a-3b99af0335ad/upload_file
[14:10:16.056748 INFO] Saving report responded with {"id":"5b76d747980f0b0e1bad2103","data_point_id":"29ab789f-5958-49e1-9d2a-3b99af0335ad","display_name":"btap_view_model_report","type":"Report","created_at":"2018-08-17T14:10:15.986Z","updated_at":"2018-08-17T14:10:15.986Z"}
[14:10:16.057133 INFO] Saving report /mnt/openstudio/analysis_bd9b4922-2e0a-4985-84ca-d86551420bc3/data_point_29ab789f-5958-49e1-9d2a-3b99af0335ad/run/objectives.json to http://web:80/data_points/29ab789f-5958-49e1-9d2a-3b99af0335ad/upload_file
[14:10:23.924702 INFO] Saving report responded with {"id":"5b76d74f980f0b026c0c872a","data_point_id":"29ab789f-5958-49e1-9d2a-3b99af0335ad","display_name":"objectives","type":"Report","created_at":"2018-08-17T14:10:23.891Z","updated_at":"2018-08-17T14:10:23.891Z"}
[14:10:23.925051 INFO] Saving report /mnt/openstudio/analysis_bd9b4922-2e0a-4985-84ca-d86551420bc3/data_point_29ab789f-5958-49e1-9d2a-3b99af0335ad/out.osw to http://web:80/data_points/29ab789f-5958-49e1-9d2a-3b99af0335ad/upload_file
[14:10:40.704653 INFO] Saving report responded with {"id":"5b76d75d980f0b02740c8723","data_point_id":"29ab789f-5958-49e1-9d2a-3b99af0335ad","display_name":"Final OSW File","type":"Report","created_at":"2018-08-17T14:10:37.392Z","updated_at":"2018-08-17T14:10:37.392Z"}
[14:10:40.705016 INFO] Saving report /mnt/openstudio/analysis_bd9b4922-2e0a-4985-84ca-d86551420bc3/data_point_29ab789f-5958-49e1-9d2a-3b99af0335ad/in.osm to http://web:80/data_points/29ab789f-5958-49e1-9d2a-3b99af0335ad/upload_file
[14:10:56.803330 INFO] Saving report responded with {"id":"5b76d76f980f0b0c2aad2108","data_point_id":"29ab789f-5958-49e1-9d2a-3b99af0335ad","display_name":"model","type":"OpenStudio Model","created_at":"2018-08-17T14:10:55.377Z","updated_at":"2018-08-17T14:10:55.377Z"}
[14:10:56.803659 INFO] Saving report /mnt/openstudio/analysis_bd9b4922-2e0a-4985-84ca-d86551420bc3/data_point_29ab789f-5958-49e1-9d2a-3b99af0335ad/run/data_point.zip to http://web:80/data_points/29ab789f-5958-49e1-9d2a-3b99af0335ad/upload_file
[14:16:05.921199 ERROR] Could not save report Zip File with message: Request Timeout in /usr/local/lib/ruby/gems/2.2.0/gems/rest-client-1.8.0/lib/restclient/request.rb:427:in `rescue in transmit'
/usr/local/lib/ruby/gems/2.2.0/gems/rest-client-1.8.0/lib/restclient/request.rb:350:in `transmit'
/usr/local/lib/ruby/gems/2.2.0/gems/rest-client-1.8.0/lib/restclient/request.rb:176:in `execute'
/usr/local/lib/ruby/gems/2.2.0/gems/rest-client-1.8.0/lib/restclient/request.rb:41:in `execute'
/usr/local/lib/ruby/gems/2.2.0/gems/rest-client-1.8.0/lib/restclient.rb:69:in `post'
/opt/openstudio/server/app/workers/run_simulate_data_point.rb:462:in `block in upload_file'
/usr/local/lib/ruby/2.2.0/timeout.rb:88:in `block in timeout'
/usr/local/lib/ruby/2.2.0/timeout.rb:32:in `block in catch'
/usr/local/lib/ruby/2.2.0/timeout.rb:32:in `catch'
/usr/local/lib/ruby/2.2.0/timeout.rb:32:in `catch'
/usr/local/lib/ruby/2.2.0/timeout.rb:103:in `timeout'
/opt/openstudio/server/app/workers/run_simulate_data_point.rb:459:in `upload_file'
/opt/openstudio/server/app/workers/run_simulate_data_point.rb:230:in `perform'
/opt/openstudio/server/app/workers/run_simulate_data_point_resque.rb:42:in `perform'
/usr/local/lib/ruby/gems/2.2.0/gems/resque-1.27.4/lib/resque/job.rb:168:in `perform'
/usr/local/lib/ruby/gems/2.2.0/gems/resque-1.27.4/lib/resque/worker.rb:304:in `perform'
/usr/local/lib/ruby/gems/2.2.0/gems/resque-1.27.4/lib/resque/worker.rb:894:in `block in perform_with_fork'
/usr/local/lib/ruby/gems/2.2.0/gems/resque-1.27.4/lib/resque/worker.rb:892:in `fork'
/usr/local/lib/ruby/gems/2.2.0/gems/resque-1.27.4/lib/resque/worker.rb:892:in `perform_with_fork'
/usr/local/lib/ruby/gems/2.2.0/gems/resque-1.27.4/lib/resque/worker.rb:260:in `work_one_job'
/usr/local/lib/ruby/gems/2.2.0/gems/resque-1.27.4/lib/resque/worker.rb:234:in `block in work'
/usr/local/lib/ruby/gems/2.2.0/gems/resque-1.27.4/lib/resque/worker.rb:231:in `loop'
/usr/local/lib/ruby/gems/2.2.0/gems/resque-1.27.4/lib/resque/worker.rb:231:in `work'
/usr/local/lib/ruby/gems/2.2.0/gems/resque-1.27.4/lib/resque/tasks.rb:20:in `block (2 levels) in <top (required)>'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/task.rb:271:in `call'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/task.rb:271:in `block in execute'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/task.rb:271:in `each'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/task.rb:271:in `execute'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/task.rb:213:in `block in invoke_with_call_chain'
/usr/local/lib/ruby/2.2.0/monitor.rb:211:in `mon_synchronize'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/task.rb:193:in `invoke_with_call_chain'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/task.rb:182:in `invoke'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/application.rb:160:in `invoke_task'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/application.rb:116:in `block (2 levels) in top_level'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/application.rb:116:in `each'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/application.rb:116:in `block in top_level'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/application.rb:125:in `run_with_threads'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/application.rb:110:in `top_level'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/application.rb:83:in `block in run'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/application.rb:186:in `standard_exception_handling'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/application.rb:80:in `run'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/exe/rake:27:in `<top (required)>'
/usr/local/bin/rake:22:in `load'
/usr/local/bin/rake:22:in `<main>'
[14:16:05.936854 INFO] Finished /opt/openstudio/server/app/workers/run_simulate_data_point.rb |
General Issue
Datapoints of a running analysis to fail because of time/retry limit while worker attempts to upload datapoint.zip to server.
c3.8xlarge Specs:
60 GB RAM
,36 cores
,640 GB
(2 * 320 GB SSD)Script used to start server: https://github.com/canmet-energy/start_aws_server
Server branch: https://github.com/NREL/OpenStudio-server/tree/nrcan_2.4.3_test
Server details
The server framework is 2.4.1 as a base, with OSApp bumped to 2.4.3
More details regarding the specs of ami can be found near the end
Description
Downloading a file approx 200 Mb (on a slow internet connection) from the OS-Server, while an analysis is running, causes the datapoints of another running analysis to fail.
On amazon cloud, the worker fails to upload the datapoint.zip file to the server. The
run_simulate_data_point.rb
file attempts to upload the datapoint.zip file up to 4 times within 120s. if the file has not been uploaded successfully within 120s, the status of the datapoint is set asdatapoint failure
No errors were found in the sdp log file
Figure 1: Shows that the worker tried multiple times to upload the file, and it sometimes passes and sometimes it fails.
Figure 2: Download was started after the 2nd analysis from bottom was finished. Then the 3rd analysis was started using PAT.
Note: Analysis 1&4 and 2&3 are identical (counted from bottom). The page refresh experiment was carried out during analysis 4
Reproduction steps:
As an experiment, I have also tried to refresh pages (periodically) with large text contents (e.g. blob data of osm files, etc) on a slow connection. This also causes the datapoints to fail.
Actual outcome:
Multiple datapoint failures are observed, because the worker fails to upload the datapoint.zip (size varying from 50 Mb to 110 Mb)
Note: the nginx upload file size limit was increased from 100 Mb to 1000Mb Commit Link
Expected outcome:
No datapoints should fail
Other information:
The datapoints do not fail if the download happens on a fast connection
Workaround
Download the results after all analysis are finished
Possible fix
Increase timeout or number of retries for
run_simulate_data_point.rb
Sample Simulate Datapoint Log
Error on Remote Server Details
AMI version being used
OpenStudio-Server-Docker-2.4.3-nrcan-dl-243_test-t4
ami-c3efdebc
-- private imageOSS version retrieval command is: ruby -r /home/ubuntu/OpenStudio-server/server/lib/openstudio_server/version.rb -e "puts OpenstudioServer::VERSION"
OSS version retrieved is 2.4.3
AMI version being overwritten from
2.4.3
to2.4.3
OSS version extension retrieval command is: ruby -r /home/ubuntu/OpenStudio-server/server/lib/openstudio_server/version.rb -e "puts OpenstudioServer::VERSION_EXT"
OSS version extension retrieved is
-nrcan
AMI version extension being overwritten from
-nrcan
to-nrcan-dl-243_test-t4
AMI author being overwritten from
GitLabCI
toCanmetENERGY
Replaced
nrel
withcanmet
in docker pull and docker tag commandsPacker command is: packer build -machine-readable -var-file=user_variables.json openstudio_server_docker_base.json 2>&1 | tee /home/ubuntu/build.log
STDOUT written to /home/ubuntu/build.log
OpenStudio Server SHA retrieval command is: git log -n 1 | grep commit
OpenStudio Server SHA retrieved is
d61dcbc8548d6366e176d64297db0912aa14a09a
openstudio-server container pull command is: docker pull canmet/openstudio-server:2.4.3-nrcan-dl-243_test-t4
openstudio-server OpenStudio version command is: docker run canmet/openstudio-server:2.4.3-nrcan-dl-243_test-t4 ruby -r openstudio -e "puts OpenStudio.openStudioLongVersion"
OpenStudio version retrieved is
2.4.3
, with SHA29a61f6637
openstudio-server OpenStudio-Standards version command is: docker run canmet/openstudio-server:2.4.3-nrcan-dl-243_test-t4 bundle exec ruby -e "require 'openstudio'; require 'openstudio-standards'; puts OpenstudioStandards::VERSION"
OpenStudio-Standards version retrieved is
0.2.2
openstudio-server OpenStudio-Analysis version command is: docker run canmet/openstudio-server:2.4.3-nrcan-dl-243_test-t4 bundle exec ruby -e "require 'openstudio'; require 'openstudio-analysis'; puts OpenStudio::Analysis::VERSION"
OpenStudio-Analysis version retrieved is
1.0.0.rc18
openstudio-server OpenStudio-Workflow version command is: docker run canmet/openstudio-server:2.4.3-nrcan-dl-243_test-t4 bundle exec ruby -e "require 'openstudio'; require 'openstudio-workflow'; puts OpenStudio::Workflow::VERSION"
OpenStudio-Workflow version retrieved is
1.3.2
openstudio-server EnergyPlus version command is: docker run canmet/openstudio-server:2.4.3-nrcan-dl-243_test-t4 ruby -r openstudio -e "puts OpenStudio.energyPlusVersion"
EnergyPlus version retrieved is
8.9
openstudio-server Radiance version command is: docker run canmet/openstudio-server:2.4.3-nrcan-dl-243_test-t4 /usr/Radiance/bin/rtrace -version
Radiance version retrieved is NREL
5.0.a
openstudio-rserve container pull command is:
docker pull canmet/openstudio-rserve:2.4.3-nrcan-dl-243_test-t4
openstudio-rserve R version command is: docker run canmet/openstudio-rserve:2.4.3-nrcan-dl-243_test-t4 R --version
R version retrieved is
3.4.2
nginx error log
error.log
The text was updated successfully, but these errors were encountered: