Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datapoints of a running analysis fail because of time/retry limit while worker attempts to upload datapoint.zip to server #366

Closed
padmassun opened this issue Jul 13, 2018 · 6 comments

Comments

@padmassun
Copy link
Member

General Issue

Datapoints of a running analysis to fail because of time/retry limit while worker attempts to upload datapoint.zip to server.

aws_instance_options = {:server_instance_type   =>	"c3.8xlarge",   
                        :worker_instance_type   =>	"c3.8xlarge", 
                        :worker_node_number     =>	5, 
                        :ami                    =>	'ami-c3efdebc' # private ami
                       }

c3.8xlarge Specs: 60 GB RAM, 36 cores, 640 GB (2 * 320 GB SSD)

Script used to start server: https://github.com/canmet-energy/start_aws_server

Server branch: https://github.com/NREL/OpenStudio-server/tree/nrcan_2.4.3_test

Server details

The server framework is 2.4.1 as a base, with OSApp bumped to 2.4.3

  • List of changes can be found here

More details regarding the specs of ami can be found near the end

Description

Downloading a file approx 200 Mb (on a slow internet connection) from the OS-Server, while an analysis is running, causes the datapoints of another running analysis to fail.

On amazon cloud, the worker fails to upload the datapoint.zip file to the server. The run_simulate_data_point.rb file attempts to upload the datapoint.zip file up to 4 times within 120s. if the file has not been uploaded successfully within 120s, the status of the datapoint is set as datapoint failure

No errors were found in the sdp log file

image

Figure 1: Shows that the worker tried multiple times to upload the file, and it sometimes passes and sometimes it fails.



fails

Figure 2: Download was started after the 2nd analysis from bottom was finished. Then the 3rd analysis was started using PAT.

Note: Analysis 1&4 and 2&3 are identical (counted from bottom). The page refresh experiment was carried out during analysis 4

Reproduction steps:

  1. Submit a large analysis to AWS (approx 1000 datapoints)
  2. Submit a second large analysis to AWS while the first one is running
  3. Wait for 1st analysis to finish
  4. After a while, when the 2nd analysis is running, download the results file (approx 60 Mb) over a slow connection which can take about 15-20 mins to download

As an experiment, I have also tried to refresh pages (periodically) with large text contents (e.g. blob data of osm files, etc) on a slow connection. This also causes the datapoints to fail.

Actual outcome:

Multiple datapoint failures are observed, because the worker fails to upload the datapoint.zip (size varying from 50 Mb to 110 Mb)

Note: the nginx upload file size limit was increased from 100 Mb to 1000Mb Commit Link

Expected outcome:

No datapoints should fail

Other information:

The datapoints do not fail if the download happens on a fast connection

Workaround

Download the results after all analysis are finished

Possible fix

Increase timeout or number of retries for run_simulate_data_point.rb

Sample Simulate Datapoint Log

# Logfile created on 2018-07-12 17:20:02 +0000 by logger.rb/47272
[17:20:02.901942 INFO] Server host is http://web:80
[17:20:02.902020 INFO] Analysis directory is /mnt/openstudio/analysis_05fa3bb9-73f7-4692-9321-ee3e5c360fb4
[17:20:02.902112 INFO] Simulation directory is /mnt/openstudio/analysis_05fa3bb9-73f7-4692-9321-ee3e5c360fb4/data_point_68819134-16e7-44d6-9bc5-83106689cbc8
[17:20:02.902163 INFO] Run datapoint type/file is workflow
[17:20:02.902218 INFO] Starting initialize_worker for datapoint 68819134-16e7-44d6-9bc5-83106689cbc8
[17:20:02.902443 INFO] Downloading analysis zip from http://web:80/analyses/05fa3bb9-73f7-4692-9321-ee3e5c360fb4/download_analysis_zip
[17:20:05.376792 INFO] Extracting analysis zip to /mnt/openstudio/analysis_05fa3bb9-73f7-4692-9321-ee3e5c360fb4
[17:20:05.552869 INFO] Downloading analysis.json from http://web:80/analyses/05fa3bb9-73f7-4692-9321-ee3e5c360fb4.json
[17:20:05.598607 INFO] Finished worker initialization
[17:20:05.610482 INFO] Downloading datapoint from http://web:80/data_points/68819134-16e7-44d6-9bc5-83106689cbc8.json
[17:20:05.626749 INFO] Creating Workflow Manager instance
[17:20:05.626830 INFO] Directory is /mnt/openstudio/analysis_05fa3bb9-73f7-4692-9321-ee3e5c360fb4/data_point_68819134-16e7-44d6-9bc5-83106689cbc8
[17:20:05.626882 INFO] Opening run.log file '/mnt/openstudio/analysis_05fa3bb9-73f7-4692-9321-ee3e5c360fb4/data_point_68819134-16e7-44d6-9bc5-83106689cbc8/run/run.log'
[17:20:05.638061 INFO] Running workflow
[17:39:33.115881 INFO] Final run state is finished
[17:39:33.139205 INFO] Saving files/reports back to the server
[17:39:33.139525 INFO] Saving report /mnt/openstudio/analysis_05fa3bb9-73f7-4692-9321-ee3e5c360fb4/data_point_68819134-16e7-44d6-9bc5-83106689cbc8/reports/openstudio_results_report.html to http://web:80/data_points/68819134-16e7-44d6-9bc5-83106689cbc8/upload_file
[17:39:36.661193 INFO] Saving report responded with {"id":"5b479258d30cf70158e37fe5","data_point_id":"68819134-16e7-44d6-9bc5-83106689cbc8","display_name":"openstudio_results_report","type":"Report","created_at":"2018-07-12T17:39:36.585Z","updated_at":"2018-07-12T17:39:36.585Z"}
[17:39:36.661461 INFO] Saving report /mnt/openstudio/analysis_05fa3bb9-73f7-4692-9321-ee3e5c360fb4/data_point_68819134-16e7-44d6-9bc5-83106689cbc8/reports/btap_view_model_report.html to http://web:80/data_points/68819134-16e7-44d6-9bc5-83106689cbc8/upload_file
[17:39:36.919939 INFO] Saving report responded with {"id":"5b479258d30cf7014ce38119","data_point_id":"68819134-16e7-44d6-9bc5-83106689cbc8","display_name":"btap_view_model_report","type":"Report","created_at":"2018-07-12T17:39:36.888Z","updated_at":"2018-07-12T17:39:36.888Z"}
[17:39:36.920216 INFO] Saving report /mnt/openstudio/analysis_05fa3bb9-73f7-4692-9321-ee3e5c360fb4/data_point_68819134-16e7-44d6-9bc5-83106689cbc8/reports/eplustbl.html to http://web:80/data_points/68819134-16e7-44d6-9bc5-83106689cbc8/upload_file
[17:40:46.447474 INFO] Saving report responded with {"id":"5b47929ed30cf70160e37e7c","data_point_id":"68819134-16e7-44d6-9bc5-83106689cbc8","display_name":"eplustbl","type":"Report","created_at":"2018-07-12T17:40:46.417Z","updated_at":"2018-07-12T17:40:46.417Z"}
[17:40:46.447780 INFO] Saving report /mnt/openstudio/analysis_05fa3bb9-73f7-4692-9321-ee3e5c360fb4/data_point_68819134-16e7-44d6-9bc5-83106689cbc8/reports/btap_view_model_report.json to http://web:80/data_points/68819134-16e7-44d6-9bc5-83106689cbc8/upload_file
[17:40:46.706730 INFO] Saving report responded with {"id":"5b47929ed30cf70160e37e7d","data_point_id":"68819134-16e7-44d6-9bc5-83106689cbc8","display_name":"btap_view_model_report","type":"Report","created_at":"2018-07-12T17:40:46.689Z","updated_at":"2018-07-12T17:40:46.689Z"}
[17:40:46.707622 INFO] Saving report /mnt/openstudio/analysis_05fa3bb9-73f7-4692-9321-ee3e5c360fb4/data_point_68819134-16e7-44d6-9bc5-83106689cbc8/run/objectives.json to http://web:80/data_points/68819134-16e7-44d6-9bc5-83106689cbc8/upload_file
[17:40:46.785331 INFO] Saving report responded with {"id":"5b47929ed30cf70160e37e7e","data_point_id":"68819134-16e7-44d6-9bc5-83106689cbc8","display_name":"objectives","type":"Report","created_at":"2018-07-12T17:40:46.759Z","updated_at":"2018-07-12T17:40:46.759Z"}
[17:40:46.785699 INFO] Saving report /mnt/openstudio/analysis_05fa3bb9-73f7-4692-9321-ee3e5c360fb4/data_point_68819134-16e7-44d6-9bc5-83106689cbc8/out.osw to http://web:80/data_points/68819134-16e7-44d6-9bc5-83106689cbc8/upload_file
[17:40:46.987419 INFO] Saving report responded with {"id":"5b47929ed30cf70160e37e7f","data_point_id":"68819134-16e7-44d6-9bc5-83106689cbc8","display_name":"Final OSW File","type":"Report","created_at":"2018-07-12T17:40:46.931Z","updated_at":"2018-07-12T17:40:46.931Z"}
[17:40:46.987855 INFO] Saving report /mnt/openstudio/analysis_05fa3bb9-73f7-4692-9321-ee3e5c360fb4/data_point_68819134-16e7-44d6-9bc5-83106689cbc8/in.osm to http://web:80/data_points/68819134-16e7-44d6-9bc5-83106689cbc8/upload_file
[17:40:48.329457 INFO] Saving report responded with {"id":"5b4792a0d30cf70160e37e80","data_point_id":"68819134-16e7-44d6-9bc5-83106689cbc8","display_name":"model","type":"OpenStudio Model","created_at":"2018-07-12T17:40:48.264Z","updated_at":"2018-07-12T17:40:48.264Z"}
[17:40:48.329938 INFO] Saving report /mnt/openstudio/analysis_05fa3bb9-73f7-4692-9321-ee3e5c360fb4/data_point_68819134-16e7-44d6-9bc5-83106689cbc8/run/data_point.zip to http://web:80/data_points/68819134-16e7-44d6-9bc5-83106689cbc8/upload_file
[17:45:12.654211 ERROR] Could not save report Zip File with message: Request Timeout in /usr/local/lib/ruby/gems/2.2.0/gems/rest-client-1.8.0/lib/restclient/request.rb:427:in `rescue in transmit'
/usr/local/lib/ruby/gems/2.2.0/gems/rest-client-1.8.0/lib/restclient/request.rb:350:in `transmit'
/usr/local/lib/ruby/gems/2.2.0/gems/rest-client-1.8.0/lib/restclient/request.rb:176:in `execute'
/usr/local/lib/ruby/gems/2.2.0/gems/rest-client-1.8.0/lib/restclient/request.rb:41:in `execute'
/usr/local/lib/ruby/gems/2.2.0/gems/rest-client-1.8.0/lib/restclient.rb:69:in `post'
/opt/openstudio/server/app/workers/run_simulate_data_point.rb:456:in `block in upload_file'
/usr/local/lib/ruby/2.2.0/timeout.rb:88:in `block in timeout'
/usr/local/lib/ruby/2.2.0/timeout.rb:32:in `block in catch'
/usr/local/lib/ruby/2.2.0/timeout.rb:32:in `catch'
/usr/local/lib/ruby/2.2.0/timeout.rb:32:in `catch'
/usr/local/lib/ruby/2.2.0/timeout.rb:103:in `timeout'
/opt/openstudio/server/app/workers/run_simulate_data_point.rb:453:in `upload_file'
/opt/openstudio/server/app/workers/run_simulate_data_point.rb:230:in `perform'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/performable_method.rb:26:in `perform'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/backend/base.rb:81:in `block in invoke_job'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/lifecycle.rb:61:in `call'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/lifecycle.rb:61:in `block in initialize'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/lifecycle.rb:66:in `call'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/lifecycle.rb:66:in `execute'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/lifecycle.rb:40:in `run_callbacks'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/backend/base.rb:78:in `invoke_job'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/worker.rb:230:in `block (2 levels) in run'
/usr/local/lib/ruby/2.2.0/timeout.rb:88:in `block in timeout'
/usr/local/lib/ruby/2.2.0/timeout.rb:98:in `call'
/usr/local/lib/ruby/2.2.0/timeout.rb:98:in `timeout'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/worker.rb:230:in `block in run'
/usr/local/lib/ruby/2.2.0/benchmark.rb:303:in `realtime'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/worker.rb:229:in `run'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/worker.rb:312:in `block in reserve_and_run_one_job'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/lifecycle.rb:61:in `call'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/lifecycle.rb:61:in `block in initialize'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/lifecycle.rb:66:in `call'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/lifecycle.rb:66:in `execute'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/lifecycle.rb:40:in `run_callbacks'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/worker.rb:312:in `reserve_and_run_one_job'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/worker.rb:213:in `block in work_off'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/worker.rb:212:in `times'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/worker.rb:212:in `work_off'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/worker.rb:175:in `block (4 levels) in start'
/usr/local/lib/ruby/2.2.0/benchmark.rb:303:in `realtime'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/worker.rb:174:in `block (3 levels) in start'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/lifecycle.rb:61:in `call'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/lifecycle.rb:61:in `block in initialize'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/lifecycle.rb:66:in `call'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/lifecycle.rb:66:in `execute'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/lifecycle.rb:40:in `run_callbacks'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/worker.rb:173:in `block (2 levels) in start'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/worker.rb:172:in `loop'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/worker.rb:172:in `block in start'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/plugins/clear_locks.rb:7:in `call'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/plugins/clear_locks.rb:7:in `block (2 levels) in <class:ClearLocks>'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/lifecycle.rb:79:in `call'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/lifecycle.rb:79:in `block (2 levels) in add'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/lifecycle.rb:61:in `call'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/lifecycle.rb:61:in `block in initialize'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/lifecycle.rb:79:in `call'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/lifecycle.rb:79:in `block in add'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/lifecycle.rb:66:in `call'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/lifecycle.rb:66:in `execute'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/lifecycle.rb:40:in `run_callbacks'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/worker.rb:171:in `start'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/command.rb:137:in `run'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/command.rb:125:in `block in run_process'
/usr/local/lib/ruby/gems/2.2.0/gems/daemons-1.2.6/lib/daemons/application.rb:270:in `call'
/usr/local/lib/ruby/gems/2.2.0/gems/daemons-1.2.6/lib/daemons/application.rb:270:in `block in start_proc'
/usr/local/lib/ruby/gems/2.2.0/gems/daemons-1.2.6/lib/daemons/application.rb:279:in `call'
/usr/local/lib/ruby/gems/2.2.0/gems/daemons-1.2.6/lib/daemons/application.rb:279:in `start_proc'
/usr/local/lib/ruby/gems/2.2.0/gems/daemons-1.2.6/lib/daemons/application.rb:300:in `start'
/usr/local/lib/ruby/gems/2.2.0/gems/daemons-1.2.6/lib/daemons/controller.rb:59:in `run'
/usr/local/lib/ruby/gems/2.2.0/gems/daemons-1.2.6/lib/daemons.rb:197:in `block in run_proc'
/usr/local/lib/ruby/gems/2.2.0/gems/daemons-1.2.6/lib/daemons/cmdline.rb:92:in `call'
/usr/local/lib/ruby/gems/2.2.0/gems/daemons-1.2.6/lib/daemons/cmdline.rb:92:in `catch_exceptions'
/usr/local/lib/ruby/gems/2.2.0/gems/daemons-1.2.6/lib/daemons.rb:196:in `run_proc'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/command.rb:123:in `run_process'
/usr/local/lib/ruby/gems/2.2.0/gems/delayed_job-4.1.5/lib/delayed/command.rb:98:in `daemonize'
/opt/openstudio/server/bin/delayed_job:6:in `<main>'
[17:45:12.664436 INFO] Finished /opt/openstudio/server/app/workers/run_simulate_data_point.rb

Error on Remote Server Details

AMI version being used
OpenStudio-Server-Docker-2.4.3-nrcan-dl-243_test-t4 ami-c3efdebc -- private image

OSS version retrieval command is: ruby -r /home/ubuntu/OpenStudio-server/server/lib/openstudio_server/version.rb -e "puts OpenstudioServer::VERSION"
OSS version retrieved is 2.4.3

AMI version being overwritten from 2.4.3 to 2.4.3
OSS version extension retrieval command is: ruby -r /home/ubuntu/OpenStudio-server/server/lib/openstudio_server/version.rb -e "puts OpenstudioServer::VERSION_EXT"
OSS version extension retrieved is -nrcan

AMI version extension being overwritten from -nrcan to -nrcan-dl-243_test-t4
AMI author being overwritten from GitLabCI to CanmetENERGY
Replaced nrel with canmet in docker pull and docker tag commands

Packer command is: packer build -machine-readable -var-file=user_variables.json openstudio_server_docker_base.json 2>&1 | tee /home/ubuntu/build.log
STDOUT written to /home/ubuntu/build.log

OpenStudio Server SHA retrieval command is: git log -n 1 | grep commit
OpenStudio Server SHA retrieved is d61dcbc8548d6366e176d64297db0912aa14a09a

openstudio-server container pull command is: docker pull canmet/openstudio-server:2.4.3-nrcan-dl-243_test-t4
openstudio-server OpenStudio version command is: docker run canmet/openstudio-server:2.4.3-nrcan-dl-243_test-t4 ruby -r openstudio -e "puts OpenStudio.openStudioLongVersion"
OpenStudio version retrieved is 2.4.3, with SHA 29a61f6637

openstudio-server OpenStudio-Standards version command is: docker run canmet/openstudio-server:2.4.3-nrcan-dl-243_test-t4 bundle exec ruby -e "require 'openstudio'; require 'openstudio-standards'; puts OpenstudioStandards::VERSION"
OpenStudio-Standards version retrieved is 0.2.2

openstudio-server OpenStudio-Analysis version command is: docker run canmet/openstudio-server:2.4.3-nrcan-dl-243_test-t4 bundle exec ruby -e "require 'openstudio'; require 'openstudio-analysis'; puts OpenStudio::Analysis::VERSION"
OpenStudio-Analysis version retrieved is 1.0.0.rc18

openstudio-server OpenStudio-Workflow version command is: docker run canmet/openstudio-server:2.4.3-nrcan-dl-243_test-t4 bundle exec ruby -e "require 'openstudio'; require 'openstudio-workflow'; puts OpenStudio::Workflow::VERSION"
OpenStudio-Workflow version retrieved is 1.3.2

openstudio-server EnergyPlus version command is: docker run canmet/openstudio-server:2.4.3-nrcan-dl-243_test-t4 ruby -r openstudio -e "puts OpenStudio.energyPlusVersion"
EnergyPlus version retrieved is 8.9

openstudio-server Radiance version command is: docker run canmet/openstudio-server:2.4.3-nrcan-dl-243_test-t4 /usr/Radiance/bin/rtrace -version
Radiance version retrieved is NREL 5.0.a

openstudio-rserve container pull command is: docker pull canmet/openstudio-rserve:2.4.3-nrcan-dl-243_test-t4

openstudio-rserve R version command is: docker run canmet/openstudio-rserve:2.4.3-nrcan-dl-243_test-t4 R --version
R version retrieved is 3.4.2

nginx error log

error.log

@padmassun padmassun changed the title Datapoints of a running analysis to fail because of time/retry limit while worker attempts to upload datapoint.zip to server Datapoints of a running analysis fail because of time/retry limit while worker attempts to upload datapoint.zip to server Jul 13, 2018
@brianlball
Copy link
Contributor

I know a bunch of timeout / file size related issues have been addressed recently. Hopefully 2.6.0 will fix a bunch of those. There is also this commit from a while back that may help you:

442537a

@padmassun
Copy link
Member Author

Hello @brianlball,

Thank you for the quick response. I was aware of this, and I have already increased the nginx upload limit. This server already has this fix applied.

Thank you.

bef391d

@nllong
Copy link
Member

nllong commented Jul 13, 2018

I imagine that one of the Timeout.timeout(...) methods is killing the process. Especially if it takes anytime greater than 240 seconds. I think we should relax these constraints. Also, it looks like we should fix the uploading on retry due to the multiple 'zip file' links that are appearing in the dashboard.

zip_download_timeout = 240
begin
Timeout.timeout(zip_download_timeout) do
loop do
break if File.exist? receipt_file
@sim_logger.info 'waiting for receipt file to appear'
sleep 3
end
end
@sim_logger.info 'receipt_file appeared, moving on'
return true
rescue ::Timeout::Error
@sim_logger.error "Required analysis objects were not retrieved after #{zip_download_timeout} seconds."
end

The other area smells like general performance issues. In version 2.5.1 and greater we moved away from delayed_jobs for background tasks and over to redis/resque for overall perfomance and better concurrency management. It would be nice to try version 2.6.0 and see if this problem still exists. Note that images for 2.6.0 are being built today.

@padmassun
Copy link
Member Author

Similar issues were observed with 2.6.0
Image built from https://github.com/ckirney/OpenStudio-server/tree/nrcan_2.6.0
( @ckirney did not have access to OS-Server yet, so it was forked)

image

Sample Datapoint Simulation Log

# Logfile created on 2018-08-17 14:07:41 +0000 by logger.rb/47272
[14:07:41.693608 INFO] Server host is http://web:80
[14:07:41.693784 INFO] Analysis directory is /mnt/openstudio/analysis_bd9b4922-2e0a-4985-84ca-d86551420bc3
[14:07:41.694036 INFO] Simulation directory is /mnt/openstudio/analysis_bd9b4922-2e0a-4985-84ca-d86551420bc3/data_point_29ab789f-5958-49e1-9d2a-3b99af0335ad
[14:07:41.694096 INFO] Run datapoint type/file is workflow
[14:07:41.694151 INFO] Starting initialize_worker for datapoint 29ab789f-5958-49e1-9d2a-3b99af0335ad
[14:07:41.694247 INFO] receipt_file already exists, moving on
[14:08:03.898766 INFO] Downloading datapoint from http://web:80/data_points/29ab789f-5958-49e1-9d2a-3b99af0335ad.json
[14:08:32.760602 INFO] Creating Workflow Manager instance
[14:08:32.760754 INFO] Directory is /mnt/openstudio/analysis_bd9b4922-2e0a-4985-84ca-d86551420bc3/data_point_29ab789f-5958-49e1-9d2a-3b99af0335ad
[14:08:32.760895 INFO] Opening run.log file '/mnt/openstudio/analysis_bd9b4922-2e0a-4985-84ca-d86551420bc3/data_point_29ab789f-5958-49e1-9d2a-3b99af0335ad/run/run.log'
[14:08:32.796586 INFO] Running workflow
[14:09:31.129801 INFO] Final run state is finished
[14:09:31.189768 INFO] Saving files/reports back to the server
[14:09:31.190108 INFO] Saving report /mnt/openstudio/analysis_bd9b4922-2e0a-4985-84ca-d86551420bc3/data_point_29ab789f-5958-49e1-9d2a-3b99af0335ad/reports/openstudio_results_report.html to http://web:80/data_points/29ab789f-5958-49e1-9d2a-3b99af0335ad/upload_file
[14:09:48.472696 INFO] Saving report responded with {"id":"5b76d729980f0b0d68ad2102","data_point_id":"29ab789f-5958-49e1-9d2a-3b99af0335ad","display_name":"openstudio_results_report","type":"Report","created_at":"2018-08-17T14:09:45.674Z","updated_at":"2018-08-17T14:09:45.674Z"}
[14:09:48.472989 INFO] Saving report /mnt/openstudio/analysis_bd9b4922-2e0a-4985-84ca-d86551420bc3/data_point_29ab789f-5958-49e1-9d2a-3b99af0335ad/reports/btap_view_model_report.html to http://web:80/data_points/29ab789f-5958-49e1-9d2a-3b99af0335ad/upload_file
[14:09:57.142299 INFO] Saving report responded with {"id":"5b76d734980f0b02320c874f","data_point_id":"29ab789f-5958-49e1-9d2a-3b99af0335ad","display_name":"btap_view_model_report","type":"Report","created_at":"2018-08-17T14:09:56.147Z","updated_at":"2018-08-17T14:09:56.147Z"}
[14:09:57.142641 INFO] Saving report /mnt/openstudio/analysis_bd9b4922-2e0a-4985-84ca-d86551420bc3/data_point_29ab789f-5958-49e1-9d2a-3b99af0335ad/reports/eplustbl.html to http://web:80/data_points/29ab789f-5958-49e1-9d2a-3b99af0335ad/upload_file
[14:10:07.078740 INFO] Saving report responded with {"id":"5b76d73a980f0b0de0ad20fb","data_point_id":"29ab789f-5958-49e1-9d2a-3b99af0335ad","display_name":"eplustbl","type":"Report","created_at":"2018-08-17T14:10:02.485Z","updated_at":"2018-08-17T14:10:02.485Z"}
[14:10:07.079081 INFO] Saving report /mnt/openstudio/analysis_bd9b4922-2e0a-4985-84ca-d86551420bc3/data_point_29ab789f-5958-49e1-9d2a-3b99af0335ad/reports/btap_view_model_report.json to http://web:80/data_points/29ab789f-5958-49e1-9d2a-3b99af0335ad/upload_file
[14:10:16.056748 INFO] Saving report responded with {"id":"5b76d747980f0b0e1bad2103","data_point_id":"29ab789f-5958-49e1-9d2a-3b99af0335ad","display_name":"btap_view_model_report","type":"Report","created_at":"2018-08-17T14:10:15.986Z","updated_at":"2018-08-17T14:10:15.986Z"}
[14:10:16.057133 INFO] Saving report /mnt/openstudio/analysis_bd9b4922-2e0a-4985-84ca-d86551420bc3/data_point_29ab789f-5958-49e1-9d2a-3b99af0335ad/run/objectives.json to http://web:80/data_points/29ab789f-5958-49e1-9d2a-3b99af0335ad/upload_file
[14:10:23.924702 INFO] Saving report responded with {"id":"5b76d74f980f0b026c0c872a","data_point_id":"29ab789f-5958-49e1-9d2a-3b99af0335ad","display_name":"objectives","type":"Report","created_at":"2018-08-17T14:10:23.891Z","updated_at":"2018-08-17T14:10:23.891Z"}
[14:10:23.925051 INFO] Saving report /mnt/openstudio/analysis_bd9b4922-2e0a-4985-84ca-d86551420bc3/data_point_29ab789f-5958-49e1-9d2a-3b99af0335ad/out.osw to http://web:80/data_points/29ab789f-5958-49e1-9d2a-3b99af0335ad/upload_file
[14:10:40.704653 INFO] Saving report responded with {"id":"5b76d75d980f0b02740c8723","data_point_id":"29ab789f-5958-49e1-9d2a-3b99af0335ad","display_name":"Final OSW File","type":"Report","created_at":"2018-08-17T14:10:37.392Z","updated_at":"2018-08-17T14:10:37.392Z"}
[14:10:40.705016 INFO] Saving report /mnt/openstudio/analysis_bd9b4922-2e0a-4985-84ca-d86551420bc3/data_point_29ab789f-5958-49e1-9d2a-3b99af0335ad/in.osm to http://web:80/data_points/29ab789f-5958-49e1-9d2a-3b99af0335ad/upload_file
[14:10:56.803330 INFO] Saving report responded with {"id":"5b76d76f980f0b0c2aad2108","data_point_id":"29ab789f-5958-49e1-9d2a-3b99af0335ad","display_name":"model","type":"OpenStudio Model","created_at":"2018-08-17T14:10:55.377Z","updated_at":"2018-08-17T14:10:55.377Z"}
[14:10:56.803659 INFO] Saving report /mnt/openstudio/analysis_bd9b4922-2e0a-4985-84ca-d86551420bc3/data_point_29ab789f-5958-49e1-9d2a-3b99af0335ad/run/data_point.zip to http://web:80/data_points/29ab789f-5958-49e1-9d2a-3b99af0335ad/upload_file
[14:16:05.921199 ERROR] Could not save report Zip File with message: Request Timeout in /usr/local/lib/ruby/gems/2.2.0/gems/rest-client-1.8.0/lib/restclient/request.rb:427:in `rescue in transmit'
/usr/local/lib/ruby/gems/2.2.0/gems/rest-client-1.8.0/lib/restclient/request.rb:350:in `transmit'
/usr/local/lib/ruby/gems/2.2.0/gems/rest-client-1.8.0/lib/restclient/request.rb:176:in `execute'
/usr/local/lib/ruby/gems/2.2.0/gems/rest-client-1.8.0/lib/restclient/request.rb:41:in `execute'
/usr/local/lib/ruby/gems/2.2.0/gems/rest-client-1.8.0/lib/restclient.rb:69:in `post'
/opt/openstudio/server/app/workers/run_simulate_data_point.rb:462:in `block in upload_file'
/usr/local/lib/ruby/2.2.0/timeout.rb:88:in `block in timeout'
/usr/local/lib/ruby/2.2.0/timeout.rb:32:in `block in catch'
/usr/local/lib/ruby/2.2.0/timeout.rb:32:in `catch'
/usr/local/lib/ruby/2.2.0/timeout.rb:32:in `catch'
/usr/local/lib/ruby/2.2.0/timeout.rb:103:in `timeout'
/opt/openstudio/server/app/workers/run_simulate_data_point.rb:459:in `upload_file'
/opt/openstudio/server/app/workers/run_simulate_data_point.rb:230:in `perform'
/opt/openstudio/server/app/workers/run_simulate_data_point_resque.rb:42:in `perform'
/usr/local/lib/ruby/gems/2.2.0/gems/resque-1.27.4/lib/resque/job.rb:168:in `perform'
/usr/local/lib/ruby/gems/2.2.0/gems/resque-1.27.4/lib/resque/worker.rb:304:in `perform'
/usr/local/lib/ruby/gems/2.2.0/gems/resque-1.27.4/lib/resque/worker.rb:894:in `block in perform_with_fork'
/usr/local/lib/ruby/gems/2.2.0/gems/resque-1.27.4/lib/resque/worker.rb:892:in `fork'
/usr/local/lib/ruby/gems/2.2.0/gems/resque-1.27.4/lib/resque/worker.rb:892:in `perform_with_fork'
/usr/local/lib/ruby/gems/2.2.0/gems/resque-1.27.4/lib/resque/worker.rb:260:in `work_one_job'
/usr/local/lib/ruby/gems/2.2.0/gems/resque-1.27.4/lib/resque/worker.rb:234:in `block in work'
/usr/local/lib/ruby/gems/2.2.0/gems/resque-1.27.4/lib/resque/worker.rb:231:in `loop'
/usr/local/lib/ruby/gems/2.2.0/gems/resque-1.27.4/lib/resque/worker.rb:231:in `work'
/usr/local/lib/ruby/gems/2.2.0/gems/resque-1.27.4/lib/resque/tasks.rb:20:in `block (2 levels) in <top (required)>'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/task.rb:271:in `call'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/task.rb:271:in `block in execute'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/task.rb:271:in `each'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/task.rb:271:in `execute'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/task.rb:213:in `block in invoke_with_call_chain'
/usr/local/lib/ruby/2.2.0/monitor.rb:211:in `mon_synchronize'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/task.rb:193:in `invoke_with_call_chain'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/task.rb:182:in `invoke'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/application.rb:160:in `invoke_task'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/application.rb:116:in `block (2 levels) in top_level'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/application.rb:116:in `each'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/application.rb:116:in `block in top_level'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/application.rb:125:in `run_with_threads'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/application.rb:110:in `top_level'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/application.rb:83:in `block in run'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/application.rb:186:in `standard_exception_handling'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/lib/rake/application.rb:80:in `run'
/usr/local/lib/ruby/gems/2.2.0/gems/rake-12.3.1/exe/rake:27:in `<top (required)>'
/usr/local/bin/rake:22:in `load'
/usr/local/bin/rake:22:in `<main>'
[14:16:05.936854 INFO] Finished /opt/openstudio/server/app/workers/run_simulate_data_point.rb

@brianlball
Copy link
Contributor

bumping them up to an hour for the zips

#525
#524

This was referenced Dec 10, 2019
@brianlball
Copy link
Contributor

making timeouts user arguments in
#537
#538

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants