-
Notifications
You must be signed in to change notification settings - Fork 549
Problems of install v1.6.0 #5445
Comments
new problem. on ubuntu 20.04.
|
@hzy46 Any comments for this? |
Caused by https://github.com/microsoft/pai/blob/master/contrib/kubespray/roles/docker-cache/install/files/add_docker_cache_config.py. If @siaimes, for a workaround, please create the file @debuggy please take a look at this issue. |
Ansible 2.7.12 doesn't support ubuntu 20.04. So we use ansible 2.9.7. See the comments here: https://github.com/microsoft/pai/blob/master/contrib/kubespray/script/environment.sh#L46 |
But now I installed on ubuntu 16.04 ansible also be upgraded to 2.9.7, and then unable to deploy the service. |
So, I think this is a bug, the master node does not have a GPU, so there is no /etc/docker/daemon.json. |
on ubuntu 16.04, this error occurred.
|
I remove the line sudo python3 -m pip install ansible==2.9.7, then new error occurred.
|
I remove line ansible-playbook -i ${HOME}/pai-deploy/cluster-cfg/hosts.yml docker-cache-config-distribute.yml || exit $? the error reappeard.
|
The most original error occurred in this line, which may be caused by incompatibility between python3 and python2. |
Even if I did not modify this part, /etc/docker/daemon.json was also modified to point the mirror to the master node.
|
Solve this problem by changing python3 to python2. |
succeed after changing python3 to python2. But the final output has no master node IP.
|
After the deploy, can you visit the webportal ? or just the final output has problem |
just the final output has problem. |
The paictl version is inconsistent.
|
maybe duel to docker-cache service do not start after
|
@siaimes Do you use our dev-box to run ansible playbook, or you run the playbook in the host env.
|
Thanks this information. I updated the previous comments. I want to confirm the deployment issue you already fixed by using python2? (We need to figure out why this error is happened ) For job can not start issue, can you run |
Indeed it is. |
The last reply has explained the situation. |
Thanks @siaimes, so everything works good after you start the docker-cache? And your job can be started now? If so, we will figure out why docker-cache not started automatically |
OK. |
Hi, I will check the issue about docker-cache, adding @SwordFaith in this thread. |
One comment was deleted for privacy reasons, and it contains the following information that may be useful.
I used a brand new ubuntu 16.04 virtual machine as a dev-box and configured it exactly according to the installation guide except docker cache change as follows.
In this way, the docker cache service should be started after the I started a dev-box container, and then I ran this command and it was all right.
|
Hi, have you tried to use quick-start install in our dev-box docker container? I wonder if those errors are due to not use our dev box to deploy. I'm a little confused by your info. It seems you start a brand new ubuntu 16.04 machine to reinstall openpai v1.6.0, and encountered several issues. The ansible version issue is about python version or ansible version, it will be a big help to understand the error if you share the detailed error message with us. Kubectl service info and the log will help a lot as well. |
It‘s my bad for not consider this situation. We will fix this later. |
First, there are three machines, master, worker, and dev-box. My dev-box is a ubuntu 16.04 virtual machine. Then follow this process exactly except docker cache section. This document does not require |
Can you run the failed playbook with -vvv to get the detailed log? It will be a big help |
This error should come from ansibel, maybe it is not compatible with python3. After change this line
to this
it worked. |
I'm sorry, I just successfully installed v1.6.0, so I can no longer run these commands. |
It's OK, we will go through your situation by ourselves in order to find the bug, thanks for using our project again. |
Is this information sufficient to locate the problem? |
This seems an apt install task failed. kubernetes-sigs/kubespray#6231 |
I have always had a question. Since users are required to install docker in advance in the document, why does the openpai installation script reinstall it again? Because the way users add docker sources is different from openpai, this often leads to apt conflicts. |
@SwordFaith This may be another bug. This is output when running |
Related to issue 5465. |
Fix update docker cache error: [issue comment](#5445 (comment)). If /etc/docker/daemon.json doesn't exist or is an empty file, the script will fail.
Fix update docker cache error: [issue comment](#5445 (comment)). If /etc/docker/daemon.json doesn't exist or is an empty file, the script will fail.
* Fix update docker cache error (#5539) Fix update docker cache error: [issue comment](#5445 (comment)). If /etc/docker/daemon.json doesn't exist or is an empty file, the script will fail. * Fix: change tail log to 16KB (#5575) * make enable_docker_cache effective (#5574) * Use sed instead of pip to change ansible version (#5573) Signed-off-by: siaimes <34199488+siaimes@users.noreply.github.com> * fix missing `WEBPORTAL_URL` issue when installing services (#5538) [issue comment](#5445 (comment)) * Add Prometheus Pushgateway as an optional service (#5590) - Add an optional service Prometheus Pushagteway - add a container `metrics-cleaner` to clean Pushgateway metrics by fixed interval - add prometheus-pushgateway in job-exporter - set `honor_lables` as true in Prometheus * adjust grafana to fit more metrics (#5591) - support more metrics, including - node_memory_bytes with `type` label - node_disk_other_bytes_total, task_block_other_byte - get task cpu utilization with `task_cpu_seconds_total` - task_network_receive_bytes_total, task_network_transmit_bytes_total - avoid wrongly computed 100% cpu utilization by using `idelta` - use `irate` instead of `rate` for fast-moving metrics & change the computing interval - set `editable` as true in all the dashboards * fix doc related to china deployment (#5593) * fix * fix * Bump runtime version (#5600) * fix link in readme (#5595) * Bump merge-deep from 3.0.2 to 3.0.3 in /src/webportal (#5524) * Bump postcss from 7.0.17 to 7.0.36 in /src/webportal (#5531) * Bump postcss from 7.0.14 to 7.0.36 in /contrib/submit-job-v2 (#5532) * Bump path-parse from 1.0.6 to 1.0.7 in /src/rest-server (#5597) * Bump color-string from 1.5.3 to 1.6.0 in /src/webportal (#5594) * Bump path-parse from 1.0.6 to 1.0.7 in /src/webportal (#5598) * Bump path-parse from 1.0.6 to 1.0.7 in /contrib/submit-job-v2 (#5599) Co-authored-by: siaimes <34199488+siaimes@users.noreply.github.com> Co-authored-by: Binyang2014 <binyli@microsoft.com> Co-authored-by: Zhiyuan He <362583303@qq.com> Co-authored-by: Guoxin <suiguoxin@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Organization Name:
Short summary about the issue/question: TASK [container-engine/docker : ensure docker-ce repository is enabled] failed
Brief what process you are following:
How to reproduce it:
OpenPAI Environment: v1.6.0
uname -a
):Anything else we need to know:
The text was updated successfully, but these errors were encountered: