Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] Upgrade Azure VMSS to use Mariner Linux #6222

Merged
merged 17 commits into from
Dec 19, 2023
Merged

Conversation

shiyu1994
Copy link
Collaborator

Upgrade Azure VM Scale Set to use Mariner (Azure Linux) systems.

Upgrade Azure VM Scale Set to use Mariner (Azure Linux) systems.
@shiyu1994 shiyu1994 closed this Dec 1, 2023
@shiyu1994 shiyu1994 reopened this Dec 1, 2023
@jameslamb
Copy link
Collaborator

What is the benefit of this change?

@shiyu1994
Copy link
Collaborator Author

I've been requested by the service team within the company to upgrade the OS of Azure Virtual Machine Scale Set (VMSS) from open-source Ubuntu image to an Azure native Linux version called Mariner...

Since upgrading is simply not feasible. I have to shut down the old VMSS and create a new one using the Mariner system.

I'm almost done with only one CI test reporting segmentation fault. I'm trying to fix that.

@shiyu1994 shiyu1994 changed the title Update .vsts-ci.yml for Azure Pipelines [ci] Update .vsts-ci.yml for Azure Pipelines Dec 18, 2023
@shiyu1994
Copy link
Collaborator Author

It is weird that cpp tests sometimes randomly segmentation faults. Note that even though we are using the Mariner image for VMSS, we still use docker to pull the Ubuntu image within the VMSS and run the tests on the same Ubuntu container as before.

https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=15574&view=logs&j=c1aa5445-36f0-58f5-2c69-68691d629457&t=c07c8f5b-ba24-5225-6ddc-80219dbcdc35

@jameslamb
Copy link
Collaborator

an Azure native Linux version called Mariner.

Got it, ok! I guess that's this? https://github.com/microsoft/CBL-Mariner

No problem, like you said...we run all the Linux tests in containers, so hopefully it shouldn't be too disruptive.

It is weird that cpp tests sometimes randomly segmentation faults

Thanks very much for providing that link to the CI run you're talking about! Yeah, I'm not sure if that's related to this change or not. There's been a very low level of activity in the repo over the last few weeks...I haven't seen that specific job segfault in other commits outside this PR, but there also just have not been very many CI runs recently.

I think it's ok to merge this and see if we observe more such problems. If we do that, can you leave the existing hosted runner VM running for a week or two so that we'd be able to switch back if the segfaults become more common?

jameslamb
jameslamb previously approved these changes Dec 19, 2023
@shiyu1994
Copy link
Collaborator Author

shiyu1994 commented Dec 19, 2023

Got it, ok! I guess that's this? https://github.com/microsoft/CBL-Mariner

Yes.

But I've debug this using a VM with the same setting as the VMSS and found that the segmentation fault happens very frequently. I'm afraid that wouldn't be acceptable. Hopefully I can find some more time to fix the issue.

It is weird because I'm debugging using a docker container created from an Ubuntu 22.04 image within the Mariner system, which is the same image for containers as our previous VMSS in CI.

I've manually triggered CI on master branch for several times yesterday and found no segmentation fault problems.

@shiyu1994 shiyu1994 changed the title [ci] Update .vsts-ci.yml for Azure Pipelines [ci] Upgrade Azure VMSS to use Mariner Linux Dec 19, 2023
@jameslamb jameslamb dismissed their stale review December 19, 2023 02:33

not ready to merge

@jameslamb
Copy link
Collaborator

I've manually triggered CI on master branch for several times yesterday and found no segmentation fault problems.

Oh! Ok you're right, that's a serious problem and we should figure it out before merging. Thank you for being so thorough!

Let me know if there's anything I can do to help.

@shiyu1994
Copy link
Collaborator Author

shiyu1994 commented Dec 19, 2023

The root cause is the sanitizers in clang. Here is a very minimal reproducible case

int main(int argc, char** argv) {
  return 0;
}

Even this program compiled with clang++ and -fsanitize=address flag causes segmentation fault on the Ubuntu container started by the Mariner system.

In other words, it has nothing to do with the code in our repo.

Do you have any suggestion on this? @jameslamb @guolinke @jmoralez @StrikerRUS

@shiyu1994
Copy link
Collaborator Author

@jameslamb
Copy link
Collaborator

hmmmm I wasn't able to reproduce that with clang 14.0.0 (the version I see in those logs from the failed build) on godbolt (https://godbolt.org/z/MYx84dcv3) or in docker locally on my mac.

docker run \
  --rm \
  -it ubuntu:latest \
  /bin/bash

apt-get update
apt-get install -y \
    clang-14

cat << EOF >test.cpp
int main(int argc, char** argv) {
  return 0;
}
EOF

clang++-14 -Wall -O3 -fsanitize=address -o test.o ./test.cpp
./test.o
echo $?
# 0

Maybe this hack to allow the use of sudo is causing problems?

LightGBM/.vsts-ci.yml

Lines 138 to 140 in 074b3e8

- script: |
/tmp/docker exec -t -u 0 ci-container \
sh -c "apt-get update && apt-get -o Dpkg::Options::="--force-confold" -y install sudo"

Like maybe Mariner (or the docker repackaged for Mariner's tndf package manager) has a slightly different security model and that somehow shows up as a segfault?


I have another idea too... maybe it's because Mariner is using such an old version of the Docker Engine API?

On a failed build on the new runner (build link), I see the following:

/usr/bin/docker version --format '{{.Server.APIVersion}}'
'1.41'
Docker daemon API version: '1.41'
/usr/bin/docker version --format '{{.Client.APIVersion}}'
'1.41'
Docker client API version: '1.41'

On the most recent build of master (from few hours ago), which was successful and ran on the existing hosted runner not using Mariner (build link), I see:

/usr/bin/docker version --format '{{.Server.APIVersion}}'
'1.43'
Docker daemon API version: '1.43'
/usr/bin/docker version --format '{{.Client.APIVersion}}'
'1.43'
Docker client API version: '1.43'

The first release of v1.41 of the Docker Engine API was in May 2019 (moby/moby#39208). That series ran until... February 2021, I think? (moby/moby#42063).

v1.43 is the most recent version of the Docker Engine API (https://docs.docker.com/engine/api/version-history/).

@shiyu1994
Copy link
Collaborator Author

Just tried with clang++17 and the issue disappear. I'll upgrade clang to use clang-17.

@jameslamb
Copy link
Collaborator

Just tried with clang++17 and the issue disappear. I'll upgrade clang to use clang-17

Ok I think that's alright for this.

@shiyu1994
Copy link
Collaborator Author

The version of Docker could also be a potential reason. But given that we are urgent to migrate to Mariner, I'll upgrade the clang for now to quickly solve this issue, and investigate with the docker version later on.

@jameslamb
Copy link
Collaborator

Yeah I think that's ok! I understand it's something Microsoft considers time-sensitive.

And we get so much more test coverage per commit by having that hosted runner than we'd be able to with only free-tier resources...I think it's worth it.

@shiyu1994
Copy link
Collaborator Author

Failure of gpu_source task with clang 17.
https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=15598&view=logs&j=582743ec-83fe-58c4-937e-4197b5816801

Retrigger to see if we can reproduce this error...

If this occurs frequently, I will restrict the usage of clang 17 to only the cpp tests task.

@shiyu1994
Copy link
Collaborator Author

I found that without -DUSE_GPU=ON and with device_type=cpu, the test_engine.py can sometimes consume up to 70% of the 8GB memory in the container.

I guess that with device_type=gpu and gpu_use_dp=true, the memory consumption is going to reach the total memory on the VM. That's why the tasks sometimes fails.

@shiyu1994
Copy link
Collaborator Author

I'll double the RAM of the VMSS...

@jameslamb
Copy link
Collaborator

reach the total memory on the VM. That's why the tasks sometimes fails.

interesting! So then could we revert the use of clang-17, to avoid the extra complexity in CI scripts?

@shiyu1994
Copy link
Collaborator Author

So then could we revert the use of clang-17, to avoid the extra complexity in CI scripts?

Perhaps not. Because we are having two issues here

  1. The segmentation fault in cpp_tests is due to the issue of using clang 14 with sanitizers using the Mariner VMSS.
  2. The OOM is another issue due to the test_model_size test case consumes almost the full capacity of RAM in VM.

So we have to keep the clang 17 for Mariner VMSS and double the RAM in the VMSS.

@jameslamb
Copy link
Collaborator

Oh I see. Ok no problem.

@shiyu1994
Copy link
Collaborator Author

Note that the Mariner image does not natively contain docker and git, so we installed these using the cloud-init tool to install docker and git when every VM is created in the VMSS. However, this doesn't seems to be 100% successful.

For example, this job fails with an undefined reference to git
https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=15608&view=logs&j=11c46b0e-4322-5d4e-fafd-abe3157be7c4

But according to the jobs we've run so far, I think this only occurs rarely (less than 1 in 10 times of the Azure Pipeline runs). So maybe we can ignore this for now and to merge the changes. And leave a better initialization of VMs in the future. WDYT? @jameslamb

@shiyu1994
Copy link
Collaborator Author

shiyu1994 commented Dec 19, 2023

I'll double the RAM of the VMSS...

Previously we have 8GB RAM for each VM in VMSS (as well as in previous VMSS using Ubuntu images). And now we have 16GB RAM. I see that large RAM do bring stability of the CI jobs.

Manually triggered 5 pipeline runs using the VMSS with 16GB RAM, and 4 out of succeed. Only one fails due to the initialization issue (did not successfully install git). However, the initialization issue rarely happens.

截屏2023-12-20 00 39 00

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok, let's merge this and see how often the jobs fail.

Even before this PR, I've observed Azure DevOps initialization errors and, timeouts, and other not-related-to-LightGBM failures pretty regularly (maybe 1 out of every 5 commits has at least 1 failed job).

@shiyu1994 shiyu1994 merged commit bed9f3f into master Dec 19, 2023
41 checks passed
@shiyu1994 shiyu1994 deleted the azure-pipelines-mariner branch December 19, 2023 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants