Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating cloud based OS images - configuration required for Azure RHEL LVM images #1355

Closed
przemyslavic opened this issue Jun 16, 2020 · 10 comments
Assignees
Milestone

Comments

@przemyslavic
Copy link
Collaborator

przemyslavic commented Jun 16, 2020

Is your feature request related to a problem?
Azure RHEL RAW images are no longer being produced in favor of LVM-partitioned images (link).
To have the latest updates we need to switch to LVM images.
The default file system configuration in LVM images is not sufficient for Epiphany. Partitioning for a 64GB disk is as follows:

[operations@ci-devazurrhelflannel-kubernetes-master-vm-0 ~]$ df -h
Filesystem                 Size  Used Avail Use% Mounted on
devtmpfs                   3.4G     0  3.4G   0% /dev
tmpfs                      3.4G     0  3.4G   0% /dev/shm
tmpfs                      3.4G  9.1M  3.4G   1% /run
tmpfs                      3.4G     0  3.4G   0% /sys/fs/cgroup
/dev/mapper/rootvg-rootlv  2.0G   66M  2.0G   4% /
/dev/mapper/rootvg-usrlv    10G  1.4G  8.7G  14% /usr
/dev/mapper/rootvg-varlv   8.0G  7.9G   96M  99% /var
/dev/sda2                  494M   77M  418M  16% /boot
/dev/mapper/rootvg-homelv 1014M   33M  982M   4% /home
/dev/mapper/rootvg-optlv   2.0G  140M  1.9G   7% /opt
/dev/sda1                  500M  9.7M  491M   2% /boot/efi
/dev/mapper/rootvg-tmplv   2.0G   96M  1.9G   5% /tmp
/dev/sdb1                   14G   41M   13G   1% /mnt
tmpfs                      680M     0  680M   0% /run/user/1000

which results in an issue downloading the requirements: Error writing blob: write /var/tmp/docker-tarfile-blob336544209: no space left on device.

Describe the solution you'd like

LVM must be reconfigured to give us the ability to use the latest RHEL LVM images RedHat:RHEL:7-LVM:7.8.2020042719. Until then we use the latest RAW image RedHat:RHEL:7-RAW:7.7.2019090418.

Describe alternatives you've considered

Additional context

@sk4zuzu
Copy link
Contributor

sk4zuzu commented Nov 26, 2020

What we need to understand here is what is required to for fully customizing and updating LVM config inside the VM. We can use cloud-init for example but we should check first if modifications of config templates are destructive in Azure (and also in AWS in case we would like to do the same in AWS). @to-bar WDYT?

@to-bar
Copy link
Contributor

to-bar commented Nov 27, 2020

I've done a spike. This task seems to be really difficult.
I've prepared the following working steps but this is a PoC (not a final script) to be evaluated and maybe implemented. During the procedure ssh connection is closed twice (because of killing spree logic) so further improvements may be needed.

Processes (run by the root user) whose first character of the zeroth command line argument is @ are excluded from the killing spree, much the same way as kernel threads are excluded too.

Source: https://systemd.io/ROOT_STORAGE_DAEMONS/

# https://unix.stackexchange.com/questions/450873/how-to-umount-var-usr-safely-on-systemd-without-reboot
# https://systemd.io/INITRD_INTERFACE/

# TODO:
# Check if execution is needed and skip if not
# Assert is run as root

# Create temporary volume to clone data and switch-root in order to unmount points in use (such as /var)
# TODO: 10G - set size dynamically based on used space and disk size

lvcreate --size 10G --name clonelv rootvg
mkfs.ext4 -m0 /dev/rootvg/clonelv

mkdir /tmp/clone
mount /dev/rootvg/clonelv /tmp/clone

# real    2m20.914s
time tar -cpSf - \
    --acls --xattrs --selinux \
    --exclude '/dev/*' \
    --exclude '/run/*' \
    --exclude '/sys/*' \
    --exclude '/proc/*' \
    --exclude '/tmp/*' \
    --exclude '/var/tmp/*' \
    --exclude '/var/run/*' \
    / |
    tar -xvf - \
        --acls --xattrs --selinux \
        -C /tmp/clone

cp -a /tmp/clone/etc/fstab{,.bak}

truncate -s0 /tmp/clone/etc/fstab

# switch-root kills spree of all running processes, ssh connection is closed
systemctl switch-root /tmp/clone

# wait, reconnect and sudo -i

declare -A LV_TO_DIR_MAP
LV_TO_DIR_MAP[homelv]=home
LV_TO_DIR_MAP[optlv]=opt
LV_TO_DIR_MAP[rootlv]=root_fs
LV_TO_DIR_MAP[tmplv]=tmp
LV_TO_DIR_MAP[usrlv]=usr
LV_TO_DIR_MAP[varlv]=var

# mount LVs to clone
for LV in "${!LV_TO_DIR_MAP[@]}"; do
    mkdir /mnt/$LV
    mount /dev/rootvg/$LV /mnt/$LV
done

lvresize -l +100%FREE /dev/rootvg/rootlv
# xfs_growfs requires mounted FS
xfs_growfs /dev/rootvg/rootlv

# clone and remove LVs
for LV in "${!LV_TO_DIR_MAP[@]}"; do
    if [[ "$LV" != 'rootlv' ]]; then
        cp -a --verbose /mnt/$LV/. /mnt/rootlv/${LV_TO_DIR_MAP[$LV]}/
        umount /mnt/$LV
        lvremove -f /dev/mapper/rootvg-$LV
        sed -i "\|^/dev/mapper/rootvg-$LV|d" /mnt/rootlv/etc/fstab
    fi
done

# checkpoint
cat /mnt/rootlv/etc/fstab
df -h

# switch-root kills spree of all running processes, ssh connection is closed
systemctl switch-root /mnt/rootlv

# wait, reconnect and sudo -i

lvremove -f /dev/mapper/rootvg-clonelv

lvresize -l +100%FREE /dev/rootvg/rootlv
xfs_growfs /dev/rootvg/rootlv

rmdir /tmp/clone

lvs
df -h

# [root@tb-rhel-lvm-grooming-repository-vm-3 ~]
# Filesystem                 Size  Used Avail Use% Mounted on
# devtmpfs                   1.6G     0  1.6G   0% /dev
# tmpfs                      1.7G     0  1.7G   0% /dev/shm
# tmpfs                      1.7G  8.5M  1.6G   1% /run
# tmpfs                      1.7G     0  1.7G   0% /sys/fs/cgroup
# /dev/mapper/rootvg-rootlv   64G  1.7G   62G   3% /
# /dev/sda2                  494M   77M  418M  16% /boot
# /dev/sdb1                  6.8G   32M  6.4G   1% /mnt
# /dev/sda1                  500M  9.9M  490M   2% /boot/efi
# tmpfs                      329M     0  329M   0% /run/user/1001

Tested with:

specification:
  storage_image_reference:
    publisher: RedHat
    offer: RHEL
    sku: 7lvm-gen2
    version: "7.9.2020111205"
  storage_os_disk:
    disk_size_gb: 64
  size: Standard_DS1_v2

@sk4zuzu
Copy link
Contributor

sk4zuzu commented Nov 27, 2020

🤗

@mkyc mkyc modified the milestones: S20201203, S20201217 Dec 4, 2020
@sk4zuzu
Copy link
Contributor

sk4zuzu commented Dec 8, 2020

This is quite crazy it's in todo just like that after the spike :) It's a complex and dangerous procedure, IMO it requires further research and maybe some working ansible POC first? 🤔

@sk4zuzu
Copy link
Contributor

sk4zuzu commented Dec 10, 2020

We discussed this in a team meeting and decided that implementing this procedure is not something that we really want to do. It's better just to resize specific logical volumes based on the real usage. Later we will introduce additional data disks into our clusters then this problem will no longer be a problem.

@to-bar to-bar self-assigned this Dec 11, 2020
@mkyc mkyc modified the milestones: S20201217, S20201231 Dec 17, 2020
@mkyc
Copy link
Contributor

mkyc commented Jan 14, 2021

@to-bar why did it went back to "ToDo"?

@mkyc mkyc modified the milestones: S20210114, S20210128 Jan 14, 2021
@to-bar
Copy link
Contributor

to-bar commented Jan 19, 2021

@mkyc Moved to ToDo temporarily, just to reflect the fact I stopped working on this task in order to handle urgent ones. Going to continue this sprint.

@przemyslavic
Copy link
Collaborator Author

przemyslavic commented Mar 2, 2021

@to-bar I encountered a problem with the read-only file system.

2021-03-02T11:14:20.7451382Z[38;21m11:14:20 INFO cli.engine.ansible.AnsibleCommand - TASK [helm_charts : Create Helm charts directory] ******************************[0m
2021-03-02T11:14:20.8067994Z[38;21m11:14:20 INFO cli.engine.ansible.AnsibleCommand - skipping: [ci-devazurrhelcanal-logging-vm-0][0m
2021-03-02T11:14:21.7224642Z[31;21m11:14:21 ERROR cli.engine.ansible.AnsibleCommand - fatal: [ci-devazurrhelcanal-repository-vm-0]: FAILED! => {"changed": false, "msg": "There was an issue creating /var/www as requested: [Errno 30] Read-only file system: '/var/www'", "path": "/var/www/html/epirepo/charts"}

As we discussed today, we have to wait for the epiphany-lvm-merge process to finish and then continue with the Ansible.

edit:
I have reproduced the problem on another cluster. The init script finished at 13:05:28 and Ansible failed at 13:05:17.

Moving this task back to TODO.

@to-bar
Copy link
Contributor

to-bar commented Mar 3, 2021

Fixed in #2096.

@przemyslavic
Copy link
Collaborator Author

✔️ Fixed.
Now Ansible waits for epiphany-lvm-merge.service to finish.

Mar 03 13:19:02 ci-devazurrhelflannel-repository-vm-0 epiphany-lvm-merge.sh[1876]: *** Finished script: /usr/local/sbin/epiphany-lvm-merge.sh at 13:19:02.516041465
Mar 03 13:19:02 ci-devazurrhelflannel-repository-vm-0 epiphany-lvm-merge.sh[1876]: *** Elapsed time: 0min 45s
[38;21m13:18:55 INFO cli.engine.ansible.AnsibleCommand - TASK [preflight : Check if epiphany-lvm-merge.service exists] ******************[0m
[38;21m13:18:57 INFO cli.engine.ansible.AnsibleCommand - ok: [ci-devazurrhelflannel-logging-vm-0][0m
[38;21m13:19:00 INFO cli.engine.ansible.AnsibleCommand - ok: [ci-devazurrhelflannel-repository-vm-0][0m

[38;21m13:19:00 INFO cli.engine.ansible.AnsibleCommand - TASK [preflight : Wait for epiphany-lvm-merge.service to finish] ***************[0m
[38;21m13:19:03 INFO cli.engine.ansible.AnsibleCommand - ok: [ci-devazurrhelflannel-repository-vm-0][0m
[38;21m13:19:03 INFO cli.engine.ansible.AnsibleCommand - ok: [ci-devazurrhelflannel-logging-vm-0][0m

Tested with the image RedHat:RHEL:7-LVM:7.9.2020111202 and the following config (had to increase the disk size to 64GB).

---
kind: infrastructure/virtual-machine
name: repository-machine-rhel
provider: azure
based_on: repository-machine
specification:
  storage_image_reference:
    publisher: $(image_publisher)
    offer: $(image_offer)
    sku: $(image_sku)
    version: $(image_version)
  storage_os_disk:
    disk_size_gb: 64

Partitioning for a 64GB disk is as follows:

[operations@ci-devazurrhelflannel-repository-vm-0 ~]$ df -h
Filesystem                 Size  Used Avail Use% Mounted on
devtmpfs                   1.6G     0  1.6G   0% /dev
tmpfs                      1.7G     0  1.7G   0% /dev/shm
tmpfs                      1.7G  9.2M  1.6G   1% /run
tmpfs                      1.7G     0  1.7G   0% /sys/fs/cgroup
/dev/mapper/rootvg-rootlv   54G  1.4G   52G   3% /
/dev/mapper/rootvg-usrlv    10G  1.4G  8.7G  14% /usr
/dev/sda2                  494M   76M  418M  16% /boot
/dev/sda1                  500M  9.9M  490M   2% /boot/efi
/dev/sdb1                  6.8G   32M  6.4G   1% /mnt
tmpfs                      329M     0  329M   0% /run/user/1001

@mkyc mkyc closed this as completed Mar 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants