Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not lowered: aten::_linalg_eigh #6017

Closed
mfatih7 opened this issue Dec 4, 2023 · 14 comments · Fixed by #7674
Closed

not lowered: aten::_linalg_eigh #6017

mfatih7 opened this issue Dec 4, 2023 · 14 comments · Fixed by #7674

Comments

@mfatih7
Copy link

mfatih7 commented Dec 4, 2023

Hello

I am getting the above error during training

pt-xla-profiler: TransferFromServerTime too frequent: 132 counts during 1 steps
pt-xla-profiler: Op(s) not lowered: aten::_linalg_eigh,  Please open a GitHub issue with the above op lowering requests.

best regards

@JackCaoG
Copy link
Collaborator

JackCaoG commented Dec 4, 2023

@wonjoolee95 do you know if this is one of the core aten ops?

@wonjoolee95
Copy link
Collaborator

It is not part of the core aten ops, but we can find someone to work on this. Probably will be after the 2.2 release though.

@mfatih7
Copy link
Author

mfatih7 commented Dec 14, 2023

Hello @wonjoolee95

Thank you for the answer.

When will be the 2.2 release?

Can the other problems (#6002, #6048) be related to this lowering issue?

Because in the execution the problems occur in the neighborhood of the torch.linalg.eigh() function

Is there anything I can do to help?

@mfatih7
Copy link
Author

mfatih7 commented Dec 15, 2023

Hello @wonjoolee95

I solved #6048 and it is not related with torch.linalg.eigh().

@wonjoolee95
Copy link
Collaborator

wonjoolee95 commented Dec 15, 2023

Thanks for the update. To answer your questions:

When will be the 2.2 release?

PyTorch's 2.2 release is set to be on January 11th (if I recall correctly), and PyTorch/XLA's release should follow shortly after.

Can the other problems (#6002, #6048) be related to this lowering issue?

I haven't looked at these two issues deeply yet, but an unlowered op most likely will not cause a crash since it would just fall back to CPU. Glad that you found the solution to one of the issues already.

Is there anything I can do to help?

If you feel comfortable with lowering the op, that'd be very appreciated and feel free to submit a PR! We have some readme's on op lowering, such as https://github.com/pytorch/xla/blob/master/OP_LOWERING_GUIDE.md.

@mfatih7
Copy link
Author

mfatih7 commented Dec 24, 2023

Hello

Eventhough there exists a lowering problem with _linalg_eigh I was performing my experiments.
But while training one of my network architectures I realized that the training got stuck.
I switched from multiple core to single core to observe more details.
And after a few batches were processed I got the error below.

_, v = torch.linalg.eigh(X[batch_idx,:,:].squeeze(), UPLO='L' ) # if upper else 'L'
torch._C._LinAlgError: linalg.eigh: The algorithm failed to converge because the input
matrix is ill-conditioned or has too many repeated eigenvalues (error code: 8).

Is this another error for _linalg_eigh?
I do not observe any errors regarding this function on GPUs.
I can provide a repo if needed.

@wonjoolee95
Copy link
Collaborator

Hey @mfatih7, the aten::_linalg_eigh isn't lowered in PyTorch/XLA, so this lowering logic seems due to how the upstream torch. linalg.eigh is implemented. But it seems like this only fails on TPU (and not on GPU), so you can provide the repro code, we can still try to take a quick look if there's anything.

@mfatih7
Copy link
Author

mfatih7 commented Jan 12, 2024

Hello @wonjoolee95

Here is the repo.

Just run the file for single-core TPU run

If you want to run multi-core change the selection on the lines.

I observe that both single-core and multi-core runs are stuck.(No lines are printed on the terminal regarding to the accuracy)

I did not observe the error above for the last couple of single-core runs.
The error print may also be non-deterministic.

I am ready to modify the repo if the current version does not help to debug.

@mfatih7
Copy link
Author

mfatih7 commented Jul 8, 2024

Hello all

Is there any opportunity for the lowering of aten::_linalg_eigh now?

@miladm miladm assigned miladm and unassigned miladm Jul 8, 2024
@miladm
Copy link
Collaborator

miladm commented Jul 8, 2024

Thanks for bringing this up @mfatih7!
@vanbasten23 and Yifei to help with this lowering.

@vanbasten23 vanbasten23 removed their assignment Jul 9, 2024
@vanbasten23
Copy link
Collaborator

@tengyifei is working on this.

@mfatih7
Copy link
Author

mfatih7 commented Jul 13, 2024

Hello

In order to check your update, I am trying to install the nightly packages into a Python 3.10 environment on a TPUv4 VM.
I am following your instructions here and running the script below:

#!/bin/bash

# Update and install necessary packages
sudo apt update
sudo apt install -y software-properties-common

# Add the deadsnakes PPA
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update

# Install Python 3.10 and necessary packages
sudo apt install -y python3.10 python3.10-venv python3.10-dev

# Create a directory for the virtual environment
VENV_DIR=~/env3_10

# Remove existing virtual environment if it exists
if [ -d "$VENV_DIR" ]; then
    rm -rf "$VENV_DIR"
fi

# Create the virtual environment using Python 3.10
python3.10 -m venv "$VENV_DIR"

# Activate the virtual environment
source "$VENV_DIR/bin/activate"

# Upgrade pip
pip install --upgrade pip

# Install the nightly versions of PyTorch, torchvision, and torch_xla
pip3 install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cpu
pip3 install https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-nightly-cp310-cp310-linux_x86_64.whl

# Verify the installations
echo "Installed torch version: $(python -c 'import torch; print(torch.__version__)')"
echo "Installed torchvision version: $(python -c 'import torchvision; print(torchvision.__version__)')"
echo "Installed torch_xla version: $(python -c 'import torch_xla; print(torch_xla.__version__)')"

echo "Setup complete. Virtual environment created at $VENV_DIR."

cd
pip list
echo "All Packages Are Installed"

But I get the error below:

ERROR: Invalid requirement: 'torch-xla==nightly': Expected end or semicolon (after name and no valid version specifier)
    torch-xla==nightly
             ^

What should I do?

@JackCaoG
Copy link
Collaborator

check the discussion in #7622 (comment) maybe?

@mfatih7
Copy link
Author

mfatih7 commented Aug 9, 2024

Hello @tengyifei, @JackCaoG

I could finally test torch.linalg.eigh() lowering on my setup.

Using the same num_workers, the training speed with an environment (Python3.10) having current nightly versions of torch, torchvision, and torch_xla is 17% faster with respect to my old environment (Python 3.8), 20240229 nightly.
I think lowering of torch.linalg.eigh() is functioning correctly since the learning characteristics are same.

I will also test with increased num_workers with maximized pumping rate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants