not lowered: aten::_linalg_eigh #6017

mfatih7 · 2023-12-04T16:40:14Z

Hello

I am getting the above error during training

pt-xla-profiler: TransferFromServerTime too frequent: 132 counts during 1 steps
pt-xla-profiler: Op(s) not lowered: aten::_linalg_eigh,  Please open a GitHub issue with the above op lowering requests.

best regards

The text was updated successfully, but these errors were encountered:

JackCaoG · 2023-12-04T18:28:13Z

@wonjoolee95 do you know if this is one of the core aten ops?

wonjoolee95 · 2023-12-13T22:26:19Z

It is not part of the core aten ops, but we can find someone to work on this. Probably will be after the 2.2 release though.

mfatih7 · 2023-12-14T13:18:58Z

Hello @wonjoolee95

Thank you for the answer.

When will be the 2.2 release?

Can the other problems (#6002, #6048) be related to this lowering issue?

Because in the execution the problems occur in the neighborhood of the torch.linalg.eigh() function

Is there anything I can do to help?

mfatih7 · 2023-12-15T17:43:55Z

Hello @wonjoolee95

I solved #6048 and it is not related with torch.linalg.eigh().

wonjoolee95 · 2023-12-15T18:58:27Z

Thanks for the update. To answer your questions:

When will be the 2.2 release?

PyTorch's 2.2 release is set to be on January 11th (if I recall correctly), and PyTorch/XLA's release should follow shortly after.

Can the other problems (#6002, #6048) be related to this lowering issue?

I haven't looked at these two issues deeply yet, but an unlowered op most likely will not cause a crash since it would just fall back to CPU. Glad that you found the solution to one of the issues already.

Is there anything I can do to help?

If you feel comfortable with lowering the op, that'd be very appreciated and feel free to submit a PR! We have some readme's on op lowering, such as https://github.com/pytorch/xla/blob/master/OP_LOWERING_GUIDE.md.

mfatih7 · 2023-12-24T12:40:58Z

Hello

Eventhough there exists a lowering problem with _linalg_eigh I was performing my experiments.
But while training one of my network architectures I realized that the training got stuck.
I switched from multiple core to single core to observe more details.
And after a few batches were processed I got the error below.

_, v = torch.linalg.eigh(X[batch_idx,:,:].squeeze(), UPLO='L' ) # if upper else 'L'
torch._C._LinAlgError: linalg.eigh: The algorithm failed to converge because the input
matrix is ill-conditioned or has too many repeated eigenvalues (error code: 8).

Is this another error for _linalg_eigh?
I do not observe any errors regarding this function on GPUs.
I can provide a repo if needed.

wonjoolee95 · 2024-01-11T20:01:07Z

Hey @mfatih7, the aten::_linalg_eigh isn't lowered in PyTorch/XLA, so this lowering logic seems due to how the upstream torch. linalg.eigh is implemented. But it seems like this only fails on TPU (and not on GPU), so you can provide the repro code, we can still try to take a quick look if there's anything.

mfatih7 · 2024-01-12T13:47:08Z

Hello @wonjoolee95

Here is the repo.

Just run the file for single-core TPU run

If you want to run multi-core change the selection on the lines.

I observe that both single-core and multi-core runs are stuck.(No lines are printed on the terminal regarding to the accuracy)

I did not observe the error above for the last couple of single-core runs.
The error print may also be non-deterministic.

I am ready to modify the repo if the current version does not help to debug.

mfatih7 · 2024-07-08T17:13:56Z

Hello all

Is there any opportunity for the lowering of aten::_linalg_eigh now?

miladm · 2024-07-08T17:42:58Z

Thanks for bringing this up @mfatih7!
@vanbasten23 and Yifei to help with this lowering.

vanbasten23 · 2024-07-09T20:29:38Z

@tengyifei is working on this.

mfatih7 · 2024-07-13T21:13:20Z

Hello

In order to check your update, I am trying to install the nightly packages into a Python 3.10 environment on a TPUv4 VM.
I am following your instructions here and running the script below:

#!/bin/bash

# Update and install necessary packages
sudo apt update
sudo apt install -y software-properties-common

# Add the deadsnakes PPA
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update

# Install Python 3.10 and necessary packages
sudo apt install -y python3.10 python3.10-venv python3.10-dev

# Create a directory for the virtual environment
VENV_DIR=~/env3_10

# Remove existing virtual environment if it exists
if [ -d "$VENV_DIR" ]; then
    rm -rf "$VENV_DIR"
fi

# Create the virtual environment using Python 3.10
python3.10 -m venv "$VENV_DIR"

# Activate the virtual environment
source "$VENV_DIR/bin/activate"

# Upgrade pip
pip install --upgrade pip

# Install the nightly versions of PyTorch, torchvision, and torch_xla
pip3 install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cpu
pip3 install https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-nightly-cp310-cp310-linux_x86_64.whl

# Verify the installations
echo "Installed torch version: $(python -c 'import torch; print(torch.__version__)')"
echo "Installed torchvision version: $(python -c 'import torchvision; print(torchvision.__version__)')"
echo "Installed torch_xla version: $(python -c 'import torch_xla; print(torch_xla.__version__)')"

echo "Setup complete. Virtual environment created at $VENV_DIR."

cd
pip list
echo "All Packages Are Installed"

But I get the error below:

ERROR: Invalid requirement: 'torch-xla==nightly': Expected end or semicolon (after name and no valid version specifier)
    torch-xla==nightly
             ^

What should I do?

JackCaoG · 2024-07-15T17:34:05Z

check the discussion in #7622 (comment) maybe?

mfatih7 · 2024-08-09T23:04:40Z

Hello @tengyifei, @JackCaoG

I could finally test torch.linalg.eigh() lowering on my setup.

Using the same num_workers, the training speed with an environment (Python3.10) having current nightly versions of torch, torchvision, and torch_xla is 17% faster with respect to my old environment (Python 3.8), 20240229 nightly.
I think lowering of torch.linalg.eigh() is functioning correctly since the learning characteristics are same.

I will also test with increased num_workers with maximized pumping rate.

mfatih7 mentioned this issue Dec 5, 2023

Error While Starting 2nd Epoch #6002

Closed

miladm assigned miladm and unassigned miladm Jul 8, 2024

miladm assigned vanbasten23 Jul 8, 2024

miladm added the op lowering label Jul 8, 2024

vanbasten23 removed their assignment Jul 9, 2024

tengyifei mentioned this issue Jul 12, 2024

Lower aten::_linalg_eigh #7674

Merged

tengyifei closed this as completed in #7674 Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

not lowered: aten::_linalg_eigh #6017

not lowered: aten::_linalg_eigh #6017

mfatih7 commented Dec 4, 2023

JackCaoG commented Dec 4, 2023

wonjoolee95 commented Dec 13, 2023

mfatih7 commented Dec 14, 2023

mfatih7 commented Dec 15, 2023

wonjoolee95 commented Dec 15, 2023 •

edited

Loading

mfatih7 commented Dec 24, 2023

wonjoolee95 commented Jan 11, 2024

mfatih7 commented Jan 12, 2024

mfatih7 commented Jul 8, 2024

miladm commented Jul 8, 2024

vanbasten23 commented Jul 9, 2024

mfatih7 commented Jul 13, 2024

JackCaoG commented Jul 15, 2024

mfatih7 commented Aug 9, 2024

not lowered: aten::_linalg_eigh #6017

not lowered: aten::_linalg_eigh #6017

Comments

mfatih7 commented Dec 4, 2023

JackCaoG commented Dec 4, 2023

wonjoolee95 commented Dec 13, 2023

mfatih7 commented Dec 14, 2023

mfatih7 commented Dec 15, 2023

wonjoolee95 commented Dec 15, 2023 • edited Loading

mfatih7 commented Dec 24, 2023

wonjoolee95 commented Jan 11, 2024

mfatih7 commented Jan 12, 2024

mfatih7 commented Jul 8, 2024

miladm commented Jul 8, 2024

vanbasten23 commented Jul 9, 2024

mfatih7 commented Jul 13, 2024

JackCaoG commented Jul 15, 2024

mfatih7 commented Aug 9, 2024

wonjoolee95 commented Dec 15, 2023 •

edited

Loading