Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

这个项目能测试显卡的好坏,如果出现以下情况说明的显卡有问题 #392

Closed
chaoqunxie opened this issue Jul 18, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@chaoqunxie
Copy link

Feel free to ask any kind of questions in the issues page, but please use English since other users may find your questions valuable.

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots / log
If applicable, add screenshots / logs to help explain your problem.

Additional context
Add any other context about the problem here.

@chaoqunxie chaoqunxie added the bug Something isn't working label Jul 18, 2024
@chaoqunxie
Copy link
Author

2024-07-18 11:34:41.122 | INFO | main:main:174 | RANK: 0 / 1 - Starting worker
2024-07-18 11:34:41.139 | INFO | main:main:185 | RANK: 0 / 1 - Processing 0/0 files
2024-07-18 11:34:52.680 | INFO | main:get_model:62 | RANK: 0 / 1 - Loaded model
Found 53 files
2024-07-18 11:34:52.683 | INFO | main:main:211 | RANK: 0 / 1 - Finished processing 0 files, 0.00 hours of audio
Loading /exp/data: 53it [00:00, 64044.40it/s]
Grouping /exp/data: 100%|██████████| 53/53 [00:00<00:00, 446.11it/s]
2024-07-18 11:34:55.661 | INFO | main:task_generator_folder:46 - Found 1 groups in /exp/data, ['/exp/data/lyf']...
1it [00:00, 5.50it/s]
2024-07-18 11:34:55.849 | INFO | main:main:165 - Finished writing 1 shards to data/quantized-dataset-ft
2024-07-18 11:34:56.298 | INFO | main:train_process:640 - lora_text2semantic_medium_20240718_113438
2024-07-18 11:34:56.299 | INFO | main:train_process:663 - ['python', 'fish_speech/train.py', '--config-name', 'text2semantic_finetune', 'project=lora_text2semantic_medium_20240718_113438', 'ckpt_path=checkpoints/text2semantic-sft-medium-v1.1-4k.pth', 'trainer.strategy.process_group_backend=nccl', 'model@model.model=dual_ar_2_codebook_medium', 'tokenizer.pretrained_model_name_or_path=checkpoints', "train_dataset.proto_files=['data/quantized-dataset-ft']", "val_dataset.proto_files=['data/quantized-dataset-ft']", 'model.optimizer.lr=1e-5', 'trainer.max_steps=1000', 'data.num_workers=4', 'data.batch_size=8', 'max_length=2048', 'trainer.precision=bf16-true', 'trainer.val_check_interval=100', 'trainer.accumulate_grad_batches=1', 'train_dataset.use_speaker=0.5', '+lora@model.lora_config=r_8_alpha_16']
[2024-07-18 11:34:58,590][main][INFO] - [rank: 0] Instantiating datamodule <fish_speech.datasets.text.TextDataModule>
[2024-07-18 11:34:58,942][datasets][INFO] - PyTorch version 2.3.1+cu121 available.
[2024-07-18 11:34:59,070][main][INFO] - [rank: 0] Instantiating model <fish_speech.models.text2semantic.TextToSemantic>
[2024-07-18 11:35:04,539][main][INFO] - [rank: 0] Instantiating callbacks...
[2024-07-18 11:35:04,539][fish_speech.utils.instantiators][INFO] - [rank: 0] Instantiating callback <lightning.pytorch.callbacks.ModelCheckpoint>
[2024-07-18 11:35:04,580][fish_speech.utils.instantiators][INFO] - [rank: 0] Instantiating callback <lightning.pytorch.callbacks.ModelSummary>
[2024-07-18 11:35:04,581][fish_speech.utils.instantiators][INFO] - [rank: 0] Instantiating callback <lightning.pytorch.callbacks.LearningRateMonitor>
[2024-07-18 11:35:04,582][fish_speech.utils.instantiators][INFO] - [rank: 0] Instantiating callback <fish_speech.callbacks.GradNormMonitor>
[2024-07-18 11:35:04,589][main][INFO] - [rank: 0] Instantiating loggers...
[2024-07-18 11:35:04,589][fish_speech.utils.instantiators][INFO] - [rank: 0] Instantiating logger <lightning.pytorch.loggers.tensorboard.TensorBoardLogger>
[2024-07-18 11:35:04,592][main][INFO] - [rank: 0] Instantiating trainer <lightning.pytorch.trainer.Trainer>
Trainer already configured with model summary callbacks: [<class 'lightning.pytorch.callbacks.model_summary.ModelSummary'>]. Skipping setting a default ModelSummary callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[2024-07-18 11:35:04,602][main][INFO] - [rank: 0] Logging hyperparameters!
[2024-07-18 11:35:04,641][main][INFO] - [rank: 0] Starting training!
[2024-07-18 11:35:04,642][main][INFO] - [rank: 0] Resuming from checkpoint: checkpoints/text2semantic-sft-medium-v1.1-4k.pth
[2024-07-18 11:35:04,642][main][INFO] - [rank: 0] Resuming weights only!
[2024-07-18 11:35:09,338][main][INFO] - [rank: 0] Error loading state dict: _IncompatibleKeys(missing_keys=['model.embeddings.lora_A', 'model.embeddings.lora_B', 'model.layers.0.attention.wqkv.lora_A', 'model.layers.0.attention.wqkv.lora_B', 'model.layers.0.attention.wo.lora_A', 'model.layers.0.attention.wo.lora_B', 'model.layers.0.feed_forward.w1.lora_A', 'model.layers.0.feed_forward.w1.lora_B', 'model.layers.0.feed_forward.w3.lora_A', 'model.layers.0.feed_forward.w3.lora_B', 'model.layers.0.feed_forward.w2.lora_A', 'model.layers.0.feed_forward.w2.lora_B', 'model.layers.1.attention.wqkv.lora_A', 'model.layers.1.attention.wqkv.lora_B', 'model.layers.1.attention.wo.lora_A', 'model.layers.1.attention.wo.lora_B', 'model.layers.1.feed_forward.w1.lora_A', 'model.layers.1.feed_forward.w1.lora_B', 'model.layers.1.feed_forward.w3.lora_A', 'model.layers.1.feed_forward.w3.lora_B', 'model.layers.1.feed_forward.w2.lora_A', 'model.layers.1.feed_forward.w2.lora_B', 'model.layers.2.attention.wqkv.lora_A', 'model.layers.2.attention.wqkv.lora_B', 'model.layers.2.attention.wo.lora_A', 'model.layers.2.attention.wo.lora_B', 'model.layers.2.feed_forward.w1.lora_A', 'model.layers.2.feed_forward.w1.lora_B', 'model.layers.2.feed_forward.w3.lora_A', 'model.layers.2.feed_forward.w3.lora_B', 'model.layers.2.feed_forward.w2.lora_A', 'model.layers.2.feed_forward.w2.lora_B', 'model.layers.3.attention.wqkv.lora_A', 'model.layers.3.attention.wqkv.lora_B', 'model.layers.3.attention.wo.lora_A', 'model.layers.3.attention.wo.lora_B', 'model.layers.3.feed_forward.w1.lora_A', 'model.layers.3.feed_forward.w1.lora_B', 'model.layers.3.feed_forward.w3.lora_A', 'model.layers.3.feed_forward.w3.lora_B', 'model.layers.3.feed_forward.w2.lora_A', 'model.layers.3.feed_forward.w2.lora_B', 'model.layers.4.attention.wqkv.lora_A', 'model.layers.4.attention.wqkv.lora_B', 'model.layers.4.attention.wo.lora_A', 'model.layers.4.attention.wo.lora_B', 'model.layers.4.feed_forward.w1.lora_A', 'model.layers.4.feed_forward.w1.lora_B', 'model.layers.4.feed_forward.w3.lora_A', 'model.layers.4.feed_forward.w3.lora_B', 'model.layers.4.feed_forward.w2.lora_A', 'model.layers.4.feed_forward.w2.lora_B', 'model.layers.5.attention.wqkv.lora_A', 'model.layers.5.attention.wqkv.lora_B', 'model.layers.5.attention.wo.lora_A', 'model.layers.5.attention.wo.lora_B', 'model.layers.5.feed_forward.w1.lora_A', 'model.layers.5.feed_forward.w1.lora_B', 'model.layers.5.feed_forward.w3.lora_A', 'model.layers.5.feed_forward.w3.lora_B', 'model.layers.5.feed_forward.w2.lora_A', 'model.layers.5.feed_forward.w2.lora_B', 'model.layers.6.attention.wqkv.lora_A', 'model.layers.6.attention.wqkv.lora_B', 'model.layers.6.attention.wo.lora_A', 'model.layers.6.attention.wo.lora_B', 'model.layers.6.feed_forward.w1.lora_A', 'model.layers.6.feed_forward.w1.lora_B', 'model.layers.6.feed_forward.w3.lora_A', 'model.layers.6.feed_forward.w3.lora_B', 'model.layers.6.feed_forward.w2.lora_A', 'model.layers.6.feed_forward.w2.lora_B', 'model.layers.7.attention.wqkv.lora_A', 'model.layers.7.attention.wqkv.lora_B', 'model.layers.7.attention.wo.lora_A', 'model.layers.7.attention.wo.lora_B', 'model.layers.7.feed_forward.w1.lora_A', 'model.layers.7.feed_forward.w1.lora_B', 'model.layers.7.feed_forward.w3.lora_A', 'model.layers.7.feed_forward.w3.lora_B', 'model.layers.7.feed_forward.w2.lora_A', 'model.layers.7.feed_forward.w2.lora_B', 'model.layers.8.attention.wqkv.lora_A', 'model.layers.8.attention.wqkv.lora_B', 'model.layers.8.attention.wo.lora_A', 'model.layers.8.attention.wo.lora_B', 'model.layers.8.feed_forward.w1.lora_A', 'model.layers.8.feed_forward.w1.lora_B', 'model.layers.8.feed_forward.w3.lora_A', 'model.layers.8.feed_forward.w3.lora_B', 'model.layers.8.feed_forward.w2.lora_A', 'model.layers.8.feed_forward.w2.lora_B', 'model.layers.9.attention.wqkv.lora_A', 'model.layers.9.attention.wqkv.lora_B', 'model.layers.9.attention.wo.lora_A', 'model.layers.9.attention.wo.lora_B', 'model.layers.9.feed_forward.w1.lora_A', 'model.layers.9.feed_forward.w1.lora_B', 'model.layers.9.feed_forward.w3.lora_A', 'model.layers.9.feed_forward.w3.lora_B', 'model.layers.9.feed_forward.w2.lora_A', 'model.layers.9.feed_forward.w2.lora_B', 'model.layers.10.attention.wqkv.lora_A', 'model.layers.10.attention.wqkv.lora_B', 'model.layers.10.attention.wo.lora_A', 'model.layers.10.attention.wo.lora_B', 'model.layers.10.feed_forward.w1.lora_A', 'model.layers.10.feed_forward.w1.lora_B', 'model.layers.10.feed_forward.w3.lora_A', 'model.layers.10.feed_forward.w3.lora_B', 'model.layers.10.feed_forward.w2.lora_A', 'model.layers.10.feed_forward.w2.lora_B', 'model.layers.11.attention.wqkv.lora_A', 'model.layers.11.attention.wqkv.lora_B', 'model.layers.11.attention.wo.lora_A', 'model.layers.11.attention.wo.lora_B', 'model.layers.11.feed_forward.w1.lora_A', 'model.layers.11.feed_forward.w1.lora_B', 'model.layers.11.feed_forward.w3.lora_A', 'model.layers.11.feed_forward.w3.lora_B', 'model.layers.11.feed_forward.w2.lora_A', 'model.layers.11.feed_forward.w2.lora_B', 'model.layers.12.attention.wqkv.lora_A', 'model.layers.12.attention.wqkv.lora_B', 'model.layers.12.attention.wo.lora_A', 'model.layers.12.attention.wo.lora_B', 'model.layers.12.feed_forward.w1.lora_A', 'model.layers.12.feed_forward.w1.lora_B', 'model.layers.12.feed_forward.w3.lora_A', 'model.layers.12.feed_forward.w3.lora_B', 'model.layers.12.feed_forward.w2.lora_A', 'model.layers.12.feed_forward.w2.lora_B', 'model.layers.13.attention.wqkv.lora_A', 'model.layers.13.attention.wqkv.lora_B', 'model.layers.13.attention.wo.lora_A', 'model.layers.13.attention.wo.lora_B', 'model.layers.13.feed_forward.w1.lora_A', 'model.layers.13.feed_forward.w1.lora_B', 'model.layers.13.feed_forward.w3.lora_A', 'model.layers.13.feed_forward.w3.lora_B', 'model.layers.13.feed_forward.w2.lora_A', 'model.layers.13.feed_forward.w2.lora_B', 'model.layers.14.attention.wqkv.lora_A', 'model.layers.14.attention.wqkv.lora_B', 'model.layers.14.attention.wo.lora_A', 'model.layers.14.attention.wo.lora_B', 'model.layers.14.feed_forward.w1.lora_A', 'model.layers.14.feed_forward.w1.lora_B', 'model.layers.14.feed_forward.w3.lora_A', 'model.layers.14.feed_forward.w3.lora_B', 'model.layers.14.feed_forward.w2.lora_A', 'model.layers.14.feed_forward.w2.lora_B', 'model.layers.15.attention.wqkv.lora_A', 'model.layers.15.attention.wqkv.lora_B', 'model.layers.15.attention.wo.lora_A', 'model.layers.15.attention.wo.lora_B', 'model.layers.15.feed_forward.w1.lora_A', 'model.layers.15.feed_forward.w1.lora_B', 'model.layers.15.feed_forward.w3.lora_A', 'model.layers.15.feed_forward.w3.lora_B', 'model.layers.15.feed_forward.w2.lora_A', 'model.layers.15.feed_forward.w2.lora_B', 'model.layers.16.attention.wqkv.lora_A', 'model.layers.16.attention.wqkv.lora_B', 'model.layers.16.attention.wo.lora_A', 'model.layers.16.attention.wo.lora_B', 'model.layers.16.feed_forward.w1.lora_A', 'model.layers.16.feed_forward.w1.lora_B', 'model.layers.16.feed_forward.w3.lora_A', 'model.layers.16.feed_forward.w3.lora_B', 'model.layers.16.feed_forward.w2.lora_A', 'model.layers.16.feed_forward.w2.lora_B', 'model.layers.17.attention.wqkv.lora_A', 'model.layers.17.attention.wqkv.lora_B', 'model.layers.17.attention.wo.lora_A', 'model.layers.17.attention.wo.lora_B', 'model.layers.17.feed_forward.w1.lora_A', 'model.layers.17.feed_forward.w1.lora_B', 'model.layers.17.feed_forward.w3.lora_A', 'model.layers.17.feed_forward.w3.lora_B', 'model.layers.17.feed_forward.w2.lora_A', 'model.layers.17.feed_forward.w2.lora_B', 'model.layers.18.attention.wqkv.lora_A', 'model.layers.18.attention.wqkv.lora_B', 'model.layers.18.attention.wo.lora_A', 'model.layers.18.attention.wo.lora_B', 'model.layers.18.feed_forward.w1.lora_A', 'model.layers.18.feed_forward.w1.lora_B', 'model.layers.18.feed_forward.w3.lora_A', 'model.layers.18.feed_forward.w3.lora_B', 'model.layers.18.feed_forward.w2.lora_A', 'model.layers.18.feed_forward.w2.lora_B', 'model.layers.19.attention.wqkv.lora_A', 'model.layers.19.attention.wqkv.lora_B', 'model.layers.19.attention.wo.lora_A', 'model.layers.19.attention.wo.lora_B', 'model.layers.19.feed_forward.w1.lora_A', 'model.layers.19.feed_forward.w1.lora_B', 'model.layers.19.feed_forward.w3.lora_A', 'model.layers.19.feed_forward.w3.lora_B', 'model.layers.19.feed_forward.w2.lora_A', 'model.layers.19.feed_forward.w2.lora_B', 'model.layers.20.attention.wqkv.lora_A', 'model.layers.20.attention.wqkv.lora_B', 'model.layers.20.attention.wo.lora_A', 'model.layers.20.attention.wo.lora_B', 'model.layers.20.feed_forward.w1.lora_A', 'model.layers.20.feed_forward.w1.lora_B', 'model.layers.20.feed_forward.w3.lora_A', 'model.layers.20.feed_forward.w3.lora_B', 'model.layers.20.feed_forward.w2.lora_A', 'model.layers.20.feed_forward.w2.lora_B', 'model.layers.21.attention.wqkv.lora_A', 'model.layers.21.attention.wqkv.lora_B', 'model.layers.21.attention.wo.lora_A', 'model.layers.21.attention.wo.lora_B', 'model.layers.21.feed_forward.w1.lora_A', 'model.layers.21.feed_forward.w1.lora_B', 'model.layers.21.feed_forward.w3.lora_A', 'model.layers.21.feed_forward.w3.lora_B', 'model.layers.21.feed_forward.w2.lora_A', 'model.layers.21.feed_forward.w2.lora_B', 'model.layers.22.attention.wqkv.lora_A', 'model.layers.22.attention.wqkv.lora_B', 'model.layers.22.attention.wo.lora_A', 'model.layers.22.attention.wo.lora_B', 'model.layers.22.feed_forward.w1.lora_A', 'model.layers.22.feed_forward.w1.lora_B', 'model.layers.22.feed_forward.w3.lora_A', 'model.layers.22.feed_forward.w3.lora_B', 'model.layers.22.feed_forward.w2.lora_A', 'model.layers.22.feed_forward.w2.lora_B', 'model.layers.23.attention.wqkv.lora_A', 'model.layers.23.attention.wqkv.lora_B', 'model.layers.23.attention.wo.lora_A', 'model.layers.23.attention.wo.lora_B', 'model.layers.23.feed_forward.w1.lora_A', 'model.layers.23.feed_forward.w1.lora_B', 'model.layers.23.feed_forward.w3.lora_A', 'model.layers.23.feed_forward.w3.lora_B', 'model.layers.23.feed_forward.w2.lora_A', 'model.layers.23.feed_forward.w2.lora_B', 'model.output.lora_A', 'model.output.lora_B', 'model.fast_embeddings.lora_A', 'model.fast_embeddings.lora_B', 'model.fast_layers.0.attention.wqkv.lora_A', 'model.fast_layers.0.attention.wqkv.lora_B', 'model.fast_layers.0.attention.wo.lora_A', 'model.fast_layers.0.attention.wo.lora_B', 'model.fast_layers.0.feed_forward.w1.lora_A', 'model.fast_layers.0.feed_forward.w1.lora_B', 'model.fast_layers.0.feed_forward.w3.lora_A', 'model.fast_layers.0.feed_forward.w3.lora_B', 'model.fast_layers.0.feed_forward.w2.lora_A', 'model.fast_layers.0.feed_forward.w2.lora_B', 'model.fast_layers.1.attention.wqkv.lora_A', 'model.fast_layers.1.attention.wqkv.lora_B', 'model.fast_layers.1.attention.wo.lora_A', 'model.fast_layers.1.attention.wo.lora_B', 'model.fast_layers.1.feed_forward.w1.lora_A', 'model.fast_layers.1.feed_forward.w1.lora_B', 'model.fast_layers.1.feed_forward.w3.lora_A', 'model.fast_layers.1.feed_forward.w3.lora_B', 'model.fast_layers.1.feed_forward.w2.lora_A', 'model.fast_layers.1.feed_forward.w2.lora_B', 'model.fast_layers.2.attention.wqkv.lora_A', 'model.fast_layers.2.attention.wqkv.lora_B', 'model.fast_layers.2.attention.wo.lora_A', 'model.fast_layers.2.attention.wo.lora_B', 'model.fast_layers.2.feed_forward.w1.lora_A', 'model.fast_layers.2.feed_forward.w1.lora_B', 'model.fast_layers.2.feed_forward.w3.lora_A', 'model.fast_layers.2.feed_forward.w3.lora_B', 'model.fast_layers.2.feed_forward.w2.lora_A', 'model.fast_layers.2.feed_forward.w2.lora_B', 'model.fast_layers.3.attention.wqkv.lora_A', 'model.fast_layers.3.attention.wqkv.lora_B', 'model.fast_layers.3.attention.wo.lora_A', 'model.fast_layers.3.attention.wo.lora_B', 'model.fast_layers.3.feed_forward.w1.lora_A', 'model.fast_layers.3.feed_forward.w1.lora_B', 'model.fast_layers.3.feed_forward.w3.lora_A', 'model.fast_layers.3.feed_forward.w3.lora_B', 'model.fast_layers.3.feed_forward.w2.lora_A', 'model.fast_layers.3.feed_forward.w2.lora_B', 'model.fast_layers.4.attention.wqkv.lora_A', 'model.fast_layers.4.attention.wqkv.lora_B', 'model.fast_layers.4.attention.wo.lora_A', 'model.fast_layers.4.attention.wo.lora_B', 'model.fast_layers.4.feed_forward.w1.lora_A', 'model.fast_layers.4.feed_forward.w1.lora_B', 'model.fast_layers.4.feed_forward.w3.lora_A', 'model.fast_layers.4.feed_forward.w3.lora_B', 'model.fast_layers.4.feed_forward.w2.lora_A', 'model.fast_layers.4.feed_forward.w2.lora_B', 'model.fast_layers.5.attention.wqkv.lora_A', 'model.fast_layers.5.attention.wqkv.lora_B', 'model.fast_layers.5.attention.wo.lora_A', 'model.fast_layers.5.attention.wo.lora_B', 'model.fast_layers.5.feed_forward.w1.lora_A', 'model.fast_layers.5.feed_forward.w1.lora_B', 'model.fast_layers.5.feed_forward.w3.lora_A', 'model.fast_layers.5.feed_forward.w3.lora_B', 'model.fast_layers.5.feed_forward.w2.lora_A', 'model.fast_layers.5.feed_forward.w2.lora_B', 'model.fast_output.lora_A', 'model.fast_output.lora_B'], unexpected_keys=[])
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1

distributed_backend=nccl
All distributed processes registered. Starting with 1 processes

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[2024-07-18 11:35:09,813][fish_speech.models.text2semantic.lit_module][INFO] - [rank: 0] Set weight decay: 0 for 459 parameters
[2024-07-18 11:35:09,814][fish_speech.models.text2semantic.lit_module][INFO] - [rank: 0] Set weight decay: 0.0 for 65 parameters

| Name | Type | Params | Mode

0 | model | DualARTransformer | 394 M | train
1 | model.embeddings | Embedding | 2.4 M | train
2 | model.layers | ModuleList | 311 M | train
3 | model.norm | RMSNorm | 1.0 K | train
4 | model.output | Linear | 280 K | train
5 | model.fast_embeddings | Embedding | 1.1 M | train
6 | model.fast_layers | ModuleList | 77.9 M | train
7 | model.fast_norm | RMSNorm | 1.0 K | train
8 | model.fast_output | Linear | 1.1 M | train

4.3 M Trainable params
390 M Non-trainable params
394 M Total params
1,577.969 Total estimated model params size (MB)
Sanity Checking: | | 0/? [00:00<?, ?it/s][2024-07-18 11:35:10,005][fish_speech.datasets.text][INFO] - [rank: 0] Reading 2 / 1 files
[2024-07-18 11:35:10,006][fish_speech.datasets.text][INFO] - [rank: 0] Read total 2 groups of data
[2024-07-18 11:35:10,007][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-18 11:35:10,008][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-18 11:35:10,016][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-18 11:35:10,017][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-18 11:35:10,021][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-18 11:35:10,021][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-18 11:35:15,341][fish_speech.datasets.text][INFO] - [rank: 0] Reading 2 / 1 files
[2024-07-18 11:35:15,360][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-18 11:35:15,361][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-18 11:35:15,361][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-18 11:35:15,373][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-18 11:35:15,373][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-18 11:35:15,373][fish_speech.datasets.text][INFO] - [rank: 0] Read total 2 groups of data
[2024-07-18 11:35:15,373][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data

Epoch 0: | | 100/? [16:22<00:00, 0.10it/s, v_num=0, train/loss=5.280, train/top_5_accuracy=0.192]
[rank0]:[W CUDAGuardImpl.h:118] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8aa5d7a897 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f8aa5d2ab25 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f8aa6123718 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1d8d6 (0x7f8aa60ee8d6 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: + 0x1f5e3 (0x7f8aa60f05e3 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x1f922 (0x7f8aa60f0922 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: + 0x5a5950 (0x7f8aa4686950 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x6a36f (0x7f8aa5d5f36f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f8aa5d581cb in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f8aa5d58379 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: c10d::Reducer::~Reducer() + 0x5c4 (0x7f8a922c69d4 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #11: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f8aa4dc9552 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #12: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7f8aa4551788 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #13: + 0xcec001 (0x7f8aa4dcd001 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #14: + 0x47b773 (0x7f8aa455c773 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #15: + 0x47c6f1 (0x7f8aa455d6f1 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)

@chaoqunxie
Copy link
Author

image
突然变成僵尸进程 然后消失了

@chaoqunxie
Copy link
Author

解下来再点击训练
/usr/local/lib/python3.10/site-packages/torch/cuda/init.py:619: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
[2024-07-18 12:04:28,841][main][INFO] - [rank: 0] Instantiating datamodule <fish_speech.datasets.text.TextDataModule>
[2024-07-18 12:04:29,011][datasets][INFO] - PyTorch version 2.3.1+cu121 available.
[2024-07-18 12:04:29,102][main][INFO] - [rank: 0] Instantiating model <fish_speech.models.text2semantic.TextToSemantic>
[2024-07-18 12:04:34,471][main][INFO] - [rank: 0] Instantiating callbacks...
[2024-07-18 12:04:34,471][fish_speech.utils.instantiators][INFO] - [rank: 0] Instantiating callback <lightning.pytorch.callbacks.ModelCheckpoint>
[2024-07-18 12:04:34,474][fish_speech.utils.instantiators][INFO] - [rank: 0] Instantiating callback <lightning.pytorch.callbacks.ModelSummary>
[2024-07-18 12:04:34,475][fish_speech.utils.instantiators][INFO] - [rank: 0] Instantiating callback <lightning.pytorch.callbacks.LearningRateMonitor>
[2024-07-18 12:04:34,475][fish_speech.utils.instantiators][INFO] - [rank: 0] Instantiating callback <fish_speech.callbacks.GradNormMonitor>
[2024-07-18 12:04:34,476][main][INFO] - [rank: 0] Instantiating loggers...
[2024-07-18 12:04:34,476][fish_speech.utils.instantiators][INFO] - [rank: 0] Instantiating logger <lightning.pytorch.loggers.tensorboard.TensorBoardLogger>
[2024-07-18 12:04:34,478][main][INFO] - [rank: 0] Instantiating trainer <lightning.pytorch.trainer.Trainer>
[2024-07-18 12:04:34,482][fish_speech.utils.utils][ERROR] - [rank: 0]
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 92, in _call_target
return target(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/utilities/argparse.py", line 70, in insert_env_defaults
return fn(self, **kwargs)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 400, in init
self._accelerator_connector = _AcceleratorConnector(
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/accelerator_connector.py", line 143, in init
self._accelerator_flag = self._choose_gpu_accelerator_backend()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/accelerator_connector.py", line 353, in _choose_gpu_accelerator_backend
raise MisconfigurationException("No supported gpu backend found!")
lightning.fabric.utilities.exceptions.MisconfigurationException: No supported gpu backend found!

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/exp/fish_speech/utils/utils.py", line 66, in wrap
metric_dict, object_dict = task_func(cfg=cfg)
File "/exp/fish_speech/train.py", line 65, in train
trainer: Trainer = hydra.utils.instantiate(
File "/usr/local/lib/python3.10/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 226, in instantiate
return instantiate_node(
File "/usr/local/lib/python3.10/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 347, in instantiate_node
return _call_target(target, partial, args, kwargs, full_key)
File "/usr/local/lib/python3.10/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 97, in _call_target
raise InstantiationException(msg) from e
hydra.errors.InstantiationException: Error in call to target 'lightning.pytorch.trainer.trainer.Trainer':
MisconfigurationException('No supported gpu backend found!')
full_key: trainer
[2024-07-18 12:04:34,484][fish_speech.utils.utils][INFO] - [rank: 0] Output dir: results/lora_text2semantic_medium_20240718_120416
Error executing job with overrides: ['project=lora_text2semantic_medium_20240718_120416', 'ckpt_path=checkpoints/text2semantic-sft-medium-v1.1-4k.pth', 'trainer.strategy.process_group_backend=nccl', 'model@model.model=dual_ar_2_codebook_medium', 'tokenizer.pretrained_model_name_or_path=checkpoints', "train_dataset.proto_files=['data/quantized-dataset-ft']", "val_dataset.proto_files=['data/quantized-dataset-ft']", 'model.optimizer.lr=1e-5', 'trainer.max_steps=1000', 'data.num_workers=4', 'data.batch_size=8', 'max_length=2048', 'trainer.precision=bf16-true', 'trainer.val_check_interval=100', 'trainer.accumulate_grad_batches=1', 'train_dataset.use_speaker=0.5', '+lora@model.lora_config=r_8_alpha_16']
Error in call to target 'lightning.pytorch.trainer.trainer.Trainer':
MisconfigurationException('No supported gpu backend found!')
full_key: trainer

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

这下就要重启机器了

@chaoqunxie
Copy link
Author

nvidia-smi
Thu Jul 18 20:07:24 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 2080 Ti Off | 00000000:01:00.0 Off | N/A |
|ERR! 51C P2 ERR! / 250W | 2289MiB / 22528MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 9725 C python3 2286MiB |
+-----------------------------------------------------------------------------------------+

@chaoqunxie
Copy link
Author

image

@chaoqunxie
Copy link
Author

chaoqunxie commented Jul 19, 2024

distributed_backend=nccl
All distributed processes registered. Starting with 1 processes

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[2024-07-18 15:45:44,275][fish_speech.models.text2semantic.lit_module][INFO] - [rank: 0] Set weight decay: 0 for 459 parameters
[2024-07-18 15:45:44,275][fish_speech.models.text2semantic.lit_module][INFO] - [rank: 0] Set weight decay: 0.0 for 65 parameters

| Name | Type | Params | Mode

0 | model | DualARTransformer | 394 M | train
1 | model.embeddings | Embedding | 2.4 M | train
2 | model.layers | ModuleList | 311 M | train
3 | model.norm | RMSNorm | 1.0 K | train
4 | model.output | Linear | 280 K | train
5 | model.fast_embeddings | Embedding | 1.1 M | train
6 | model.fast_layers | ModuleList | 77.9 M | train
7 | model.fast_norm | RMSNorm | 1.0 K | train
8 | model.fast_output | Linear | 1.1 M | train

4.3 M Trainable params
390 M Non-trainable params
394 M Total params
1,577.969 Total estimated model params size (MB)
Sanity Checking: | | 0/? [00:00<?, ?it/s][2024-07-18 15:45:44,446][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-18 15:45:44,446][fish_speech.datasets.text][INFO] - [rank: 0] Reading 2 / 1 files
[2024-07-18 15:45:44,447][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-18 15:45:44,447][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-18 15:45:44,447][fish_speech.datasets.text][INFO] - [rank: 0] Read total 2 groups of data
[2024-07-18 15:45:44,447][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-18 15:45:44,448][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-18 15:45:44,448][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-18 15:45:44,448][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-18 15:45:44,448][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-18 15:45:44,450][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-18 15:45:44,451][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-18 15:45:44,452][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-18 15:45:44,452][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-18 15:45:44,452][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-18 15:45:44,452][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-18 15:45:47,450][fish_speech.datasets.text][INFO] - [rank: 0] Reading 2 / 1 files
[2024-07-18 15:45:47,451][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-18 15:45:47,451][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-18 15:45:47,452][fish_speech.datasets.text][INFO] - [rank: 0] Read total 2 groups of data
[2024-07-18 15:45:47,458][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-18 15:45:47,458][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-18 15:45:47,458][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-18 15:45:47,459][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-18 15:45:47,461][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-18 15:45:47,461][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-18 15:45:47,461][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-18 15:45:47,461][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-18 15:45:47,464][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-18 15:45:47,464][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-18 15:45:47,508][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-18 15:45:47,508][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 0] Process group watchdog thread terminated with exception: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4941b7a897 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f4941b2ab25 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f4941f06718 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f48f58598e6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f48f585d9e8 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f48f586305c in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f48f5863dcc in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xd44a3 (0x7f49412ba4a3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x89134 (0x7f4943751134 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: __clone + 0x40 (0x7f49437d0a40 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 Rank 0] Process group watchdog thread terminated with exception: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4941b7a897 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f4941b2ab25 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f4941f06718 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f48f58598e6 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f48f585d9e8 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f48f586305c in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f48f5863dcc in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xd44a3 (0x7f49412ba4a3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x89134 (0x7f4943751134 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: __clone + 0x40 (0x7f49437d0a40 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4941b7a897 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xe32119 (0x7f48f54e7119 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd44a3 (0x7f49412ba4a3 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x89134 (0x7f4943751134 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: __clone + 0x40 (0x7f49437d0a40 in /usr/lib/x86_64-linux-gnu/libc.so.6)

@chaoqunxie chaoqunxie changed the title 好家伙直接把我买的二手卡给跑坏咯[BUG] 这个项目能测试显卡的好坏,如果出现以下情况说明的显卡有问题 Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant