Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to deal with the problem when starting data preprocess? #4

Open
liusfore opened this issue Jun 15, 2023 · 3 comments
Open

How to deal with the problem when starting data preprocess? #4

liusfore opened this issue Jun 15, 2023 · 3 comments

Comments

@liusfore
Copy link

(dyMEAN) dell@dell-Precision-7920-Tower:/mnt/e/code/dyMEAN$ bash scripts/data_preprocess.sh all_structures/imgt all_data
Locate the project folder at /mnt/e/code/dyMEAN
Processing SAbDab with output directory /mnt/e/code/dyMEAN/all_data
Processing RAbD with output directory /mnt/e/code/dyMEAN/all_data/RAbD
2023-06-15 15:59:18::INFO::Namespace(fout='/mnt/e/code/dyMEAN/all_data/rabd_all.json', n_cpu=4, numbering='imgt', pdb_dir='/mnt/e/code/dyMEAN/all_structures/imgt', pre_numbered=True, summary='/mnt/e/code/dyMEAN/all_data/sabdab_all.json', type='rabd')
2023-06-15 15:59:18::INFO::download rabd from summary file /mnt/e/code/dyMEAN/all_data/sabdab_all.json
2023-06-15 15:59:18::INFO::Extracting summary to json format
Traceback (most recent call last):
File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/mnt/e/code/dyMEAN/data/download.py", line 376, in
main(parse())
File "/mnt/e/code/dyMEAN/data/download.py", line 360, in main
items = read_rabd(fpath)
File "/mnt/e/code/dyMEAN/data/download.py", line 94, in read_rabd
with open(fpath, 'r') as fin:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/e/code/dyMEAN/all_data/sabdab_all.json'
Traceback (most recent call last):
File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/mnt/e/code/dyMEAN/data/split.py", line 249, in
main(parse())
File "/mnt/e/code/dyMEAN/data/split.py", line 72, in main
items = load_file(args.data)
File "/mnt/e/code/dyMEAN/data/split.py", line 37, in load_file
with open(fpath, 'r') as fin:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/e/code/dyMEAN/all_data/sabdab_all.json'

@kxz18
Copy link
Collaborator

kxz18 commented Jun 15, 2023

Hi~ Sorry for the mistake in the data_preprocess.sh. I accidentally commented the processing logic for SAbDab, which leads to the absence of sabdab_all.json. Now I've uncommented them. Could you please run the script again to see if the problem is solved? I think it should be fine now.

@liusfore
Copy link
Author

It seems that bug still happens. The file sabdab_all.json has not been generated.
(dyMEAN) dell@dell-Precision-7920-Tower:/mnt/e/code/dyMEAN$ bash scripts/data_preprocess.sh all_structures/imgt all_data
Locate the project folder at /mnt/e/code/dyMEAN
Processing SAbDab with output directory /mnt/e/code/dyMEAN/all_data
2023-06-16 11:17:59::INFO::Namespace(fout='/mnt/e/code/dyMEAN/all_data/sabdab_all.json', n_cpu=4, numbering='imgt', pdb_dir='/mnt/e/code/dyMEAN/all_structures/imgt', pre_numbered=True, summary='summaries/sabdab_summary.tsv', type='sabdab')
2023-06-16 11:17:59::INFO::download sabdab from summary file summaries/sabdab_summary.tsv
2023-06-16 11:17:59::INFO::Extracting summary to json format
2023-06-16 11:18:00::INFO::Start downloading pdbs in the summary
2023-06-16 11:18:00::INFO::using local PDB files: /mnt/e/code/dyMEAN/all_structures/imgt
2023-06-16 11:18:00::INFO::Assume PDB file already renumbered with scheme imgt
2023-06-16 11:18:00::INFO::downloading raw files
6%|████████▎ | 390/6741 [00:00<00:10, 613.14it/s]6B3M not found in /mnt/e/code/dyMEAN/all_structures/imgt, try fetching from remote server
6TNP not found in /mnt/e/code/dyMEAN/all_structures/imgt, try fetching from remote server
6QXE not found in /mnt/e/code/dyMEAN/all_structures/imgt, try fetching from remote server
5FUU not found in /mnt/e/code/dyMEAN/all_structures/imgt, try fetching from remote server
2023-06-16 11:18:04::WARN::Trying for the 2 times
2023-06-16 11:18:05::WARN::Trying for the 3 times
fetched
6DZT not found in /mnt/e/code/dyMEAN/all_structures/imgt, try fetching from remote server
fetched
2023-06-16 11:18:06::WARN::Trying for the 4 times
fetched
7%|█████████▊ | 457/6741 [00:06<02:46, 37.78it/s]2023-06-16 11:18:07::WARN::Trying for the 5 times
2023-06-16 11:18:08::WARN::Get https://files.rcsb.org/download/5FUU.pdb failed
15%|█████████████████████▍ | 1013/6741 [00:08<00:45, 125.59it/s]
fetched
Traceback (most recent call last):
File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/mnt/e/code/dyMEAN/data/download.py", line 376, in
main(parse())
File "/mnt/e/code/dyMEAN/data/download.py", line 369, in main
items = download(items, out_path, args.n_cpu, args.pdb_dir, args.numbering, args.pre_numbered)
File "/mnt/e/code/dyMEAN/data/download.py", line 280, in download
valid_entries = thread_map(map_func, items, max_workers=ncpu)
File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/site-packages/tqdm/contrib/concurrent.py", line 69, in thread_map
return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map
return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/site-packages/tqdm/std.py", line 1178, in iter
for obj in iterable:
File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator
yield fs.pop().result()
File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/concurrent/futures/_base.py", line 444, in result
return self.__get_result()
File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/mnt/e/code/dyMEAN/data/download.py", line 189, in download_one_item_local
from_remote = fetch_from_pdb(pdb_id)
File "/mnt/e/code/dyMEAN/data/download.py", line 35, in fetch_from_pdb
data['pdb'] = text.text
AttributeError: 'NoneType' object has no attribute 'text'
Processing RAbD with output directory /mnt/e/code/dyMEAN/all_data/RAbD
2023-06-16 11:18:09::INFO::Namespace(fout='/mnt/e/code/dyMEAN/all_data/rabd_all.json', n_cpu=4, numbering='imgt', pdb_dir='/mnt/e/code/dyMEAN/all_structures/imgt', pre_numbered=True, summary='/mnt/e/code/dyMEAN/all_data/sabdab_all.json', type='rabd')
2023-06-16 11:18:09::INFO::download rabd from summary file /mnt/e/code/dyMEAN/all_data/sabdab_all.json
2023-06-16 11:18:09::INFO::Extracting summary to json format
Traceback (most recent call last):
File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/mnt/e/code/dyMEAN/data/download.py", line 376, in
main(parse())
File "/mnt/e/code/dyMEAN/data/download.py", line 360, in main
items = read_rabd(fpath)
File "/mnt/e/code/dyMEAN/data/download.py", line 94, in read_rabd
with open(fpath, 'r') as fin:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/e/code/dyMEAN/all_data/sabdab_all.json'
Traceback (most recent call last):
File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/mnt/e/code/dyMEAN/data/split.py", line 249, in
main(parse())
File "/mnt/e/code/dyMEAN/data/split.py", line 72, in main
items = load_file(args.data)
File "/mnt/e/code/dyMEAN/data/split.py", line 37, in load_file
with open(fpath, 'r') as fin:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/e/code/dyMEAN/all_data/sabdab_all.json'
2023-06-16 11:18:10::INFO::No meta-info file found, start processing
Traceback (most recent call last):
File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/mnt/e/code/dyMEAN/data/dataset.py", line 323, in
dataset = E2EDataset(args.dataset, args.save_dir, num_entry_per_file=-1)
File "/mnt/e/code/dyMEAN/data/dataset.py", line 119, in init
self.preprocess(file_path, save_dir, num_entry_per_file)
File "/mnt/e/code/dyMEAN/data/dataset.py", line 192, in preprocess
with open(file_path, 'r') as fin:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/e/code/dyMEAN/all_data/RAbD/test.json'
2023-06-16 11:18:11::INFO::No meta-info file found, start processing
Traceback (most recent call last):
File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/mnt/e/code/dyMEAN/data/dataset.py", line 323, in
dataset = E2EDataset(args.dataset, args.save_dir, num_entry_per_file=-1)
File "/mnt/e/code/dyMEAN/data/dataset.py", line 119, in init
self.preprocess(file_path, save_dir, num_entry_per_file)
File "/mnt/e/code/dyMEAN/data/dataset.py", line 192, in preprocess
with open(file_path, 'r') as fin:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/e/code/dyMEAN/all_data/RAbD/valid.json'
^CTraceback (most recent call last):
File "/mnt/e/code/dyMEAN/data/dataset.py", line 104, in init
with open(metainfo_file, 'r') as fin:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/e/code/dyMEAN/all_data/RAbD/train_processed/_metainfo'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/mnt/data/anaconda/envs/dyMEAN/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/mnt/e/code/dyMEAN/data/dataset.py", line 323, in
dataset = E2EDataset(args.dataset, args.save_dir, num_entry_per_file=-1)
File "/mnt/e/code/dyMEAN/data/dataset.py", line 110, in init
print_log('No meta-info file found, start processing', level='INFO')
File "/mnt/e/code/dyMEAN/utils/logger.py", line 34, in print_log
print(s, end=end)
KeyboardInterrupt
2023-06-16 11:18:12::INFO::No meta-info file found, start processing

@kxz18
Copy link
Collaborator

kxz18 commented Jun 16, 2023

Looks like this is because the pdb of 5FUU is no longer available in the PDB database, which causes error in fetching it from the network. I've add a branch to detect such error. I've tested it, now it should be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants