-
Notifications
You must be signed in to change notification settings - Fork 0
querying `str_data` : `extract_from_distr.py`
str_data
contains a lot of information, which makes it complicated to navigate it.
utils/extract_from_distr.py
is a script that can extract and tabulate data from an str_data
pickle file.
python3 utils/extract_from_distr.py str_data.pkl query.txt
The second mandatory argument of the script is a formatted text file containing the queried keys.
In the query file, each line specifies a key defined in str_data_entry_current.json
. Since str_data
is a nested structure and last-level keys are not unique (i.e. there can be two sequence
keys, one under OPM_files
and another under PDB_files
), the whole path to that key must be specified by the tab-separated list of keys leading to it.
Example: the tab-separated query file
FixedDict::ENCOMPASS class
FixedDict::ENCOMPASS FixedDict::structure kchains
will produce an output like
PDBcode class kchains
1a0s beta ['P', 'Q', 'R']
1a0t beta ['P', 'Q', 'R']
1a91 alpha ['A']
1af6 beta ['A', 'B', 'C']
1afo alpha ['A', 'B']
1aig alpha ['H', 'L', 'M']
1aij alpha ['H', 'L', 'M']
...
If keys are nested, the output file will contain an extra column for each intermediate information in the str_data
.
Example : the tab-separated query file
FixedDict::ENCOMPASS FixedDict::structure kchains
FixedDict::ENCOMPASS resolution
FixedDict::ENCOMPASS FixedDict::structure FixedList::chains FixedDict::TM_regions TM_coverage
produces
PDBcode kchains resolution FixedList::chains TM_coverage
1a0s ['P', 'Q', 'R'] 2.4 0.0 199
1a0s ['P', 'Q', 'R'] 2.4 1.0 200
1a0s ['P', 'Q', 'R'] 2.4 2.0 203
1a0t ['P', 'Q', 'R'] 2.4 0.0 215
1a0t ['P', 'Q', 'R'] 2.4 1.0 215
1a0t ['P', 'Q', 'R'] 2.4 2.0 215
1a91 ['A'] None 0.0 55
1af6 ['A', 'B', 'C'] 2.4 0.0 218
...
Here, we can see that the first two queries correspond to the second and third column, but the third query corresponds to both columns 4 and 5: the list index of FixedList::chains
(reported so far as a float, TODO) is an intermediate information that is necessary to get to the desired information (without specifying the chain, there is no meaning in asking the TM coverage of a chain...)
Yet, the FixedList::chains
index does not help us much in recovering the chain each line is telling about. This is why you can also perform some basic operation on the data you want to extract:
- with the
|FUNC|
specification, you can use any intrinsic (standard library) Python3 function on the data you are specifying - with the
|REPLACE|
specification, you can use data you have collected to replace some obscure index.
Example : the tab-separated query file
FixedDict::ENCOMPASS FixedDict::structure kchains
FixedDict::ENCOMPASS resolution
FixedDict::ENCOMPASS FixedDict::structure FixedList::chains|REPLACE|kchains FixedDict::TM_regions TM_coverage
gives now
PDBcode resolution FixedList::chains TM_coverage
1a0s 2.4 P 199
1a0s 2.4 Q 200
1a0s 2.4 R 203
1a0t 2.4 P 215
1a0t 2.4 Q 215
1a0t 2.4 R 215
1a91 None A 55
1af6 2.4 A 218
thanks to the use of the FixedList::chains
list on the kchains
list
Naming conventions
Quick start
- Initializing the database
- Setting file permissions
- Singularity containers
- Running locusts for EncoMPASS
Navigating the database
- Database filesystem structure
- Key files
- Data structures
Main code
- Code overview
- OPM parsing
- Write OPM representation chart
- Structure alignment decision criteria
- Symmetry
- Check repo update
- Creating xml and squashfs
- see flowchart
Reference info