Skip to content

querying `str_data` : `extract_from_distr.py`

Edoardo Sarti edited this page Sep 7, 2023 · 1 revision

str_data contains a lot of information, which makes it complicated to navigate it. utils/extract_from_distr.py is a script that can extract and tabulate data from an str_data pickle file.

python3 utils/extract_from_distr.py str_data.pkl query.txt

The second mandatory argument of the script is a formatted text file containing the queried keys.
In the query file, each line specifies a key defined in str_data_entry_current.json. Since str_data is a nested structure and last-level keys are not unique (i.e. there can be two sequence keys, one under OPM_files and another under PDB_files), the whole path to that key must be specified by the tab-separated list of keys leading to it.

Example: the tab-separated query file

FixedDict::ENCOMPASS    class
FixedDict::ENCOMPASS    FixedDict::structure    kchains

will produce an output like

PDBcode class   kchains
1a0s    beta    ['P', 'Q', 'R']
1a0t    beta    ['P', 'Q', 'R']
1a91    alpha   ['A']
1af6    beta    ['A', 'B', 'C']
1afo    alpha   ['A', 'B']
1aig    alpha   ['H', 'L', 'M']
1aij    alpha   ['H', 'L', 'M']
...

If keys are nested, the output file will contain an extra column for each intermediate information in the str_data. Example : the tab-separated query file

FixedDict::ENCOMPASS    FixedDict::structure    kchains
FixedDict::ENCOMPASS    resolution
FixedDict::ENCOMPASS    FixedDict::structure    FixedList::chains       FixedDict::TM_regions   TM_coverage

produces

PDBcode kchains resolution      FixedList::chains       TM_coverage
1a0s    ['P', 'Q', 'R'] 2.4     0.0     199
1a0s    ['P', 'Q', 'R'] 2.4     1.0     200
1a0s    ['P', 'Q', 'R'] 2.4     2.0     203
1a0t    ['P', 'Q', 'R'] 2.4     0.0     215
1a0t    ['P', 'Q', 'R'] 2.4     1.0     215
1a0t    ['P', 'Q', 'R'] 2.4     2.0     215
1a91    ['A']   None    0.0     55
1af6    ['A', 'B', 'C'] 2.4     0.0     218
...

Here, we can see that the first two queries correspond to the second and third column, but the third query corresponds to both columns 4 and 5: the list index of FixedList::chains (reported so far as a float, TODO) is an intermediate information that is necessary to get to the desired information (without specifying the chain, there is no meaning in asking the TM coverage of a chain...)

Yet, the FixedList::chains index does not help us much in recovering the chain each line is telling about. This is why you can also perform some basic operation on the data you want to extract:

  • with the |FUNC| specification, you can use any intrinsic (standard library) Python3 function on the data you are specifying
  • with the |REPLACE| specification, you can use data you have collected to replace some obscure index.

Example : the tab-separated query file

FixedDict::ENCOMPASS    FixedDict::structure    kchains
FixedDict::ENCOMPASS    resolution
FixedDict::ENCOMPASS    FixedDict::structure    FixedList::chains|REPLACE|kchains       FixedDict::TM_regions   TM_coverage

gives now

PDBcode resolution      FixedList::chains       TM_coverage
1a0s    2.4     P       199
1a0s    2.4     Q       200
1a0s    2.4     R       203
1a0t    2.4     P       215
1a0t    2.4     Q       215
1a0t    2.4     R       215
1a91    None    A       55
1af6    2.4     A       218

thanks to the use of the FixedList::chains list on the kchains list

Clone this wiki locally