RECOVAR: Regularized covariance estimation for cryo-EM heterogeneity analysis

RECOVAR is a software tool for analyzing different conformations in heterogeneous cryo-EM and cryo-ET datasets. RECOVAR can reconstruct high-resolution volumes, estimate conformational density and low free-energy motions, and automatically identify subsets of images with a particular volume feature.

Click here to see more details about the features RECOVAR

High resolution. According to a thorough benchmarking led by Ellen Zhong’s lab, cryobench, it achieves higher resolution than other methods in most cases. See analyzing the output.
Extracting images based on a volume. If you see a particular feature in a volume generated by RECOVAR, there is a feature to automatically extract the set of images that have created that feature. See extracting
Accurate estimation of conformational density.
Estimation of low free-energy motions (if you believe conformational density and free energy are related). See analyzing the output.
It can use a focus mask. See option --focus_mask in pipeline.
It supports tomography data (same format as cryoDRGN-ET). One practical advantage over the cryoDRGN-ET and tomodrgn is that a focus mask can be input. This is a new feature (no paper yet) so it may be less stable. See tomography
You can also use the RECOVAR volume generation on latent space generated by your favorite heterogeneity method. For example, I have used it to improve cryoDRGN reconstructions. Since the volume generation scheme is transparent, you can be quite confident that it produces no hallucination which may also be useful to validate the output of another method. See kernel regression

If you are interested in the research and methods see preprint here and recorded talk here.

Installation

Running RECOVAR:

1. Preprocessing
2. Specifying a mask
3. Running the pipeline
- Uses pipeline.py
4. Analyzing results
- Uses analyze.py
- Generating volumes at specific points: compute_state.py
- Generating and analyzing trajectories: compute_trajectory.py
- Extracting image subsets:
  - Based on volumes: extract_image_subset.py
  - Based on k-means clusters: extract_image_subset_from_kmeans.py
- Estimating conformational density: estimate_conformational_density.py
5. Visualizing results

Other:

RECOVAR for Cryo-ET
Testing your installation

Using kernel regression with other embeddings

TLDR

(OUT OF DATE) Peak at what output looks like on a synthetic dataset and real dataset.

Also: using the source code, limitations, contact

Installation

To run this code, CUDA and JAX are required. See information about JAX installation here. Assuming you already have CUDA, installation should take less than 5 minutes. Below are a set of commands which runs on our university cluster (Della), but may need to be tweaked to run on other clusters. You may need to load CUDA before installing JAX, E.g., on our university cluster with module load cudatoolkit/12.3

Then create an environment, download JAX-cuda (for some reason the latest version is causing issues, so make sure to use 0.4.23), clone the directory and install the requirements (note the --no-deps flag. This is because of some conflict with dependencies of cryodrgn. Will fix it soon.).

git clone https://github.com/ma-gilles/recovar.git
conda create --name recovar python=3.11 -y
conda activate recovar
pip install -U "jax[cuda12_pip]"==0.4.23 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html --no-input
pip install --no-deps -r recovar/recovar_install_requirements.txt --no-input
python -m ipykernel install --user --name=recovar

It is recommanded to test your installation before running on a real dataset, see Testing your installation.

I. Preprocessing

NOTE: Until I fix versioning issues, you need to install cryoDRGN and RECOVAR in two different conda environments.

This means you should install cryoDRGN in the crydrgn environment as below, process the data using the cryoDRGN environment after doing conda activate cryodrgn, then activate the recovar environement with conda activate recovar, and go on with the analysis.

The input interface of RECOVAR is borrowed directly from the excellent cryoDRGN toolbox. Particles, poses and CTF must be prepared in the same way, and below is copy-pasted part of cryoDRGN's documentation.

You should first install cryoDRGN, and prepare the dataset as below before going on to step 2.

cryoDRGN Installation

cryodrgn may be installed via pip, and we recommend installing cryodrgn in a clean conda environment.

# Create and activate conda environment
(base) $ conda create --name cryodrgn python=3.9
(cryodrgn) $ conda activate cryodrgn

# install cryodrgn
(cryodrgn) $ pip install cryodrgn

(NOTE: right now you need to install cryoDRGN and RECOVAR in two different environments, will fix soon!)

You can alternatively install a newer, less stable, development version of cryodrgn using our beta release channel:

(cryodrgn) $ pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ cryodrgn --pre

More installation instructions are found in the documentation.

1. Preprocess image stack

First resize your particle images using the cryodrgn downsample command:

$ cryodrgn downsample -h

usage: cryodrgn downsample [-h] -D D -o MRCS [--is-vol] [--chunk CHUNK]
                           [--datadir DATADIR]
                           mrcs

Downsample an image stack or volume by clipping fourier frequencies

positional arguments:
  mrcs               Input images or volume (.mrc, .mrcs, .star, .cs, or .txt)

optional arguments:
  -h, --help         show this help message and exit
  -D D               New box size in pixels, must be even
  -o MRCS            Output image stack (.mrcs) or volume (.mrc)
  --is-vol           Flag if input .mrc is a volume
  --chunk CHUNK      Chunksize (in # of images) to split particle stack when
                     saving
  --relion31         Flag for relion3.1 star format
  --datadir DATADIR  Optionally provide path to input .mrcs if loading from a
                     .star or .cs file
  --max-threads MAX_THREADS
                     Maximum number of CPU cores for parallelization (default: 16)
  --ind PKL          Filter image stack by these indices

We recommend first downsampling images to 128x128 since larger images can take much longer to train:

$ cryodrgn downsample [input particle stack] -D 128 -o particles.128.mrcs

The maximum recommended image size is D=256, so we also recommend downsampling your images to D=256 if your images are larger than 256x256:

$ cryodrgn downsample [input particle stack] -D 256 -o particles.256.mrcs

The input file format can be a single .mrcs file, a .txt file containing paths to multiple .mrcs files, a RELION .star file, or a cryoSPARC .cs file. For the latter two options, if the relative paths to the .mrcs are broken, the argument --datadir can supply the path to where the .mrcs files are located.

If there are memory issues with downsampling large particle stacks, add the --chunk 10000 argument to save images as separate .mrcs files of 10k images.

2. Parse image poses from a consensus homogeneous reconstruction

CryoDRGN expects image poses to be stored in a binary pickle format (.pkl). Use the parse_pose_star or parse_pose_csparc command to extract the poses from a .star file or a .cs file, respectively.

Example usage to parse image poses from a RELION 3.1 starfile:

$ cryodrgn parse_pose_star particles.star -o pose.pkl -D 300

Example usage to parse image poses from a cryoSPARC homogeneous refinement particles.cs file:

$ cryodrgn parse_pose_csparc cryosparc_P27_J3_005_particles.cs -o pose.pkl -D 300

Note: The -D argument should be the box size of the consensus refinement (and not the downsampled images from step 1) so that the units for translation shifts are parsed correctly.

3. Parse CTF parameters from a .star/.cs file

CryoDRGN expects CTF parameters to be stored in a binary pickle format (.pkl). Use the parse_ctf_star or parse_ctf_csparc command to extract the relevant CTF parameters from a .star file or a .cs file, respectively.

Example usage for a .star file:

$ cryodrgn parse_ctf_star particles.star -D 300 --Apix 1.03 -o ctf.pkl

The -D and --Apix arguments should be set to the box size and Angstrom/pixel of the original .mrcs file (before any downsampling).

Example usage for a .cs file:

$ cryodrgn parse_ctf_csparc cryosparc_P27_J3_005_particles.cs -o ctf.pkl

II. Specifying a real-space mask

A real space mask is important to boost SNR. Most consensus reconstruction software output a mask, which you can use as input (--mask=path_to_mask.mrc). Make sure the mask is not too tight; you can use the input --dilate-mask-iter to expand the mask if needed. You may also want to use a focusing mask to focus on heterogeneity in one part of the volume click here to find instructions to generate one with Chimera.

If you don't input a mask, you can ask the software to estimate one using the two halfmaps of the mean ( --mask=from_halfmaps). You may also want to run with a loose spherical mask (option --mask=sphere) and use the computed variance map to observe which parts have large variance.

III. Running the RECOVAR Pipeline

To run the RECOVAR pipeline, ensure that the input images (.mrcs), poses (.pkl), and CTF parameters (.pkl) are prepared. Then, execute the following command:

$ python [recovar_directory]/pipeline.py particles.128.mrcs -o output_test --ctf ctf.pkl --poses poses.pkl --mask [path_to_your_mask.mrc]

To view a list of all available options and their usage, use the help command:

$ python pipeline.py -h

usage: pipeline.py [-h] -o OUTDIR [--zdim ZDIM] --poses POSES --ctf pkl --mask mrc [--focus-mask mrc] 
                   [--mask-dilate-iter MASK_DILATE_ITER] [--correct-contrast] [--ignore-zero-frequency] 
                   [--ind PKL] [--uninvert-data UNINVERT_DATA] [--lazy] [--datadir DATADIR] [--n-images N_IMAGES] 
                   [--padding PADDING] [--halfsets HALFSETS] [--keep-intermediate] [--noise-model NOISE_MODEL] 
                   [--mean-fn MEAN_FN] [--accept-cpu] [--test-covar-options] [--low-memory-option] 
                   [--very-low-memory-option] [--dont-use-image-mask] [--do-over-with-contrast] [--tilt-series] 
                   [--tilt-series-ctf TILT_SERIES_CTF] [--dose-per-tilt DOSE_PER_TILT] [--angle-per-tilt ANGLE_PER_TILT] 
                   [--only-mean] [--ntilts NTILTS]
                   particles

positional arguments:
  particles             Input particles (.mrcs, .star, .cs, or .txt)

optional arguments:
  -h, --help            Show this help message and exit
  -o OUTDIR, --outdir OUTDIR
                        Output directory to save model
  --zdim ZDIM           Dimensions of latent variable. Default = 1, 2, 4, 10, 20
  --poses POSES         Image poses (.pkl)
  --ctf pkl             CTF parameters (.pkl)
  --mask mrc            Solvent mask (.mrc). Options: from_halfmaps, sphere, none
  --focus-mask mrc      Focus mask (.mrc)
  --mask-dilate-iter MASK_DILATE_ITER
                        Number of iterations to dilate solvent and focus mask
  --correct-contrast    Estimate and correct for amplitude scaling (contrast) variation across images
  --ignore-zero-frequency
                        Ignore zero frequency. Useful if images are normalized to 0 mean
  --tilt-series         Use tilt series data
  --tilt-series-ctf TILT_SERIES_CTF
                        Specify CTF for tilt series. Default is the same as tilt series
  --dose-per-tilt DOSE_PER_TILT
                        Specify dose per tilt
  --angle-per-tilt ANGLE_PER_TILT
                        Specify angle per tilt
  --only-mean           Only compute mean
  --ntilts NTILTS       Number of tilts to use per tilt series. Default is all

Dataset loading:
  --ind PKL             Filter particles by these indices (.pkl)
  --uninvert-data UNINVERT_DATA
                        Invert data sign: Options are true, false, automatic (default). 
                        Automatic will swap signs if sum(estimated mean) < 0
  --lazy                Use lazy loading if the full dataset is too large to fit in memory
  --datadir DATADIR     Path prefix to particle stack if loading relative paths from a .star or .cs file
  --n-images N_IMAGES   Number of images to use (useful for quick runs)
  --padding PADDING     Real-space padding
  --halfsets HALFSETS   Path to a file with indices of split dataset (.pkl)
  --keep-intermediate   Save intermediate results (useful for debugging)
  --noise-model NOISE_MODEL
                        Noise model to use. Options: radial (default), white
  --mean-fn MEAN_FN     Mean function to use. Options: triangular (default), old, triangular_reg
  --accept-cpu          Accept running on CPU if no GPU is found
  --test-covar-options  Test different covariance estimation options (for development)
  --low-memory-option   Use low memory options for covariance estimation
  --very-low-memory-option
                        Use very low memory options for covariance estimation
  --dont-use-image-mask Do not use image mask
  --do-over-with-contrast
                        Run again once contrast is estimated

Required Arguments:

particles: Input particle stack file (.mrcs, .star, .cs, or .txt)
-o OUTDIR, --outdir OUTDIR: Output directory to save the model
--poses POSES: Image poses file (.pkl)
--ctf pkl: CTF parameters file (.pkl)
--mask mrc: Solvent mask file (.mrc). Options: from_halfmaps, sphere, none (no mask used).

Commonly Used Optional Arguments:

--focus-mask mrc: Specify a focus mask file (.mrc). If using only a solvent mask, pass it with --mask. If only a focus mask is available, you can use --mask=sphere.
--mask-dilate-iter MASK_DILATE_ITER: Number of iterations to dilate the mask (default = 0).
--zdim ZDIM: Dimensions of the PCA to use for embedding. You can specify one integer (--zdim=20) or a comma-separated list (--zdim=10,50,100). Default is --zdim=1,2,4,10,20 with no regularization.
--only-mean: Only compute the mean. Fast and useful to make dataset is properly preprocessed.

IV. Analyzing results

After the pipeline is run, you can find the mean, eigenvectors, variance maps, and embeddings in the outdir/results directory, where outdir is the option given above by -o. You can run some standard analysis by running:

python analyze.py [pipeline_output_dir] --zdim=10

It will run k-means, generate volumes corresponding to the centers, generate trajectories between pairs of cluster centers, and run UMAP. See more input details below.

If you want to sample many points of the distribution (e.g 100-200), and want to do it fast, you can use --n-clusters=200 and reduce run time (at the cost of resolution) with --n-bins=10. Afterwards, you can recompute the same cluster to higher resolution using compute_state.py.

Usage of `analyze.py`

Command:

python analyze.py [result_dir] --zdim=10

Positional Arguments:

result_dir: Path to the result directory (output directory of the pipeline).

Required Arguments:

--zdim ZDIM: Dimension of the latent variable (must be a single integer, not a list).

Optional Arguments

-o OUTDIR, --outdir OUTDIR: Output directory to save model. If not provided, results will be saved in result_dir/analysis_{zdim}/.
--n-clusters N_CLUSTERS: Number of k-means clusters (default: 40).
--n-trajectories N_TRAJECTORIES: Number of trajectories to compute between k-means clusters (default: 0).
--skip-umap: Skip UMAP embedding (useful for large datasets where UMAP can be slow).
--skip-centers: Skip generating volumes of the k-means cluster centers.
--n-vols-along-path N_VOLS_ALONG_PATH: Number of volumes to compute along each trajectory (default: 6).
--Bfactor BFACTOR: B-factor for sharpening (default: 0).
--n-bins N_BINS: Number of bins for kernel regression.
--maskrad-fraction MASKRAD_FRACTION: Radius of mask used in kernel regression. Default is 20, which means radius = grid_size/20 pixels, or grid_size * voxel_size / 20 Å.
--n-min-images N_MIN_IMAGES: Minimum number of images required for kernel regression. Default is 100 for SPA and 10 particles for tilt series.
--density DENSITY: Path to a .pkl file containing density data with keys density and latent_space_bounds.
--normalize-kmeans: Normalize latent variables (z) before computing k-means clustering.
--no-z-regularization: Use latent variables without regularization (e.g., use 2_noreg instead of 2).
--lazy: Enable lazy loading if the full dataset is too large to fit in memory.

Generating Volumes at Specific Points in Latent Space

To generate volumes at specific coordinates in latent space, use the compute_state.py script. Ensure you specify the coordinates using a .txt file readable by np.loadtxt and provide a B-factor if sharpening is required.

Command:

python compute_state.py [pipeline_output_dir] -o [volume_output_dir] --latent-points [zfiles.txt] --Bfactor=[Bfac]

Positional Arguments:

result_dir: The output directory provided to pipeline.py using the --o option. For analyze.py, this is set by default to result_dir/output/analysis_[zdim].

Required Arguments:

--latent-points LATENT_POINTS: Path to latent points (.txt file). You can use the output of k-means, such as output/analysis_2/centers.txt from analyze.py, or create your own latent points. The file should have a shape of (n_points, zdim).

Optional Arguments

-o OUTDIR, --outdir OUTDIR: Output directory to save all outputs. For analyze.py, it defaults to result_dir/output/analysis_[zdim]. For other functions, this argument is required.
--Bfactor BFACTOR: B-factor for sharpening. The B-factor of the consensus reconstruction is likely a good guess. Default is 0, meaning no sharpening.
--n-bins N_BINS: Number of bins for kernel regression. Default is 50, which works well for most cases. This setting was used to generate all figures in the paper.
--maskrad-fraction MASKRAD_FRACTION: Radius of mask used in kernel regression. Default is 20, meaning radius = grid_size/20 pixels or grid_size * voxel_size / 20 Å. The default setting works well for most cases. This was used for all figures in the paper. If using cryo-ET or very noisy (or very non-noisy) data, you might want to adjust this value accordingly. For low-resolution data (less than 128x128 images), consider increasing this value. For high-resolution data (more than 512x512 images), consider decreasing it.
--n-min-images N_MIN_IMAGES: Minimum number of images required for kernel regression. The default is 100 for SPA and 10 particles for tilt series. The default works well for most cases, as it was used to generate all figures in the paper. For cryo-ET or very noisy (or very non-noisy) data, consider adjusting this value.
--zdim1: Enable this if dimension 1 is used. This setting addresses a specific issue with np.loadtxt.
--no-z-regularization: If set, uses latent variables without regularization.
--lazy: Enables lazy loading when the full dataset is too large to fit in memory.
--particles PARTICLES: Path to the particle stack dataset. If not specified, the same stack as provided to pipeline.py will be used. Use this option to utilize a higher-resolution stack.
--datadir DATADIR: Path prefix to the particle stack if loading relative paths from a .star or .cs file. Similar to the --datadir option in pipeline.py. If not specified, the same stack as provided to pipeline.py will be used. Use this option to load a higher-resolution stack.

Generating and Analyzing Trajectories in Latent Space

To compute a high-density (low free energy) trajectory in latent space and generate corresponding volumes, use the compute_trajectory.py script.

Command:

python compute_trajectory.py [pipeline_output_dir] -o [volume_output_dir] --zdim [dimension] [options]

Positional Arguments:

result_dir: The output directory provided to pipeline.py using the -o option. This is where the results from pipeline.py are stored.

Required Arguments:

-o OUTDIR, --outdir OUTDIR: Output directory to save all results. This is a required argument.
--zdim ZDIM: Dimension of the latent variable (a single integer). This should match the zdim used in the pipeline.

Endpoint Specification (One of the following is required):

To define the start and end points of the trajectory in latent space, you need to specify endpoints using one of the following options:

--ind IND: Indices of points in the list of coordinates to use as endpoints. Provide indices as a comma-separated list, e.g., --ind 0,1.

This option is used in conjunction with the --endpts option.
--endpts ENDPTS_FILE: Path to a file containing endpoint coordinates in latent space. The file should be a .txt file with at least two rows, each containing the coordinates of an endpoint. If the file has more than two lines, it will use the first two lines as endpoints by default. If you want to specify different lines, use the --ind option to specify the indices of the lines to use.
--z_st Z_ST_FILE and --z_end Z_END_FILE: Paths to files containing the starting point (z_st) and ending point (z_end) coordinates in latent space, respectively. Each file should be a .txt file containing a single row of coordinates.

Optional Arguments:

--density DENSITY: Path to the density file saved in .pkl format, containing keys 'density' and 'latent_space_bounds'. This is typically generated by estimate_conformational_density.py, e.g., density/deconv_density_knee.pkl.
--n-vols-along-path N_VOLS_ALONG_PATH: Number of volumes to compute along the trajectory. Default is 6.
--Bfactor BFACTOR: B-factor for sharpening. The B-factor of the consensus reconstruction is a good estimate. Default is 0 (no sharpening).
--n-bins N_BINS: Number of bins for kernel regression. Default is 50.
--maskrad-fraction MASKRAD_FRACTION: Radius of mask used in kernel regression. Default is 20, which means radius = grid_size / 20 pixels, or grid_size * voxel_size / 20 Å.
--n-min-images N_MIN_IMAGES: Minimum number of images required for kernel regression. Default is 100 for SPA data.
--lazy: Use lazy loading if the dataset is too large to fit in memory.
--particles PARTICLES: Path to the particle stack dataset. If not specified, the same stack used in pipeline.py will be used. Use this option if you want to use a higher-resolution stack.
--datadir DATADIR: Path prefix to particle stack if loading relative paths from a .star or .cs file.
--zdim1: Enable this if using a 1-dimensional latent space. This addresses a specific issue with np.loadtxt.
--override_z_regularization: Override the z regularization setting from the pipeline. Use with caution.

Examples:

Compute trajectory using indices in a coordinate file:

If you have a coordinate file (e.g., kmeans_center_coords.txt generated by analyze.py), you can specify which lines (points) to use as endpoints using --ind.
```
python compute_trajectory.py [pipeline_output_dir] -o [volume_output_dir] --zdim [dimension] \
    --density [density.pkl] --endpts kmeans_center_coords.txt --ind 0,1
```
This will use the first two lines (indices 0 and 1) in kmeans_center_coords.txt as the endpoints.
Compute trajectory using endpoint coordinates from a file:

If your endpoint file has exactly two lines, you can simply specify:
```
python compute_trajectory.py [pipeline_output_dir] -o [volume_output_dir] --zdim [dimension] \
    --density [density.pkl] --endpts endpoints.txt
```
If the file has more than two lines, and you want to use specific lines, use --ind to specify the indices.

Compute trajectory using specific start and end point files:

If you have separate .txt files for the starting point (z_st.txt) and ending point (z_end.txt):

python compute_trajectory.py [pipeline_output_dir] -o [volume_output_dir] --zdim [dimension] \
    --density [density.pkl] --z_st z_st.txt --z_end z_end.txt

Additional Notes:

The --density option is important if you want to compute trajectories that follow regions of high conformational density (low free energy). The density file can be generated using estimate_conformational_density.py.
Ensure that the zdim matches the dimension used in the pipeline and in density estimation.
The volumes generated along the trajectory will be saved in the specified output directory (--outdir), typically under subdirectories like vol0000, vol0001, etc.

Help Message:

For more details and options, you can view the help message by running:

python compute_trajectory.py -h

Help Output:

usage: compute_trajectory.py [-h] -o OUTDIR [--zdim ZDIM] [--Bfactor BFACTOR] [--n-bins N_BINS]
                             [--maskrad-fraction MASKRAD_FRACTION] [--n-min-images N_MIN_IMAGES] [--zdim1]
                             [--no-z-regularization] [--override_z_regularization] [--lazy]
                             [--particles PARTICLES] [--datadir DATADIR] [--n-vols-along-path N_VOLS_ALONG_PATH]
                             [--density DENSITY] [--ind IND] [--endpts ENDPTS_FILE]
                             [--z_st Z_ST_FILE] [--z_end Z_END_FILE]
                             result_dir

positional arguments:
  result_dir            Output directory provided to pipeline.py using the --o option.

optional arguments:
  -h, --help            Show this help message and exit.
  -o OUTDIR, --outdir OUTDIR
                        Output directory to save all outputs. This is a required argument.
  --zdim ZDIM           Dimension of latent variable (a single int, not a list). Must match the zdim used in the pipeline.
  --Bfactor BFACTOR     B-factor sharpening. The B-factor of the consensus reconstruction is probably a good guess. Default is 0, which means no sharpening.
  --n-bins N_BINS       Number of bins for kernel regression. Default is 50 and works well for most cases.
  --maskrad-fraction MASKRAD_FRACTION
                        Radius of mask used in kernel regression. Default = 20, which means radius = grid_size/20 pixels, or grid_size * voxel_size / 20 angstrom.
  --n-min-images N_MIN_IMAGES
                        Minimum number of images to compute kernel regression. Default = 100 for SPA, and 10 particles for tilt series.
  --zdim1               Enable if using a 1-dimensional latent space. Addresses an issue with np.loadtxt.
  --no-z-regularization
                        Use unregularized latent variables. This should be consistent with how the density was estimated.
  --override_z_regularization
                        Override the z regularization setting from the pipeline. Use with caution.
  --lazy                Use lazy loading if the dataset is too large to fit in memory.
  --particles PARTICLES
                        Particle stack dataset. If not specified, the same stack as provided to pipeline.py will be used. Use this option if you want to use a higher resolution stack.
  --datadir DATADIR     Path prefix to particle stack if loading relative paths from a .star or .cs file.
  --n-vols-along-path N_VOLS_ALONG_PATH
                        Number of volumes to compute along each trajectory (default 6)
  --density DENSITY     Density saved in pkl file, keys are 'density' and 'latent_space_bounds'.
  --ind IND             Indices of points in the list of coordinates to use as endpoints. Provide as a comma-separated list, e.g., --ind 0,1.
  --endpts ENDPTS_FILE  Endpoints file (txt). If it has more than 2 lines, it will use the first two lines as endpoints. If you want to specify different lines, use --ind.
  --z_st Z_ST_FILE      Starting point file (txt).
  --z_end Z_END_FILE    Ending point file (txt).

Extracting image subset based on volumes

Overview

This command will automatically figure out which images produced a particular feature of a volume generated by RECOVAR. You may want to use these images for a downstream tasks, or to re-import into RELION/cryosparc.

You can use cryodrgn_utils write_star and cryodrgn write_cs to do that after running extract_image_subset.

The volume generation method of RECOVAR uses different sets of images to generate different parts of the volume. Thus, you need to tell this function which part of the volume is of interest. This can be done by giving a small mask around a region (note that it is really the center of mass of the mask which is used, so if you give a strange non convex mask, it will give strange result).

Usage

python extract_image_subset.py --input-dir [input_directory] --output [output_path] --subvol-idx [subvolume_idx] --mask [mask_file] --coordinate [x,y,z]

Parameters

--input-dir (required):
- Type: str
- Description: Path to the directory containing the volume data, typically generated by volume-processing scripts like analyze, compute_state, or compute_trajectory. input-dir should point to a folder that looks like \vol00**\. For example, if you ran pipeline.py then analyze.py --zdim 20, this folder should exist: [pipeline_output_dir]/output/analysis_20/centers/vol0000/, which contains all the parameters of the volume stored in [pipeline_output_dir]/output/analysis_20/centers/all_volumes/vol0000.mrc.
--output (required):
- Type: str
- Description: Output path where the .pkl file containing the indices of the extracted image subset will be saved.
--subvol-idx (optional):
- Type: int
- Description: Specifies the subvolume index to use for extracting the subset of images. This index directs the function to focus on a particular subvolume.
--mask (optional):
- Type: str (file path)
- Description: Path to an MRC file that defines a mask around the feature of interest. If provided, the center of mass of the mask is calculated, and the subset of images closest to this center is selected.
--coordinate (optional):
- Type: list of int
- Description: Specifies the pixel coordinates of the feature for which images are to be extracted. This should be provided in the format x,y,z.

Example Commands

Extracting by Subvolume Index:

python extract_image_subset.py --input-dir "/path/to/volume_data" --output "/path/to/output_indices.pkl" --subvol-idx 3

Extracting by Mask:

python extract_image_subset.py --input-dir "/path/to/volume_data" --output "/path/to/output_indices.pkl" --mask "/path/to/mask.mrc"

Extracting by Coordinates:

python extract_image_subset.py --input-dir "/path/to/volume_data" --output "/path/to/output_indices.pkl" --coordinate 50,50,50

Notes

Ensure that the correct path is provided for both the --input-dir and --output parameters.
Only one of --subvol-idx, --mask, or --coordinate should be provided to define the criteria for image extraction. Providing more than one of these parameters will result in an error.

Command: `extract_image_subset_from_kmeans`

Overview

This command extracts a subset of images based on the k-means clustering indices from a specified .pkl file. The images can be selected either by directly specifying the k-means indices to keep, or by excluding the specified indices using the inverse option.

Usage

python extract_image_subset_from_kmeans.py [path_to_centers] [output_path] [kmeans_indices] [-i]

Parameters

path_to_centers (required):
- Type: str
- Description: Path to the centers.pkl file. This file is usually found in the folder [recovar_output_folder]/output/analysis_[zdim]/centers.pkl and contains the labels of the k-means clustering.
output_path (required):
- Type: str
- Description: Path where the output .pkl file containing the indices of the extracted image subset will be saved.
kmeans_indices (required):
- Type: list of int
- Description: A comma-separated list of k-means indices that you wish to keep. For example, 20,30,50 would keep the images that belong to these k-means clusters.
-i, --inverse (optional):
- Type: bool
- Description: If this option is provided, the function will exclude the images that correspond to the specified k-means indices. By default, the function keeps the images that correspond to the specified indices.

Example Commands

Extracting images corresponding to specific k-means clusters:

python extract_image_subset_from_kmeans.py /path/to/centers.pkl /path/to/output_indices.pkl 20,30,50

Excluding specific k-means clusters:

python extract_image_subset_from_kmeans.py /path/to/centers.pkl /path/to/output_indices.pkl 20,30,50 -i

Notes

The path_to_centers must be a valid .pkl file containing the k-means clustering labels, typically located at [recovar_output_folder]/output/analysis_[zdim]/centers.pkl.
The output_path must also be a .pkl file where the subset of indices will be saved.
Only one of the --inverse option should be used. If provided, the images not in the specified clusters will be saved.

Command: `estimate_conformational_density.py`

Overview

This script estimates the conformational density from RECOVAR results using a deconvolution approach. It processes the results in a specified principal component space and outputs the estimated densities along with related plots.

Note that this conformational density works better if you have a relatively simple distribution, which can be well represented in a small number of PCS (<=4). Often this is achievable by using focused mask to focus the analysis on a part of the biomoecule.

Note that the density estimation requires a regularization parameter $\alpha$ in the paper. By default, the code will try a range of different $\alpha$'s and estimate the correct one by finding the knee in the L-curve. This does not always work great, so you should look at the all_densities.png plot and the L-curve.png plots to make sure the densities don't look overfit at the selected knee. The estimated densities with all values of $\alpha$ are stored in all_densities/, and you can select the corresponding one if you do not think the knee selection is accurate, e.g. if you think the a[3] is the most accurate density on the all_densities.png plot, select the density all_densities/deconv_density_3.pkl for further analysis (e.g. by passing it to compute_trajectory.py).

Usage

python estimate_conformational_density.py [recovar_result_dir]

Positional Arguments:

recovar_result_dir
Directory containing RECOVAR results provided to pipeline.py.

Optional Arguments

--pca_dim PCA_DIM
Dimension of PCA space in which the density is estimated (default 4). The runtime increases exponentially with this number, so <=5 is recommended.
--z_dim_used Z_DIM_USED
Dimension of latent variable used (default 4). Should be at least as big as pca_dim, and should be one of the dims used in analyze.py
--output_dir OUTPUT_DIR
Directory to save the density estimation results. Default is recovar_result_dir/density/.
--percentile_reject PERCENTILE_REJECT
Percentile of data to reject due to large covariance (default: 10%).
--num_disc_points NUM_DISC_POINTS
Number of discretization points in each dimension for the grid density estimation (default: 50). For example, 50^4 points for a 4-dimensional grid.
--alphas ALPHAS
List of alphas for regularization. Provide as space-separated values, e.g., --alphas 1e-9 1e-8 1e-7. If not provided, defaults to values from 1e-9 to 1e1 in logarithmic spacing.
--percentile_bound PERCENTILE_BOUND
Reject zs with coordinates above this percentile when deciding the bounds of the grid (default: 1%).

Example Commands

Basic Density Estimation:

python estimate_conformational_density.py /path/to/recovar_results --pca_dim 3

Advanced Example with Custom Parameters:

python estimate_conformational_density.py /path/to/recovar_results --z_dim_used 4 \
 --pca_dim 2 \
  --output_dir /path/to/output \
  --percentile_reject 15 \
  --num_disc_points 100 \
  --alphas 1e-9 1e-5 1e-3 1e-1 \
  --percentile_bound 5

Notes

The dimensionality of the deconvolved space (--pca_dim) determines the runtime complexity, which grows exponentially with higher dimensions.
Ensure that the correct recovar_result_dir is provided, as it should contain the necessary results generated by pipeline.py.
The --alphas parameter allows for fine-tuning regularization, and the script will automatically select an optimal alpha based on the L-curve.

Output

The script will create a density/ directory inside the recovar_result_dir (or in the specified output directory), which includes:

all_densities/: Contains density estimations for different alphas.
deconv_density_knee.pkl: Density file corresponding to the optimal alpha determined by the knee point of the L-curve.
all_densities.png: Visualization of all densities across different alphas.
Lcurve.png: L-curve plot used for selecting the optimal alpha.

V. Visualizing results

Output Structure

After running the RECOVAR pipeline and the subsequent analysis scripts (e.g., analyze.py, compute_state.py, compute_trajectory.py, estimate_conformational_density.py), the outputs will be organized in a structured directory tree. Understanding this structure will help you navigate the results and locate the files of interest.

If you are running on a remote server and wish to visualize the results locally, it is recommended to copy only the [output_dir]/output directory to your local machine, as the model files can be quite large. You can then use visualization tools like ChimeraX to inspect the generated volumes.

Below is the updated directory tree structure (click to expand):

Output File Structure

.
├── analysis_2_noreg                   # Generated by `analyze.py` with zdim=2
│   ├── contrast_histogram.png         # Histogram of contrast values
│   ├── density_plots                  # Density plots from `estimate_conformational_density.py`
│   │   └── density_01.png
│   ├── density_plots_sliced           # Sliced density plots
│   │   └── density_01.png
│   ├── kmeans_center_coords.txt       # Coordinates of k-means cluster centers in latent space
│   ├── kmeans_center_volumes          # Volumes corresponding to k-means cluster centers
│   │   ├── all_volumes
│   │   │   ├── locres0000.mrc
│   │   │   ├── locres0001.mrc
│   │   │   ├── locres0002.mrc
│   │   │   ├── vol0000.mrc
│   │   │   ├── vol0001.mrc
│   │   │   └── vol0002.mrc
│   │   ├── latent_coords.txt          # Latent coordinates of volumes
│   │   ├── vol0000                    # Subdirectories for each cluster center volume
│   │   │   ├── half1_unfil.mrc
│   │   │   ├── half2_unfil.mrc
│   │   │   ├── heterogeneity_distances.txt
│   │   │   ├── locres_filtered.mrc
│   │   │   ├── locres_filtered_nob.mrc
│   │   │   ├── locres.mrc
│   │   │   ├── params.pkl
│   │   │   ├── split_choice.pkl
│   │   │   ├── unfil.mrc
│   │   │   └── volume_sampling.mrc
│   │   ├── vol0001
│   │   │   └── [same structure as vol0000]
│   │   └── vol0002
│   │       └── [same structure as vol0000]
│   ├── kmeans_result.pkl              # Results of k-means clustering
│   ├── PCA                            # Principal Component Analysis results
│   │   ├── PC_01.png
│   │   └── PC_01no_annotate.png
│   ├── run.log                        # Log file from `analyze.py`
│   ├── traj0                          # Trajectory directory generated by `compute_trajectory.py`
│   │   ├── all_volumes
│   │   │   ├── locres0000.mrc
│   │   │   ├── locres0001.mrc
│   │   │   ├── locres0002.mrc
│   │   │   ├── locres0003.mrc
│   │   │   ├── locres0004.mrc
│   │   │   ├── locres0005.mrc
│   │   │   ├── vol0000.mrc
│   │   │   ├── vol0001.mrc
│   │   │   ├── vol0002.mrc
│   │   │   ├── vol0003.mrc
│   │   │   ├── vol0004.mrc
│   │   │   └── vol0005.mrc
│   │   ├── density
│   │   │   └── density_01.png
│   │   ├── latent_coords.txt
│   │   ├── path.json                  # Trajectory path in latent space
│   │   ├── vol0000
│   │   │   └── [volume files as in kmeans_center_volumes]
│   │   ├── vol0001
│   │   │   └── [volume files as in kmeans_center_volumes]
│   │   ├── vol0002
│   │   │   └── [volume files as in kmeans_center_volumes]
│   │   ├── vol0003
│   │   │   └── [volume files as in kmeans_center_volumes]
│   │   ├── vol0004
│   │   │   └── [volume files as in kmeans_center_volumes]
│   │   └── vol0005
│   │       └── [volume files as in kmeans_center_volumes]
│   ├── trajectory_endpoints.pkl       # Endpoints used for trajectory computation
│   └── umap                           # UMAP embedding results
│       ├── kmeans_centers.png
│       ├── kmeans_centers_no_annotate.png
│       ├── sns.png
│       ├── sns_hex.png
│       └── umap_embedding.pkl
├── command.txt                        # Command used to run the pipeline
├── density                            # Generated by `estimate_conformational_density.py`
│   ├── all_densities
│   │   ├── deconv_density_0.pkl
│   │   ├── deconv_density_1.pkl
│   │   ├── deconv_density_2.pkl
│   │   ├── deconv_density_3.pkl
│   │   ├── deconv_density_4.pkl
│   │   ├── deconv_density_5.pkl
│   │   ├── deconv_density_6.pkl
│   │   ├── deconv_density_7.pkl
│   │   ├── deconv_density_8.pkl
│   │   ├── deconv_density_9.pkl
│   │   └── deconv_density_10.pkl
│   ├── all_densities.png              # Visualization of densities across alphas
│   ├── deconv_density_knee.pkl        # Optimal density selected via L-curve
│   └── Lcurve.png                     # L-curve plot for regularization selection
├── model                              # Generated by `pipeline.py`
│   ├── covariance_cols.pkl            # Covariance matrices
│   ├── embeddings.pkl                 # Embeddings of the data
│   ├── halfsets.pkl                   # Information about half-sets
│   ├── params.pkl                     # Parameters used during pipeline execution
│   └── particles_halfsets.pkl         # Indices of particles in each half-set
├── output
│   ├── plots                          # Plots generated during the pipeline
│   │   ├── contrast_histogram.png     # Histogram of contrast values
│   │   ├── eigenvalues.png            # Eigenvalues plot
│   │   ├── mean_fsc.png               # Fourier Shell Correlation of the mean volume
│   │   └── mean_variance_eigenvolume_plots.png
│   └── volumes                        # Volumes generated by `pipeline.py`
│       ├── dilated_mask.mrc           # Dilated version of the mask
│       ├── eigen_neg*.mrc             # Negative eigenvolumes (principal components)
│       ├── eigen_pos*.mrc             # Positive eigenvolumes (principal components)
│       ├── focus_mask.mrc             # Focus mask if provided
│       ├── mask.mrc                   # Solvent mask used
│       ├── mean.mrc                   # Mean volume
│       ├── mean_filt.mrc              # Filtered mean volume
│       ├── mean_half1_unfil.mrc       # Unfiltered mean volume from half-set 1
│       ├── mean_half2_unfil.mrc       # Unfiltered mean volume from half-set 2
│       ├── variance4.mrc              # Variance map for zdim=4
│       ├── variance10.mrc             # Variance map for zdim=10
│       └── variance20.mrc             # Variance map for zdim=20
├── run.log                            # Log file from the pipeline execution
└── trajectory1                        # Generated by `compute_trajectory.py`
    ├── density
    │   └── density_01.png
    ├── path.json                      # Trajectory path in latent space
    └── vol0000
        └── [volume files as in kmeans_center_volumes]

Directory and File Descriptions

analysis_2_noreg/: Results from analyze.py with zdim=2 and no regularization.
- contrast_histogram.png: Histogram of estimated contrast values across images.
- density_plots/: Density plots generated from conformational density estimation.
- kmeans_center_coords.txt: Coordinates of k-means cluster centers in latent space.
- kmeans_center_volumes/: Contains volumes corresponding to k-means cluster centers.
  - all_volumes/: Aggregated volumes for all centers.
    - vol****.mrc: Volumes for each cluster center.
    - locres****.mrc: Locally filtered volumes.
  - latent_coords.txt: Latent space coordinates of the volumes.
  - vol****/: Subdirectories for each cluster center containing detailed volume files.
- kmeans_result.pkl: Results of the k-means clustering, including labels and centers.
- PCA/: Principal Component Analysis plots.
  - PC_01.png: Plot of the first principal component.
  - PC_01no_annotate.png: Same plot without annotations.
- traj0/: Trajectory results from compute_trajectory.py.
  - all_volumes/: Volumes along the trajectory.
  - density/: Density plots along the trajectory.
  - path.json: JSON file containing the trajectory path.
  - vol****/: Volumes at specific points along the trajectory.
- trajectory_endpoints.pkl: Endpoints used for trajectory computations.
- umap/: UMAP embedding results.
  - sns.png, sns_hex.png: Scatter plots of the embeddings.
  - kmeans_centers.png: UMAP plot showing k-means centers.
  - umap_embedding.pkl: UMAP embedding coordinates.
command.txt: Contains the command used to run the pipeline, useful for record-keeping and reproducing results.
density/: Generated by estimate_conformational_density.py.
- all_densities/: Deconvolved densities for different regularization parameters (alphas).
  - deconv_density_*.pkl: Density files for each alpha value.
- all_densities.png: Visualization of all densities across different alphas.
- deconv_density_knee.pkl: Deconvolved density corresponding to the optimal regularization parameter (knee point).
- Lcurve.png: L-curve plot used for selecting the optimal alpha.
model/: Generated by pipeline.py.
- covariance_cols.pkl: Covariance matrices of the columns.
- embeddings.pkl: Embeddings of the data.
- halfsets.pkl: Information about half-sets used in the analysis.
- params.pkl: Parameters used during the pipeline execution.
- particles_halfsets.pkl: Indices of particles in each half-set.
output/: Main output directory containing results and volumes.
- plots/: Various plots generated during the pipeline.
  - contrast_histogram.png: Histogram of estimated contrast values.
  - eigenvalues.png: Plot showing the decay of eigenvalues.
  - mean_fsc.png: Fourier Shell Correlation of the mean volume.
  - mean_variance_eigenvolume_plots.png: Combined plots of mean, variance, and eigenvolumes.
- volumes/: Volumes generated by pipeline.py.
  - mean.mrc: The mean volume reconstructed from the data.
  - mean_filt.mrc: Filtered mean volume.
  - mean_half1_unfil.mrc & mean_half2_unfil.mrc: Unfiltered mean volumes from half-sets.
  - variance*.mrc: Variance maps computed for different dimensions.
  - eigen_pos*.mrc & eigen_neg*.mrc: Positive and negative eigenvolumes representing principal components of variability.
  - mask.mrc: The solvent mask used during reconstruction.
  - focus_mask.mrc: The focus mask if provided.
  - dilated_mask.mrc: Dilated version of the mask.
run.log: Log file containing messages and errors during the pipeline execution.
trajectory1/: Generated by compute_trajectory.py.
- density/: Density plots along the trajectory.
  - density_01.png: Density plot for the first pair of dimensions.
- path.json: JSON file containing the trajectory path in latent space.
- vol0000/: Volume at the starting point of the trajectory.
  - [Contains volume files as in kmeans_center_volumes/vol****/]

Functions and Their Outputs

pipeline.py: Generates the model/ directory and the output/ directory.
- Computes the mean volume, variance maps, eigenvolumes, and saves necessary parameters and embeddings.
- Generates plots such as eigenvalues and FSC curves.
analyze.py: Creates the analysis_[zdim]_noreg/ directories.
- Performs k-means clustering, generating cluster center volumes, and computing UMAP and PCA embeddings.
- Generates contrast histograms and density plots.
compute_state.py: Generates volumes at specific points in latent space provided by the user.
- Outputs are stored in directories like vol****/ under kmeans_center_volumes/ or specified output directory.
compute_trajectory.py: Produces the traj0/ directory within analysis_[zdim]_noreg/ or a separate trajectory directory.
- Contains volumes along a trajectory in latent space, density plots, and the trajectory path (path.json).
estimate_conformational_density.py: Creates the density/ directory with deconvolved density estimations, plots, and related data.
- Computes the conformational density in the principal component space and selects the optimal regularization parameter using the L-curve method.

Copying and Visualizing Results

For visualization purposes, especially when working remotely, it's recommended to copy only the necessary output directories (e.g., output/, analysis_[zdim]_noreg/, trajectory1/, density/) to your local machine. This allows you to use visualization tools without transferring large model files.

You can then use software like ChimeraX to open and visualize the .mrc volume files found in these directories.

Visualization in jupyter notebook

You can visualize the results using this notebook, which will show a bunch of results including:

the FSC of the mean estimation, which you can interpret as an upper bound of the resolution you can expect.
decay of eigenvalues to help you pick the right zdim
and standard clustering visualization (borrowed from the cryoDRGN output).

Small test dataset

To verify that RECOVAR is installed properly and functioning correctly, you can run a small test using a synthetic dataset. The script run_test_dataset.py is already included in the repository and will:

Generate a small synthetic dataset.
Run the RECOVAR pipeline on this dataset.
Perform analysis, including clustering and embedding.
Estimate conformational density.
Compute trajectories in the latent space.
Report which steps passed or failed.
Delete the test dataset directory at the end if all steps pass.

Steps to Run the Test Script

Activate your recovar environment:
```
conda activate recovar
```
Run the test script

Navigate to the directory where RECOVAR is installed and execute:
```
python run_test_dataset.py
```
This script is located in the root directory of the RECOVAR repository.

What the Script Does

The script performs the following steps:

Generates a small synthetic dataset using make_test_dataset.py.
Runs the RECOVAR pipeline on the test dataset using pipeline.py.
Analyzes the results using analyze.py, including clustering and embedding.
Estimates the conformational density with estimate_conformational_density.py.
Computes trajectories in the latent space using compute_trajectory.py.
Reports which steps passed or failed.
Deletes the test_dataset directory at the end if all steps pass.

Verifying the Results

Success Message: If all steps complete successfully, you'll see:

Total steps passed: 6
Total steps failed: 0

All functions completed successfully!
Test dataset directory 'test_dataset' has been deleted.

Failure Notification: If any steps fail, the script will indicate which functions failed, and you can check the output for details.

Total steps passed: 5
Total steps failed: 1

The following functions failed:
- estimate_conformational_density.py

Please check the output above for details.

Notes

Verifying Outputs: You can compare the volumes generated in test_dataset/pipeline_output/output/analysis_2_noreg/kmeans_center_volumes/all_volumes with the simulated volumes in recovar/data/vol*.mrc to ensure the pipeline is producing expected results (the order may differ).
Cleanup: The script will delete the test_dataset directory at the end if all steps pass. This helps keep your workspace clean.

[Rest of the README remains unchanged.]

TLDR

A short example illustrating the steps to run the code on EMPIAR-10076. Assuming you have downloaded the code and have a GPU, the code should take less than an hour to run, and less than 10 minutes if you downsample to 128 instead (exact running time depends on your hardware). and Read above for more details:

# Downloaded poses from here: https://github.com/zhonge/cryodrgn_empiar.git
git clone https://github.com/zhonge/cryodrgn_empiar.git
cd cryodrgn_empiar/empiar10076/inputs/

# Download particles stack from here. https://www.ebi.ac.uk/empiar/EMPIAR-10076/ with your favorite method.
# My method of choice is to use https://www.globus.org/ 
# Move the data into cryodrgn_empiar/empiar10076/inputs/

conda activate recovar
# Downsample images to D=256
cryodrgn downsample Parameters.star -D 256 -o particles.256.mrcs --chunk 50000

# Extract pose and ctf information from cryoSPARC refinement
cryodrgn parse_ctf_csparc cryosparc_P4_J33_004_particles.cs -o ctf.pkl
cryodrgn parse_pose_csparc cryosparc_P4_J33_004_particles.cs -D 320 -o poses.pkl

# run recovar
python [recovar_dir]/pipeline.py particles.256.mrcs --ctf --ctf.pkl -poses poses.pkl --o recovar_test 

# run analysis
python [recovar_dir]/analysis.py recovar_test --zdim=20 

# Open notebook output_visualization.ipynb
# Change the recovar_result_dir = '[path_to_this_dir]/recovar_test' and

Note that this is different from the one in the paper. Run the following pipeline command to get the one in the paper (runs on the filtered stack from the cryoDRGN paper, and uses a predefined mask):

# Download mask
git clone https://github.com/ma-gilles/recovar_masks.git

python ~/recovar/pipeline.py particles.256.mrcs --ctf ctf.pkl --poses poses.pkl -o test-mask --mask recovar_masks/mask_10076.mrc --ind filtered.ind.pkl

The output should be the same as this notebook.

Using kernel regression with other embeddings

You can generate volumes from embedding not generated by RECOVAR using generate_from_embedding. E.g., for a cryoDRGN embedding:

python [recovar_dir/]generate_from_embedding.py particles.256.mrcs --poses poses.pkl --ctf ctf.pkl --embedding 02_cryodrgn256/z.24.pkl --o [output_dir] --target zfile.txt

$ python generate_from_embedding.py -h

usage: generate_from_embedding.py [-h] -o OUTDIR [--zdim ZDIM] --poses POSES --ctf pkl [--ind PKL] [--uninvert-data UNINVERT_DATA]
                                [--datadir DATADIR] [--n-images N_IMAGES] [--padding PADDING] [--halfsets HALFSETS]
                                [--noise-model NOISE_MODEL] [--Bfactor BFACTOR] [--n-bins N_BINS] --embedding EMBEDDING --target
                                TARGET [--zdim1]
                                particles

positional arguments:
particles             Input particles (.mrcs, .star, .cs, or .txt)

optional arguments:
-h, --help            show this help message and exit
-o OUTDIR, --outdir OUTDIR
                        Output directory to save model
--zdim ZDIM           Dimensions of latent variable. Default=1,2,4,10,20
--poses POSES         Image poses (.pkl)
--ctf pkl             CTF parameters (.pkl)
--Bfactor BFACTOR     0
--n-bins N_BINS       number of bins for kernel regression
--embedding EMBEDDING
                        Image embeddings zs (.pkl), e.g. 00_cryodrgn256/z.24.pkl if you want to use a cryoDRGN embedding.
--target TARGET       Target zs to evaluate the kernel regression (.txt)
--zdim1               Whether dimension 1 embedding is used. This is an annoying corner case for np.loadtxt...

Dataset loading:
--ind PKL             Filter particles by these indices
--uninvert-data UNINVERT_DATA
                        Invert data sign: options: true, false (default)
--datadir DATADIR     Path prefix to particle stack if loading relative paths from a .star or .cs file
--n-images N_IMAGES   Number of images to use (should only use for quick run)
--padding PADDING     Real-space padding
--halfsets HALFSETS   Path to a file with indices of split dataset (.pkl).
--noise-model NOISE_MODEL
                        what noise model to use. Options are radial (default) computed from outside the masks, and white computed by
                        power spectrum at high frequencies

Using the source code

I hope some developers find parts of the code useful for their projects. See this notebook for a short tutorial. (OUT OF DATE, see cryoJAX for a much better documented JAX cryo-EM code.)

Some of the features which may be of interest:

The basic building block operations of cryo-EM efficiently, in batch and on GPU: shift images, slice volumes, do the adjoint slicing, apply CTF. See recovar.core. Though I have not tried it, all of these operations should be differentiable thus you could use JAX's autodiff.
A heterogeneity dataset simulator that includes variations in contrast, realistic CTF and pose distribution (loaded from real dataset), junk proteins, outliers, etc. See recovar.simulator.
A code to go from atomic positions to volumes or images. Does not run on GPU. Thanks to prody, if you have an internet connection, you can generate the volume from only the PDB ID. E.g., you can do recovar.simulate_scattering_potential.generate_molecule_spectrum_from_pdb_id('6VXX', 2, 256) to generate the volume of the spike protein with voxel size 2 on a 256^3 grid. Note that this code exactly evaluates the Fourier transform of the potential, thus it is exact in Fourier space, which can produce some oscillations artifact in the spatial domain. Also see cryoJAX
Some other features that aren't very well separated from the rest of the code, but I think could easily be stripped out: trajectory computation recovar.trajectory, per-image mask generation recovar.covariance_core, regularization schemes recovar.regularization, various noise estimators recovar.noise.
Some features that are not there (but soon, hopefully): pose search, symmetry handling.

Tomography

I am developping the tomography extension. You can try it out by passing the the --tilt-series, --tilt-series-ctf=v2, --angle-per-tilt=[insert angle here] to pipeline.py. The input should be identical to cryoDRGN-ET (see here).

E.g, on the M-file:

$ python [recovar_dir]/pipeline.py M_particles.star --ctf ctf.pkl --poses pose.pkl -o v2_nocont_$ntilts --datadir=128 --mask=path_to_mask.mrc --tilt-series-ctf=v2  --ntilts=10 --tilt-series  --angle-per-tilt=3.0 --dose-per-tilt=2.93

You can use all tilts by not passing the argument --ntilts.

Limitations

Symmetry: there is currently no support for symmetry. If you got your poses through symmetric refinement, it will probably not work. It should probably work if you make a symmetry expansion of the particle stack, but I have not tested it.
Memory: RECOVAR uses a lot of memory by default. For a stack of images of size 256, you need approximately 200 GB + size of dataset. You can also use the --lazy option, which will do lazy loading, in which case you need a little more than 200. If you run out of memory, you can use the --low-memory-option, in which case you need 60GB. If even that fails, you can try --very-low-memory-option. Finally, you can downsample the data and run RECOVAR. You can generate states from the full resolution images even if you ran the RECOVAR pipeline on downsampled images by using compute_state.py and passing the --particles argument. The memory requirement for the compute_state is approximately n_bins * volume_size * 32 bytes where n_bins is 50 by default (~26 GB for size 256 images).
ignore-zero-frequency: I haven't thought much about the best way to do this. I would advise against using it for now.
Other ones, probably?: if you run into issues, please let me know.

Acknowledgement

A lot of code in RECOVAR is based on cryoDRGN (https://cryodrgn.cs.princeton.edu/) and RELION (https://relion.readthedocs.io/en/release-5.0/). Hopefully, it is clear from the code naming/comments. Much of this documentation is generated using chatGTP4.

Citation

If you use this software for analysis, please cite:

@article{gilles2023bayesian,
  title={A Bayesian Framework for Cryo-EM Heterogeneity Analysis using Regularized Covariance Estimation},
  author={Gilles, Marc Aurele T and Singer, Amit},
  journal={bioRxiv},
  pages={2023--10},
  year={2023},
  publisher={Cold Spring Harbor Laboratory}
}

Contact

You can reach me (Marc) at gilles@princeton.edu with questions or comments, or open an issue on Github.

Dataset from manuscript

Input files to replicate the results in the manuscript can be found in this dropbox.

Name		Name	Last commit message	Last commit date
Latest commit History 194 Commits
recovar		recovar
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analyze.py		analyze.py
compute_embedding.py		compute_embedding.py
compute_state.py		compute_state.py
compute_trajectory.py		compute_trajectory.py
estimate_conformational_density.py		estimate_conformational_density.py
estimate_stable_states.py		estimate_stable_states.py
ewald_test_cg.ipynb		ewald_test_cg.ipynb
extract_image_subset.py		extract_image_subset.py
extract_image_subset_from_kmeans.py		extract_image_subset_from_kmeans.py
make_spike_datasets.py		make_spike_datasets.py
make_test_dataset.py		make_test_dataset.py
output_visualization.ipynb		output_visualization.ipynb
output_visualization_empiar10076.ipynb		output_visualization_empiar10076.ipynb
output_visualization_simple_synthetic.ipynb		output_visualization_simple_synthetic.ipynb
output_visualization_synthetic-Copy2.ipynb		output_visualization_synthetic-Copy2.ipynb
pipeline.py		pipeline.py
recovar_coding_tutorial.ipynb		recovar_coding_tutorial.ipynb
recovar_install_requirements.txt		recovar_install_requirements.txt
run_test_dataset.py		run_test_dataset.py

License

ma-gilles/recovar

Folders and files

Latest commit

History

Repository files navigation

RECOVAR: Regularized covariance estimation for cryo-EM heterogeneity analysis

Installation

I. Preprocessing

NOTE: Until I fix versioning issues, you need to install cryoDRGN and RECOVAR in two different conda environments.

cryoDRGN Installation

1. Preprocess image stack

2. Parse image poses from a consensus homogeneous reconstruction

3. Parse CTF parameters from a .star/.cs file

II. Specifying a real-space mask

III. Running the RECOVAR Pipeline

Required Arguments:

Commonly Used Optional Arguments:

IV. Analyzing results

Usage of analyze.py

Generating Volumes at Specific Points in Latent Space

Generating and Analyzing Trajectories in Latent Space

Positional Arguments:

Required Arguments:

Endpoint Specification (One of the following is required):

Optional Arguments:

Examples:

Additional Notes:

Help Message:

Extracting image subset based on volumes

Overview

Usage

Parameters

Example Commands

Notes

Command: extract_image_subset_from_kmeans

Overview

Usage

Parameters

Example Commands

Notes

Command: estimate_conformational_density.py

Overview

Usage

Example Commands

Notes

Output

V. Visualizing results

Output Structure

Directory and File Descriptions

Functions and Their Outputs

Copying and Visualizing Results

Visualization in jupyter notebook

Small test dataset

Steps to Run the Test Script

What the Script Does

Verifying the Results

Notes

TLDR

Using kernel regression with other embeddings

Using the source code

Tomography

Limitations

Acknowledgement

Citation

Contact

Dataset from manuscript

Other relevant github repos:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Usage of `analyze.py`

Command: `extract_image_subset_from_kmeans`

Command: `estimate_conformational_density.py`

Packages