Skip to content

Command Line Interface for Extraction and Modeling

Sherry edited this page Jun 28, 2022 · 47 revisions

Overview

Besides the Juypter notebooks, MoSeq also offers a series of Command Line Interface (CLI) tools. You can access a short description of the tool by running moseq2-extract --help, moseq2-pca --help, moseq2-model --help, and moseq2-model --help in the Terminal. You can use --help to access the available options for a specific command, eg. moseq2-extract extract --help.

Project Setup

If you are using Conda, and the environment name is moseq2-app, please run conda activate moseq2-app to activate the environment. If you are using the Docker container, please make sure your Docker is up and running, and you should see moseq next to your terminal prompt.

You can verify that all MoSeq modules by running

moseq2-extract --version
# moseq2-extract, version 1.1.2
moseq2-pca --version
# moseq2-pca, version 1.1.3
moseq2-model --version
# moseq2-model, version 1.1.2
moseq2-viz --version
# moseq2-viz, version 1.2.0

Currently Supported Depth File Extensions

We currently support .dat, .tar.gz, .avi and .mkv. You can read more about these depth data extensions here.

Directory Structure

Each MoSeq project is contained within a base directory, and you should see the base directory using ls in your current working directory and cd to change the directory when you use the CLI. To better organize the output, you may want to specify <base_dir> as your input directory and output directory in the CLI commands, if your working directory is not <base_dir>. At this stage, the base directory should contain separate subfolders for each depth recording session, as shown below and you can read more about it here:

.
└── <base_dir>/
    ├── session_1/
    ├   ├── depth.dat
    ├   ├── depth_ts.txt
    ├   └── metadata.json
    ...
    ├── session_n/
    ├   ├── depth.dat
    ├   ├── depth_ts.txt
    └── └── metadata.json

Note: if your data was acquired using an Azure Kinect, you will not have depth_ts.txt or metadata.json in your session subfolders. MoSeq will automatically generate the necessary files. The directory structure would be the following:

.
└── <base_dir>/
    ├── session_1/
    ├   └── session_1.mkv
    ...
    ├── session_n/
    └── └── session_n.mkv
    

Raw Data Extraction

Generate config.yaml

config.yaml is the configuration file that holds all configurable parameters for all steps in the MoSeq pipeline, such as extraction parameters and PCA parameters. The parameters in config.yaml are used to set relevant parameters in CLI commands.

To generate the initial default config.yaml file in the current working directory, run the following command.

moseq2-extract generate-config # generates a config.yaml file in the current directory

Alternatively to generate the config.yaml in a specified location pass the -o flag.

moseq2-extract generate-config -o <specific path>/config.yaml # generates a config.yaml file in provided subfolder (if it exists)

Download the Flip Classifier

MoSeq2 uses a Random Forest flip classifier to guarantee that the mouse's nose is always pointed to the right after cropping and rotationally aligning the depth videos.

The flip classifiers we provide are trained for experiments run with C57BL/6 mice using Kinect v2 depth cameras. We provide three kinds of pre-trained flip classifiers: large mice with fibers, adult male C57BL/6 mice, and mice with Inscopix cables.

To download a pre-trained flip classifier, run the following command. If the config.yaml is in your current working directory, the downloaded flip classifier will add the flip classifier path to the config.yaml file.

# The command below will prompt for input to indicate which one of 3 flip classifiers to download
moseq2-extract download-flip-file 

You can pass in the path to your config file to the command:

# The command below will prompt for input to indicate which one of 3 flip classifiers to download
moseq2-extract download-flip-file <specific path>/config.yaml

Train Your Own Flip Classifier

If your dataset does not work well with our pre-trained flip classifiers, we provide a flip-classifier training notebook. After using this notebook, add the absolute path of your custom classifier to the flip_classifier field config.yaml file.

(Optional) Using the Interactive Arena Detection Tool to find Extraction Parameters

You can use the Interactive Arena Detection Tool in the MoSeq2 Extract Modeling Notebook to interactively find extraction parameters such as depth range, dilate iters, and mouse height, previewing the arena for extraction and extraction samples. Before running the cell for the interactive tool, run the Setup/Restore cell to set up progress variables.

Running IInteractive Arena Detection Tool is optional. If the Interactive Arena Detection Tool is not run, the default parameters in config.yaml file will be used in the extraction step. You can find the file structure after running this tool here.

interactive arena

Instructions:

  • Run the following cell to initialize the Arena detection widget. The cell renders a control panel to configure parameters for detecting the arena.
  • By default the widget selects the first session in your dataset, sorted alphanumerically.
  • Adjust the depth range for detecting the floor of the arena.
  • Adjust the dilation iterations to include more of the wall of the arena.
  • Click the Compute arena mask button to compute and display the mask for the detected floor given the parameters. The displayed mask won't recompute and refresh when you change the parameters unless you click the button.
  • Check the "Show advanced arena mask parameters" checkbox to display more advanced arena mask parameters and you can find more information about the parameters by running the CLI moseq2-extract extract --help. You can find documentation for CLI here.
  • If you like the arena mask, click the Compute extraction button to extract a subset of the data.
  • Once you are satisfied with the extraction, click the Save parameters... button to move on to the next session and save this session's parameters.

Extract Data

To extract data, pass path to any depth.dat file in a session subfolder and specify the path to config.yaml using --config-file option. The extraction step uses parameters specified in config.yaml and the the path to the flip classifier. If you use the interactive arena detection tool to find the parameters for each session, you should still use --config-file ./<base_dir>/config.yaml to specify the path to the config file.

moseq2-extract extract ./<base_dir>/session_1/depth.dat --config-file ./<base_dir>/config.yaml 

If everything worked, you should see an extraction movie that looks like the following video (within reason).

You can run the following command to extract the sessions sequentially. If your depth file extension is not .dat, you can specify your file extension using --extensions flag, eg --extensions .avi. If you use the interactive arena detection tool to find the parameters for each sessions, use --config-file ./<base_dir>/session_config.yaml instead of --config-file ./<base_dir>/config.yaml.

moseq2-extract batch-extract <base_dir> --config-file ./<base_dir>/config.yaml

You can find the file structure after data extraction here.


Slurm

If you are running batch-extract locally, the sessions will be extracted sequentially and the process can be fairly slow. You can extract the sessions parallelly with Google Compute Engine using Slurm. If you have a GCE Slurm cluster running (or Slurm locally), you can use --cluster-type slurm to generate a bash script to run an extraction job for each session. Before running the bash script, don't forget to activate the virtual environment MoSeq packages are. You can find more information about the CLI options by running moseq2-extract batch-extract --help.

ssh $slurm_login_node # login to a slurm cluster
moseq2-extract batch-extract <base_dir> --config-file ./<base_dir>/config.yaml --cluster-type slurm

After the bash script is generated, run the following command to run the script:

conda activate moseq2-app #activate virtual environment
bash ./<base_dir>/extract_out.sh

Aggregate Results and Generate moseq2-index.yaml

Once all of your raw data recordings have been extracted and are of good quality, to simply keep track of all the training data, you should consolidate all the output files from extraction in a single folder called aggregate_results/. The command below recursively searches the current working directory for fully extracted recordings, copy the files contained within their respective proc/ directories into a new folder aggregate_results and generate moseq2-index.yaml (more information).

To aggregate your extraction results and generate the corresponding Index File, run the following command:

# assuming you are in the same working directory as the previous step
moseq2-extract aggregate-results --input-dir <base_dir> --output-dir <base_dir>
# assuming you are in the <base_dir> where the session subfolders live
moseq2-extract aggregate-results

The copied files in the aggregate_results/ will be named according to the variables in the recording's metadata.json file with the following naming scheme: {start_time}_{session_name}_{subject_name}.

Assign Groups

Assign Groups in moseq2-index.yaml

The session information in metadata.json is stored in metadata field in moseq2-index.yaml for each session, and a unique key (UUID) is given to each session. You can assign group labels to sessions for analyses comparing different cohorts or experimental conditions and the labels will also be stored in moseq2-index.yaml.

moseq2-viz add-group is used to specify groups for each session and the current working directory should be the same as aggregate_results. For example, if your aggregate_results is a folder in the base directory, then your current working directory should also be the base directory.

In this command, -k flag is for specifying the keyword field in metadata to look into, -v flag is for specifying the value to look for, -g flag is for specifying the group name. You can find more instructions for the command by running moseq2-viz add-group --help.

For example, the following command will assign group name saline to all the sessions whose SessionName field in their metadata matches the value saline.

moseq2-viz add-group -k SessionName -v saline -g saline moseq2-index.yaml

You can also specify multiple values to look for. For example, the following command will assign group name saline to all the sessions whose SubjectName field in their metadata matches one of the specified values.

moseq2-viz add-group -k SubjectName -v 000069 -v 000077 -v 000086 -g saline moseq2-index.yaml

(Optional) Interactive Assign Group Tool

You can use the interactive assign group tool in the MoSeq2 Extract Modeling Notebook to assign groups to sessions. Before running the cell for the interactive tool, run the Setup/Restore cell to set up progress variables.

The tool is intended for users to specify groups for the sessions interactively. The group field in the moseq2-index.yaml is used to store group labels, so the sessions can be grouped by experimental design for downstream analysis. Group labels in the moseq2-index.yaml can be used analyses comparing different cohorts or experimental conditions. Initially, all sessions are labeled "default" and the Group Setter tool below is used to assign group labels to sessions. This step requires that all your sessions have a metadata.json file containing a session name. Specify Groups

Instructions:

  • Run the following cell to launch the Group Setter tool.
  • Click on a column name to sort the table by values in the column.
  • Click the filter button to filter the values in a column.
  • Click on the session to select the session. To select multiple sessions, click the sessions while holding the CTRL/COMMAND key, or click the first and last entry while holding the SHIFT key.
  • Enter the group name in the text field and click Set Group to update the group column for the selected sessions.
  • Click the Update Index File button to save current group assignments.

PCA

Fit PCA for dimensionality reduction

Fitting a PCA to the extracted data to determine the pose trajectories. This process computes Principal Components (PCs) that explain the largest possible percentage of the variance in your data.

Once all of your recordings have been correctly extracted and aggregated into the aggregate_results/ folder. Run the following command to fit PCA using input from aggregate_results/ and PCA output in _pca.

Upon completion a new folder, _pca/, will have been created containing the following files:

  • pca.h5: HDF5 file that contains the principal components.
  • pca.yaml: YAML file that contains the configuration variables and metadata used to fit the PCA.
  • pca_components.pdf/png: Image containing a grid of 2D images representing each computed Principal Component.
  • pca_scree.pdf/png: Scree Plot that indicates the total number of computed PCs that explain 90% of the data.
# Assuming you are in the same directory as aggregate_results
moseq2-pca train-pca -i aggregate_results/ -o _pca/ --config-file config.yaml 

Note: You may see warning messages that says: distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: X GB -- Worker memory limit: Y GB. This message doesn't mean there is an error and you can ignore the warning message as long as the process is not terminated. If you experience process getting killed or terminated, you may consider adding --nworkers 1 to limit the number of workers to 1.

You should see pca_componnts.png and pca_scree.png in _pca along with pca.h5, which stores the results of the computation. Here's a typical example of the first 50 components:

And the corresponding scree plot:


Slurm

If you are running train-pca, the process can be fairly slow and you can train PCA parallelly with Google Compute Engine using Slurm. If you have a GCE Slurm cluster running (or Slurm locally), you can use --cluster-type slurm to train PCA with Slurm. You can find more information about the CLI options by running moseq2-pca train-pca --help.

ssh $slurm_login_node # login to a slurm cluster
srun --pty --mem=30G -n 5 bash # open an interactive node
# Assuming you are in the same directory as aggregate_results and the partition you intend to use is short
moseq2-pca train-pca -i aggregate_results/ -o _pca/ --cluster-type slurm -q short --config-file config.yaml 

Computing Principal Component Scores

Applying the computed PCs onto the extracted data to output dimensionality reduced data points that will be used to train the AR-HMM in the following pipeline step. Run the following command to apply the PC coefficients and compute the principal component scores.

Upon completion, a pca_scores.h5 file will be created in the same directory as the files generated from train-pca, for example _pca.

# assuming you are in the same directory as _pca and the data aggregate_result
moseq2-pca apply-pca -i aggregate_results/ -o _pca/ --config-file config.yaml 

You can find the file structure afte the PCA steps here.

Compute Model-free Syllable Changepoints

Computing the distribution of block durations of the behaviors as captured by the PC Scores. The Model-Free Changepoints is used to compare with your AR-HMM model fits.

Computing the Model-Free Changepoints of your dataset requires both the pca.h5 and pca_scores.h5. Run the following command to compute the Model-Free Changepoints captured by your Principal Components.

Note: Please make sure you specify -i aggregate_results in the command so the sessions that go into the computation have no duplicates or the command will fail.

moseq2-pca compute-changepoints -i aggregate_results/ --config-file config.yaml 

You can find the file structure after running Computer model-free changpoints here.

The changepoint distribution is typically a left-skewed distribution that has a median block duration of ~0.3 seconds.

Below is an example of an outputted changepoint distribution.

AR-HMM Modeling

Train AR-HMM

MoSeq uses AR-HMM on PC scores from the previous step to generate syllables. You can fit different variations of AR-HMMs to your input data using moseq2-model and you can access information about the CLI flags associated with the command by running moseq2-model learn-model --help.

Below are examples of how to use each of the listed parameters above to train the different model types. The command specifies the input being _pca/pca_scores.h5 and the output model is model_dir/my_model.p. --num-iter specifies the number of times to resample the model, and the default number is 100. 100 iterations are good enough to explore the model parameters but we recommend setting --num-iter to 1000 to get a more accurate model once decided on a set of parameters.


Non-Robust VS Robust

In non-robust models, the noise in autoregressive is z-distributed, whereas, in rousts models, the noise is t-distributed. Non-robust models generate fewer syllables than robust models.

Single Transition VS Separate Group Transition

Single transition means all groups will have one transition matrix and separate group transition means different groups will have different transition matrices. If the size of the data is small, we don't recommend modeling your data with separate group transitions.

Note: To attain accurate results, we recommend training at least 100 models, each with at least 1000 iterations by setting --num-iter 1000. After all the models are trained, you can use moseq2-viz get-best-model to find the best model that matches the PC changepoints.


  1. Non-Robust (z-distributed) Single Transition Graph AR-HMM The noise in autoregressive is Gaussian and all groups are modeled with one single transition matrix.
# assuming you are in the same directory as _pca
moseq2-model learn-model _pca/pca_scores.h5 model_dir/my_model.p --index ./moseq2-index.yaml
  1. Non-Robust (z-distributed) Separate Group Transition Graph AR-HMM
# assuming you are in the same directory as _pca
moseq2-model learn-model _pca/pca_scores.h5 model_dir/my_model.p --index ./moseq2-index.yaml --separate-trans
  1. Robust (t-distributed) Single Transition Graph AR-HMM
# assuming you are in the same directory as _pca
moseq2-model learn-model _pca/pca_scores.h5 model_dir/my_model.p --index ./moseq2-index.yaml --robust
  1. Robust (t-distributed) Separate Group Transition Graph AR-HMM
# assuming you are in the same directory as _pca
moseq2-model learn-model _pca/pca_scores.h5 model_dir/my_model.p --index ./moseq2-index.yaml --robust --separate-trans

The most important free parameter is kappa, which corresponds to the model's prior probability distribution for outputted syllable durations. By default, kappa is set to the total number of frames in the dataset. Increasing the value of kappa will increase the outputted syllable durations by the model, and vice versa. To find the best kappa value that matches the PC score changepoints, you can use moseq2-model kappa-scan to run models with a series of kappa values. You can find more information about the CLI options by running moseq2-model kappa-scan --help.

You can find the file structure after fitting AR-HMM model here.


kappa-scan Using Slurm

Currently, we support running kappa-scan using Google Compute Engine using Slurm. If you have a GCE Slurm cluster running (or Slurm locally), you can use --cluster-type slurm to generate a bash script to run a series of models with different kappa values. You can specify minimum kappa value using --min-kappa flag (eg. --min-kappa 10000), and maximum kappa value using --max-kappa flag (eg. '--max-kappa 10000000). You can find more information on scanning kappa and best practices running models in the analysis tips.

The bash script will be generated in the specified model directory. Before running the bash script, don't forget to activate the virtual environment MoSeq packages are.

ssh $slurm_login_node # login to a slurm cluster
moseq2-model kappa-scan _pca/pca_scores.h5 model_dir --cluster-type slurm

After the bash script is generated, run the following command to run the script:

conda activate moseq2-app #activate the virtual environment
bash ./model_dir/out.sh

When all the models finish running, you can run moseq2-viz get-best-model to find the model that best fits the PC changepoints.

Note that the model results have -5 prepended to the actual labels to account for the number of lags in the model. Thus, if you set nlags to 3 (the default value), and your data has 5 frames, you should see the following,

import joblib
results = joblib.load('my_model.p')
print(results['labels'])

[-5, -5, -5, 0, 0, 48, 57, 57] 

Get Best Fit Model

This command will save a plot model_pca_changepoints.png/pdf showing the comparative changepoint distribution curves between the trained model and the PCA scores changepoints.

best-model

The command supports comparison concerning two objectives: duration and jsd. duration finds the model where the median syllable duration best matches that of the principal components' changepoints. jsd finds the model where the distribution of syllable durations best match that of the principal components' changepoints.

If there are multiple models in the inputted folder, then the outputted figure will plot multiple dashed distribution curves representing distributions of unselected models and 2 solid distribution curves that show the "Best"/chosen model and the principal compoments' changepoint durations.

moseq2-viz get-best-model model_dir/ _pca/changepoints.h5 model_pca_changepoints

Additional Resources

moseq2-extract [Repository][Documentation]

moseq2-pca [Repository][Documentation]

moseq2-model [Repository][Documentation]

moseq2-viz [Repository][Documentation]

Clone this wiki locally