Skip to content

Directory Structures and yaml Files in MoSeq Pipeline

Sherry edited this page Feb 14, 2022 · 38 revisions

Overview

Project setup and directory structures

Currently Supported Depth File Extensions

The currently accepted depth data extensions are:

  • .dat (raw depth files from our kinect2 data acquisition software)
  • .tar.gz (compressed depth files from our kinect2 data acquisition software)
  • .avi (compressed depth files from the moseq2-extract CLI)
  • .mkv (generated from Microsoft's recording software for the Azure Kinect)

The kinect2nidaq acquisition software produces 3-5 files after a recording:

  • depth.dat
  • depth_ts.txt
  • metadata.json
  • (optionally, depending on if the Nidaq data stream box is checked) nidaq.dat
  • (optionally, depending on if the RGB stream box is checked) RGB.mp4

Contents of the depth.dat file

The depth.dat file is a 3D depth video stored in raw byte form. Each pixel of each movie frame is a little-endian unsigned 16-bit integer (uint16) representing the distance from the camera, in millimeters.

Contents of the depth_ts.txt file

The depth_ts.txt file records the timestamps of each video frame in plain text format. The file has 2 columns separated by a single whitespace. The first column contains the hardware timestamps of the camera in ms while the second column contains timestamps from the NIDAQ, if you enabled data capture from it. Otherwise, the second column will be populated with zeros. The MoSeq analysis pipeline only uses the first column.

Contents of the metadata.json file

The metadata.json file contains the following information in JSON format:

  • mouse name
  • session name
  • time of the recording
  • NIDAQ-specific parameters (not important for typical behavioral recordings)
  • video-specific parameters (i.e., resolution, data type)

We recommend recording more than 10 hours of depth video (~1 million frames at 30 frames per second) to ensure quality MoSeq models

Directory Structures

Each MoSeq project is contained within a base directory.

To better organize the extraction, modeling, and analysis results, you can copy the MoSeq notebooks to the base directory and navigate to the base directory using cd when you are using the notebooks.

You should see the base directory using ls in your current working directory and cd to change the directory when you use the CLI. To better organize the output, you may want to specify <base_dir> as your input directory and output directory in the CLI commands, if your working directory is not <base_dir>.

At the beginning of the MoSeq pipeline, the base directory should contain separate subfolders for each depth recording session. The directory structure is as shown below:

.                   ** current working directory
└── <base_dir>/     ** base directory with all depth recordings
    ├── session_1/  ** - the folder containing all of a single session's data
    ├   ├── depth.dat        ** depth data - the recording itself
    ├   ├── depth_ts.txt     ** timestamps - csv/txt file of the frame timestamps (2 columns, recording timestamps in ms and nidaq timestamps)
    ├   └── metadata.json    ** metadata - json file that contains the rodent's info (group, subjectName, etc.)
    ...
    ├── session_n/
    ├   ├── depth.dat
    ├   ├── depth_ts.txt
    └── └── metadata.json

After generating progress.yaml, config.yaml, and downloading the Flip Classifier File

Running the MoSeq2 Extract Modeling Notebook on your data the first time, the Notebook will generate necessary yaml files in the analysis pipeline. You can find more information in the section describing yaml files in MoSeq pipeline. After running the generate progress.yaml cell, a progress.yaml file will be added to the base directory if the file doesn't already exist.

MoSeq uses a config.yaml file to hold all the configurable parameters in the pipeline. After running the generate config.yaml cell, a config.yaml file will be added to the base directory.

MoSeq uses a flip classifier to orient the mouse's head to always point right during the extraction. After running the download flip classifier cell, a file with .pkl extension will be added to the base directory.

After running generate progress.yaml, config.yaml, and download the Flip Classifier File cells, the directory structure is as shown below:

.                   ** current working directory
└── <base_dir>/
    ├── config.yaml  ** - NEW FILE -
    ├── progress.yaml  ** - NEW FILE -
    ├── flip_classifier_k2_c57_10to13weeks.pkl  ** - NEW FILE -
    ├── session_1/ 
    ├   ├── depth.dat        
    ├   ├── depth_ts.txt     
    ├   └── metadata.json    
    ...
    ├── session_n/ 
    ├   ├── depth.dat
    ├   ├── depth_ts.txt
    └── └── metadata.json

After running Interactive Arena Detection Tool

Running Interactive ROI Detection Tool in MoSeq2 Extract Modeling Notebook is an optional step. After running the Interactive ROI Detection Tool, the notebook generates a session_config.yaml that is later used in the extraction step. After running the interactive arena detection tool, a session_config.yaml will be added to the base directory. The directory structure is as shown below:

.                   ** current working directory
└── <base_dir>/
    ├── config.yaml
    ├── session_config.yaml  ** - NEW FILE -
    ├── progress.yaml
    ├── flip_classifier_k2_c57_10to13weeks.pkl
    ├── session_1/ 
    ├   ├── depth.dat
    ├   ├── depth_ts.txt
    ├   └── metadata.json
    ...
    ├── session_n/ 
    ├   ├── depth.dat
    ├   ├── depth_ts.txt
    └── └── metadata.json

After extracting the data

A folder called proc that contains all the extraction results is generated within each session sub-folder. The proc folder contains roi.tiff, first_frame.tiff, bground.tiff, results_00.yaml, results_00.h5 and results_00.mp4.

.                   ** current working directory
└── <base_dir>/.
    ├── config.yaml
    ├── session_config.yaml
    ├── progress.yaml
    ├── flip_classifier_k2_c57_10to13weeks.pkl
    ├── session_1/
    ├   ...
    ├   └── proc/  ** - NEW FOLDER -
    ├   ├   ├── roi.tiff          ** the detected arena
    ├   ├   ├── first_frame.tiff  ** the first frame of the recording
    ├   ├   ├── bground.tiff      ** the background of the recording
    ├   ├   ├── results_00.yaml   ** .yaml file storing extraction parameters
    ├   ├   ├── results_00.h5     ** .h5 file storing extraction
    ├   └   └── results_00.mp4    ** extracted video
    └── session_n/
    ├   ...
    ├   └── proc/  ** - NEW FOLDER -
    ├   ├   ├── roi.tiff
    ├   ├   ...
    ├   ├   ├── results_00.yaml
    ├   ├   ├── results_00.h5
    └   └   └── results_00.mp4
        

After aggregating the extraction results

The following cell will search for the proc/ subfolders containing the extraction output, and copy them to a single aggregate_results/ folder. An index file called moseq2-index.yaml will also be generated with metadata for all extracted sessions.

After running the aggregate results cell, a folder called aggregate_results will be generated in the base directory. The aggregate_results/ folder contains all the data you need to run the rest of the pipeline. The PCA and modeling step will use data in this folder.

.                   ** current working directory
└── <base_dir>/.
    ├── aggregate_results/ ** - NEW FOLDER -
    ├   ├── session_1_results_00.h5   ** session 1 compressed extraction + metadata 
    ├   ├── session_1_results_00.yaml ** session 1 extraction parameters
    ├   ├── session_1_results_00.mp4  ** session 1 extracted video
    ...
    ├   ├── session_n_results_00.h5   ** session n compressed extraction + metadata 
    ├   ├── session_n_results_00.yaml ** session n extraction parameters
    ├   └── session_n_results_00.mp4  ** session n extracted video
    ├── config.yaml
    ├── moseq2-index.yaml ** - NEW FILE -
    ├── session_1/
    ...
    └── session_n/

After running train PCA step

After running the train PCA step, a new folder called _pca will be generated and the PCA results, pca.h5, pca.yaml, pca_components.png and pca_scree.png will be stored in the newly generated directory _pca.

.                   ** current working directory
└── <base_dir>/.
    ├── _pca/  ** - NEW FOLDER -
    ├   ├── pca.h5      ** - NEW FILE - pca model compressed file
    ├   ├── pca.yaml    ** - NEW FILE - pca model YAML metadata file
    ├   ├── pca_components.png  ** - NEW FILE -
    ├   └── pca_scree.png  ** - NEW FILE -
    ├── aggregate_results/
    ├── config.yaml
    ├── moseq2-index.yaml
    ├── session_1/
    ...
    └── session_n/

After running apply PCA steps

After running the apply PCA step, a pca_scores.h5 will be added to the _pca folder.

.                   ** current working directory
└── <base_dir>/.
    ├── _pca/
    ├   ├── pca.h5 
    ├   ├── pca.yaml
    ├   ├── pca_components.png
    ├   ├── pca_scree.png 
    ├   └── pca_scores.h5   ** - NEW FILE - depth video PC scores
    ├── aggregate_results/
    ├── config.yaml
    ├── moseq2-index.yaml
    ├── session_1/
    ...
    └── session_n/

After computing model-free changepoints

After computing model-free changepoints, changepoints.h5 and changepoints_dist.png will be added to _pca.

.                   ** current working directory
└── <base_dir>/.
    ├── _pca/ 
    ├   ├── pca.h5
    ├   ├── pca_scores.h5
    ├   ...
    ├   ├── changepoints.h5  ** - NEW FILE - HDF5 file that contains the computed changepoints for each session used to produce the block duration distribution plot.
    ├   └── changepoints_dist.pdf/png ** - NEW FILES - Images that contain the distribution of behavior block durations captured by the PCs.
    ├── aggregate_results/ 
    ├── config.yaml
    ├── moseq2-index.yaml
    ├── session_1/
    ...
    └── session_n/

After training the AR-HMM model(s)

Running train AR-HMM step will generate a folder specified in the base_model_path if it doesn't already exist. Trained model(s) will be stored in the newly generated directory, specified in base_model_path.

After training 1 model, model.p, the directory structure will be as shown below.

.                   ** current working directory
└── <base_dir>/.
    ├── _pca/ 
    ├── aggregate_results/ 
    ├── <base_model_path>/
    ├   └── model.p  ** - NEW FILE -
    ├── moseq2-index.yaml/
    ├── config.yaml
    ├── session_1/
    ...
    └── session_n/

After training multiple models, (eg. 3 models, model1.p, model2.p and model3.p), the directory structure will be as shown below.

.                   ** current working directory
└── <base_dir>/.
    ├── _pca/ 
    ├── aggregate_results/ 
    ├── <base_model_path>/
    ├   ├── model1.p  ** - NEW FILE -
    ├   ├── model2.p  ** - NEW FILE -
    ├   └── model3.p  ** - NEW FILE -
    ├── moseq2-index.yaml/
    ├── config.yaml
    ├── session_1/
    ...
    └── session_n/

The Setup Directory Structure for Analyzing Model(s) cell in MoSeq2-Analysis-Visualization-Notebook detects all the models and creates a model-specific folder for each model and copies the model to its model-specific folder. For example, if there are three models in base_model_path (eg. model1.p, model2.p and model3.p), the directory structure will be as shown below.

.                   ** current working directory
└── <base_dir>/.
    ├── _pca/ 
    ├── aggregate_results/ 
    ├── <base_model_path>/
    ├   ├── model1.p
    ├   ├── model2.p
    ├   └── model3.p
    ├   ├── model1/  ** - NEW FOLDER -
    ├   ├   └── model1.p
    ├   ├── model2/  ** - NEW FOLDER -
    ├   ├   └── model2.p
    ├   └── model3/  ** - NEW FOLDER -
    ├   ├   └── model3.p
    ├── moseq2-index.yaml/
    ├── config.yaml
    ├── session_1/
    ...
    └── session_n/

After running syllable labeler

After running the syllable labeler tool, syllable crowd movies based on the model of interest will be generated in its model-specific folder. In this example, there are three models in base_model_path (model1.p, model2.p and model3.p), and model1.p is the model of interest.

.                   ** current working directory
└── <base_dir>/.
    ├── _pca/ 
    ├── aggregate_results/ 
    ├── <base_model_path>/
    ├   ├── model1.p
    ├   ├── model2.p
    ├   └── model3.p
    ├   ├── model1/
    ├   ├   ├── model1.p
    ├   ├   ├── syll_info.yaml  ** - NEW FILE - yaml file the stores the syallable names and descriptions
    ├   ├   └crowd_movies       ** - NEW FOLDER - syllable crowd movies
    ├   ├   ├    ├syll1.mp4
    ├   ├   ├    ...
    ├   ├   ├    └sylln.mp4
    ├   ├── model2/
    ├   ├   └── model2.p
    ├   └── model3/
    ├   ├   └── model3.p
    ...

Yaml files in the MoSeq Pipeline

Generally, these files store metadata, configuration parameters, or file paths.

progress.yaml

This notebook generates a progress.yaml file that stores the file paths to data generated from this notebook including extraction data files, PC scores from the extractions, and model results. This file is project-specific. Each time you start a new project with MoSeq, a new progress.yaml file is generated. Every Below we show the content in the progress.yaml file and what each field is for.

name description
base_dir path to the data directory with all depth recordings
config_file path to the config.yaml file
index_file path to the moseq2-index.yaml file
train_data_dir path to the aggregate_results folder
pca_dirname path to the folder containing PCA results
scores_filename file name containing PCA scores
scores_path path to the PCA scores
changepoints_path file name or path to the file containing model-free changepoints
base_model_path folder where all models are saved
model_session_path path to the one model you have selected for analysis, and used in the analysis pipeline
crowd_dir path to the folder that saves crowd movies. This folder should be a subdirectory of the model_path
plot_path folder where most plots are saved, excluding the plots generated during the PCA step
session_config path to the session_config.yaml file (see below for description)
syll_info path to the syll_info.yaml file (see below for description)
df_info_path path to syllable statistics dataframe. Contains the same information as the mean_df but is saved in a different location and format.

If your notebook kernel is shut down, you can load the progress file to 'restore' your progress. The progress file may not correctly track MoSeq pipeline operations that were executed outside this notebook (for example, if you were to run PCA using the command line interface). If necessary, you can manually modify the paths in the progress file or the corresponding progress_paths dictionary to access the output of these external operations.

We recommend running the notebooks from the folder where your data is located so the results are better organized. In that case, you can specify the base_dir like ./ (or the current folder).

If you run the MoSeq2 Extract Modeling Notebook to initialize or restore a progress.yaml file, progress.yaml file will be generated if there is not such a file in the base directory. When generating the progress.yaml file, the program will scan through the folder that stores all the depth recordings, the base directory, and determine the progress of the analysis pipeline. Otherwise, the program will try to find the progress.yaml file or the last saved checkpoint to determine the progress of the analysis pipeline.

  • If there is a progress.yaml file in the directory, the information will be loaded into the progress_paths dictionary. The check_progress function will print progress bars for each pipeline step in the notebook. The extraction progress bar indicates the total number of extracted sessions detected in the provided base_dir path.
  • It prints the session names that haven't been extracted. Note: the progress does not reflect the contents of the aggregate_results/ folder.
  • The remainder of the progress bars are derived from reading the paths in the progress_paths dictionary, and the bars will fill up if the included paths are found.

config.yaml

The notebook generates a config.yaml that holds all configurable parameters for all steps in the MoSeq pipeline, such as extraction parameters and PCA parameters. The file is initialized with default values we found to work best for the common C57BL/6J mouse strain. Here is an example config.yaml file, with settings for each MoSeq package clearly demarcated.

Parameters will be added to this file as you progress through the notebook. The config file can be used to run an identical pipeline in future analyses.

session-config.yaml

The notebook generates a session-config.yaml that holds all configurable extraction parameters for each session. Each session contains the same parameters as in the moseq2-extract section of the example config file. During initialization, the depth of the bucket in each session gets detected and the values are stored in session-config.yaml. If you use the Interactive ROI Detection Tool to configure parameters for specific sessions, the new parameters are stored insession-config.yaml.

The file structure looks like the following:

session1_name:
    moseq2_extract_parameter1: value1
    moseq2_extract_parameter2: value2
session2_name:
    moseq2_extract_parameter1: value1
    moseq2_extract_parameter2: value2

moseq2-index.yaml

During the aggregating results step, the proc/ subfolders generated by extraction are copied to a single aggregate_results/ folder. The notebook generates a moseq2-index.yaml from the metadata for all extracted sessions. The aggregate_results/ folder contains all the data you need to run the rest of the pipeline. The PCA and modeling step will use data in this folder.

Important Note: The index file contains UUIDs to map each session to a specific extraction. If you want to re-extract session(s), delete the existing moseq2-index.yaml file and re-aggregate the extracted results to keep the moseq2-index.yaml updated. Not doing so may cause KeyErrors in the PCA and modeling steps.

syll_info.yaml

The syllable labeler widget in the MoSeq2 Analysis Visualization Notebook generates a syll_info.yaml that saves the syllable names and descriptions.

The contents of the file look like the following:

0:
   label: walk
   desc: ''
   crowd_movie_path: /data/saline-amphetamine/model/crowd_movies/syllable_sorted-id-00_(usage)_original-id-64.mp4
   sorted_id: 0
   sort_type: usage
   original_id: 64
1:
   label: ''
   desc: ''
   crowd_movie_path: /data/saline-amphetamine/model/crowd_movies/syllable_sorted-id-01_(usage)_original-id-75.mp4
   sorted_id: 1
   sort_type: usage
   original_id: 75

Where the first number denotes the syllable label and the indented data contain the syllable description and crowd movie links.

Extraction hdf5 file contents

If you ever need to access the extracted raw data, you can look into the results_00.h5 file directly. Each hdf5 file produced by the MoSeq extractions contains the following data structure:

 /
   - /frames
   - /frames_mask
   - /timestamps
   /metadata
     /metadata/acquisition
       - /metadata/acquisition/ColorDataType
       - /metadata/acquisition/ColorResolution
       - /metadata/acquisition/DepthDataType
       - /metadata/acquisition/DepthResolution
       - /metadata/acquisition/IsLittleEndian
       - /metadata/acquisition/NidaqChannels
       - /metadata/acquisition/NidaqSamplingRate
       - /metadata/acquisition/SessionName
       - /metadata/acquisition/StartTime
       - /metadata/acquisition/SubjectName
     /metadata/extraction
       - /metadata/extraction/background
       - /metadata/extraction/extract_version
       - /metadata/extraction/first_frame
       - /metadata/extraction/first_frame_idx
       - /metadata/extraction/flips
       - /metadata/extraction/last_frame_idx
       /metadata/extraction/parameters
         - /metadata/extraction/parameters/angle_hampel_sig
         - /metadata/extraction/parameters/angle_hampel_span
         - /metadata/extraction/parameters/bg_roi_depth_range
         - /metadata/extraction/parameters/bg_roi_dilate
         - /metadata/extraction/parameters/bg_roi_erode
         - /metadata/extraction/parameters/bg_roi_fill_holes
         - /metadata/extraction/parameters/bg_roi_gradient_filter
         - /metadata/extraction/parameters/bg_roi_gradient_kernel
         - /metadata/extraction/parameters/bg_roi_gradient_threshold
         - /metadata/extraction/parameters/bg_roi_index
         - /metadata/extraction/parameters/bg_roi_shape
         - /metadata/extraction/parameters/bg_roi_weights
         - /metadata/extraction/parameters/bg_sort_roi_by_position
         - /metadata/extraction/parameters/bg_sort_roi_by_position_max_rois
         - /metadata/extraction/parameters/cable_filter_iters
         - /metadata/extraction/parameters/cable_filter_shape
         - /metadata/extraction/parameters/cable_filter_size
         - /metadata/extraction/parameters/camera_type
         - /metadata/extraction/parameters/centroid_hampel_sig
         - /metadata/extraction/parameters/centroid_hampel_span
         - /metadata/extraction/parameters/chunk_overlap
         - /metadata/extraction/parameters/chunk_size
         - /metadata/extraction/parameters/cluster_type
         - /metadata/extraction/parameters/compress
         - /metadata/extraction/parameters/compress_chunk_size
         - /metadata/extraction/parameters/compress_threads
         - /metadata/extraction/parameters/compute_raw_scalars
         - /metadata/extraction/parameters/config_file
         - /metadata/extraction/parameters/crop_size
         - /metadata/extraction/parameters/delete
         - /metadata/extraction/parameters/detected_true_depth
         - /metadata/extraction/parameters/dilate_iterations
         - /metadata/extraction/parameters/erode_iterations
         - /metadata/extraction/parameters/flip_classifier
         - /metadata/extraction/parameters/flip_classifier_smoothing
         - /metadata/extraction/parameters/fps
         - /metadata/extraction/parameters/frame_dtype
         - /metadata/extraction/parameters/frame_trim
         - /metadata/extraction/parameters/graduate_walls
         - /metadata/extraction/parameters/manual_set_depth_range
         - /metadata/extraction/parameters/mapping
         - /metadata/extraction/parameters/max_height
         - /metadata/extraction/parameters/min_height
         - /metadata/extraction/parameters/model_smoothing_clips
         - /metadata/extraction/parameters/movie_dtype
         - /metadata/extraction/parameters/noise_tolerance
         - /metadata/extraction/parameters/num_frames
         - /metadata/extraction/parameters/output_dir
         - /metadata/extraction/parameters/output_file
         - /metadata/extraction/parameters/pixel_format
         - /metadata/extraction/parameters/progress_bar
         - /metadata/extraction/parameters/recompute_bg
         - /metadata/extraction/parameters/skip_completed
         - /metadata/extraction/parameters/spatial_filter_size
         - /metadata/extraction/parameters/tail_filter_iters
         - /metadata/extraction/parameters/tail_filter_shape
         - /metadata/extraction/parameters/tail_filter_size
         - /metadata/extraction/parameters/temporal_filter_size
         - /metadata/extraction/parameters/threads
         - /metadata/extraction/parameters/tracking_model_init
         - /metadata/extraction/parameters/tracking_model_ll_clip
         - /metadata/extraction/parameters/tracking_model_ll_threshold
         - /metadata/extraction/parameters/tracking_model_mask_threshold
         - /metadata/extraction/parameters/tracking_model_segment
         - /metadata/extraction/parameters/use_cc
         - /metadata/extraction/parameters/use_plane_bground
         - /metadata/extraction/parameters/use_tracking_model
         - /metadata/extraction/parameters/widen_radius
         - /metadata/extraction/parameters/write_movie
       - /metadata/extraction/roi
       - /metadata/extraction/true_depth
     - /metadata/uuid
   /scalars
     - /scalars/angle
     - /scalars/area_mm
     - /scalars/area_px
     - /scalars/centroid_x_mm
     - /scalars/centroid_x_px
     - /scalars/centroid_y_mm
     - /scalars/centroid_y_px
     - /scalars/height_ave_mm
     - /scalars/length_mm
     - /scalars/length_px
     - /scalars/velocity_2d_mm
     - /scalars/velocity_2d_px
     - /scalars/velocity_3d_mm
     - /scalars/velocity_3d_px
     - /scalars/velocity_theta
     - /scalars/width_mm
     - /scalars/width_px
Clone this wiki locally