Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated guides and pipeline docs #927

Merged
merged 2 commits into from
Apr 23, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions content/community-update/community-update/dcp-updates.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,17 @@ description: "Latest updates for the HCA Data Coordinaton Platform (DCP)."
# DCP Platform Updates


## Raw data for 16 new projects now available
#### April 12, 2021

Raw sequencing data for 16 new projects are now available in the DCP [Data Browser](https://data.humancellatlas.org/explore/projects). These projects include single-cell data derived from:
- Human and mouse
- 10x 3’, 10x 5’, Smart-seq2 technologies
- Small intestine, aorta, brain, skeletal muscle, blood, pancreas, tonsil, lung, skin, immune system, kidney, and eye
- Airway basal stem cells exposed to SARS-CoV-2
- Disease states, including Crohn’s Disease, aneurysm, Multiple Sclerosis, HIV, Type 2 Diabetes, and glioblastoma


## Processed data now available for 26 HCA 10x datasets
#### April 02, 2021

Expand Down
6 changes: 3 additions & 3 deletions content/guides/userguides/exploring-projects.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@ description: "Overview of exploring projects in the HCA Data Browser."

# Exploring Projects

Projects are a basic unit of data organization in the HCA Data Coordination Platform (HCA DCP). Project contributors contribute raw sequencing and associated [files](/metadata/dictionary/file/sequence_file) along with rich metadata describing:
Projects are a basic unit of data organization in the Data Coordination Platform (DCP). Project contributors contribute raw sequencing and associated [files](/metadata/dictionary/file/sequence_file) along with rich metadata describing:

1. the [origin and type of the cells](/metadata/dictionary/biomaterial/cell_line) used in the project
1. the [processes](/metadata/dictionary/process/analysis_process) and [protocols](/metadata/dictionary/protocol/aggregate_generation_protocol) used to collect and process the cells prior to sequencing
1. the [sequencing](/metadata/dictionary/protocol/sequencing_protocol) methods used
1. details about the [project](/metadata/dictionary/project/project) contributors and their institutions

This [Metadata](/metadata/dictionary/process/analysis_process) is included in the project's Metadata Manifest (TSV file). When the HCA DCP [processes](/pipelines) the contributor's raw data with standardized pipelines, this processing information is also added to the Metadata Manifest.
This [Metadata](/metadata/dictionary/process/analysis_process) is included in the project's Metadata Manifest (TSV file). When the DCP [processes](/pipelines) the contributor's raw data with standardized pipelines, this processing information is also added to the Metadata Manifest.


## Finding a Project of Interest
Expand Down Expand Up @@ -54,7 +54,7 @@ The project information page contains:

## Downloading Project Metadata

For each project, the HCA DCP maintains a project specific TSV file containing the full project metadata. The TSV contains a row for each file in the project and columns for each metadata property. Meanings of the metadata properties are listed in the [HCA Metadata Dictionary](/metadata).
For each project, the DCP maintains a project specific TSV file containing the full project metadata. The TSV contains a row for each file in the project and columns for each metadata property. Meanings of the metadata properties are listed in the [Metadata Dictionary](/metadata).

The metadata TSV file gives a representation of the project's metadata graph that can be sorted and filtered using standard spreadsheet or data manipulation tools.

Expand Down
20 changes: 10 additions & 10 deletions content/guides/userguides/matrices.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@ title: "Matrices"
description: "An overview of the available matrices"
---

# HCA DCP 2.0 Data Matrix Overview
Cell-by-gene matrices (commonly referred to as "count matrices" or "expression matrices") are files that contain a measure of gene expression for every gene in every cell in your single-cell sample(s). These matrices can be used for downstream analyses and cell type annotations. This overview describes the DCP 2.0 matrix types, how to download them, and how to link them back to the HCA metadata.
# DCP 2.0 Data Matrix Overview
Cell-by-gene matrices (commonly referred to as "count matrices" or "expression matrices") are files that contain a measure of gene expression for every gene in every cell in your single-cell sample(s). These matrices can be used for downstream analyses and cell type annotations. This overview describes the Data Coordination Platform (DCP) 2.0 matrix types, how to download them, and how to link them back to the DCP metadata.

Overall, three types of matrices are currently available for HCA DCP 2.0 data:
Overall, three types of matrices are currently available for DCP 2.0 data:
- DCP-generated matrices (Loom file format) for projects
- DCP-generated matrices (Loom file format) for individual library preparations within a project
- Contributor-generated matrices (variable file format) provided by the project-contributor
Expand Down Expand Up @@ -36,16 +36,16 @@ Both project matrices and library-level matrices have unique filenames. Project

![Project Matrices Filenames](../_images/project_matrix_name.png "Matrix Name")

Library-level matrices have filenames matching the numerical ID in the HCA metadata field `sequencing_process.provenance.document_id`.
Library-level matrices have filenames matching the numerical ID in the DCP metadata field `sequencing_process.provenance.document_id`.



#### DCP Project-level Matrices
Project-level matrices are Loom files that contain standardized cell-by-gene measures and metrics for all the data in a project that are of the same species, organ, and sequencing method. This means each HCA project can have multiple project-level matrices if for example, the project contains both human and mouse or 10x and Smart-seq2 data.
Project-level matrices are Loom files that contain standardized cell-by-gene measures and metrics for all the data in a project that are of the same species, organ, and sequencing method. This means each DCP project can have multiple project-level matrices if for example, the project contains both human and mouse or 10x and Smart-seq2 data.

The gene measures in project matrices vary based on the pipeline used for analysis. Matrices produced with the Optimus Pipeline (10x data) will have UMI-aware counts whereas matrices produced with the Smart-seq2 pipeline will have TPMs and estimated counts. Additionally, 10x matrices have been minimally filtered based on the number of UMIs (only cells with 100 molecules or more are retained).

Each project matrix also has HCA metadata (see table below) stored in the Loom file's global and column attributes. This metadata may be useful when exploring the data and linking it back to the additional Project metadata in the Data Manifest. Read more about each metadata field in the [Metadata Dictionary](/metadata/).
Each project matrix also has DCP metadata (see table below) stored in the Loom file's global and column attributes. This metadata may be useful when exploring the data and linking it back to the additional Project metadata in the Data Manifest. Read more about each metadata field in the [Metadata Dictionary](/metadata/).

| Metadata Attribute Name in DCP Generated Matrix | Metadata Description |
| --- | --- |
Expand All @@ -57,15 +57,15 @@ Each project matrix also has HCA metadata (see table below) stored in the Loom f
| `input_id` | Metadata values for `sequencing_process.provenance.document_id` |
| `input_name` | Metadata values for `sequencing_input.biomaterial_core.biomaterial_id` |

More information about HCA post-processing for the project-level matrices can be found in the Matrix Overview for the [Optimus Pipeline](https://broadinstitute.github.io/warp/documentation/Pipelines/Optimus_Pipeline/Loom_schema.html#hca-data-coordination-platform-matrix-processing) and the [Smart-seq2 Pipeline](https://broadinstitute.github.io/warp/documentation/Pipelines/Smart-seq2_Multi_Sample_Pipeline/Loom_schema.html#table-2-column-attributes-cell-metrics) (in development).
More information about DCP post-processing for the project-level matrices can be found in the Matrix Overview for the [Optimus Pipeline](https://broadinstitute.github.io/warp/documentation/Pipelines/Optimus_Pipeline/Loom_schema.html#hca-data-coordination-platform-matrix-processing) and the [Smart-seq2 Pipeline](https://broadinstitute.github.io/warp/documentation/Pipelines/Smart-seq2_Multi_Sample_Pipeline/Loom_schema.html#table-2-column-attributes-cell-metrics) (in development).


#### DCP Library-level Matrices
Library-level matrices (also Loom files) are cell-by-gene matrices for each individual library preparation in a project. These matrices contain the same standardized gene (row) metrics, cell (column) metrics and counts as the project-level matrices, but are separated by the HCA metadata field for library preparation, `sequencing_process.provenance.document_id`, allowing you to only use a sub-sampling of all the project's data.
Library-level matrices (also Loom files) are cell-by-gene matrices for each individual library preparation in a project. These matrices contain the same standardized gene (row) metrics, cell (column) metrics and counts as the project-level matrices, but are separated by the DCP metadata field for library preparation, `sequencing_process.provenance.document_id`, allowing you to only use a sub-sampling of all the project's data.

While a library preparation for 10x datasets will likely include all the cells for a single donor, a library preparation for Smart-seq2 data will include the individual cell (i.e. if your Smart-seq2 data has 200 cells, it will have 200 library-level matrices).

Unlike project matrices, **library-level matrices are not filtered** and they do not contain all the HCA metadata for species, organ, and sequencing method in the matrix global attributes. Instead, they only contain the metadata for `input_id` and `input_name` (described in table above).
Unlike project matrices, **library-level matrices are not filtered** and they do not contain all the DCP metadata for species, organ, and sequencing method in the matrix global attributes. Instead, they only contain the metadata for `input_id` and `input_name` (described in table above).

## Contributor Generated Matrices
Contributor Generated Matrices are optionally provided by the data contributors. These can be useful when trying to annotate cell types or when comparing results back to a contributors published results. When these matrices are available, you can download them from the individual Project page. Across projects, these matrices will vary in file format and content. For questions about the Contributor Generated Matrix, reach out to the contributors listed in the Project page Contact section.
Expand All @@ -78,7 +78,7 @@ DCP-generated project-level matrices and contributor-generated matrices may be d
## Linking Project-level DCP Generated Matrices to the Data Manifest (Metadata)
DCP 2.0 project-level matrices only contain some of the available project metadata (species, organs, library methods, etc.). However, there are several metadata facets in the Metadata Manifest, such as disease state or donor information, that you might want to link back to the DCP-generated cell-by-gene matrix.

To link a metadata field in the Metadata Manifest back to an individual sample in a DCP Generated Matrix, use the matrix `input_id` field. This field includes all the values for the HCA metadata `sequencing_process.provenance.document_id`, the ID used to demarcate each library preparation.
To link a metadata field in the Metadata Manifest back to an individual sample in a DCP Generated Matrix, use the matrix `input_id` field. This field includes all the values for the DCP metadata `sequencing_process.provenance.document_id`, the ID used to demarcate each library preparation.



Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,10 @@ description: "Overview of the data processing pipelines in the HCA DCP."

# Overview of Data Processing Pipelines
## What is Data Processing?
In the HCA DCP, data processing refers to the use of a computational pipeline to analyze raw experimental data from a specific assay. Processing of HCA data produces collections of quality metrics and features that can be used for further analysis. For example, the processing of single-cell RNA-Seq data produces aligned, QC’d reads, a matrix of gene expression, and a matrix of quality control metrics describing the data.
Data processing refers to the use of a computational pipeline to analyze raw experimental data from a specific assay. Processing data submitted to the Data Coordination Platform (DCP) produces collections of quality metrics and features that can be used for further analysis. For example, the processing of single-cell RNA-Seq data produces aligned, QC’d reads, a matrix of gene expression, and a matrix of quality control metrics describing the data.

## What is the Data Processing Pipeline Service?
The Data Processing Pipeline Service consists of analysis pipelines and execution infrastructure that move raw data through analysis, producing measurements that are available for download by the community from the Data Portal. These include both the submitted raw data and data resulting from data processing. As new single-cell technologies and analysis methods are developed and accepted by the research community, we will implement new data processing pipelines and make both the pipelines and the data publically available.
The Data Processing Pipeline Service consists of analysis pipelines and execution infrastructure that move raw data through analysis, producing measurements that are available for download by the community from the Data Portal. These include both the submitted raw data and data resulting from data processing. As new single-cell technologies and analysis methods are developed and accepted by the research community, we will implement new data processing pipelines and make both the pipelines and the data publicly available.

Data processing pipelines are each bespoke to the characteristics of the data they process. These pipelines can attempt to address the quality of the measurements, detecting false positives or negatives, optimal processing (such as aligning, collapsing UMIs, or segmenting images into accurate features), and many other concerns. Please see the details about each of our pipelines and send us your feedback!

Expand All @@ -25,7 +25,7 @@ The following are pipelines in development or in production in this platform:
> Pipeline code and detailed documentation are hosted in the [WDL Analysis Research Pipelines (WARP)](https://github.com/broadinstitute/warp) repository on GitHub.

## Access to Pipeline Outputs
Matrices are publicly available and can be accessed through the HCA Data Browser or from the individual Project page.
Matrices are publicly available and can be accessed through the DCP Data Browser or from the individual Project page.



Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ description: "Overview of the file formats used by the data processing pipelines
## DCP Matrix Download File Format

> **MTX and CSV Matrix Deprecation Notice:**
DCP 1.0 matrices will be deprecated in the DCP 2.0. Loom files will be the default format.
DCP 1.0 matrices are deprecated in the DCP 2.0. Loom files are now the default format.

Cell by gene count matrices are provided in [Loom](http://loompy.org/) file format and be downloaded through the HCA Data Portal. From the Portal's Data Browser, you can make a multifaceted search to download matrices for multiple projects. Alternatively, you can explore the matrices available for download on the individual Project pages.
Cell by gene count matrices are provided in [Loom](http://loompy.org/) file format and can be downloaded through the Data Coordination Platform's (DCP) Data Portal. From the Portal's Data Browser, you can make a multifaceted search to download matrices for multiple projects. Alternatively, you can explore the matrices available for download on the individual Project pages.

#### Working with Loom Files

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ description: "Overview of best practices for building data processing pipelines

# Best Practices for Building Data Processing Pipelines

Each of our pipelines are developed using the best practices detailed below and are approved by the [Human Cell Atlas Analysis Working Group](https://www.humancellatlas.org/learn-more/working-groups/). We describe each of these best practices to give insight as to why they are important and we provide examples to give you a sense of how to apply them.
Each of our Data Coordination Platform (DCP) standardized pipelines are developed using the best practices detailed below and are approved by the [Human Cell Atlas Analysis Working Group](https://www.humancellatlas.org/learn-more/working-groups/). We describe each of these best practices to give insight as to why they are important and we provide examples to give you a sense of how to apply them.

Overall, the best pipelines should be:
- automated
Expand Down Expand Up @@ -60,7 +60,7 @@ _Impact._ Pipelines will have greatest impact when they can be leveraged in mult

_Maintainability._ Over the long term, it is easier to maintain pipelines that can be run in multiple environments. Portability avoids being tied to specific infrastructure and enables ease of deployment to development environments.

Within the scope of the HCA, to ensure that others will be able to use your pipeline, avoid building in assumptions about environments and infrastructures in which it will run.
Within the scope of the HCA project, to ensure that others will be able to use your pipeline, avoid building in assumptions about environments and infrastructures in which it will run.

### Configurability for running on different technical infrastructures.
Code should not change to enable a pipeline to run on a different technical architecture; this change in execution environment should be configurable outside of the pipeline code.
Expand Down