From 77e4f49a55981037aec5b3878341ae0766bf9243 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sat, 6 Aug 2022 17:09:13 -0400 Subject: [PATCH 01/48] first try at revised index --- doc/new.md | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 75 insertions(+) create mode 100644 doc/new.md diff --git a/doc/new.md b/doc/new.md new file mode 100644 index 0000000000..6fce123fb9 --- /dev/null +++ b/doc/new.md @@ -0,0 +1,75 @@ +# Welcome to sourmash! + +sourmash is a command-line tool and Python/Rust library for +**metagenome analysis** and **genome comparison** with k-mers. It +supports the compositional analysis of metagenomes, rapid search of +large sequence databases, and flexible taxonomic analysis with both +NCBI and GTDB taxonomies. sourmash works well with sequences 30kb or +larger, including bacterial and viral genomes. + +You might try sourmash if you want to - + +* identify which reference genomes are present in a metagenome +* search all Genbank microbial genomes with a genome query +* build an annotation-free clustering of many genomes using k-mers or average nucleotide identity (ANI) +* taxonomically classify genomes or metagenomes against NCBI and/or GTDB; +* search all available reference genomes +* search thousands of metagenomes with a query genome or sequence + +Underneath, sourmash uses [FracMinHash sketches](@@) for fast and +lightweight sequence comparison; FracMinHash builds on +[MinHash sketching](@@wikipedia) to support both Jaccard similarity +_and_ containment analyses with k-mers. This significantly expands +the range of operations that can be done quickly and in low +memory. sourmash also implements a number of new and powerful analysis +techniques, including minimum metagenome covers and alignment-free ANI +estimation. + +sourmash is inspired by [mash](https://mash.readthedocs.io), and +supports most mash analyses. sourmash also implements an expanded set +of functionality for metagenome and taxonomic analysis. + +## Using sourmash + +### Tutorials and examples + +These tutorials are command line tutorials that should work on Mac OS +X and Linux. They require about 5 GB of disk space and 5 GB of RAM. + +* [The first sourmash tutorial - making signatures, comparing, and searching](tutorial-basic.md) + +* [Using sourmash LCA to do taxonomic classification](tutorials-lca.md) + +* [Analyzing the genomic and taxonomic composition of an environmental genome using GTDB and sample-specific MAGs with sourmash](tutorial-lemonade.md) + +* [Some sourmash command line examples!](sourmash-examples.md) + +### How-To Guides + +* Installing sourmash + +* [Classifying genome sketches](classifying-signatures.md) + +* [Working with private collections of genome sketches.](sourmash-collections.md) + +* [Using the `LCA_Database` API.](using-LCA-database-API.ipynb) + +* [Building plots from `sourmash compare` output](plotting-compare.md). + +* [A short guide to using sourmash output with R](other-languages.md). + +### How sourmash works under the hood + +* [An introduction to k-mers for genome comparison and analysis](kmers-and-minhash.md) +* [Support, versioning, and migration between versions](support.md) + +### Reference material + +* [UNIX command-line documentation](command-line.md) +* [Genbank and GTDB databases and taxonomy files](databases.md) +* [Python examples using the API](api-example.md) +* [Publications about sourmash](publications.md) + +## Developing and extending sourmash + +* [Releasing a new version of sourmash](release.md) From 33eeb062c8fda67bf9c612841b0fbc3e558dfc4f Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Mon, 15 Aug 2022 15:19:51 -1000 Subject: [PATCH 02/48] lots of details --- doc/sourmash-internals.md | 334 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 334 insertions(+) create mode 100644 doc/sourmash-internals.md diff --git a/doc/sourmash-internals.md b/doc/sourmash-internals.md new file mode 100644 index 0000000000..977b0c5ef1 --- /dev/null +++ b/doc/sourmash-internals.md @@ -0,0 +1,334 @@ +## Signatures and sketches + +Default signature format is stored as JSON, gzipped. Read by Rust. + +Complication with multiple sketches per signature. + +Right now signatures are 1:1 with sketches. + +### Scaled (FracMinHash) sketches support similarity and containment + +downsampling - mostly done dynamically + +### Num (MinHash) sketches support Jaccard similarity + +less well supported; see issue @@. + +downsampling may need to be done manually. + +## K-mer sizes + +## Molecule types - DNA, protein, Dayhoff, and hydrophobic-polar + +sourmash supports four different sequence encodings, which we refer to +as "molecule: DNA (`--dna`), protein (`--protein`), Dayhoff, +(`--dayhoff`), and hydrophobic-polar (`--hp`). + +All FracMinHash sketches have exactly one molecule type, and can only +be compared to the same molecule type (and ksize). + +DNA moltype sketches can be constructed from DNA input sequences using +`sourmash sketch dna`. + +Protein, Dayhoff, and HP moltype sketches can be constructed from +protein input sequences using `sourmash sketch protein`, or from DNA +input sequences using `sourmash sketch translate`; `translate` will +translate in all six reading frames (see also @orpheum). By default +protein sketches will be created; dayhoff sketches can be created by +including `dayhoff` in the param string, e.g. `sourmash sketch protein +-p dayhoff`, and hydrophobic-polar sketches can be built with `hp` in +the param string. + +## Manifests + +sourmash makes extensive use of signature manifests to support +rapid selection and lazy loading of signatures based on signature +metadata (name, ksize, moltype, etc.) See @@this blog post for +some of the motivation. + +Manifests are an internal format that is not meant to be particularly +human readable, but the CSV format can be loaded into a spreadsheet +program if you're curious :). + +If in a zipfile (`.zip`) or SBT zipfile (`.sbt.zip`), manifests must +be named `SOURMASH-MANIFEST.csv`. They can also be stored directly on +disk in CSV/gzipped CSV, or in a sqlite database; see +`sourmash sig manifest`, `sourmash sig check`, and `sourmash sig collect` +for manifest creation, management, and export utilities. + +Where signatures are stored individually in `Index` collections, +e.g. in a zipfile, manifests may be stored alongside them; for other +subclasses of `Index` such as the inverted indices, manifests are +generated dynamically by the class itself. + +Currently (sourmash 4.x) manifests do not contain information about the +hash seed or ... This will be fixed in future, see @@issue. + +Manifests are very flexible and, especially when stored in a sqlite +database, can be extremely performant for organizing hundreds of +thousands to millions of sketches. Please see `StandaloneManifestIndex` +for a lazy-loading `Index` class that supports such massive-scale +organization. + +## Index implementations + +The `Index` class and its various subclasses (in `sourmash.index`) are +containers that provide an API for organizing, selecting, and +searching (potentially) large numbers of signatures. + +Loading and saving of `Index` objects is handled separately from the +class: loading can be done in Python via the +`sourmash.load_file_as_index(...)` method, while creation and/or +updating of `Index` objects is done via +`sourmash.sourmash_args.SaveSignaturesToLocation(...)`. These are the +same APIs used by the command-line functionality. + +There are quite a few different `Index` subclasses and they all have +distinct features. We have a high-level guide to which collection +type to use @@here in command-line. + +Conceptually, `Index` classes are either organized around storing +individual signatures often with metadata that permits loading, +selecting, and/or searching them more efficiently +(e.g. `ZipFileLinearIndex` and `SBTs`); or they store signatures +as inverted indices (`LCA_Database` and `SqliteIndex`) that permit +certain kinds of fast queries. + +### In-memory storage and search. + +The simplest way to handle collections of signatures is to load them +into memory, but it is also the least performant and most memory +intensive mechanism! + +`LinearIndex` and `MultiIndex` both support sketches loaded from +JSON files; both will load the sketches once and then keep them in +memory. `LinearIndex` does not use manifests while `MultiIndex` builds +a manifest as it loads the sketches. + +Note that `MultiIndex` is the class used to load signatures from +pathlists, directory hierarchies, and so on; because it stores +sketches in memory, this can incur a significant memory penalty (see +issue @@). Therefore where possible we suggest building a standalone +manifest (`StandaloneManifestIndex`) to do lazy loading from the disk +instead; you can use `sourmash sig collect` to do this. + +### Zipfile collections + +`ZipFileLinearIndex` stores signature files in a zip file with an +accompanying manifest. This is the most versatile and compressed +option for working with large collections of sketches - it supports +rapid selection and loading of specific sketches from disk, and can +store and search any mixture of sketch types (ksize, molecule type, +scaled values, etc.) + +By default, `ZipFileLinearIndex` stores one signature (equiv. one +sketch) in each member file in the zip archive. Each signature is +stored uncompressed. The accompanying manifest stores the full member +file path in `internal_location`, so that sketches can be retrieved +directly. + +Searching a `ZipFileLinearIndex` is done linearly, as implied by the +name. This is fine for `gather` but if you are doing repeated queries +with `search` you may want to use an SBT or LCA database instead; see +below. + +In the future we expect to parallelize searching `ZipFileLinearIndex` +files in Rust; see @@greyhound. + +`ZipFileLinearIndex` does support zip files without manifests as well +as multiple signatures in a single file; this was originally intended +to support simply zipping entire directory hierarchies into a zipfile. +However, this slows down performance and is not recommended. If you +have an existing zipfile (or really any collection of signatures) and +you want to turn them into a proper `ZipFileLinearIndex`, you can use +`sig cat -o combined.zip` to create a +`ZipFileLinearIndex` file named `combined.zip` that will have a +manifest and signatures broken out into individual files. + +### Sequence Bloom Trees (SBTs) + +Sequence Bloom Trees (SBTs; @@link) provide a faster (but more memory +intensive) on-disk storage and search mechanism. In brief, SBTs implement +a binary tree organization of an arbitrary number of signatures; each +internal node is a Bloom filter containing all of the hashes for the +nodes below them. This permits potentially rapid elimination of +irrelevant nodes on search. + +SBTs are restricted to storing and searching sketches with the +same/single k-mer size and molecule type, as well as either a single +num value or a single scaled value. + +We suggest using SBTs when you are doing multiple Jaccard search or +containment searches with genomes via `sourmash search`. + +### Lowest common ancestor (LCA) databases + +The `LCA_Database` index class stores signatures in an inverted index, +where a Python dictionary is used to link individual hashes back to +signatures and/or taxonomic lineages. This supports the individualized +hash analyses used in the `lca` submodule. + +LCA databases only support a single ksize, moltype, and scaled. They +can only be used with FracMinHash (scaled) sketches. + +The default `LCA_Database` class is serialized via JSON, and loads +everything into memory when requested. The load time incurs a +significant latency penalty when used from the command line, as well +as having a potentially large memory footprint; this makes it +difficult to use the default `LCA_Database` for very large databases, +e.g. genbank bacteria. + +The newer `LCA_SqliteDatabase` (based on `SqliteIndex`, described +below) also supports LCA-style queries, and is stored on disk, is fast to +load, and uses very little memory. The tradeoff is that the underlying +sqlite database can be quite large. `LCA_SqliteDatabase` should also +support rapid concurrent access (see issue @@). + +Both types of LCA database can be constructed with `sourmash lca index`. + +### SqliteIndex + +The `SqliteIndex` storage class uses sqlite3 to store hashes and +sketch information for search and retrieval. (@@see blog post) These +are fast, low-memory, on-disk databases, with the tradeoff that they +can be quite large. This is probably currently the best solution for +concurrent access to sketches via e.g. a Web server (see issue@). + +`SqliteIndex` can only contain FracMinHash sketches and can only store +sketches with the same scaled parameter. However, it can store +multiple ksizes and moltypes as long as the same scaled is used. + +`SqliteIndex` objects can be constructed using `sourmash sig cat +... -o filename.sqldb`. + +### Standalone manifests + +The `StandaloneManifestIndex` class loads standalone manifests generated +by `sourmash sig collect`. They support rapid selection and lazy loading +on potentially extremely large collections of signatures. + +The underlying mechanism uses the `internal_location` field of +manifests to point to the container file. When particular sketches are +requested, the container file is loaded into an `Index` object with +`sourmash.load_file_as_index` and the `md5` values of the requested +sketches are used as a picklist to retrieve the desired signatures. + +Thus, while standalone manifests can point at any kind of container, +including JSON files or LCA databases, they are most efficient when +`internal_location` points at a file with either a single sketch in +it, or a manifest that supports direct loading of sketches. Therefore, +we suggest using standalone manifest indices. + +Note that searching a standalone manifest is currently done through a +linear iteration, and does not use any features of indexed containers +such as SBTs or LCAs. This is fine for `gather` with the default +approach, but is probably suboptimal for a `search`. + +## Speeding up `gather` and `search` + +The basics - gather/prefetch, vs search. + +Repeated gathers and searches. + +Using prefetch explicitly. + +### Using a higher scaled value + +### + +## Taxonomy and assigning lineages + +All sourmash taxonomy handling is done within the `lca` and `tax` +subcommands (CLI) and submodules (Python). + +In the case of the `lca` subcommands, the taxonomic information is +incorporated into the LCA database construction (see the `lca index` +command), while the `tax` subcommands load taxonomic information +on demand from taxonomy databases (CSVs or databases). + +sourmash anchors all taxonomy to identifiers, and uses the signature +name to do so - this is the name as set by the `--name` parameter to +`sourmash sketch`, and output by `sourmash sig describe` as the +`signature:` field. + +### Identifier handling + +sourmash prefers identifiers to be the first space-separated token in +the signature name. This token can contain any alphanumeric letters +other than space (@@ctb check me), and should contain at most one +period. The version of the identifier will be the + +So, for example, for a signature name of + +``` +CP001941.1 Aciduliprofundum boonei T469, complete genome +``` +the identifier would be `CP001941.1` and the version would be 1. +There are no other constraints placed on the identifier, and +versions are not handled in any special way other than as below. + +The `lca index` and `tax` commands both support some modified +identifier handling in sourmash 3.x and 4.x, but in the future, we +plan to deprecate these as they mostly cause confusion and internal +complexity. + +The two modifiers are: + +* `--keep-full-identifiers` will use the entire signature +name instead of just the first space-separated token. It is by default +off (set to false). + +* `--keep-identifier-versions` turns on keeping the full identifier, +including what is after the first period. It is by default on (set to +True) @@ctb check me. When it is off (False), identifiers are stripped +of their version on load. + +### Taxonomies, or lineage spreadsheets + +sourmash supports arbitrary (free) taxonomies, and new taxonomic +lineages can be created and used internally as long as they +are provided in the appropriate spreadsheet format. + +You can also mix and match taxonomies as you need; for example, it is +entirely legitimate in sourmash-land to combine the GTDB taxonomy for +bacterial and archaeal sequence classification, with the NCBI taxonomy +for eukaryotic and viral sequence classification. (You probably don't +want to mix and match within superkingdoms, though!) + +As of sourmash v4, lineage spreadsheets should contain columns for +superkingdom, phylum, class, order, family, genus, and species. Some +commands may also support a 'strain' column, although this is +inconsistently handled within sourmash internally. + +For spreadsheet organization, `lca index` expects the columns to be +present in order from superkingdom on down, while the `tax` +subcommands use CSV column headers instead. We are planning to +consolidate around the `tax` subcommand handling in the future (@@ctb +issue). + +@@link to example spreadsheets, talk about how to extract. + +### `LCA_SqliteDatabase` - a special case + +The `LCA_SqliteDatabase` index class can serve multiple purposes: as +an index of sketches (for regular search and gather); as a taxonomy +database for use with the `tax` subcommands; and as an LCA database +for use with the `lca` subcommands. + +When used as a taxonomy database, an `LCA_SqliteDatabase` file +contains the same SQL tables as a sqlite taxonomy database. + +When used as an LCA database, an `LCA_SqliteDatabase` dynamically loads +the taxonomic lineages from the sqlite database and applies them to the +individual hashes, permitting the same kind of hash-to-lineage query +capability as the `LCA_Database`. + +## Picklists + +picklist types and handling + +metatypes. + +using manifests as picklists. + +using taxonomy as picklists. From 68fe6d75e8c9de89a98dda17702d51e3d719391a Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Mon, 15 Aug 2022 16:23:09 -1000 Subject: [PATCH 03/48] much docs, so wow --- doc/new.md | 1 + doc/sourmash-internals.md | 200 +++++++++++++++++++++++++++++++++++++- 2 files changed, 197 insertions(+), 4 deletions(-) diff --git a/doc/new.md b/doc/new.md index 6fce123fb9..ea19160f71 100644 --- a/doc/new.md +++ b/doc/new.md @@ -69,6 +69,7 @@ X and Linux. They require about 5 GB of disk space and 5 GB of RAM. * [Genbank and GTDB databases and taxonomy files](databases.md) * [Python examples using the API](api-example.md) * [Publications about sourmash](publications.md) +* [A guide to the internals of sourmash](sourmash-internals.md) ## Developing and extending sourmash diff --git a/doc/sourmash-internals.md b/doc/sourmash-internals.md index 977b0c5ef1..926088f041 100644 --- a/doc/sourmash-internals.md +++ b/doc/sourmash-internals.md @@ -1,3 +1,22 @@ +# A guide to the internals of sourmash + +```{contents} Contents +:depth: 3 +``` + +sourmash was created in 2015, and has been repeatedly reorganized, +refactored, and optimized to support ever larger databases, faster +queries, and new use cases. We've also regularly added new +functionality and features. So sourmash can be pretty complicated +internally, and our user-facing documentation only covers a fraction +of its potential! + +This document is a brain dump intended for expert users and sourmash +developers who want to understand how, why, and when to use various +sourmash features. It is unlikely ever to be comprehensive, so the +information you are interested in may not yet exist in this document, +but we are always happy to add to it - [just ask in an issue!](@@) + ## Signatures and sketches Default signature format is stored as JSON, gzipped. Read by Rust. @@ -8,7 +27,9 @@ Right now signatures are 1:1 with sketches. ### Scaled (FracMinHash) sketches support similarity and containment -downsampling - mostly done dynamically +@paper ref + +downsampling - mostly done dynamically. How it works with index classes. #1799 ### Num (MinHash) sketches support Jaccard similarity @@ -76,6 +97,9 @@ The `Index` class and its various subclasses (in `sourmash.index`) are containers that provide an API for organizing, selecting, and searching (potentially) large numbers of signatures. +`sourmash sig summarize` is a good way to determine what type of `Index` +class is used to handle a collection. + Loading and saving of `Index` objects is handled separately from the class: loading can be done in Python via the `sourmash.load_file_as_index(...)` method, while creation and/or @@ -94,6 +118,15 @@ selecting, and/or searching them more efficiently as inverted indices (`LCA_Database` and `SqliteIndex`) that permit certain kinds of fast queries. +Unless otherwise noted, the `Index` classes below can be loaded +concurrently in "read only" mode - that is, you should build the +collection _once_, and then use it from multiple processes. We +currently do not test for or support concurrent read/write. Note also +that (generally speaking) memory footprints will be additive, so +loading the same `LCA_Database` twice will consume twice the memory. +(If you're interested in concurrency, we suggest using the sqlite +containers - see `SqliteIndex`.) + ### In-memory storage and search. The simplest way to handle collections of signatures is to load them @@ -224,9 +257,116 @@ linear iteration, and does not use any features of indexed containers such as SBTs or LCAs. This is fine for `gather` with the default approach, but is probably suboptimal for a `search`. +### Pathlists and `--from-file` + +All (or most) sourmash commands natively support taking in lists of +signature collections via pathlists, `--from-file`, or paths to +directories. This is useful for situations where you have thousands of +signature files and don't want to provide them explicitly on the +command line; you can simply put a list of the files in a text file, +and pass it in directly (or use `--from-file` to pass it in). + +Both pathlists and files passed to `--from-file` contain a list of +paths to be loaded; relatives paths will be interpreted relative to +the current working directory of sourmash. Pathlists should be +universally available on sourmash commands. When `--from-file` is +available for a command, sourmash will behave as if the file paths in +the file were provided on the command line. + +We suggest avoiding pathlists. Instead, we suggest using `--from-file` +or a standalone manifest index (generated with `sourmash sig +collect`). This is because the signatures from pathlists are loaded +into memory (see `MultiIndex`, above) it is generally a bad idea to +use them - they may be slow to load and may consume a lot of +memory. They also do not support good loading error messages +(@@example/issue). + +### Extensions for outputting index classes + +Most commands that support saving signatures will save them in a +variety of formats, based on the extension provided (see @@exceptions +issue for exceptions). The supported extensions are - + +* `.zip` for `ZipFileLinearIndex` +* `.sqldb` for `SqliteIndex` +* `.sig` or `.sig.gz` for JSON/gzipped JSON +* `dirname/` to save in a directory hierarchy + +The default signature save format is JSON, if the extension is not +recognized. + ## Speeding up `gather` and `search` -The basics - gather/prefetch, vs search. +There are two primary search commands in sourmash: `gather` and +`search`. + +`gather` calculates a minimum metagenome cover as discussed in @@. It +is mostly intended for querying a database with a metagenome, although +it can be used with genome queries, as well. It depends on overlap +analyses and can only be used with FracMinHash sketches. + +`search` does a straight Jaccard similarity search on MinHash and +FracMinHash sketches (or, with `--containment`, a containment search +on FracMinHash sketches). It is typically used to find matches to a +query genome sketch in a large database of sketches. + +The `prefetch` command does a containment search and is intended for +power users; it is a standalone implementation of the prefetch +algorithm discussed below for `gather`. It only works with FracMinHash +sketches. + +Note that all of these commands work with any and all `Index` +collection/container types, and will return the same results however +the collections are organized - see the "online behavior" section, +below. In practice this means that you can provide additional +collections of signatures via the command line without building a +combined index of all your signatures. It also means that the only +reason to choose different collections/containers is for +optimization - you should select the conatiners that help you achieve +the desired performance characteristics for your search +(i.e. the right memory/time/disk space tradeoffs). + +### Running `search` many times on the same database + +`search` typically is used to search a large database of sketches for +all similarity or containment matches above a threshold. Depending on +the query and the database, certain kinds of database indices may make +search much faster, especially when only a few matches are expected. + +If you are doing many searches against the same database, indexing the +database as an SBT (with `sourmash index`) or as a `SqliteIndex`/sqldb +database is likely to provide a significant speed increase, albeit +with increased memory usage (SBT) or increased disk space (sqldb). + +Conversely, `ZipFileLinearIndex` and the default `LCA_Database` are likely +to be poor choices for many searches - the former only supports linear +searches, and the latter needs to be loaded from disk and deserialized each +time. + +### Running `gather` once + +`gather` is typically used to search a metagenome against a large +database of sketches, as part of finding a minimum set cover. This can +be quite slow! Our current implementation (as of sourmash vXX, @pull +request) does a single pass across the database to find all matches +with an overlap above the provided threshold, and then organizes +the matches for rapid min-set-cov analysis. This single pass across the +database is called a "prefetch", and it is also implemented in the +`prefetch` subcommand. + +With this single pass approach, benchmarks (@@ link) show that a +linearly searchable database is performant enough to be used with +`gather`. We therefore suggest using a `ZipFileLinearIndex` container +with gather, or in cases where low-memory concurrency is desired, a +`SqliteIndex` container. + +### Using `prefetch` and `gather` together + +If you want to use `prefetch` independently of `gather`, you can +use the prefetch output as a picklist passed into gather - see picklists, +below@@. This can be useful when you want to experiment with different +threshold parameters for `gather` - first, do a very sensitive/low-threshold +search with `prefetch` and save the results to a CSV file with `-o`, Repeated gathers and searches. @@ -234,7 +374,25 @@ Using prefetch explicitly. ### Using a higher scaled value -### +With FracMinHash sketches, you can downsample the query to make both +`search` and `gather` _much_ faster. A good rule of thumb is to use a +scaled value that is about 5x smaller than the minimum overlap to +detect; so, if you want to be able to detect 50kb of similarity, you +can use a scaled value of 10,000. Conversely, the default scaled value +of 1,000 (for DNA sketches) should robustly detect overlaps of 5kb. + +You can supply `--scaled` to `gather` and `prefetch` to dynamically +downsample the query FracMinHash. For `search` you will need to use +`sourmash sig downsample` to generate a downsampled sketch. + +### Running `gather` many times - `multigather` + +In situations where loading the search database is slow (e.g. +`LCA_Database` or zipfiles with very large manifests), the `sourmash +multigather` command supports many queries against many databases. + +(We don't particularly suggest using `multigather`; we would prefer +to make search databases faster. But it's there! :) ## Taxonomy and assigning lineages @@ -325,10 +483,44 @@ capability as the `LCA_Database`. ## Picklists -picklist types and handling +picklist types and handling. metatypes. +picklist on SBT/LCA vs ZipFileLinearIndex. + using manifests as picklists. using taxonomy as picklists. + +## Similarity matrices with `sourmash compare` + +## ANI + +## Online and streaming + +n+1 problem + +## Formats natively understood by sourmash + +sourmash should always autodetect the format of a collection or +database, in most cases based on its content (and not its +filename). Please file a bug report if this doesn't work for you! + +`sourmash sig summarize` is a good way to examine the properties of a +signature collection. + +### Reading and writing gzipped CSV files + +(As of sourmash v4.5) + +When a CSV filename is specified (e.g. `sourmash gather ... -o +mygather.csv`), you can always provide a name that ends with `.gz` to +produce a gzip-compressed file instead. This can save quite a bit of +space for prefetch results and manifests in particular! + +All sourmash commands that take in a CSV (via manifest, or picklist, +or taxonomy) will autodetect a gzipped CSV based on content (the file +does not need to end with `.gz`). The one exception is manifests, +where the CSV needs to end with `.gz` to be loaded as a gzipped CSV +(but see issue @@). From 0bf170d1c12cf104806684ab27362b775dc71134 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Mon, 15 Aug 2022 17:22:02 -1000 Subject: [PATCH 04/48] even more docs --- doc/sourmash-internals.md | 87 +++++++++++++++++++++++++++++++++++---- 1 file changed, 79 insertions(+), 8 deletions(-) diff --git a/doc/sourmash-internals.md b/doc/sourmash-internals.md index 926088f041..d0d044d6eb 100644 --- a/doc/sourmash-internals.md +++ b/doc/sourmash-internals.md @@ -19,26 +19,97 @@ but we are always happy to add to it - [just ask in an issue!](@@) ## Signatures and sketches -Default signature format is stored as JSON, gzipped. Read by Rust. +sourmash operates on sketches. Each sketch is a collection of hashes. +Each sketch is contained in a signature. -Complication with multiple sketches per signature. +Internally, sketches (class `MinHash`) contain the following information: +* a set of hashes; +* an optional abundance for each hash (when `track_abund` is True); +* a seed; +* a k-mer size; +* a molecule type; +* either a `num` (for MinHash) or a `scaled` value (for FracMinHash); -Right now signatures are 1:1 with sketches. +Signature objects (class `SourmashSignature`) contain a sketch (property `.minhash`) as well as additional information: +* an optional `name` +* an optional `filename` +* a license (currently must be CC0); +* an `md5sum(...)` method that returns a hash of the sketch. + +For now, we tend to refer to signatures and sketches interchangeably, +because they are almost entirely 1:1 in the code base (but see issue +@@). + +The default signature interchange/serialization format is JSON, optionally +gzipped. This format is written and read by Rust code. + +In general, a lot of effort in sourmash is spent managing collections of +signatures _before_ actually doing comparisons with them; see manifests, +and `Index` objects, below. + +### Making sketches + +Sketches are produced by hashing k-mers with murmurhash and then +keeping either the lowest `num` hashes (for MinHashes sketches) or +keeping all hashes below `2**64 / scaled` (for FracMinHash sketches). + +The default MinHash sketches use parameters so that they are +compatible with mash sketches. + +See utils/compute-dna-mh-another-way.py @@ for details on how k-mers are +hashed. + +Note that if hashes are produced some other way (with a different hash +function) or from some source other than DNA, sourmash can still work +with them; only `sourmash sketch` does input checking. + +### Compatibility checking + +The point of the signatures and sketches is to enable certain kinds of +rapid comparisons - Jaccard similarity and overlap analysis, +specifically. However, these comparisons can only be done between +compatible sketches. + +Here, "compatible" means - +* the same MurmurHash seed (default 42); +* the same k-mer size/ksize (see k-mer sizes, below); +* the same molecule type (see molecule types, below); +* the same `num` or `scaled` (although see downsampling discussion @@, and below) + +sourmash uses selectors (`Index.select(...)`) to select sketches with +compatible ksizes, molecule types, and sketch types. ### Scaled (FracMinHash) sketches support similarity and containment -@paper ref +Per our discussion in @paper ref, FracMinHash sketches can always be compared +by downsampling to the max of the two scaled values. (This is not always +true of sketches stored in indexed collections, e.g. SBTs; see @@ #1799.) -downsampling - mostly done dynamically. How it works with index classes. #1799 +In practice, sourmash does all necessary downsampling dynamically, but +returns the original sketches. This means that (for example) you can +do a low-resolution/high-scaled large scale search by specifying a +high scaled value, and then do a higher resolution comparison with +only the highly similar matches the results to do a more refined (see +below, Speeding up `gather` and `search`.) ### Num (MinHash) sketches support Jaccard similarity -less well supported; see issue @@. +"Regular" MinHash (or "num MinHash") sketches are implemented the same +way as in mash. However, they are less well supported in sourmash, +because they don't offer the same opportunities for metagenome +analysis. (See issue @@.) -downsampling may need to be done manually. +Num MinHash sketches can always be compared by downsampling to a +common `num` value. This may need to be done manually using `sourmash +sig downsample`, however. ## K-mer sizes +There is no explicit restriction on k-mer sizes built into sourmash. + +For highly specific genome and metagenome comparisons, we typically +use k=21, k=31, or k=51. @@more links. + ## Molecule types - DNA, protein, Dayhoff, and hydrophobic-polar sourmash supports four different sequence encodings, which we refer to @@ -322,7 +393,7 @@ below. In practice this means that you can provide additional collections of signatures via the command line without building a combined index of all your signatures. It also means that the only reason to choose different collections/containers is for -optimization - you should select the conatiners that help you achieve +optimization - you should select the containers that help you achieve the desired performance characteristics for your search (i.e. the right memory/time/disk space tradeoffs). From f3ccf11831a9362fc791698fa3398500eb7e8484 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Mon, 15 Aug 2022 17:54:22 -1000 Subject: [PATCH 05/48] even more docs --- doc/sourmash-internals.md | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/doc/sourmash-internals.md b/doc/sourmash-internals.md index d0d044d6eb..2bba16b643 100644 --- a/doc/sourmash-internals.md +++ b/doc/sourmash-internals.md @@ -485,7 +485,8 @@ name to do so - this is the name as set by the `--name` parameter to sourmash prefers identifiers to be the first space-separated token in the signature name. This token can contain any alphanumeric letters other than space (@@ctb check me), and should contain at most one -period. The version of the identifier will be the +period. The version of the identifier will be the component after +the period. So, for example, for a signature name of @@ -554,7 +555,21 @@ capability as the `LCA_Database`. ## Picklists -picklist types and handling. +Picklists are a generic mechanism used to select a (potentially small) +subset of signatures for search/display. + +The general idea of picklists is that you create a list of signatures +you're interested in - by name, or identifier, or md5sum - and then +supply that list in a csvfile on the command line via `--picklist`. +For example, `--picklist list.csv:colname:ident` would load the +values in the column named `colname` in the file `list.csv` as identifiers +to be used to restrict the search. + +The support picklist column types are `name`, `ident` +(space-delimieted identifier), `identprefix` (identifier with version +removed), `md5`, `md5prefix8`, and `md5short`. Generally the `md5` +and derived values are used to reference signatures found some other +way with sourmash, while the identifiers are more broadly useful. metatypes. From 4c60cd7fe731ad8ed281e98e56475adb5038cc84 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Mon, 15 Aug 2022 18:02:39 -1000 Subject: [PATCH 06/48] mo' --- doc/sourmash-internals.md | 24 ++++++++++++++++++++---- 1 file changed, 20 insertions(+), 4 deletions(-) diff --git a/doc/sourmash-internals.md b/doc/sourmash-internals.md index 2bba16b643..6adea8a5b3 100644 --- a/doc/sourmash-internals.md +++ b/doc/sourmash-internals.md @@ -571,13 +571,29 @@ removed), `md5`, `md5prefix8`, and `md5short`. Generally the `md5` and derived values are used to reference signatures found some other way with sourmash, while the identifiers are more broadly useful. -metatypes. +There are also four special column types that can be used without a column +name: `gather`, `prefetch`, `search`, and `manifest`. These take the +CSV output of the respective sourmash commands as inputs for picklists, +so that you can use prefetch to generate a picklist and then use that +picklist with `--picklist prefetch_out.csv.gz::prefetch`. -picklist on SBT/LCA vs ZipFileLinearIndex. +### Differing internal behavior -using manifests as picklists. +Picklists behave differently with different `Index` classes. -using taxonomy as picklists. +For indexed databases like SBT and LCA, the search is done _first_, and +then only those results that match the picklist are selected. + +For linear search databases like `ZipFileLinearIndex` or standalone +manifests, picklists are _first_ used to subselect the desired +signatures, and then only those signatures are searched. + +So, performance can vary widely, but the results should be the same. + +### Taxonomy / lineage spreadsheets as picklists + +Note that lineage CSV spreadsheets as consumed by `sourmash tax` commands, +and as output by `sourmash tax grep`, can be used as `ident` picklists. ## Similarity matrices with `sourmash compare` From 471ddf46cce5382bd51379973611eb31ba031a5f Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Tue, 16 Aug 2022 04:16:08 -1000 Subject: [PATCH 07/48] add issue links --- doc/sourmash-internals.md | 104 ++++++++++++++++++++------------------ 1 file changed, 56 insertions(+), 48 deletions(-) diff --git a/doc/sourmash-internals.md b/doc/sourmash-internals.md index 6adea8a5b3..763a03418c 100644 --- a/doc/sourmash-internals.md +++ b/doc/sourmash-internals.md @@ -15,7 +15,8 @@ This document is a brain dump intended for expert users and sourmash developers who want to understand how, why, and when to use various sourmash features. It is unlikely ever to be comprehensive, so the information you are interested in may not yet exist in this document, -but we are always happy to add to it - [just ask in an issue!](@@) +but we are always happy to add to it - +[just ask in an issue!](https://github.com/sourmash-bio/sourmash/issues) ## Signatures and sketches @@ -37,8 +38,7 @@ Signature objects (class `SourmashSignature`) contain a sketch (property `.minha * an `md5sum(...)` method that returns a hash of the sketch. For now, we tend to refer to signatures and sketches interchangeably, -because they are almost entirely 1:1 in the code base (but see issue -@@). +because they are almost entirely 1:1 in the code base (but see [sourmash#616](https://github.com/sourmash-bio/sourmash/issues/616)). The default signature interchange/serialization format is JSON, optionally gzipped. This format is written and read by Rust code. @@ -56,7 +56,7 @@ keeping all hashes below `2**64 / scaled` (for FracMinHash sketches). The default MinHash sketches use parameters so that they are compatible with mash sketches. -See utils/compute-dna-mh-another-way.py @@ for details on how k-mers are +See [utils/compute-dna-mh-another-way.py](https://github.com/sourmash-bio/sourmash/blob/latest/utils/compute-dna-mh-another-way.py) for details on how k-mers are hashed. Note that if hashes are produced some other way (with a different hash @@ -74,16 +74,16 @@ Here, "compatible" means - * the same MurmurHash seed (default 42); * the same k-mer size/ksize (see k-mer sizes, below); * the same molecule type (see molecule types, below); -* the same `num` or `scaled` (although see downsampling discussion @@, and below) +* the same `num` or `scaled` (although see [this downsampling discussion](api-example.md#downsampling-signatures), and the next two sections); sourmash uses selectors (`Index.select(...)`) to select sketches with compatible ksizes, molecule types, and sketch types. ### Scaled (FracMinHash) sketches support similarity and containment -Per our discussion in @paper ref, FracMinHash sketches can always be compared +Per our discussion in [Irber et al., 2022](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2), FracMinHash sketches can always be compared by downsampling to the max of the two scaled values. (This is not always -true of sketches stored in indexed collections, e.g. SBTs; see @@ #1799.) +true of sketches stored in indexed collections, e.g. SBTs; see [sourmash#1799](https://github.com/sourmash-bio/sourmash/issues/1799).) In practice, sourmash does all necessary downsampling dynamically, but returns the original sketches. This means that (for example) you can @@ -97,7 +97,7 @@ below, Speeding up `gather` and `search`.) "Regular" MinHash (or "num MinHash") sketches are implemented the same way as in mash. However, they are less well supported in sourmash, because they don't offer the same opportunities for metagenome -analysis. (See issue @@.) +analysis. (See also [sourmash#1354](https://github.com/sourmash-bio/sourmash/issues/1354).) Num MinHash sketches can always be compared by downsampling to a common `num` value. This may need to be done manually using `sourmash @@ -108,7 +108,7 @@ sig downsample`, however. There is no explicit restriction on k-mer sizes built into sourmash. For highly specific genome and metagenome comparisons, we typically -use k=21, k=31, or k=51. @@more links. +use k=21, k=31, or k=51. For a longer discussion, see [Assembly Free Analysis with K-mers](https://github.com/mblstamps/stamps2022/blob/main/kmers_and_sourmash/2022-stamps-assembly-free%20class.pdf) from STAMPS 2022. ## Molecule types - DNA, protein, Dayhoff, and hydrophobic-polar @@ -154,7 +154,7 @@ subclasses of `Index` such as the inverted indices, manifests are generated dynamically by the class itself. Currently (sourmash 4.x) manifests do not contain information about the -hash seed or ... This will be fixed in future, see @@issue. +hash seed or sketch license. This will be fixed in the future - see [sourmash#1849](https://github.com/sourmash-bio/sourmash/issues/1849). Manifests are very flexible and, especially when stored in a sqlite database, can be extremely performant for organizing hundreds of @@ -180,7 +180,7 @@ same APIs used by the command-line functionality. There are quite a few different `Index` subclasses and they all have distinct features. We have a high-level guide to which collection -type to use @@here in command-line. +type to use [here](command-line.md#loading-many-signatures). Conceptually, `Index` classes are either organized around storing individual signatures often with metadata that permits loading, @@ -212,7 +212,7 @@ a manifest as it loads the sketches. Note that `MultiIndex` is the class used to load signatures from pathlists, directory hierarchies, and so on; because it stores sketches in memory, this can incur a significant memory penalty (see -issue @@). Therefore where possible we suggest building a standalone +[sourmash#1899](https://github.com/sourmash-bio/sourmash/issues/1899)). Therefore where possible we suggest building a standalone manifest (`StandaloneManifestIndex`) to do lazy loading from the disk instead; you can use `sourmash sig collect` to do this. @@ -237,7 +237,7 @@ with `search` you may want to use an SBT or LCA database instead; see below. In the future we expect to parallelize searching `ZipFileLinearIndex` -files in Rust; see @@greyhound. +files in Rust; see [sourmash#1752](https://github.com/sourmash-bio/sourmash/issues/1752). `ZipFileLinearIndex` does support zip files without manifests as well as multiple signatures in a single file; this was originally intended @@ -251,12 +251,13 @@ manifest and signatures broken out into individual files. ### Sequence Bloom Trees (SBTs) -Sequence Bloom Trees (SBTs; @@link) provide a faster (but more memory -intensive) on-disk storage and search mechanism. In brief, SBTs implement -a binary tree organization of an arbitrary number of signatures; each -internal node is a Bloom filter containing all of the hashes for the -nodes below them. This permits potentially rapid elimination of -irrelevant nodes on search. +Sequence Bloom Trees (SBTs; see +[the Kingsford Lab page for details](http://www.cs.cmu.edu/~ckingsf/software/bloomtree/)) +provide a faster (but more memory intensive) on-disk storage and +search mechanism. In brief, SBTs implement a binary tree organization +of an arbitrary number of signatures; each internal node is a Bloom +filter containing all of the hashes for the nodes below them. This +permits potentially rapid elimination of irrelevant nodes on search. SBTs are restricted to storing and searching sketches with the same/single k-mer size and molecule type, as well as either a single @@ -286,17 +287,19 @@ The newer `LCA_SqliteDatabase` (based on `SqliteIndex`, described below) also supports LCA-style queries, and is stored on disk, is fast to load, and uses very little memory. The tradeoff is that the underlying sqlite database can be quite large. `LCA_SqliteDatabase` should also -support rapid concurrent access (see issue @@). +support rapid concurrent access (see [sourmash#909](https://github.com/sourmash-bio/sourmash/issues/909)). Both types of LCA database can be constructed with `sourmash lca index`. ### SqliteIndex The `SqliteIndex` storage class uses sqlite3 to store hashes and -sketch information for search and retrieval. (@@see blog post) These -are fast, low-memory, on-disk databases, with the tradeoff that they -can be quite large. This is probably currently the best solution for -concurrent access to sketches via e.g. a Web server (see issue@). +sketch information for search and retrieval; see +[this blog post](http://ivory.idyll.org/blog/2022-storing-ulong-in-sqlite-sourmash.html) +for background information and details. These are fast, low-memory, +on-disk databases, with the tradeoff that they can be quite large. +This is probably currently the best solution for concurrent access to +sketches via e.g. a Web server (see issue@). `SqliteIndex` can only contain FracMinHash sketches and can only store sketches with the same scaled parameter. However, it can store @@ -349,14 +352,15 @@ or a standalone manifest index (generated with `sourmash sig collect`). This is because the signatures from pathlists are loaded into memory (see `MultiIndex`, above) it is generally a bad idea to use them - they may be slow to load and may consume a lot of -memory. They also do not support good loading error messages -(@@example/issue). +memory. They also do not support good loading error messages; +see [sourmash#1414](https://github.com/sourmash-bio/sourmash/issues/1414). ### Extensions for outputting index classes Most commands that support saving signatures will save them in a -variety of formats, based on the extension provided (see @@exceptions -issue for exceptions). The supported extensions are - +variety of formats, based on the extension provided (see +[sourmash#1890](https://github.com/sourmash-bio/sourmash/issues/1890) +for exceptions). The supported extensions are - * `.zip` for `ZipFileLinearIndex` * `.sqldb` for `SqliteIndex` @@ -371,7 +375,7 @@ recognized. There are two primary search commands in sourmash: `gather` and `search`. -`gather` calculates a minimum metagenome cover as discussed in @@. It +`gather` calculates a minimum metagenome cover as discussed in [Irber et al., 2022](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2). It is mostly intended for querying a database with a metagenome, although it can be used with genome queries, as well. It depends on overlap analyses and can only be used with FracMinHash sketches. @@ -425,7 +429,7 @@ the matches for rapid min-set-cov analysis. This single pass across the database is called a "prefetch", and it is also implemented in the `prefetch` subcommand. -With this single pass approach, benchmarks (@@ link) show that a +With this single pass approach, benchmarks - [sourmash#2014](https://github.com/sourmash-bio/sourmash/issues/2014) - show that a linearly searchable database is performant enough to be used with `gather`. We therefore suggest using a `ZipFileLinearIndex` container with gather, or in cases where low-memory concurrency is desired, a @@ -433,11 +437,12 @@ with gather, or in cases where low-memory concurrency is desired, a ### Using `prefetch` and `gather` together -If you want to use `prefetch` independently of `gather`, you can -use the prefetch output as a picklist passed into gather - see picklists, -below@@. This can be useful when you want to experiment with different -threshold parameters for `gather` - first, do a very sensitive/low-threshold -search with `prefetch` and save the results to a CSV file with `-o`, +If you want to use `prefetch` independently of `gather`, you can use +the prefetch output as a picklist passed into gather - see +[picklists](#picklists), below. This can be useful when you want to +experiment with different threshold parameters for `gather` - first, +do a very sensitive/low-threshold search with `prefetch` and save the +results to a CSV file with `-o`, Repeated gathers and searches. @@ -506,12 +511,11 @@ The two modifiers are: * `--keep-full-identifiers` will use the entire signature name instead of just the first space-separated token. It is by default -off (set to false). +off (set to False). * `--keep-identifier-versions` turns on keeping the full identifier, -including what is after the first period. It is by default on (set to -True) @@ctb check me. When it is off (False), identifiers are stripped -of their version on load. +including what is after the first period. It is by default off (set to +False), stripping identifiers of their version on load. When it is on (True), identifiers are not stripped of their version on load. ### Taxonomies, or lineage spreadsheets @@ -533,8 +537,7 @@ inconsistently handled within sourmash internally. For spreadsheet organization, `lca index` expects the columns to be present in order from superkingdom on down, while the `tax` subcommands use CSV column headers instead. We are planning to -consolidate around the `tax` subcommand handling in the future (@@ctb -issue). +consolidate around the `tax` subcommand handling in the future (see [sourmash#2198](https://github.com/sourmash-bio/sourmash/issues/2198)). @@link to example spreadsheets, talk about how to extract. @@ -581,18 +584,21 @@ picklist with `--picklist prefetch_out.csv.gz::prefetch`. Picklists behave differently with different `Index` classes. -For indexed databases like SBT and LCA, the search is done _first_, and -then only those results that match the picklist are selected. +For indexed databases like SBT, LCA, and `SqliteIndex`, the search is +done _first_, and then only those results that match the picklist are +selected. For linear search databases like `ZipFileLinearIndex` or standalone manifests, picklists are _first_ used to subselect the desired -signatures, and then only those signatures are searched. +signatures, and only those signatures are searched. -So, performance can vary widely, but the results should be the same. +This means that picklists can dramatically speed up searches on some +`Index` types, but won't affect performance on others. But +the results will be the same. ### Taxonomy / lineage spreadsheets as picklists -Note that lineage CSV spreadsheets as consumed by `sourmash tax` commands, +Note that lineage CSV spreadsheets, as consumed by `sourmash tax` commands and as output by `sourmash tax grep`, can be used as `ident` picklists. ## Similarity matrices with `sourmash compare` @@ -624,5 +630,7 @@ space for prefetch results and manifests in particular! All sourmash commands that take in a CSV (via manifest, or picklist, or taxonomy) will autodetect a gzipped CSV based on content (the file does not need to end with `.gz`). The one exception is manifests, -where the CSV needs to end with `.gz` to be loaded as a gzipped CSV -(but see issue @@). +where the CSV needs to end with `.gz` to be loaded as a gzipped CSV; +see +[sourmash#2214](https://github.com/sourmash-bio/sourmash/issues/2214) +for an issue to fix this. From 419e4335b6c45ab3c50e0f0db83c2154941acbc0 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Tue, 16 Aug 2022 04:23:38 -1000 Subject: [PATCH 08/48] fix more links --- doc/sourmash-internals.md | 26 ++++++++++++++------------ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/doc/sourmash-internals.md b/doc/sourmash-internals.md index 763a03418c..3df87f6277 100644 --- a/doc/sourmash-internals.md +++ b/doc/sourmash-internals.md @@ -125,18 +125,21 @@ DNA moltype sketches can be constructed from DNA input sequences using Protein, Dayhoff, and HP moltype sketches can be constructed from protein input sequences using `sourmash sketch protein`, or from DNA input sequences using `sourmash sketch translate`; `translate` will -translate in all six reading frames (see also @orpheum). By default -protein sketches will be created; dayhoff sketches can be created by -including `dayhoff` in the param string, e.g. `sourmash sketch protein --p dayhoff`, and hydrophobic-polar sketches can be built with `hp` in -the param string. +translate in all six reading frames (see also +[orpheum](https://github.com/czbiohub/orpheum) from +[Botvinnik et al., 2021](https://www.biorxiv.org/content/10.1101/2021.07.09.450799v1)). +By default protein sketches will be created; dayhoff sketches can be +created by including `dayhoff` in the param string, e.g. `sourmash +sketch protein -p dayhoff`, and hydrophobic-polar sketches can be +built with `hp` in the param string. ## Manifests -sourmash makes extensive use of signature manifests to support -rapid selection and lazy loading of signatures based on signature -metadata (name, ksize, moltype, etc.) See @@this blog post for -some of the motivation. +sourmash makes extensive use of signature manifests to support rapid +selection and lazy loading of signatures based on signature metadata +(name, ksize, moltype, etc.) See +[Blog post: Scaling sourmash to millions of samples](http://ivory.idyll.org/blog/2021-sourmash-scaling-to-millions.html) +for some of the motivation. Manifests are an internal format that is not meant to be particularly human readable, but the CSV format can be loaded into a spreadsheet @@ -299,7 +302,7 @@ sketch information for search and retrieval; see for background information and details. These are fast, low-memory, on-disk databases, with the tradeoff that they can be quite large. This is probably currently the best solution for concurrent access to -sketches via e.g. a Web server (see issue@). +sketches via e.g. a Web server (see also [sourmash#909](https://github.com/sourmash-bio/sourmash/issues/909)). `SqliteIndex` can only contain FracMinHash sketches and can only store sketches with the same scaled parameter. However, it can store @@ -422,8 +425,7 @@ time. `gather` is typically used to search a metagenome against a large database of sketches, as part of finding a minimum set cover. This can -be quite slow! Our current implementation (as of sourmash vXX, @pull -request) does a single pass across the database to find all matches +be quite slow! Our current implementation (as of sourmash 4.1.0, [pull request sourmash#1370](https://github.com/sourmash-bio/sourmash/pull/1370)) does a single pass across the database to find all matches with an overlap above the provided threshold, and then organizes the matches for rapid min-set-cov analysis. This single pass across the database is called a "prefetch", and it is also implemented in the From 8e9248fe0e77ffb17585915e6621eb0df2347496 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Wed, 17 Aug 2022 06:02:42 -0700 Subject: [PATCH 09/48] notes --- doc/sourmash-internals.md | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/doc/sourmash-internals.md b/doc/sourmash-internals.md index 3df87f6277..78eacf641d 100644 --- a/doc/sourmash-internals.md +++ b/doc/sourmash-internals.md @@ -603,10 +603,16 @@ the results will be the same. Note that lineage CSV spreadsheets, as consumed by `sourmash tax` commands and as output by `sourmash tax grep`, can be used as `ident` picklists. -## Similarity matrices with `sourmash compare` - ## ANI +estimated from k-mers + +containment preferred + +ani variability, point at issue + +minimum size for ani calculation + ## Online and streaming n+1 problem From 91fbf7c1daad71f5f3d43b8081f799b67c80e6d7 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Wed, 17 Aug 2022 12:24:07 -0700 Subject: [PATCH 10/48] update text --- doc/sourmash-internals.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/doc/sourmash-internals.md b/doc/sourmash-internals.md index 78eacf641d..267faf0622 100644 --- a/doc/sourmash-internals.md +++ b/doc/sourmash-internals.md @@ -52,6 +52,10 @@ and `Index` objects, below. Sketches are produced by hashing k-mers with murmurhash and then keeping either the lowest `num` hashes (for MinHashes sketches) or keeping all hashes below `2**64 / scaled` (for FracMinHash sketches). +This has the effect of selecting approximately one hash for every +`scaled` k-mers - so, when sketching a set of 100,000 distinct k-mers, +a scaled value of 1,000 would yield approximately 100 hashes to be +retained in the sketch. The default MinHash sketches use parameters so that they are compatible with mash sketches. From b67d38f69e188fdedb6a0c33982b294bf9c15b42 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Thu, 18 Aug 2022 13:50:07 -0700 Subject: [PATCH 11/48] Apply suggestions from code review thanks! Co-authored-by: ccbaumler <63077899+ccbaumler@users.noreply.github.com> --- doc/sourmash-internals.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/doc/sourmash-internals.md b/doc/sourmash-internals.md index 267faf0622..0d4531521c 100644 --- a/doc/sourmash-internals.md +++ b/doc/sourmash-internals.md @@ -92,7 +92,7 @@ true of sketches stored in indexed collections, e.g. SBTs; see [sourmash#1799](h In practice, sourmash does all necessary downsampling dynamically, but returns the original sketches. This means that (for example) you can do a low-resolution/high-scaled large scale search by specifying a -high scaled value, and then do a higher resolution comparison with +high `scaled` value, and then do a higher resolution comparison with only the highly similar matches the results to do a more refined (see below, Speeding up `gather` and `search`.) @@ -135,7 +135,7 @@ translate in all six reading frames (see also By default protein sketches will be created; dayhoff sketches can be created by including `dayhoff` in the param string, e.g. `sourmash sketch protein -p dayhoff`, and hydrophobic-polar sketches can be -built with `hp` in the param string. +built with `hp` in the param string, e.g. `sourmash sketch protein -p hp`. ## Manifests @@ -430,7 +430,7 @@ time. `gather` is typically used to search a metagenome against a large database of sketches, as part of finding a minimum set cover. This can be quite slow! Our current implementation (as of sourmash 4.1.0, [pull request sourmash#1370](https://github.com/sourmash-bio/sourmash/pull/1370)) does a single pass across the database to find all matches -with an overlap above the provided threshold, and then organizes +with a Jaccard similarity or containment above the provided threshold, and then organizes the matches for rapid min-set-cov analysis. This single pass across the database is called a "prefetch", and it is also implemented in the `prefetch` subcommand. From fbbcee56c7892816092052d457232669a519fc46 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Thu, 18 Aug 2022 13:58:06 -0700 Subject: [PATCH 12/48] address many of @ccbaumler suggestions --- doc/sourmash-internals.md | 27 ++++++++++++++------------- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/doc/sourmash-internals.md b/doc/sourmash-internals.md index 0d4531521c..17bde8672f 100644 --- a/doc/sourmash-internals.md +++ b/doc/sourmash-internals.md @@ -91,10 +91,10 @@ true of sketches stored in indexed collections, e.g. SBTs; see [sourmash#1799](h In practice, sourmash does all necessary downsampling dynamically, but returns the original sketches. This means that (for example) you can -do a low-resolution/high-scaled large scale search by specifying a -high `scaled` value, and then do a higher resolution comparison with -only the highly similar matches the results to do a more refined (see -below, Speeding up `gather` and `search`.) +do a low-resolution/high-scaled search quickly by specifying a +high `scaled` value, and then use a higher resolution comparison with +the resulting matches for a more refined and accurate analysis (see +below, [Speeding up `gather` and `search`](#speeding-up-gather-and-search).) ### Num (MinHash) sketches support Jaccard similarity @@ -112,13 +112,14 @@ sig downsample`, however. There is no explicit restriction on k-mer sizes built into sourmash. For highly specific genome and metagenome comparisons, we typically -use k=21, k=31, or k=51. For a longer discussion, see [Assembly Free Analysis with K-mers](https://github.com/mblstamps/stamps2022/blob/main/kmers_and_sourmash/2022-stamps-assembly-free%20class.pdf) from STAMPS 2022. +use k=21, k=31, or k=51. For a longer discussion, see [Assembly Free Analysis with K-mers](https://github.com/mblstamps/stamps2022/blob/main/kmers_and_sourmash/2022-stamps-assembly-free%20class.pdf) from STAMPS 2022 +and a more general overview at [Using sourmash:a practical guide](using-sourmash-a-guide.md). ## Molecule types - DNA, protein, Dayhoff, and hydrophobic-polar sourmash supports four different sequence encodings, which we refer to -as "molecule: DNA (`--dna`), protein (`--protein`), Dayhoff, -(`--dayhoff`), and hydrophobic-polar (`--hp`). +as "molecule": DNA (`--dna`), protein (`--protein`), [Dayhoff encoding](https://en.wikipedia.org/wiki/Margaret_Oakley_Dayhoff#Table_of_Dayhoff_encoding_of_amino_acids), +(`--dayhoff`), and [hydrophobic-polar](sourmash-sketch.md#protein-encodings) (`--hp`). All FracMinHash sketches have exactly one molecule type, and can only be compared to the same molecule type (and ksize). @@ -149,16 +150,16 @@ Manifests are an internal format that is not meant to be particularly human readable, but the CSV format can be loaded into a spreadsheet program if you're curious :). -If in a zipfile (`.zip`) or SBT zipfile (`.sbt.zip`), manifests must +If a collection of signatures is in a zipfile (`.zip`) or SBT zipfile (`.sbt.zip`), manifests must be named `SOURMASH-MANIFEST.csv`. They can also be stored directly on disk in CSV/gzipped CSV, or in a sqlite database; see `sourmash sig manifest`, `sourmash sig check`, and `sourmash sig collect` for manifest creation, management, and export utilities. Where signatures are stored individually in `Index` collections, -e.g. in a zipfile, manifests may be stored alongside them; for other -subclasses of `Index` such as the inverted indices, manifests are -generated dynamically by the class itself. +e.g. as separate files in a zipfile, manifests may be stored alongside +them; for other subclasses of `Index` such as the inverted indices, +manifests are generated dynamically by the class itself. Currently (sourmash 4.x) manifests do not contain information about the hash seed or sketch license. This will be fixed in the future - see [sourmash#1849](https://github.com/sourmash-bio/sourmash/issues/1849). @@ -384,8 +385,8 @@ There are two primary search commands in sourmash: `gather` and `gather` calculates a minimum metagenome cover as discussed in [Irber et al., 2022](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2). It is mostly intended for querying a database with a metagenome, although -it can be used with genome queries, as well. It depends on overlap -analyses and can only be used with FracMinHash sketches. +it can be used with genome queries, as well. This approach relies on overlaps +between genomes and metagenomes and can only be used with FracMinHash sketches. `search` does a straight Jaccard similarity search on MinHash and FracMinHash sketches (or, with `--containment`, a containment search From 87ab0dcde3d9dd52cd611e7bf59852a6a9ea1b6d Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Thu, 18 Aug 2022 14:16:35 -0700 Subject: [PATCH 13/48] Update doc/sourmash-internals.md Co-authored-by: ccbaumler <63077899+ccbaumler@users.noreply.github.com> --- doc/sourmash-internals.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/doc/sourmash-internals.md b/doc/sourmash-internals.md index 17bde8672f..94d343263e 100644 --- a/doc/sourmash-internals.md +++ b/doc/sourmash-internals.md @@ -107,6 +107,10 @@ Num MinHash sketches can always be compared by downsampling to a common `num` value. This may need to be done manually using `sourmash sig downsample`, however. +### Conversion between Scaled (FracMinHash) and Num (MinHash) signatures with `downsample` + +As discussed in the previous sections, it is possible to adjust the `scaled` and `num` values to compare two FracMinHash signatures or two Num MinHash signatures. However, it is also possible to covert between the `scaled` and `num` signatures with the `sourmash sig downsample` command. For more details, review [command line](https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-signature-downsample-decrease-the-size-of-a-signature) document. + ## K-mer sizes There is no explicit restriction on k-mer sizes built into sourmash. From ee17cf3e22336149b1fb7168960a07b3237a332f Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Thu, 18 Aug 2022 14:17:32 -0700 Subject: [PATCH 14/48] misc changes --- doc/sourmash-internals.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/doc/sourmash-internals.md b/doc/sourmash-internals.md index 94d343263e..a26fd1a5b0 100644 --- a/doc/sourmash-internals.md +++ b/doc/sourmash-internals.md @@ -20,8 +20,10 @@ but we are always happy to add to it - ## Signatures and sketches -sourmash operates on sketches. Each sketch is a collection of hashes. -Each sketch is contained in a signature. +sourmash operates on sketches. Each sketch is a collection of hashes, +which are in turn built from k-mers by applying a hash function +(currently always murmurhash) and a filtering function. Each sketch +is contained in a signature wrapper that contains some metadata. Internally, sketches (class `MinHash`) contain the following information: * a set of hashes; @@ -70,7 +72,7 @@ with them; only `sourmash sketch` does input checking. ### Compatibility checking The point of the signatures and sketches is to enable certain kinds of -rapid comparisons - Jaccard similarity and overlap analysis, +rapid comparisons - Jaccard similarity and number of overlapping k-mers, specifically. However, these comparisons can only be done between compatible sketches. @@ -580,7 +582,7 @@ values in the column named `colname` in the file `list.csv` as identifiers to be used to restrict the search. The support picklist column types are `name`, `ident` -(space-delimieted identifier), `identprefix` (identifier with version +(space-delimited identifier), `identprefix` (identifier with version removed), `md5`, `md5prefix8`, and `md5short`. Generally the `md5` and derived values are used to reference signatures found some other way with sourmash, while the identifiers are more broadly useful. From 8fce224ec7ae59277a3dcb3a27b8ac4ea12287e7 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sat, 20 Aug 2022 08:37:41 -0700 Subject: [PATCH 15/48] whups, add faq.md --- doc/faq.md | 72 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 72 insertions(+) create mode 100644 doc/faq.md diff --git a/doc/faq.md b/doc/faq.md new file mode 100644 index 0000000000..fed5fd4e64 --- /dev/null +++ b/doc/faq.md @@ -0,0 +1,72 @@ +# Frequently Asked Questions + +## How is sourmash different from mash? + +mash is an awesome piece of software that inspired sourmash - hence the +name we chose for sourmash! But are they the same? No, they are not! + +mash is based on MinHash sketching, a technique for randomly selecting +a very small set of k-mers to represent a potentially very large +sequence. MinHash sketches can be used to estimate Jaccard similarity +(a distance metric that lets you find closely related sequences) +with very high accuracy, under one condition - that the two sequences +be of similar size. + +So, MinHash sketching is great when comparing bacterial genomes, which are +all around 2-10 MB in size. + +But... MinHash sketching doesn't work when comparing _metagenomes_ to +genomes, because metagenomes are usually _much_ larger than genomes. + +So we built sourmash around a different kind of sketch - FracMinHash - +which can estimate Jaccard similarity between sets of very different +sizes. FracMinHash sketches also support overlap and containment analysis, +and are convenient in a variety of other ways - see our paper on +FracMinHash, [Lightweight compositional analysis of metagenomes with +FracMinHash and minimum metagenome covers](@@). + +There are some drawbacks to FracMinHash sketches - read on! + +## What are the drawbacks to FracMinHash and sourmash? + +There are two drawbacks to FracMinHash relative to MinHash. Neither +one is a showstopper, but you should know about them! + +One drawback is that FracMinHash sketches can be a _lot_ bigger than +MinHash sketches - technically speaking, MinHash sketches are bounded +in size (you pick the number of hashes to keep!) while FracMinHash +sketches can get arbitrarily large. In practice this means that when +you're sketching large metagenomes, you end up with large sketches. +How large depends on the _cardinality_ of the metagenome k-mers - if a +metagenome has 1 billion unique k-mers, a FracMinHash sketch with scaled +parameter of 1000 will contain a million hashes. + +The other drawback is that FracMinHash sketches _don't work well_ for +very small sequences. Our default parameter choice for DNA +(scaled=1000) works well for finding 10 kb or larger matches between +sequences - some simple Poisson matching math suggests that about 99.98% +of 10kb overlaps will be found with scaled=1000. + +## What k-mer size(s) should I use with sourmash? + +The short answer is: for DNA, use k=31. + +Slightly longer answer: when we look at the k-mer distribution +across all of the bacterial genomes in GTDB, we find that 99% or +more of 31-mers are _genome_, _species_ or _genus_ specific. + +If you go lower (say, k=21), then you get a few percent of k-mers +that match above the genus level - family or above. + +If you go higher (k=51), a higher percentage of k-mers are genome-specific. + +For the core sourmash operations - search, gather, and compare - +we believe (with evidence!) that (a) the differences between k=21, +k=31, and k=51 are negligible; and that (b) k=31 works fine for +most day-to-day use of sourmash. + +We also provide [Genbank and GTDB databases](@@) for k=21, k=31, and +k=51, too. + +## How do k-mer-based analyses compare with read mapping? + From 2f1eb4b47bc451ed9e763616c2c785ec6077d35f Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sat, 23 Sep 2023 16:15:00 -0700 Subject: [PATCH 16/48] update some text --- doc/new.md | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/doc/new.md b/doc/new.md index ea19160f71..8ee9b11c71 100644 --- a/doc/new.md +++ b/doc/new.md @@ -9,11 +9,10 @@ larger, including bacterial and viral genomes. You might try sourmash if you want to - -* identify which reference genomes are present in a metagenome -* search all Genbank microbial genomes with a genome query -* build an annotation-free clustering of many genomes using k-mers or average nucleotide identity (ANI) +* identify which reference genomes to map your metagenomic reads to +* search all Genbank microbial genomes with a sequence query +* cluster many genomes by similarity * taxonomically classify genomes or metagenomes against NCBI and/or GTDB; -* search all available reference genomes * search thousands of metagenomes with a query genome or sequence Underneath, sourmash uses [FracMinHash sketches](@@) for fast and @@ -22,8 +21,8 @@ lightweight sequence comparison; FracMinHash builds on _and_ containment analyses with k-mers. This significantly expands the range of operations that can be done quickly and in low memory. sourmash also implements a number of new and powerful analysis -techniques, including minimum metagenome covers and alignment-free ANI -estimation. +techniques, including [minimum metagenome covers](@@) and [alignment-free ANI +estimation](@@). sourmash is inspired by [mash](https://mash.readthedocs.io), and supports most mash analyses. sourmash also implements an expanded set From 87ae6a2dae9f72594e8046406c059291c2af54e2 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Tue, 26 Sep 2023 07:59:38 -0700 Subject: [PATCH 17/48] more FAQ --- doc/faq.md | 116 ++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 98 insertions(+), 18 deletions(-) diff --git a/doc/faq.md b/doc/faq.md index fed5fd4e64..5cf5d1bf26 100644 --- a/doc/faq.md +++ b/doc/faq.md @@ -1,16 +1,22 @@ # Frequently Asked Questions +```{contents} Contents +:depth: 3 +``` + ## How is sourmash different from mash? -mash is an awesome piece of software that inspired sourmash - hence the -name we chose for sourmash! But are they the same? No, they are not! +[mash](https://mash.readthedocs.org/) is an awesome piece of software +that inspired sourmash - hence the name we chose for sourmash! But are +they the same? No, they are not! -mash is based on MinHash sketching, a technique for randomly selecting -a very small set of k-mers to represent a potentially very large -sequence. MinHash sketches can be used to estimate Jaccard similarity -(a distance metric that lets you find closely related sequences) -with very high accuracy, under one condition - that the two sequences -be of similar size. +mash is based on +[MinHash sketching](https://en.wikipedia.org/wiki/MinHash), a +technique for randomly selecting a very small set of k-mers to +represent a potentially very large sequence. MinHash sketches can be +used to estimate Jaccard similarity (a distance metric that lets you +find closely related sequences) with very high accuracy, under one +condition - that the two sequences be of similar size. So, MinHash sketching is great when comparing bacterial genomes, which are all around 2-10 MB in size. @@ -20,10 +26,10 @@ genomes, because metagenomes are usually _much_ larger than genomes. So we built sourmash around a different kind of sketch - FracMinHash - which can estimate Jaccard similarity between sets of very different -sizes. FracMinHash sketches also support overlap and containment analysis, -and are convenient in a variety of other ways - see our paper on -FracMinHash, [Lightweight compositional analysis of metagenomes with -FracMinHash and minimum metagenome covers](@@). +sizes. FracMinHash sketches also support overlap and containment +analysis, and are convenient in a variety of other ways - see our +paper on FracMinHash, +[Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2). There are some drawbacks to FracMinHash sketches - read on! @@ -60,13 +66,87 @@ that match above the genus level - family or above. If you go higher (k=51), a higher percentage of k-mers are genome-specific. -For the core sourmash operations - search, gather, and compare - -we believe (with evidence!) that (a) the differences between k=21, -k=31, and k=51 are negligible; and that (b) k=31 works fine for -most day-to-day use of sourmash. +For the core sourmash operations - search, gather, and compare - we +believe (with evidence!) that (a) the differences between k=21, k=31, +and k=51 are negligible; and that (b) k=31 works fine for most +day-to-day use of sourmash. + +We also provide [Genbank and GTDB databases](databases.md) for k=21, +k=31, and k=51. -We also provide [Genbank and GTDB databases](@@) for k=21, k=31, and -k=51, too. +For some background on k-mer specificity, we recommend this paper: +[MetaPalette: a k-mer Painting Approach for Metagenomic Taxonomic Profiling and Quantification of Novel Strain Variation](https://journals.asm.org/doi/10.1128/msystems.00020-16), +Koslicki & Falush, 2016. ## How do k-mer-based analyses compare with read mapping? +tl;dr very well! But it's a bit one sided: if k-mers match, reads will +map, but not necessarily vice versa. So read mapping rates are almost always +higher than k-mer matching rates. + +### Mapping reads to reference vs k-mer "detection" or containment + +Let's start by looking at a simple example: suppose you have Illumina +shotgun sequencing of a new isolate, and you want to compare it to a +reference genome for a member of the same species. You calculate +k-mer containment (using e.g. `sourmash search --containment +ref.sig.gz isolate.sig.gz`) of the reference genome in the isolate shotgun +sequence to be 65%. You then map the reads from the isolate data to the +reference genome. What should you expect to see? + +What you should see is that 65% or more of the reference genome is +covered by at least one read. This is known as the mapping-based +"detection" of the reference genome in the read data set, and k-mer +detection typically *underestimates* mapping-based detection. + +(If you want to know _how many of the reads will map_, you need to use +a different number that is output by `sourmash gather` - although, for +single genomes with only a few repeats, the percentage of reads that +map should match the k-mer detection number you calculate with +containment.) + +### Why do k-mers underestimate read mapping? + +K-mers underestimate read mapping because k-mers rely on exact matches, +while read mapping tolerates mismatches. So unless you are looking at +entirely error free data with no strain variation, mismatches will +sometimes prevent k-mers from matching. + +For some math: consider a k-mer of size 21. It will only match to a +specific 21 base sequence exactly. If you are matching two isolate genomes +that align completely but have a 95% average nucleotide identity, then +one out of every 20 bases will be different (on average), and, on average, +every 21-mer will be different! (In practice you'd expect about 1/3 of the +k-mers to match if the variation is random, since there will be many 21-base +windows that don't have any mismatches.) However, alignment-based approaches +like read mapping will happily align across single-base mismatches, and so +even at 95% ANI you would expect many reads to align. + +In practice, the numbers will differ, but the intuition remains! + +### How do read mapping rates for metagenomes compare with k-mer statistics? + +@@ + +### Further reading and links on k-mers and mapping: + +* The paper + [Biogeographic Distribution of Five Antarctic Cyanobacteria Using Large-Scale k-mer Searching with sourmash branchwater](https://www.biorxiv.org/content/10.1101/2022.10.27.514113v1), + Lumian et al., 2022, shows that k-mer detection almost always + underestimates mapping, and k-mer abundance analysis is always more + conservative than mapping-based analyses. + +* The paper + [Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash](https://genome.cshlp.org/content/33/7/1061), + Rahman Hera et al., 2023, shows how to translate between average + nucleotide identity (ANI) and k-mer statistics. + +* The paper +[Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2), +Irber et al., 2022, describes how `sourmash gather` assigns k-mers +from metagenomes to a set of reference genomes, and shows that read +mapping correlates pretty well with k-mer overlap. Note that it is +focused on *systematic decomposition of metagenomes against reference +databases*, so it tackles the question of analyzing an entire +metagenome against all available references, not just a single +matching genome. From 1d5b75459f7d41d0c3c1d6bd698e3857bb8d09dc Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Tue, 26 Sep 2023 08:03:20 -0700 Subject: [PATCH 18/48] update / add verify notes --- doc/faq.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/doc/faq.md b/doc/faq.md index 5cf5d1bf26..8baebd4ce8 100644 --- a/doc/faq.md +++ b/doc/faq.md @@ -50,8 +50,9 @@ parameter of 1000 will contain a million hashes. The other drawback is that FracMinHash sketches _don't work well_ for very small sequences. Our default parameter choice for DNA (scaled=1000) works well for finding 10 kb or larger matches between -sequences - some simple Poisson matching math suggests that about 99.98% -of 10kb overlaps will be found with scaled=1000. +sequences - some simple Poisson matching math suggests that about +99.98% of 10kb overlaps will be found with scaled=1000. @CTB verify I +think this is 5kb. ## What k-mer size(s) should I use with sourmash? @@ -90,7 +91,7 @@ Let's start by looking at a simple example: suppose you have Illumina shotgun sequencing of a new isolate, and you want to compare it to a reference genome for a member of the same species. You calculate k-mer containment (using e.g. `sourmash search --containment -ref.sig.gz isolate.sig.gz`) of the reference genome in the isolate shotgun +ref.sig.gz isolate.sig.gz` @@verify) of the reference genome in the isolate shotgun sequence to be 65%. You then map the reads from the isolate data to the reference genome. What should you expect to see? @@ -103,7 +104,8 @@ detection typically *underestimates* mapping-based detection. a different number that is output by `sourmash gather` - although, for single genomes with only a few repeats, the percentage of reads that map should match the k-mer detection number you calculate with -containment.) +containment. Read the section below on metagenome analysis for more +information on read mapping rates!) ### Why do k-mers underestimate read mapping? From 7a254ee5592595c82df80e105e4e3efd72f386d2 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Tue, 26 Sep 2023 12:28:19 -0700 Subject: [PATCH 19/48] more writing --- doc/faq.md | 27 ++++++++++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-) diff --git a/doc/faq.md b/doc/faq.md index 8baebd4ce8..7bb5818d71 100644 --- a/doc/faq.md +++ b/doc/faq.md @@ -128,7 +128,32 @@ In practice, the numbers will differ, but the intuition remains! ### How do read mapping rates for metagenomes compare with k-mer statistics? -@@ +Shotgun metagenome sequencing adds several complicating factors when +comparing metagenome reads to a reference database. First, each genome +will generally be present at a different abundance in the metagenome. +Second, metagenomes rarely contain the exact strain genome present in +a reference database; often the genomes in the metagenome differ both +in terms of SNPs _and_ accessory elements from what's in the reference +database. And third, metagenomes often contain complicated mixtures of +strains - even if one strain is dominant. + +The `sourmash gather` method is built for analyzing metagenomes against +reference databases, and it does so by finding the shortest list of +reference genomes that "cover" all of the k-mers in the metagenome. +This list is arranged by how many k-mers from the metagenome are covered +by that entry in the output: the first match is the biggest, and the +second match is the second biggest, and so on. We call this a "minimum +metagenome cover" and it is described in the Irber et al., 2022, paper below. + +When we construct the minimum metagenome cover, it correlates well with +mapping (per Irber et al., 2022), with one caveat: you need to assign +reads to the reference genomes in the rank order output by gather. +This is needed to properly assign reads that map to multiple genomes +to the "best" match - the one that will capture the most reads. + +`sourmash gather` is still a research method but it seems to work +pretty well - in particular, it is both highly sensitive _and_ highly +specific in taxonomic benchmarking. Please ask questions as you have them! ### Further reading and links on k-mers and mapping: From 651ca6ec8010f61c14f528b0df04507aa8d247fa Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Tue, 26 Sep 2023 12:57:09 -0700 Subject: [PATCH 20/48] more --- doc/command-line.md | 13 +++++++++++++ doc/faq.md | 8 ++++---- 2 files changed, 17 insertions(+), 4 deletions(-) diff --git a/doc/command-line.md b/doc/command-line.md index f19d2aa5be..c1be8b8b72 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -358,8 +358,21 @@ overlap p_query p_match 0.9 Mbp 7.4% 11.8% BA000019.2 Nostoc sp. PCC 7120 DNA, c... 0.7 Mbp 5.9% 23.0% FOVK01000036.1 Proteiniclasticum rumi... 0.7 Mbp 5.3% 17.6% AE017285.1 Desulfovibrio vulgaris sub... +... +found less than 50.0 kbp in common. => exiting + +found 64 matches total; +the recovered matches hit 94.0% of the abundance-weighted query. +the recovered matches hit 45.6% of the query k-mers (unweighted). ``` +For each match, +* 'overlap', the first column, is the estimated number of k-mers shared between the match and the query. +* 'p_query' is the _percentage_ of the query that overlaps with the match; it is the amount of the metagenome "explained" by this match. +* 'p_match' is the percentage of the _match_ that overlaps with the query; it is the "detection" of the match in the metagenome. + +@CTB bottom recovered matches + The command line option `--threshold-bp` sets the threshold below which matches are no longer reported; by default, this is set to 50kb. see the Appendix in diff --git a/doc/faq.md b/doc/faq.md index 7bb5818d71..7659530693 100644 --- a/doc/faq.md +++ b/doc/faq.md @@ -90,10 +90,10 @@ higher than k-mer matching rates. Let's start by looking at a simple example: suppose you have Illumina shotgun sequencing of a new isolate, and you want to compare it to a reference genome for a member of the same species. You calculate -k-mer containment (using e.g. `sourmash search --containment -ref.sig.gz isolate.sig.gz` @@verify) of the reference genome in the isolate shotgun -sequence to be 65%. You then map the reads from the isolate data to the -reference genome. What should you expect to see? +k-mer containment of the reference genome in the isolate shotgun +sequence to be 65%, using e.g. `sourmash search --containment +ref.sig.gz isolate.sig.gz`. You then map the reads from the isolate +data to the reference genome. What should you expect to see? What you should see is that 65% or more of the reference genome is covered by at least one read. This is known as the mapping-based From 01ed1837332b0eb99c670002a7660307aca0282d Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Wed, 27 Sep 2023 06:30:26 -0700 Subject: [PATCH 21/48] more --- doc/command-line.md | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/doc/command-line.md b/doc/command-line.md index c1be8b8b72..675d36ae8c 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -371,11 +371,27 @@ For each match, * 'p_query' is the _percentage_ of the query that overlaps with the match; it is the amount of the metagenome "explained" by this match. * 'p_match' is the percentage of the _match_ that overlaps with the query; it is the "detection" of the match in the metagenome. -@CTB bottom recovered matches +Quite a bit more information per match row is available in the CSV +output saved with `-o`; for details, see +[Classifying signatures: how sourmash gather works](classifying-signatures.html#appendix-a-how-sourmash-gather-works). + +The "recovered matches" lines detail how much of the query is +explained by the entire collection of matches. You will get two numbers if +your metagenome sketch has been calculated with `-p abund`, and only +one if it does not have abundances. The abundance-weighted +number should approximate the fraction of metagenome reads that will +map to at least one reference genome, while the unweighted number +describes how much of the metagenome itself matches to genomes. +Here's another way to put it: if the metagenome could be perfectly +assembled into contigs, the unweighted number would approximate the +number of bases from the contigs that would match perfectly to at +least one genome in the reference database. More practically, +the abundance-weighted number is less sensitive to sequencing errors. +See @CTB classifying signatures or FAQ for more information here! The command line option `--threshold-bp` sets the threshold below which matches are no longer reported; by default, this is set to -50kb. see the Appendix in +50kb. See the Appendix in [Classifying Signatures](classifying-signatures.md) for details. As of sourmash 4.2.0, `gather` supports `--picklist`, to From d99d7c94a053a4d51945a6c289c6f2dc6458627b Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Wed, 27 Sep 2023 06:38:58 -0700 Subject: [PATCH 22/48] fix links etc --- doc/classifying-signatures.md | 3 ++- doc/command-line.md | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/doc/classifying-signatures.md b/doc/classifying-signatures.md index 756e5b4728..d30c36b3fa 100644 --- a/doc/classifying-signatures.md +++ b/doc/classifying-signatures.md @@ -293,7 +293,8 @@ match that is _unique_ with respect to the original query. It will always decrease as you get more matches. The sum of `f_unique_to_query` across all rows is what is reported in by gather as the fraction of query k-mers hit by the recovered matches -(unweighted) and should never be greater than 1! +(unweighted) and should never be greater than 1! This column should +be used in any analysis that needs to avoid double-counting matches. The second column, `f_match_orig`, is how much of the match is in the _original_ query. For this fictional metagenome, each match is diff --git a/doc/command-line.md b/doc/command-line.md index aeb0a85ca9..6557413e4b 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -373,7 +373,7 @@ For each match, Quite a bit more information per match row is available in the CSV output saved with `-o`; for details, see -[Classifying signatures: how sourmash gather works](classifying-signatures.html#appendix-a-how-sourmash-gather-works). +[Classifying signatures: how sourmash gather works](classifying-signatures.md#appendix-a-how-sourmash-gather-works). The "recovered matches" lines detail how much of the query is explained by the entire collection of matches. You will get two numbers if From c60da9656bc807ab1a7e3c7d1798bf4469d9f3f2 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Wed, 27 Sep 2023 09:57:15 -0700 Subject: [PATCH 23/48] add faq --- doc/faq.md | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/doc/faq.md b/doc/faq.md index 7659530693..ce171e9963 100644 --- a/doc/faq.md +++ b/doc/faq.md @@ -54,6 +54,42 @@ sequences - some simple Poisson matching math suggests that about 99.98% of 10kb overlaps will be found with scaled=1000. @CTB verify I think this is 5kb. +## How can I better understand FracMinHash and sourmash intuitively? + +@@ tutorial on k-mers + +## What papers should I read to better understand the FracMinHash approach used by sourmash? + +I would suggest reading these four papers, in order: + +[Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2), +Irber et al., 2022. This is the fullest technical description of FracMinHash available. + +[Mash: fast genome and metagenome distance estimation using MinHash](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x), +Ondov et al., 2016. This is the original paper that inspired +sourmash. It discusses sketching with MinHash and does a great job of +showing how well Jaccard estimation works for comparing genomes! A +good contrasting point to take into account is that _MinHash cannot do +overlap or containment estimation_, which nicely motivates the +previous paper and the next two. + +[Mash Screen: high-throughput sequence containment estimation for genome discovery](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1841-x), +Ondov et al., 2019; and +[CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices](https://academic.oup.com/bioinformatics/article/38/Supplement_1/i28/6617499). Both +papers discusses containment and metagenome analysis extensively, and +use an approach that can be usefully contrasted with sourmash. There +is a nice blog post on +[mash screen](https://genomeinformatics.github.io/mash-screen/) that +is worth reading, too! + +If you want a nice chaser, please see this section at the end of the +blog post above: + +>It would be great to see additional methods developed to process +containment scores, reduce the output redundancy, and report accurate +compositional estimates for metagenomes. One easy approach is a +“winner take all” model, like sourmash implements. + ## What k-mer size(s) should I use with sourmash? The short answer is: for DNA, use k=31. From b21282ef7991a99b813e609f73a72b28f8df60cd Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Fri, 29 Sep 2023 16:52:22 -0700 Subject: [PATCH 24/48] add funding acks --- doc/funding.md | 9 +++++++++ doc/new.md | 6 ++++++ 2 files changed, 15 insertions(+) create mode 100644 doc/funding.md diff --git a/doc/funding.md b/doc/funding.md new file mode 100644 index 0000000000..9a717042b7 --- /dev/null +++ b/doc/funding.md @@ -0,0 +1,9 @@ +# Funding + +Sourmash development was initiated under award GBMF4551 from the Gordon +and Betty Moore Foundation to C. Titus Brown. + +Further development was sponsored by: + +* [NSF DBI-2018911](https://www.nsf.gov/awardsearch/showAward?AWD_ID=2018911&HistoricalAwards=false), BBSRC-NSF/BIO:Collaborative Research: genomeRxiv: a microbial whole-genome database and diagnostic marker design resource for classification, identification, and data sharing. +* [NIH 1R03OD030596-01](https://reporter.nih.gov/search/aktj48mYz0ObXo62p_Tx7Q/project-details/10112077), Large-scale annotation-free disease correlation analysis of the iHMP. diff --git a/doc/new.md b/doc/new.md index 8ee9b11c71..29cd529779 100644 --- a/doc/new.md +++ b/doc/new.md @@ -28,6 +28,11 @@ sourmash is inspired by [mash](https://mash.readthedocs.io), and supports most mash analyses. sourmash also implements an expanded set of functionality for metagenome and taxonomic analysis. +sourmash development was initiated with a grant from the Moore +Foundation under the Data Driven Discovery program, and has been +supported by further funding from the NIH and NSF. Please see +[funding acknowledgements](funding.md) for details! + ## Using sourmash ### Tutorials and examples @@ -69,6 +74,7 @@ X and Linux. They require about 5 GB of disk space and 5 GB of RAM. * [Python examples using the API](api-example.md) * [Publications about sourmash](publications.md) * [A guide to the internals of sourmash](sourmash-internals.md) +* [Funding acknowledgements](funding.md) ## Developing and extending sourmash From de2b5a4d79c4f7e2bc12a65f0be0fbb070ad21ad Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Fri, 29 Sep 2023 18:03:31 -0700 Subject: [PATCH 25/48] detailed gather output --- doc/classifying-signatures.md | 137 ++++++++++++++++++++++++++++++++++ 1 file changed, 137 insertions(+) diff --git a/doc/classifying-signatures.md b/doc/classifying-signatures.md index d30c36b3fa..089793c731 100644 --- a/doc/classifying-signatures.md +++ b/doc/classifying-signatures.md @@ -396,3 +396,140 @@ overlap p_query p_match avg_abund The cause of the poor approximation for genome s10 is unclear; it could be due to low coverage, or small genome size, or something else. More investigation needed. + +## Appendix C: sourmash gather output examples + +Below we show two real gather analyses done with a mock metagenome, +SRR606249 (from +[Shakya et al., 2014](https://pubmed.ncbi.nlm.nih.gov/23387867/)) and +three of the known genomes contained within it - two *Shewanella baltica* +strains and one *Akkermansia muciniphila* genome + +### sourmash gather with a query containing abundance information + +``` +% sourmash gather -k 31 SRR606249.trim.sig.zip podar-ref/2.fa.sig podar-ref/47.fa.sig podar-ref/63.fa.sig + +== This is sourmash version 4.8.5.dev0. == +== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == + +selecting specified query k=31 +loaded query: SRR606249... (k=31, DNA) +-- +loaded 9 total signatures from 3 locations. +after selecting signatures compatible with search, 3 remain. + +Starting prefetch sweep across databases. +Prefetch found 3 signatures with overlap >= 50.0 kbp. +Doing gather to generate minimum metagenome cover. + +overlap p_query p_match avg_abund +--------- ------- ------- --------- +5.2 Mbp 0.8% 99.0% 11.7 NC_011663.1 Shewanella baltica OS223... +2.7 Mbp 0.9% 100.0% 24.5 CP001071.1 Akkermansia muciniphila A... +5.2 Mbp 0.3% 51.0% 8.1 NC_009665.1 Shewanella baltica OS185... +found less than 50.0 kbp in common. => exiting + +found 3 matches total; +the recovered matches hit 2.0% of the abundance-weighted query. +the recovered matches hit 2.5% of the query k-mers (unweighted). +``` + +### sourmash gather with the same query, *ignoring* abundances + +``` +% sourmash gather -k 31 SRR606249.trim.sig.zip podar-ref/2.fa.sig podar-ref/47.fa.sig podar-ref/63.fa.sig --ignore-abundance + +== This is sourmash version 4.8.5.dev0. == +== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == + +selecting specified query k=31 +loaded query: SRR606249... (k=31, DNA) +-- +loaded 9 total signatures from 3 locations. +after selecting signatures compatible with search, 3 remain. + +Starting prefetch sweep across databases. +Prefetch found 3 signatures with overlap >= 50.0 kbp. +Doing gather to generate minimum metagenome cover. + +overlap p_query p_match +--------- ------- ------- +5.2 Mbp 1.2% 99.0% NC_011663.1 Shewanella baltica OS223, complete... +2.7 Mbp 0.6% 100.0% CP001071.1 Akkermansia muciniphila ATCC BAA-83... +5.2 Mbp 0.6% 51.0% NC_009665.1 Shewanella baltica OS185, complete... +found less than 50.0 kbp in common. => exiting + +found 3 matches total; +the recovered matches hit 2.5% of the query k-mers (unweighted). +``` + +### Notes and comparisons + +There are a few interesting things to point out about the above output: + +* `p_match` is the same whether or not abundance information is used. + This is because it is the fraction of the matching genome detected in + the metagenome, which is inherently "flat". It is also reported + progressively: only the portions of the metagenome that have not + matched to any previous matches are used in `p_match`; read on for + details :). +* `p_query` is different when abundance information is used. For + queries with abundance information, `p_query` provides a weighted + estimate that approximates the number of metagenome reads that would + map to this genome _after_ mapping reads to all previously reported + matches, i.e. all matches above this match. +* When abundance information is not available or + not used, `p_query` is an estimate of what fraction of the remaining k-mers + in the metagenome match to this genome, after all previous matches + have been removed. +* The `avg_abund` column only shows up when abundance information is + supplied. This is the k-mer coverage of the detected portion of the + match; it is a lower bound on the expected mapping-based coverage + for metagenome reads mapped to the detected portion of the match. +* The percent of recovered matches for the abundance-weighted query + is the sum of the `p_query` column and estimates the total fraction + of metagenome reads that will map across all of the matching references. +* The percent of recovered matches when _ignoring_ abundances is likewise + the sum of the (unweighted) `p_query` column and is not particularly + informative - it will always be low for real metagenomes, because sourmash + cannot match erroneous k-mers created by sequencing errors. +* The `overlap` column is the estimated size of the overlap between the + (unweighted) original query metagenome and the match. It does not take + into account previous matches. + +Last but not least, something interesting is going on here with strains. +While it is not highlighted in the text output of gather, there is +substantial overlap between the two *Shewanella baltica* genomes. And, +in fact, both of them are entirely (well, 99%) present in the metagenome +if measured individually with `sourmash search --containment`. + +Consider a few more details: + +* `p_match` for the first *Shewanella* match, `NC_011663.1`, is 99%! +* `p_match` for the second *Shewanella* match, `NC_009665.1`, is only 50%! +* and, confusingly, the `overlap` for both matches is 5.2 Mbp! + +What's up?! + +What's happening here is that `sourmash gather` is subtracting the match +to the first *Shewanella* genome from the metagenome before moving on to +the next result, and `p_match` reports only the amount of the match +detected in the _remaining_ metagenome after that subtraction. +However, `overlap` is reported as the amount of overlap with the +_original_ metagenome, which is essentially the entire genome in all +three cases. + +The main things to keep in mind for gather are this: + +* `p_query` and `p_match` do not double-count k-mers or matches; in particular, you can sum across `p_query` for a metagenome without + counting anything more than once. +* `overlap` _does_ count matches redundantly. +* the percent of recovered matches is a useful summary of the whole + metagenome! + +We know it's confusing but it's the best output we've been able to +figure out across all of the different use cases for gather. Perhaps +in the future we'll find a better way to represent all of these +numbers in a more clear, concise, and interpretable way; in the +meantime, we welcome your questions and comments! From b789b73ecdb698787967de1b40fb27e0f3de4319 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Fri, 29 Sep 2023 20:56:10 -0700 Subject: [PATCH 26/48] minor updates --- doc/sourmash-internals.md | 25 ++++++++++++++++++++----- 1 file changed, 20 insertions(+), 5 deletions(-) diff --git a/doc/sourmash-internals.md b/doc/sourmash-internals.md index a26fd1a5b0..81aa950b09 100644 --- a/doc/sourmash-internals.md +++ b/doc/sourmash-internals.md @@ -67,7 +67,8 @@ hashed. Note that if hashes are produced some other way (with a different hash function) or from some source other than DNA, sourmash can still work -with them; only `sourmash sketch` does input checking. +with them; only `sourmash sketch` actually cares about DNA sequences, +everything else works with hashes. ### Compatibility checking @@ -89,7 +90,7 @@ compatible ksizes, molecule types, and sketch types. Per our discussion in [Irber et al., 2022](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2), FracMinHash sketches can always be compared by downsampling to the max of the two scaled values. (This is not always -true of sketches stored in indexed collections, e.g. SBTs; see [sourmash#1799](https://github.com/sourmash-bio/sourmash/issues/1799).) +indexed collections of sketches, e.g. SBTs; see [sourmash#1799](https://github.com/sourmash-bio/sourmash/issues/1799).) In practice, sourmash does all necessary downsampling dynamically, but returns the original sketches. This means that (for example) you can @@ -111,7 +112,7 @@ sig downsample`, however. ### Conversion between Scaled (FracMinHash) and Num (MinHash) signatures with `downsample` -As discussed in the previous sections, it is possible to adjust the `scaled` and `num` values to compare two FracMinHash signatures or two Num MinHash signatures. However, it is also possible to covert between the `scaled` and `num` signatures with the `sourmash sig downsample` command. For more details, review [command line](https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-signature-downsample-decrease-the-size-of-a-signature) document. +As discussed in the previous sections, it is possible to adjust the `scaled` and `num` values to compare two FracMinHash signatures or two Num MinHash signatures. However, it is also possible to covert between the `scaled` and `num` signatures with the `sourmash sig downsample` command. For more details, review the [command line docs for `sig downsample`](https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-signature-downsample-decrease-the-size-of-a-signature). ## K-mer sizes @@ -146,6 +147,11 @@ built with `hp` in the param string, e.g. `sourmash sketch protein -p hp`. ## Manifests +Manifests are catalogs of sketches: they include most of the information +about a sketch, except for the actual hashes. The idea of manifests is +that you can use them to identify *which* sketches you are interested +in before actually working with them (loading them, for example). + sourmash makes extensive use of signature manifests to support rapid selection and lazy loading of signatures based on signature metadata (name, ksize, moltype, etc.) See @@ -457,9 +463,9 @@ experiment with different threshold parameters for `gather` - first, do a very sensitive/low-threshold search with `prefetch` and save the results to a CSV file with `-o`, -Repeated gathers and searches. +Repeated gathers and searches. CTB -Using prefetch explicitly. +Using prefetch explicitly. CTB ### Using a higher scaled value @@ -483,6 +489,15 @@ multigather` command supports many queries against many databases. (We don't particularly suggest using `multigather`; we would prefer to make search databases faster. But it's there! :) +### Much faster search and gather with branchwater + +We also have a reasonably stable plugin, +[pyo3_branchwater](https://github.com/sourmash-bio/pyo3_branchwater), +that implements multithreaded operations using Rust. It is 100-1000 +times faster than sourmash, and 5-50 times lower memory. In exchange, +it's not quite as flexible as the full sourmash package. But if you're +running into speed or memory problems, you should give it a try! + ## Taxonomy and assigning lineages All sourmash taxonomy handling is done within the `lca` and `tax` From fb45b6c0c25762a5a5d8aede4adfdda955c147c5 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sat, 30 Sep 2023 06:27:23 -0700 Subject: [PATCH 27/48] add information about save formats and memory usage --- doc/command-line.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/doc/command-line.md b/doc/command-line.md index d65408c2d6..c5a536682d 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -2059,6 +2059,13 @@ All of these save formats can be loaded by sourmash commands. **We strongly suggest using .zip files to store signatures: they are fast, small, and fully supported by all the sourmash commands.** +Note that when outputting large collections of signatures, some save +formats require holding all the sketches in memory until they can be +written out, and others can save progressively. This can affect memory +usage! Currently `.sig` and `.sig.gz` formats are held in memory, +while `.zip`, directory outputs, and `.sqldb` formats write progressively +to disk. + For more detailed information on database formats and performance tradeoffs, please see [the advanced usage information for databases!](databases-advanced.md) From 69dfbc08aa2a79f276f99755bf51bd6d407bbec4 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sat, 30 Sep 2023 06:54:58 -0700 Subject: [PATCH 28/48] add section on choosing a reference for read mapping --- doc/faq.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/doc/faq.md b/doc/faq.md index ce171e9963..d6c3a8fbae 100644 --- a/doc/faq.md +++ b/doc/faq.md @@ -213,3 +213,21 @@ focused on *systematic decomposition of metagenomes against reference databases*, so it tackles the question of analyzing an entire metagenome against all available references, not just a single matching genome. + +## Can I use sourmash to determine the best reference genome for mapping my reads? + +Yes! (And see the FAQ above, +[How do k-mer analyses compare with read mapping?](#how-do-k-mer-based-analyses-compare-with-read-mapping) + +If you're interested in picking a single best reference genome (from a +large database) for read mapping, you can do the following: + +* sketch all your reference genomes with `sourmash sketch dna`, and/or download one of our [prepared databases](databases.md). You can use default parameters for `sourmash sketch`. +* sketch your read collection with `sourmash sketch dna`. +* run `sourmash prefetch reads.sig -o matches.csv` +* sort `matches.csv` on the `f_match_query` column and pick the highest value - this is the k-mer detection - and pick the match name from the `match_name` column; +* use that reference genome as your mapping target. + +If you want to map a metagenome to _multiple_ references, consider +using `sourmash gather` and/or +[the genome-grist workflow](https://dib-lab.github.io/genome-grist/). From e828ff99a89bc09af9b8d3a415bb8858a5a47e08 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sat, 30 Sep 2023 07:00:17 -0700 Subject: [PATCH 29/48] add section on retrieving reads to FAQ --- doc/faq.md | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/doc/faq.md b/doc/faq.md index d6c3a8fbae..0be10ca51e 100644 --- a/doc/faq.md +++ b/doc/faq.md @@ -217,7 +217,7 @@ matching genome. ## Can I use sourmash to determine the best reference genome for mapping my reads? Yes! (And see the FAQ above, -[How do k-mer analyses compare with read mapping?](#how-do-k-mer-based-analyses-compare-with-read-mapping) +[How do k-mer analyses compare with read mapping?](#how-do-k-mer-based-analyses-compare-with-read-mapping).) If you're interested in picking a single best reference genome (from a large database) for read mapping, you can do the following: @@ -231,3 +231,18 @@ large database) for read mapping, you can do the following: If you want to map a metagenome to _multiple_ references, consider using `sourmash gather` and/or [the genome-grist workflow](https://dib-lab.github.io/genome-grist/). + +(This is also known as "read recruitment.") + +## How do I get the sequences for a particular reference genome from a metagenome, using sourmash? + +If sourmash reports that a particular strain or genome is present in a +metagenome, how do you retrieve the reads using sourmash? + +The short answer is: you have to use a different tool. You can do +read mapping between the metagenome and the relevant reference genome +(which can be automated with +[the genome-grist workflow](https://dib-lab.github.io/genome-grist/); +or, if you are interested in retrieving accessory elements, you can +try out +[spacegraphcats](https://spacegraphcats.github.io/spacegraphcats/02-spacegraphcats-use-cases/). From 4d2a91b7464515842984faf5d2a52fe896be44fa Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sat, 30 Sep 2023 07:01:58 -0700 Subject: [PATCH 30/48] collection load order --- doc/sourmash-internals.md | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/doc/sourmash-internals.md b/doc/sourmash-internals.md index 81aa950b09..2799f21a4e 100644 --- a/doc/sourmash-internals.md +++ b/doc/sourmash-internals.md @@ -643,6 +643,29 @@ minimum size for ani calculation n+1 problem +### Gather on multiple collections, and order of search and reporting + +Since `sourmash gather` will pick only one "best match" if there +are several (and will ignore the others), the order of searching +can matter for multiple collections. How does this work? + +In brief, sourmash doesn't guarantee a particular load order for +sketches in a single collection, but it _does_ guarantee that +collections are loaded and searched in their entirety in the order +that you provide them. So, for example, if you have a large zipfile +database of sketches that contain some duplicates, you can't pick +which of the duplicates will be chosen as a match; but you _can_ +provide your own collection of prioritized matches as a separate +database. A practical application of this might be to provide the +GTDB "representatives" database first on the command line, with the +full GTDB database second, in order to prioritize choosing +representative genomes as matches over the rest. + +This also plays a role in the order of reporting for `prefetch` - +`prefetch` will report matching sketches in the order it encounters +them, which will match the order in which collections are given +to `sourmash prefetch` on the command line. + ## Formats natively understood by sourmash sourmash should always autodetect the format of a collection or From 4e97e8e3e7223fc34c6448d781e6accda33d1475 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sat, 30 Sep 2023 07:04:23 -0700 Subject: [PATCH 31/48] update author/copyright --- doc/conf.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/conf.py b/doc/conf.py index 23331efe08..3035b6274c 100644 --- a/doc/conf.py +++ b/doc/conf.py @@ -59,8 +59,8 @@ # General information about the project. project = 'sourmash' -copyright = '2016-2020, C. Titus Brown and Luiz Irber' -author = 'C. Titus Brown and Luiz Irber' +copyright = '2016-2023, C. Titus Brown, Luiz Irber, and N. Tessa Pierce-Ward' +author = 'C. Titus Brown, Luiz Irber, and N. Tessa Pierce-Ward' # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the From 571c86588559ed7f5c8c6ca913000ac213065d7e Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sat, 30 Sep 2023 07:07:45 -0700 Subject: [PATCH 32/48] minor corrections --- doc/faq.md | 2 +- doc/sourmash-internals.md | 26 +++++++++++++------------- 2 files changed, 14 insertions(+), 14 deletions(-) diff --git a/doc/faq.md b/doc/faq.md index 0be10ca51e..2f1e0fecd6 100644 --- a/doc/faq.md +++ b/doc/faq.md @@ -217,7 +217,7 @@ matching genome. ## Can I use sourmash to determine the best reference genome for mapping my reads? Yes! (And see the FAQ above, -[How do k-mer analyses compare with read mapping?](#how-do-k-mer-based-analyses-compare-with-read-mapping).) +[How do k-mer analyses compare with read mapping?](#how-do-k-mer-based-analyses-compare-with-read-mapping)) If you're interested in picking a single best reference genome (from a large database) for read mapping, you can do the following: diff --git a/doc/sourmash-internals.md b/doc/sourmash-internals.md index 2799f21a4e..6aaeee90e5 100644 --- a/doc/sourmash-internals.md +++ b/doc/sourmash-internals.md @@ -647,24 +647,24 @@ n+1 problem Since `sourmash gather` will pick only one "best match" if there are several (and will ignore the others), the order of searching -can matter for multiple collections. How does this work? +can matter for large collections. How does this work? In brief, sourmash doesn't guarantee a particular load order for sketches in a single collection, but it _does_ guarantee that collections are loaded and searched in their entirety in the order that you provide them. So, for example, if you have a large zipfile -database of sketches that contain some duplicates, you can't pick -which of the duplicates will be chosen as a match; but you _can_ -provide your own collection of prioritized matches as a separate -database. A practical application of this might be to provide the -GTDB "representatives" database first on the command line, with the -full GTDB database second, in order to prioritize choosing -representative genomes as matches over the rest. - -This also plays a role in the order of reporting for `prefetch` - -`prefetch` will report matching sketches in the order it encounters -them, which will match the order in which collections are given -to `sourmash prefetch` on the command line. +database of sketches that contains duplicates, you can't predict which +of the duplicates will be chosen as a match; but you _can_ build your +own collection of prioritized matches as a separate database, and put +it first on the command line. A practical application of this might +be to list the GTDB "representatives" database first on the command +line, with the full GTDB database second, in order to prioritize +choosing representative genomes as matches over the rest. + +This also plays a role in the order of reporting for `prefetch` +output - `prefetch` will report matching sketches in the order it +encounters them, which will match the order in which collections are +given to `sourmash prefetch` on the command line. ## Formats natively understood by sourmash From c6251d340a301318b287f0d8bd7872d563a67d77 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 1 Oct 2023 07:48:39 -0700 Subject: [PATCH 33/48] close to done with a first pass --- doc/sourmash-internals.md | 89 ++++++++++++++++++++++++++++++++++++--- 1 file changed, 84 insertions(+), 5 deletions(-) diff --git a/doc/sourmash-internals.md b/doc/sourmash-internals.md index 6aaeee90e5..a9d9cca1de 100644 --- a/doc/sourmash-internals.md +++ b/doc/sourmash-internals.md @@ -114,6 +114,29 @@ sig downsample`, however. As discussed in the previous sections, it is possible to adjust the `scaled` and `num` values to compare two FracMinHash signatures or two Num MinHash signatures. However, it is also possible to covert between the `scaled` and `num` signatures with the `sourmash sig downsample` command. For more details, review the [command line docs for `sig downsample`](https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-signature-downsample-decrease-the-size-of-a-signature). +### Operations you can do safely with FracMinHash sketches + +As described in the @gather paper, FracMinHash sketches support a wide +range of operations that mirror actions taken on the underlying data +set _without_ revisiting the underlying data. This allows users to +build sketches once (requiring the original data) and then do all +sorts of manipulations on the sketches, and know that the results of +the sketch manipulations represent what would happen if they did the same +thing on the original data. +For example, + +* set unions, intersections, and subtractions all perform the same + when done on the sketches as when applied to the underling data. + So, for example, you can sketch two files separately and merge + the sketches (with `sig merge`), and get the same result as if you'd + concatenated the files first and then sketched them. +* if you filter hashes on abundance with `sig filter`, you get the + same result as if you filtered the data set on k-mer abundance and + then sketched it. +* downsampling: you can sketch the original data set at a high resolution + (e.g. scaled=100) and then downsample it later (to e.g. scaled=1000), + and get the same result as if you'd sketched the data set at scaled=1000. + ## K-mer sizes There is no explicit restriction on k-mer sizes built into sourmash. @@ -560,14 +583,18 @@ want to mix and match within superkingdoms, though!) As of sourmash v4, lineage spreadsheets should contain columns for superkingdom, phylum, class, order, family, genus, and species. Some commands may also support a 'strain' column, although this is -inconsistently handled within sourmash internally. +inconsistently handled within sourmash internally at the moment. For spreadsheet organization, `lca index` expects the columns to be present in order from superkingdom on down, while the `tax` subcommands use CSV column headers instead. We are planning to consolidate around the `tax` subcommand handling in the future (see [sourmash#2198](https://github.com/sourmash-bio/sourmash/issues/2198)). -@@link to example spreadsheets, talk about how to extract. +An example spreadsheet is +[here, bacteria_refseq_lineage.csv](https://github.com/sourmash-bio/sourmash/blob/latest/tests/test-data/tax/bacteria_refseq_lineage.csv). (The +`taxid` column is not used by most sourmash functions and is mostly +ignored, but it is needed for the `kreport` and `bioboxes` report +formats.) ### `LCA_SqliteDatabase` - a special case @@ -629,6 +656,7 @@ the results will be the same. Note that lineage CSV spreadsheets, as consumed by `sourmash tax` commands and as output by `sourmash tax grep`, can be used as `ident` picklists. + + +## Online and streaming; and adding to collections of sketches. + +One of the big challenges with Big Data is looking at it all at once - +loading all your data into memory, for example, will fail with really large +data sets. The ability to look at subsets of data without looking at _all_ +of it is called "streaming" (much like when you watch a streaming +movie online - you can start watching the movie without downloading the +whole video, and you can also usually jump to a particular +location in the video without downloading the intervening bits.) + +Another related challenge is analyzing data against a database that is +constantly growing, either because you're adding to it or because it's +being updated by others. For example, in genomics, often you want to +repeat the same analysis you did last time but with more reference +genomes. With many software packages, this requires rebuilding your +indexed database, which can be challenging for large genomes. In +computer science parlance, the ability to add new data at the end +_without_ performing an expensive reindexing operation is referred +to as "online". + +sourmash tackles these challenges in a few different ways, and does its +best to support streaming and online behavior. + +First, all sourmash commands can take multiple databases and will +return the same results with multiple databases as they would with a +single database containing the same sketches, unless otherwise +noted. This allows you to incrementally expand your sketch collections +over time without building new databases. _Performance_ may vary +(i.e. if you're using an SBT to do search, and you add an unindexed +collection of sketches to the search, the search may take longer than +if you'd add the new sketches to the SBT) but the _results_ will be +the same. In this sense, many of the sourmash algorithms are online. + +Second, several sourmash algorithms use _streaming_ when searching +databases - in particular, `prefetch` will load and unload sketches as +it goes, as long as the underlying collection data structure supports +it (`.sig.gz` and LCA JSON databases do _not_, but zip files, SBTs, +and SQLite databases _do_). This lets you do containment searches +against really large collections without consuming large amounts of +memory. Another example is the `manysearch` command in the +[pyo3_branchwater](https://github.com/sourmash-bio/pyo3_branchwater) +plugin, which loads and searches a limited number of metagenomes from +a large collection, rather than loading the entire collection into +memory - which would be impossible. + +Last but not least, one of the interesting guarantees that FracMinHash +sketches provide is that no hash is ever _removed_ when sketching. +This supports various types of input streaming, which we haven't spent +too much time exploring, but (for example) means that "watching" +sequencing runs and/or downloads of sequencing data, and reporting +interim results with certainty, is possible. If you're interested +in making use of this, please reach out! ### Gather on multiple collections, and order of search and reporting From 1d1e4605c35fade2f2259ceffe5a71850606dde2 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 1 Oct 2023 12:23:51 -0700 Subject: [PATCH 34/48] finish off internals --- doc/sourmash-internals.md | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/doc/sourmash-internals.md b/doc/sourmash-internals.md index a9d9cca1de..fd26293a67 100644 --- a/doc/sourmash-internals.md +++ b/doc/sourmash-internals.md @@ -116,14 +116,15 @@ As discussed in the previous sections, it is possible to adjust the `scaled` and ### Operations you can do safely with FracMinHash sketches -As described in the @gather paper, FracMinHash sketches support a wide -range of operations that mirror actions taken on the underlying data -set _without_ revisiting the underlying data. This allows users to -build sketches once (requiring the original data) and then do all -sorts of manipulations on the sketches, and know that the results of -the sketch manipulations represent what would happen if they did the same -thing on the original data. -For example, +As described in +[Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2), +FracMinHash sketches support a wide range of operations that mirror +actions taken on the underlying data set _without_ revisiting the +underlying data. This allows users to build sketches once (requiring +the original data) and then do all sorts of manipulations on the +sketches, and know that the results of the sketch manipulations +represent what would happen if they did the same thing on the original +data. For example, * set unions, intersections, and subtractions all perform the same when done on the sketches as when applied to the underling data. @@ -540,9 +541,8 @@ name to do so - this is the name as set by the `--name` parameter to sourmash prefers identifiers to be the first space-separated token in the signature name. This token can contain any alphanumeric letters -other than space (@@ctb check me), and should contain at most one -period. The version of the identifier will be the component after -the period. +other than space, and should contain at most one period. The version +of the identifier will be the component after the period. So, for example, for a signature name of From 4ed0367726002a51d8c9695ecbb5b748eb21d278 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sun, 1 Oct 2023 12:30:43 -0700 Subject: [PATCH 35/48] finish things off? --- doc/command-line.md | 2 +- doc/faq.md | 5 ++--- doc/new.md | 8 ++++---- 3 files changed, 7 insertions(+), 8 deletions(-) diff --git a/doc/command-line.md b/doc/command-line.md index c5a536682d..bfd44b0d26 100644 --- a/doc/command-line.md +++ b/doc/command-line.md @@ -387,7 +387,7 @@ assembled into contigs, the unweighted number would approximate the number of bases from the contigs that would match perfectly to at least one genome in the reference database. More practically, the abundance-weighted number is less sensitive to sequencing errors. -See @CTB classifying signatures or FAQ for more information here! +See [classifying signatures](classifying-signatures.md#abundance-weighting) or [the FAQ](faq.md) for more information! The command line option `--threshold-bp` sets the threshold below which matches are no longer reported; by default, this is set to diff --git a/doc/faq.md b/doc/faq.md index 2f1e0fecd6..e54eff9a0c 100644 --- a/doc/faq.md +++ b/doc/faq.md @@ -51,12 +51,11 @@ The other drawback is that FracMinHash sketches _don't work well_ for very small sequences. Our default parameter choice for DNA (scaled=1000) works well for finding 10 kb or larger matches between sequences - some simple Poisson matching math suggests that about -99.98% of 10kb overlaps will be found with scaled=1000. @CTB verify I -think this is 5kb. +99.98% of 5kb overlaps will be found with scaled=1000. ## How can I better understand FracMinHash and sourmash intuitively? -@@ tutorial on k-mers +Please see [the k-mers and minhash tutorial](kmers-and-minhash.ipynb). ## What papers should I read to better understand the FracMinHash approach used by sourmash? diff --git a/doc/new.md b/doc/new.md index 29cd529779..e73d0d51a9 100644 --- a/doc/new.md +++ b/doc/new.md @@ -15,14 +15,14 @@ You might try sourmash if you want to - * taxonomically classify genomes or metagenomes against NCBI and/or GTDB; * search thousands of metagenomes with a query genome or sequence -Underneath, sourmash uses [FracMinHash sketches](@@) for fast and +Underneath, sourmash uses [FracMinHash sketches](https://www.biorxiv.org/content/10.1101/2022.01.11.475838) for fast and lightweight sequence comparison; FracMinHash builds on -[MinHash sketching](@@wikipedia) to support both Jaccard similarity +[MinHash sketching](https://en.wikipedia.org/wiki/MinHash) to support both Jaccard similarity _and_ containment analyses with k-mers. This significantly expands the range of operations that can be done quickly and in low memory. sourmash also implements a number of new and powerful analysis -techniques, including [minimum metagenome covers](@@) and [alignment-free ANI -estimation](@@). +techniques, including minimum metagenome covers and alignment-free ANI +estimation. sourmash is inspired by [mash](https://mash.readthedocs.io), and supports most mash analyses. sourmash also implements an expanded set From a4c124428e672c4a587628a57d643ec1c0d09f11 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Mon, 2 Oct 2023 20:42:04 -0700 Subject: [PATCH 36/48] clean up missing refs --- doc/new.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/doc/new.md b/doc/new.md index e73d0d51a9..d0f6983cf0 100644 --- a/doc/new.md +++ b/doc/new.md @@ -46,7 +46,7 @@ X and Linux. They require about 5 GB of disk space and 5 GB of RAM. * [Analyzing the genomic and taxonomic composition of an environmental genome using GTDB and sample-specific MAGs with sourmash](tutorial-lemonade.md) -* [Some sourmash command line examples!](sourmash-examples.md) +* [Some sourmash command line examples!](sourmash-examples.ipynb) ### How-To Guides @@ -54,17 +54,17 @@ X and Linux. They require about 5 GB of disk space and 5 GB of RAM. * [Classifying genome sketches](classifying-signatures.md) -* [Working with private collections of genome sketches.](sourmash-collections.md) +* [Working with private collections of genome sketches.](sourmash-collections.ipynb) * [Using the `LCA_Database` API.](using-LCA-database-API.ipynb) -* [Building plots from `sourmash compare` output](plotting-compare.md). +* [Building plots from `sourmash compare` output](plotting-compare.ipynb). * [A short guide to using sourmash output with R](other-languages.md). ### How sourmash works under the hood -* [An introduction to k-mers for genome comparison and analysis](kmers-and-minhash.md) +* [An introduction to k-mers for genome comparison and analysis](kmers-and-minhash.ipynb) * [Support, versioning, and migration between versions](support.md) ### Reference material From 3967b15044e03799820715c78077d3a5b47d36dd Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Tue, 3 Oct 2023 06:07:18 -0700 Subject: [PATCH 37/48] add publications --- doc/new.md | 33 ++++++++++++++++++++++++++++ doc/publications.md | 53 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 86 insertions(+) create mode 100644 doc/publications.md diff --git a/doc/new.md b/doc/new.md index d0f6983cf0..14d3a2e0ec 100644 --- a/doc/new.md +++ b/doc/new.md @@ -1,5 +1,9 @@ # Welcome to sourmash! +```{contents} Contents +:depth: 3 +``` + sourmash is a command-line tool and Python/Rust library for **metagenome analysis** and **genome comparison** with k-mers. It supports the compositional analysis of metagenomes, rapid search of @@ -15,6 +19,12 @@ You might try sourmash if you want to - * taxonomically classify genomes or metagenomes against NCBI and/or GTDB; * search thousands of metagenomes with a query genome or sequence +Our **vision**: sourmash strives to support biologists in analyzing +modern sequencing data at high resolution and with full context, +including all public reference genomes and metagenomes. + +## How does sourmash work? + Underneath, sourmash uses [FracMinHash sketches](https://www.biorxiv.org/content/10.1101/2022.01.11.475838) for fast and lightweight sequence comparison; FracMinHash builds on [MinHash sketching](https://en.wikipedia.org/wiki/MinHash) to support both Jaccard similarity @@ -33,6 +43,28 @@ Foundation under the Data Driven Discovery program, and has been supported by further funding from the NIH and NSF. Please see [funding acknowledgements](funding.md) for details! +## Mission statement + +The project mission is to provide practical tools and approaches for +analyzing extremely large sequencing data sets, with an emphasis on +high resolution results. We design around the following principles: + +* genomic and metagenomic analyses should be able to make use of all + available reference genomes. +* metagenomic analyses should support assembly independent approaches, + to avoid biases stemming from low coverage or high strain + variability. +* private and public databases should be equally well supported. +* a variety of data structures and algorithms are necessary to support + a wide set of use cases, including efficient command-line analysis, + real-time queries, and massive-scale batch analyses. +* our tools should be well behaved members of the bioinformatics + analysis tool ecosystem, and use common installation approaches, + standard formats, and semantic versioning. +* our tools should be robustly tested, well documented, and supported. +* we discuss scientific and computational tradeoffs and make specific + recommendations where possible, admitting uncertainty as needed. + ## Using sourmash ### Tutorials and examples @@ -79,3 +111,4 @@ X and Linux. They require about 5 GB of disk space and 5 GB of RAM. ## Developing and extending sourmash * [Releasing a new version of sourmash](release.md) + diff --git a/doc/publications.md b/doc/publications.md new file mode 100644 index 0000000000..44e8e9c4f9 --- /dev/null +++ b/doc/publications.md @@ -0,0 +1,53 @@ +# Publications + +Below are publications from the sourmash team. + +## sourmash fundamentals + +[Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2), Irber et al., 2022. This is the core technical paper describing both FracMinHash and `sourmash gather`. + +[Large-scale sequence comparisons with sourmash](https://f1000research.com/articles/8-1006), +Pierce et al., 2019. This is the original sourmash use case paper. + +## Evaluation and benchmarking + +[Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05103-0), +Portik et al., 2022. This paper shows that sourmash is extremely +sensitive and very specific for taxonomic classification. + +## Petabase-scale search + +[Biogeographic Distribution of Five Antarctic Cyanobacteria Using Large-Scale k-mer Searching with sourmash branchwater](https://www.biorxiv.org/content/10.1101/2022.10.27.514113v1), +Lumian et al., 2022. This paper uses sourmash and branchwater to +search ~500,000 public metagenomes for 5 query genomes, validates the +results using mapping, and discusses the biogeography of the query +species. + +[Sourmash Branchwater Enables Lightweight Petabyte-Scale Sequence Search](https://www.biorxiv.org/content/10.1101/2022.11.02.514947v1), +Irber et al., 2022. This paper describes the technical underpinnings +of the first version of sourmash branchwater, for petabase-scale search. + +## Advanced uses of sourmash + +[Single-cell transcriptomics for the 99.9% of species without reference genomes](https://www.biorxiv.org/content/10.1101/2021.07.09.450799v1.abstract), +Botvinnik et al., 2021. This paper uses sourmash (and many other +techniques!) to analyze single cell data from the Chinese horseshoe +bat. + +[Meta-analysis of metagenomes via machine learning and assembly graphs reveals strain switches in Crohn’s disease](https://www.biorxiv.org/content/10.1101/2022.06.30.498290v1.abstract), +Reiter et al., 2022. This paper uses sourmash and +[spacegraphcats](https://spacegraphcats.github.io/spacegraphcats/) to +detect and analyze strain-specific signals in fecal microbiomes from +the iHMP. + +[Protein k-mers enable assembly-free microbial metapangenomics](https://www.biorxiv.org/content/10.1101/2022.06.27.497795v1), +Reiter et al., 2022. This paper develops a technique to use protein +k-mers to analyze metapangenome graphs from metagenomes. + +## Additional works + +Dr. Luiz Irber's PhD thesis, +[Decentralizing Indices for Genomic Data](https://github.com/luizirber/phd/releases), +describes several additional features of the sourmash ecosystem, +including wort, which monitors the SRA for new data sets and sketches +them automatically. From 535d7eb04ffa5f1ce0bb2cf2f28615fef6a88ae4 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Fri, 13 Oct 2023 14:37:38 -0400 Subject: [PATCH 38/48] update new -> index --- doc/index.md | 237 +++++++++++++++++---------------------------------- doc/new.md | 114 ------------------------- 2 files changed, 80 insertions(+), 271 deletions(-) delete mode 100644 doc/new.md diff --git a/doc/index.md b/doc/index.md index 4e7a9927e3..14d3a2e0ec 100644 --- a/doc/index.md +++ b/doc/index.md @@ -1,191 +1,114 @@ # Welcome to sourmash! -sourmash is a command-line tool and Python library for computing -[hash sketches](https://en.wikipedia.org/wiki/MinHash) from DNA -sequences, comparing them to each other, and plotting the results. -This allows you to estimate sequence similarity between even very -large data sets quickly and accurately. - -sourmash can be used to quickly search large databases of genomes -for matches to query genomes and metagenomes; see [our list of -available databases](databases.md). - -sourmash also includes k-mer based taxonomic exploration and -classification routines for genome and metagenome analysis. These -routines can use the NCBI and GTDB taxonomies but do not depend on them -specifically. - -We have [several tutorials](tutorials.md) available! Start with -[Making signatures, comparing, and searching](tutorial-basic.md). - -The paper [Large-scale sequence comparisons with sourmash (Pierce et al., 2019)](https://f1000research.com/articles/8-1006) -gives an overview of how sourmash works and what its major use cases are. -Please also see the `mash` [software](http://mash.readthedocs.io/en/latest/) and -[paper (Ondov et al., 2016)](http://dx.doi.org/10.1186/s13059-016-0997-x) for -background information on how and why MinHash works. - -**Questions? Thoughts?** Ask us on the [sourmash issue tracker](https://github.com/sourmash-bio/sourmash/issues/)! - -**Want to migrate to sourmash v4?** sourmash v4 is now available, and -has a number of incompatibilites with v2 and v3. Please see -[our migration guide](support.md#migrating-from-sourmash-v3x-to-sourmash-v4x)! - ----- - -To use sourmash, you must be comfortable with the UNIX command line; -programmers may find the [Python library and API](api.md) useful as well. - -If you use sourmash, please cite us! - -> Brown and Irber (2016), -> **sourmash: a library for MinHash sketching of DNA**. -> Journal of Open Source Software, 1(5), 27, [doi:10.21105/joss.00027](https://joss.theoj.org/papers/3d793c6e7db683bee7c03377a4a7f3c9) - -## sourmash in brief - -sourmash uses MinHash-style sketching to create "signatures", compressed -representations of DNA/RNA sequence. These signatures can then -be stored, searched, explored, and taxonomically annotated. - -* `sourmash` provides command line utilities for creating, comparing, - and searching signatures, as well as plotting and clustering - signatures by similarity (see [the command-line docs](command-line.md)). - -* `sourmash` can **search very large collections of signatures** to find matches - to a query. - -* `sourmash` can also **identify parts of metagenomes that match known genomes**, - and can **taxonomically classify genomes and metagenomes** against databases - of known species. +```{contents} Contents +:depth: 3 +``` -* `sourmash` can be used to **search databases of public sequences** - (e.g. all of GenBank) and can also be used to create and search databases - of **private sequencing data**. +sourmash is a command-line tool and Python/Rust library for +**metagenome analysis** and **genome comparison** with k-mers. It +supports the compositional analysis of metagenomes, rapid search of +large sequence databases, and flexible taxonomic analysis with both +NCBI and GTDB taxonomies. sourmash works well with sequences 30kb or +larger, including bacterial and viral genomes. -* `sourmash` supports saving, loading, and communication of signatures - via [JSON](http://www.json.org/), a ~human-readable and editable format. +You might try sourmash if you want to - -* `sourmash` also has a simple Python API for interacting with signatures, - including support for online updating and querying of signatures - (see [the API docs](api.md)). +* identify which reference genomes to map your metagenomic reads to +* search all Genbank microbial genomes with a sequence query +* cluster many genomes by similarity +* taxonomically classify genomes or metagenomes against NCBI and/or GTDB; +* search thousands of metagenomes with a query genome or sequence -* `sourmash` relies on an underlying Rust core for performance. +Our **vision**: sourmash strives to support biologists in analyzing +modern sequencing data at high resolution and with full context, +including all public reference genomes and metagenomes. -* `sourmash` is developed [on GitHub](https://github.com/sourmash-bio/sourmash) - and is **freely and openly available** under the BSD 3-clause license. - Please see [the README](https://github.com/sourmash-bio/sourmash/blob/latest/README.md) - for more information on development, support, and contributing. +## How does sourmash work? -You can take a look at sourmash analyses on real data -[in a saved Jupyter notebook](https://github.com/sourmash-bio/sourmash/blob/latest/doc/sourmash-examples.ipynb), -and experiment with it yourself -[interactively in a Jupyter Notebook](https://mybinder.org/v2/gh/sourmash-bio/sourmash/latest?labpath=doc%2Fsourmash-examples.ipynb) -at [mybinder.org](http://mybinder.org). +Underneath, sourmash uses [FracMinHash sketches](https://www.biorxiv.org/content/10.1101/2022.01.11.475838) for fast and +lightweight sequence comparison; FracMinHash builds on +[MinHash sketching](https://en.wikipedia.org/wiki/MinHash) to support both Jaccard similarity +_and_ containment analyses with k-mers. This significantly expands +the range of operations that can be done quickly and in low +memory. sourmash also implements a number of new and powerful analysis +techniques, including minimum metagenome covers and alignment-free ANI +estimation. -## Installing sourmash +sourmash is inspired by [mash](https://mash.readthedocs.io), and +supports most mash analyses. sourmash also implements an expanded set +of functionality for metagenome and taxonomic analysis. -You can use pip: -```bash -$ pip install sourmash -``` +sourmash development was initiated with a grant from the Moore +Foundation under the Data Driven Discovery program, and has been +supported by further funding from the NIH and NSF. Please see +[funding acknowledgements](funding.md) for details! -or conda: -```bash -$ conda install -c conda-forge -c bioconda sourmash -``` +## Mission statement -Please see [the README file in github.com/sourmash-bio/sourmash](https://github.com/sourmash-bio/sourmash/blob/latest/README.md) -for more information. +The project mission is to provide practical tools and approaches for +analyzing extremely large sequencing data sets, with an emphasis on +high resolution results. We design around the following principles: -## Memory and speed +* genomic and metagenomic analyses should be able to make use of all + available reference genomes. +* metagenomic analyses should support assembly independent approaches, + to avoid biases stemming from low coverage or high strain + variability. +* private and public databases should be equally well supported. +* a variety of data structures and algorithms are necessary to support + a wide set of use cases, including efficient command-line analysis, + real-time queries, and massive-scale batch analyses. +* our tools should be well behaved members of the bioinformatics + analysis tool ecosystem, and use common installation approaches, + standard formats, and semantic versioning. +* our tools should be robustly tested, well documented, and supported. +* we discuss scientific and computational tradeoffs and make specific + recommendations where possible, admitting uncertainty as needed. -sourmash has relatively small disk and memory requirements compared to -many other software programs used for genome search and taxonomic -classification. +## Using sourmash -`sourmash search` and `sourmash gather` can be used to search 100k -genbank microbial genomes ([using our prepared databases](databases.md)) -with about 20 GB of disk and in under 1 GB of RAM. -Typically a search for a single genome takes about 30 seconds on a laptop. +### Tutorials and examples -`sourmash lca` can be used to search/classify against all genbank -microbial genomes with about 200 MB of disk space and about 10 GB of -RAM. Typically a metagenome classification takes about 1 minute on a -laptop. +These tutorials are command line tutorials that should work on Mac OS +X and Linux. They require about 5 GB of disk space and 5 GB of RAM. -## sourmash versioning +* [The first sourmash tutorial - making signatures, comparing, and searching](tutorial-basic.md) -We support the use of sourmash in pipelines and applications -by communicating clearly about bug fixes, feature additions, and feature -changes. We use version numbers as follows: +* [Using sourmash LCA to do taxonomic classification](tutorials-lca.md) -* Major releases, like v4.0.0, may break backwards compatibility at - the command line as well as top-level Python/Rust APIs. -* Minor releases, like v4.1.0, will remain backwards compatible but - may introduce significant new features. -* Patch releases, like v4.1.1, are for minor bug fixes; full backwards - compatibility is retained. +* [Analyzing the genomic and taxonomic composition of an environmental genome using GTDB and sample-specific MAGs with sourmash](tutorial-lemonade.md) -If you are relying on sourmash in a pipeline or application, we -suggest specifying your version requirements at the major release, -e.g. in conda you would specify `sourmash>=3,<4`. +* [Some sourmash command line examples!](sourmash-examples.ipynb) -See [the Versioning docs](support.md) for more information on what our -versioning policy means in detail, and how to migrate between major -versions! +### How-To Guides -## Limitations +* Installing sourmash -**sourmash cannot find matches across large evolutionary distances.** +* [Classifying genome sketches](classifying-signatures.md) -sourmash seems to work well to search and compare data sets for -nucleotide matches at the species and genus level, but does not have much -sensitivity beyond that. (It seems to be particularly good at -strain-level analysis.) You should use protein-based analyses -to do searches across larger evolutionary distances. +* [Working with private collections of genome sketches.](sourmash-collections.ipynb) -**sourmash signatures can be very large.** +* [Using the `LCA_Database` API.](using-LCA-database-API.ipynb) -We use a modification of the MinHash sketch approach that allows us -to search the contents of metagenomes and large genomes with no loss -of sensitivity, but there is a tradeoff: there is no guaranteed limit -to signature size when using 'scaled' signatures. +* [Building plots from `sourmash compare` output](plotting-compare.ipynb). -## Logo +* [A short guide to using sourmash output with R](other-languages.md). -The sourmash logo was designed by Stéfanie Fares Sabbag, -with feedback from Clara Barcelos, -Taylor Reiter and Luiz Irber. +### How sourmash works under the hood -
+* [An introduction to k-mers for genome comparison and analysis](kmers-and-minhash.ipynb) +* [Support, versioning, and migration between versions](support.md) -The logo -is licensed under a Creative Commons -Attribution-ShareAlike 4.0 International License. +### Reference material -## Contents: +* [UNIX command-line documentation](command-line.md) +* [Genbank and GTDB databases and taxonomy files](databases.md) +* [Python examples using the API](api-example.md) +* [Publications about sourmash](publications.md) +* [A guide to the internals of sourmash](sourmash-internals.md) +* [Funding acknowledgements](funding.md) -```{toctree} ---- -maxdepth: 2 ---- - -command-line -tutorials -using-sourmash-a-guide -classifying-signatures -databases -api -more-info -support -developer -``` +## Developing and extending sourmash -# Indices and tables +* [Releasing a new version of sourmash](release.md) -* {ref}`genindex` -* {ref}`modindex` -* {ref}`search` diff --git a/doc/new.md b/doc/new.md deleted file mode 100644 index 14d3a2e0ec..0000000000 --- a/doc/new.md +++ /dev/null @@ -1,114 +0,0 @@ -# Welcome to sourmash! - -```{contents} Contents -:depth: 3 -``` - -sourmash is a command-line tool and Python/Rust library for -**metagenome analysis** and **genome comparison** with k-mers. It -supports the compositional analysis of metagenomes, rapid search of -large sequence databases, and flexible taxonomic analysis with both -NCBI and GTDB taxonomies. sourmash works well with sequences 30kb or -larger, including bacterial and viral genomes. - -You might try sourmash if you want to - - -* identify which reference genomes to map your metagenomic reads to -* search all Genbank microbial genomes with a sequence query -* cluster many genomes by similarity -* taxonomically classify genomes or metagenomes against NCBI and/or GTDB; -* search thousands of metagenomes with a query genome or sequence - -Our **vision**: sourmash strives to support biologists in analyzing -modern sequencing data at high resolution and with full context, -including all public reference genomes and metagenomes. - -## How does sourmash work? - -Underneath, sourmash uses [FracMinHash sketches](https://www.biorxiv.org/content/10.1101/2022.01.11.475838) for fast and -lightweight sequence comparison; FracMinHash builds on -[MinHash sketching](https://en.wikipedia.org/wiki/MinHash) to support both Jaccard similarity -_and_ containment analyses with k-mers. This significantly expands -the range of operations that can be done quickly and in low -memory. sourmash also implements a number of new and powerful analysis -techniques, including minimum metagenome covers and alignment-free ANI -estimation. - -sourmash is inspired by [mash](https://mash.readthedocs.io), and -supports most mash analyses. sourmash also implements an expanded set -of functionality for metagenome and taxonomic analysis. - -sourmash development was initiated with a grant from the Moore -Foundation under the Data Driven Discovery program, and has been -supported by further funding from the NIH and NSF. Please see -[funding acknowledgements](funding.md) for details! - -## Mission statement - -The project mission is to provide practical tools and approaches for -analyzing extremely large sequencing data sets, with an emphasis on -high resolution results. We design around the following principles: - -* genomic and metagenomic analyses should be able to make use of all - available reference genomes. -* metagenomic analyses should support assembly independent approaches, - to avoid biases stemming from low coverage or high strain - variability. -* private and public databases should be equally well supported. -* a variety of data structures and algorithms are necessary to support - a wide set of use cases, including efficient command-line analysis, - real-time queries, and massive-scale batch analyses. -* our tools should be well behaved members of the bioinformatics - analysis tool ecosystem, and use common installation approaches, - standard formats, and semantic versioning. -* our tools should be robustly tested, well documented, and supported. -* we discuss scientific and computational tradeoffs and make specific - recommendations where possible, admitting uncertainty as needed. - -## Using sourmash - -### Tutorials and examples - -These tutorials are command line tutorials that should work on Mac OS -X and Linux. They require about 5 GB of disk space and 5 GB of RAM. - -* [The first sourmash tutorial - making signatures, comparing, and searching](tutorial-basic.md) - -* [Using sourmash LCA to do taxonomic classification](tutorials-lca.md) - -* [Analyzing the genomic and taxonomic composition of an environmental genome using GTDB and sample-specific MAGs with sourmash](tutorial-lemonade.md) - -* [Some sourmash command line examples!](sourmash-examples.ipynb) - -### How-To Guides - -* Installing sourmash - -* [Classifying genome sketches](classifying-signatures.md) - -* [Working with private collections of genome sketches.](sourmash-collections.ipynb) - -* [Using the `LCA_Database` API.](using-LCA-database-API.ipynb) - -* [Building plots from `sourmash compare` output](plotting-compare.ipynb). - -* [A short guide to using sourmash output with R](other-languages.md). - -### How sourmash works under the hood - -* [An introduction to k-mers for genome comparison and analysis](kmers-and-minhash.ipynb) -* [Support, versioning, and migration between versions](support.md) - -### Reference material - -* [UNIX command-line documentation](command-line.md) -* [Genbank and GTDB databases and taxonomy files](databases.md) -* [Python examples using the API](api-example.md) -* [Publications about sourmash](publications.md) -* [A guide to the internals of sourmash](sourmash-internals.md) -* [Funding acknowledgements](funding.md) - -## Developing and extending sourmash - -* [Releasing a new version of sourmash](release.md) - From 6026ecbe1188a5a00845d602ac9073208b85c0e1 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Fri, 13 Oct 2023 13:14:24 -0700 Subject: [PATCH 39/48] Apply suggestions from code review thanks! Co-authored-by: ccbaumler <63077899+ccbaumler@users.noreply.github.com> --- doc/faq.md | 4 ++-- doc/new.md | 20 ++++++++++---------- 2 files changed, 12 insertions(+), 12 deletions(-) diff --git a/doc/faq.md b/doc/faq.md index e54eff9a0c..88af9dc82c 100644 --- a/doc/faq.md +++ b/doc/faq.md @@ -94,8 +94,8 @@ compositional estimates for metagenomes. One easy approach is a The short answer is: for DNA, use k=31. Slightly longer answer: when we look at the k-mer distribution -across all of the bacterial genomes in GTDB, we find that 99% or -more of 31-mers are _genome_, _species_ or _genus_ specific. +across all of the bacterial genomes in GTDB, we find that 99% (or +more) of 31-mers are _genome_, _species_, or _genus_ specific. If you go lower (say, k=21), then you get a few percent of k-mers that match above the genus level - family or above. diff --git a/doc/new.md b/doc/new.md index 14d3a2e0ec..52eab4177b 100644 --- a/doc/new.md +++ b/doc/new.md @@ -5,10 +5,10 @@ ``` sourmash is a command-line tool and Python/Rust library for -**metagenome analysis** and **genome comparison** with k-mers. It +**metagenome analysis** and **genome comparison** using k-mers. It supports the compositional analysis of metagenomes, rapid search of -large sequence databases, and flexible taxonomic analysis with both -NCBI and GTDB taxonomies. sourmash works well with sequences 30kb or +large sequence databases, and flexible taxonomic profiling with both +NCBI and GTDB taxonomies ([See our prepared databases for more information](databases.md)). sourmash works well with sequences 30kb or larger, including bacterial and viral genomes. You might try sourmash if you want to - @@ -16,7 +16,7 @@ You might try sourmash if you want to - * identify which reference genomes to map your metagenomic reads to * search all Genbank microbial genomes with a sequence query * cluster many genomes by similarity -* taxonomically classify genomes or metagenomes against NCBI and/or GTDB; +* taxonomically classify genomes or metagenomes against NCBI and/or GTDB * search thousands of metagenomes with a query genome or sequence Our **vision**: sourmash strives to support biologists in analyzing @@ -30,8 +30,8 @@ lightweight sequence comparison; FracMinHash builds on [MinHash sketching](https://en.wikipedia.org/wiki/MinHash) to support both Jaccard similarity _and_ containment analyses with k-mers. This significantly expands the range of operations that can be done quickly and in low -memory. sourmash also implements a number of new and powerful analysis -techniques, including minimum metagenome covers and alignment-free ANI +memory. sourmash also implements a number of new and powerful techniques +for analysis, including minimum metagenome covers and alignment-free ANI estimation. sourmash is inspired by [mash](https://mash.readthedocs.io), and @@ -45,9 +45,9 @@ supported by further funding from the NIH and NSF. Please see ## Mission statement -The project mission is to provide practical tools and approaches for +This project's mission is to provide practical tools and approaches for analyzing extremely large sequencing data sets, with an emphasis on -high resolution results. We design around the following principles: +high resolution results. Our designs follow these guiding principles: * genomic and metagenomic analyses should be able to make use of all available reference genomes. @@ -86,9 +86,9 @@ X and Linux. They require about 5 GB of disk space and 5 GB of RAM. * [Classifying genome sketches](classifying-signatures.md) -* [Working with private collections of genome sketches.](sourmash-collections.ipynb) +* [Working with private collections of genome sketches](sourmash-collections.ipynb) -* [Using the `LCA_Database` API.](using-LCA-database-API.ipynb) +* [Using the `LCA_Database` API](using-LCA-database-API.ipynb) * [Building plots from `sourmash compare` output](plotting-compare.ipynb). From 82aa9ed823a2a5f1eae832178c38b4c0e50b7389 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sat, 14 Oct 2023 06:51:42 -0400 Subject: [PATCH 40/48] update --- doc/api-example.md | 2 +- doc/conf.py | 6 +++- doc/index.md | 68 +++++++++++++++++++++++++++++-------------- doc/tutorial-basic.md | 2 +- 4 files changed, 53 insertions(+), 25 deletions(-) diff --git a/doc/api-example.md b/doc/api-example.md index 782ca86c07..0be78a4fc2 100644 --- a/doc/api-example.md +++ b/doc/api-example.md @@ -476,7 +476,7 @@ building the minhashes. If you're interested in building comparison matrices and dendrograms, please see the notebook -[Building plots from `sourmash compare` output](plotting-compare.md). +[Building plots from `sourmash compare` output](plotting-compare.ipynb). ## Saving and loading signature files diff --git a/doc/conf.py b/doc/conf.py index 3035b6274c..fdd819b93a 100644 --- a/doc/conf.py +++ b/doc/conf.py @@ -114,6 +114,9 @@ # If true, `todo` and `todoList` produce output, else they produce nothing. todo_include_todos = False +# CTB: suppress warnings about circularity in ToC. +# see https://github.com/sphinx-doc/sphinx/issues/7410. +suppress_warnings = ['toc.circular'] # -- Options for HTML output ---------------------------------------------- @@ -127,7 +130,8 @@ html_theme_options = { 'logo': 'logo.png', 'logo_name': True, - 'description': 'Quickly search, compare, and analyze genomic and metagenomic data sets' + 'description': 'Quickly search, compare, and analyze genomic and metagenomic data sets', + 'sidebar_collapse': False, } # Add any paths that contain custom themes here, relative to this directory. diff --git a/doc/index.md b/doc/index.md index 52eab4177b..b0a2fd34d6 100644 --- a/doc/index.md +++ b/doc/index.md @@ -8,8 +8,10 @@ sourmash is a command-line tool and Python/Rust library for **metagenome analysis** and **genome comparison** using k-mers. It supports the compositional analysis of metagenomes, rapid search of large sequence databases, and flexible taxonomic profiling with both -NCBI and GTDB taxonomies ([See our prepared databases for more information](databases.md)). sourmash works well with sequences 30kb or -larger, including bacterial and viral genomes. +NCBI and GTDB taxonomies +([see our prepared databases for more information](databases.md)). sourmash +works well with sequences 30kb or larger, including bacterial and +viral genomes. You might try sourmash if you want to - @@ -23,26 +25,6 @@ Our **vision**: sourmash strives to support biologists in analyzing modern sequencing data at high resolution and with full context, including all public reference genomes and metagenomes. -## How does sourmash work? - -Underneath, sourmash uses [FracMinHash sketches](https://www.biorxiv.org/content/10.1101/2022.01.11.475838) for fast and -lightweight sequence comparison; FracMinHash builds on -[MinHash sketching](https://en.wikipedia.org/wiki/MinHash) to support both Jaccard similarity -_and_ containment analyses with k-mers. This significantly expands -the range of operations that can be done quickly and in low -memory. sourmash also implements a number of new and powerful techniques -for analysis, including minimum metagenome covers and alignment-free ANI -estimation. - -sourmash is inspired by [mash](https://mash.readthedocs.io), and -supports most mash analyses. sourmash also implements an expanded set -of functionality for metagenome and taxonomic analysis. - -sourmash development was initiated with a grant from the Moore -Foundation under the Data Driven Discovery program, and has been -supported by further funding from the NIH and NSF. Please see -[funding acknowledgements](funding.md) for details! - ## Mission statement This project's mission is to provide practical tools and approaches for @@ -65,6 +47,26 @@ high resolution results. Our designs follow these guiding principles: * we discuss scientific and computational tradeoffs and make specific recommendations where possible, admitting uncertainty as needed. +## How does sourmash work? + +Underneath, sourmash uses [FracMinHash sketches](https://www.biorxiv.org/content/10.1101/2022.01.11.475838) for fast and +lightweight sequence comparison; FracMinHash builds on +[MinHash sketching](https://en.wikipedia.org/wiki/MinHash) to support both Jaccard similarity +_and_ containment analyses with k-mers. This significantly expands +the range of operations that can be done quickly and in low +memory. sourmash also implements a number of new and powerful techniques +for analysis, including minimum metagenome covers and alignment-free ANI +estimation. + +sourmash is inspired by [mash](https://mash.readthedocs.io), and +supports most mash analyses. sourmash also implements an expanded set +of functionality for metagenome and taxonomic analysis. + +sourmash development was initiated with a grant from the Moore +Foundation under the Data Driven Discovery program, and has been +supported by further funding from the NIH and NSF. Please see +[funding acknowledgements](funding.md) for details! + ## Using sourmash ### Tutorials and examples @@ -101,6 +103,7 @@ X and Linux. They require about 5 GB of disk space and 5 GB of RAM. ### Reference material +* [Full table of contents for all docs](toc.md) * [UNIX command-line documentation](command-line.md) * [Genbank and GTDB databases and taxonomy files](databases.md) * [Python examples using the API](api-example.md) @@ -112,3 +115,24 @@ X and Linux. They require about 5 GB of disk space and 5 GB of RAM. * [Releasing a new version of sourmash](release.md) + + +```{toctree} +--- +maxdepth: 3 +hidden: true +--- + +index.md +``` + +# Indices and tables + +* {ref}`genindex` +* {ref}`modindex` +* {ref}`search` diff --git a/doc/tutorial-basic.md b/doc/tutorial-basic.md index ebc8cb9ea3..1ea7cc1416 100644 --- a/doc/tutorial-basic.md +++ b/doc/tutorial-basic.md @@ -308,4 +308,4 @@ and `gather`; see `sourmash index`, above, [the LCA tutorial][4], or [2]:https://pubmed.ncbi.nlm.nih.gov/23387867/ [3]:index.md [4]:tutorials-lca.md -[5]:sourmash-collections.md +[5]:sourmash-collections.ipynb From 14e9a51fd857438425baf93f7094ce08be36204a Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sat, 14 Oct 2023 06:52:15 -0400 Subject: [PATCH 41/48] add global toc --- doc/toc.md | 41 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) create mode 100644 doc/toc.md diff --git a/doc/toc.md b/doc/toc.md new file mode 100644 index 0000000000..5e41325360 --- /dev/null +++ b/doc/toc.md @@ -0,0 +1,41 @@ +# Full table of contents + +```{toctree} +--- +maxdepth: 2 +--- + +api-example.md +api.md +classifying-signatures.md +command-line.md +databases-advanced.md +databases.md +dev_plugins.md +developer.md +faq.md +funding.md +index.md +kmers-and-minhash.ipynb +legacy-databases.md +more-info.md +other-languages.md +plotting-compare.ipynb +publications.md +release.md +requirements.md +sourmash-collections.ipynb +sourmash-examples.ipynb +sourmash-internals.md +sourmash-sketch.md +storage.md +support.md +tutorial-basic.md +tutorial-lemonade.md +tutorial-lin-taxonomy.md +tutorial-long.md +tutorials-lca.md +tutorials.md +using-LCA-database-API.ipynb +using-sourmash-a-guide.md +``` From c30aea6bced79f6167cc026b27d1785b16cf4823 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sat, 14 Oct 2023 08:17:54 -0400 Subject: [PATCH 42/48] remove ToC at top --- doc/index.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/doc/index.md b/doc/index.md index b0a2fd34d6..648dafab10 100644 --- a/doc/index.md +++ b/doc/index.md @@ -1,9 +1,5 @@ # Welcome to sourmash! -```{contents} Contents -:depth: 3 -``` - sourmash is a command-line tool and Python/Rust library for **metagenome analysis** and **genome comparison** using k-mers. It supports the compositional analysis of metagenomes, rapid search of From a08a36f0b7c32b30f5240d40d97a21e77594ea4c Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sat, 14 Oct 2023 09:44:31 -0400 Subject: [PATCH 43/48] try unhiding --- doc/index.md | 1 - 1 file changed, 1 deletion(-) diff --git a/doc/index.md b/doc/index.md index 648dafab10..0c512225c0 100644 --- a/doc/index.md +++ b/doc/index.md @@ -121,7 +121,6 @@ doesn't show up redundantly at the bottom of the index page. ```{toctree} --- maxdepth: 3 -hidden: true --- index.md From d128c6521977106b3b4135fe06beea683e777ac2 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sat, 14 Oct 2023 11:22:59 -0400 Subject: [PATCH 44/48] add sidebar --- doc/index.md | 4 +++- doc/sidebar.md | 46 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 49 insertions(+), 1 deletion(-) create mode 100644 doc/sidebar.md diff --git a/doc/index.md b/doc/index.md index 0c512225c0..9301aa329d 100644 --- a/doc/index.md +++ b/doc/index.md @@ -65,6 +65,8 @@ supported by further funding from the NIH and NSF. Please see ## Using sourmash + + ### Tutorials and examples These tutorials are command line tutorials that should work on Mac OS @@ -123,7 +125,7 @@ doesn't show up redundantly at the bottom of the index page. maxdepth: 3 --- -index.md +sidebar ``` # Indices and tables diff --git a/doc/sidebar.md b/doc/sidebar.md new file mode 100644 index 0000000000..721d82757d --- /dev/null +++ b/doc/sidebar.md @@ -0,0 +1,46 @@ +## Tutorials and examples + +These tutorials are command line tutorials that should work on Mac OS +X and Linux. They require about 5 GB of disk space and 5 GB of RAM. + +* [The first sourmash tutorial - making signatures, comparing, and searching](tutorial-basic.md) + +* [Using sourmash LCA to do taxonomic classification](tutorials-lca.md) + +* [Analyzing the genomic and taxonomic composition of an environmental genome using GTDB and sample-specific MAGs with sourmash](tutorial-lemonade.md) + +* [Some sourmash command line examples!](sourmash-examples.ipynb) + +## How-To Guides + +* Installing sourmash + +* [Classifying genome sketches](classifying-signatures.md) + +* [Working with private collections of genome sketches](sourmash-collections.ipynb) + +* [Using the `LCA_Database` API](using-LCA-database-API.ipynb) + +* [Building plots from `sourmash compare` output](plotting-compare.ipynb). + +* [A short guide to using sourmash output with R](other-languages.md). + +## How sourmash works under the hood + +* [An introduction to k-mers for genome comparison and analysis](kmers-and-minhash.ipynb) +* [Support, versioning, and migration between versions](support.md) + +## Reference material + +* [Full table of contents for all docs](toc.md) +* [UNIX command-line documentation](command-line.md) +* [Genbank and GTDB databases and taxonomy files](databases.md) +* [Python examples using the API](api-example.md) +* [Publications about sourmash](publications.md) +* [A guide to the internals of sourmash](sourmash-internals.md) +* [Funding acknowledgements](funding.md) + +## Developing and extending sourmash + +* [Releasing a new version of sourmash](release.md) + From 13fde86ecd4b13dc6bb959b58913fb03bf60047a Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sat, 14 Oct 2023 11:28:08 -0400 Subject: [PATCH 45/48] add TOC to sidebar --- doc/sidebar.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/sidebar.md b/doc/sidebar.md index 721d82757d..9303bcf50f 100644 --- a/doc/sidebar.md +++ b/doc/sidebar.md @@ -32,7 +32,6 @@ X and Linux. They require about 5 GB of disk space and 5 GB of RAM. ## Reference material -* [Full table of contents for all docs](toc.md) * [UNIX command-line documentation](command-line.md) * [Genbank and GTDB databases and taxonomy files](databases.md) * [Python examples using the API](api-example.md) @@ -44,3 +43,4 @@ X and Linux. They require about 5 GB of disk space and 5 GB of RAM. * [Releasing a new version of sourmash](release.md) +## [Full table of contents for all docs](toc.md) From 8144c3cb4db09f8ddf34f9ca0d4ae523e9dce2c8 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sat, 14 Oct 2023 11:50:55 -0400 Subject: [PATCH 46/48] upd sidebar --- doc/index.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/doc/index.md b/doc/index.md index 9301aa329d..3d23ab85a2 100644 --- a/doc/index.md +++ b/doc/index.md @@ -122,10 +122,13 @@ doesn't show up redundantly at the bottom of the index page. ```{toctree} --- -maxdepth: 3 +maxdepth: 1 --- sidebar +command-line +api-example +databases ``` # Indices and tables From 6a5bb1a5ab47bbea6cc1edfcf0760f15d0f98b5e Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sat, 14 Oct 2023 11:58:17 -0400 Subject: [PATCH 47/48] rename internals page --- doc/index.md | 2 +- doc/sidebar.md | 2 +- doc/sourmash-internals.md | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/doc/index.md b/doc/index.md index 3d23ab85a2..d1502ba1b6 100644 --- a/doc/index.md +++ b/doc/index.md @@ -106,7 +106,7 @@ X and Linux. They require about 5 GB of disk space and 5 GB of RAM. * [Genbank and GTDB databases and taxonomy files](databases.md) * [Python examples using the API](api-example.md) * [Publications about sourmash](publications.md) -* [A guide to the internals of sourmash](sourmash-internals.md) +* [A guide to the internal design and structure of sourmash](sourmash-internals.md) * [Funding acknowledgements](funding.md) ## Developing and extending sourmash diff --git a/doc/sidebar.md b/doc/sidebar.md index 9303bcf50f..b0d5e6e22d 100644 --- a/doc/sidebar.md +++ b/doc/sidebar.md @@ -36,7 +36,7 @@ X and Linux. They require about 5 GB of disk space and 5 GB of RAM. * [Genbank and GTDB databases and taxonomy files](databases.md) * [Python examples using the API](api-example.md) * [Publications about sourmash](publications.md) -* [A guide to the internals of sourmash](sourmash-internals.md) +* [A guide to the internal design and structure of sourmash](sourmash-internals.md) * [Funding acknowledgements](funding.md) ## Developing and extending sourmash diff --git a/doc/sourmash-internals.md b/doc/sourmash-internals.md index fd26293a67..a615844290 100644 --- a/doc/sourmash-internals.md +++ b/doc/sourmash-internals.md @@ -1,4 +1,4 @@ -# A guide to the internals of sourmash +# A guide to the internal design and structure of sourmash ```{contents} Contents :depth: 3 From 817092f1ee8865a1b5ad7b9938d16b5fea941d47 Mon Sep 17 00:00:00 2001 From: "C. Titus Brown" Date: Sat, 14 Oct 2023 15:07:58 -0400 Subject: [PATCH 48/48] clean up --- doc/index.md | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/doc/index.md b/doc/index.md index d1502ba1b6..2adbf317e9 100644 --- a/doc/index.md +++ b/doc/index.md @@ -123,6 +123,7 @@ doesn't show up redundantly at the bottom of the index page. ```{toctree} --- maxdepth: 1 +hidden: true --- sidebar @@ -130,9 +131,3 @@ command-line api-example databases ``` - -# Indices and tables - -* {ref}`genindex` -* {ref}`modindex` -* {ref}`search`