Kraken2

Introduction

Kraken is a taxonomic sequence classifier that assigns taxonomic labels to DNA sequences. Kraken examines the $k$-mers within a query sequence and uses the information within those $k$-mers to query a database. That database maps $k$-mers to the lowest common ancestor (LCA) of all genomes known to contain a given $k$-mer.

Kraken2 itself isn't typically referred to as a pipeline, but it is a core component in many metagenomic analysis pipelines. Here's a breakdown of a typical pipeline that includes Kraken2:

Preprocessing:

Reads from DNA sequencers (Illumina, Nanopore, etc.) are quality checked and filtered.

Taxonomic Classification with Kraken2:
- Kraken2 takes the preprocessed reads as input and classifies them by analyzing small subsequences (k-mers) within each read.
- It assigns each k-mer to the lowest common ancestor (LCA) of all organisms in its reference database that contain that k-mer.
- This builds a profile of the organisms likely present in the sample.
(Optional) Refining Classifications with Bracken:
- Bracken (often used alongside Kraken2) refines the classifications from Kraken2 by statistically analyzing the k-mer assignments.
- It helps reduce reporting ambiguities in the final results.
Downstream Analysis:
- The taxonomic profile generated by Kraken2 (and potentially refined by Bracken) is used for various downstream analyses, such as:
  - Abundance estimation of different species in the sample.
  - Identification of functional genes and pathways.
  - Comparison of microbial communities across different samples.
Visualization (Optional):
- Tools like Krona can be used to visualize the taxonomic data in interactive reports.

It's important to note that this is a general outline, and specific pipelines may vary depending on the research goals and the type of data being analyzed.

System Requirements

Disk space: Construction of a Kraken 2 standard database requires approximately 100 GB of disk space. A test on 01 Jan 2018 of the default installation showed 42 GB of disk space was used to store the genomic library files, 26 GB was used to store the taxonomy information from NCBI, and 29 GB was used to store the Kraken 2 compact hash table.

Like in Kraken 1, we strongly suggest against using NFS storage to store the Kraken 2 database if at all possible.
Memory: To run efficiently, Kraken 2 requires enough free memory to hold the database (primarily the hash table) in RAM. While this can be accomplished with a ramdisk, Kraken 2 will by default load the database into process-local RAM; the --memory-mapping switch to kraken2 will avoid doing so. The default database size is 29 GB (as of Jan. 2018), and you will need slightly more than that in RAM if you want to build the default database.
Dependencies: Kraken 2 currently makes extensive use of Linux utilities such as sed, find, and wget. Many scripts are written using the Bash shell, and the main scripts are written using Perl. Core programs needed to build the database and run the classifier are written in C++11, and need to be compiled using a somewhat recent version of g++ that will support C++11. Multithreading is handled using OpenMP. Downloads of NCBI data are performed by wget and rsync. Most Linux systems will have all of the above listed programs and development libraries available either by default or via package download.

Unlike Kraken 1, Kraken 2 does not use an external $k$-mer counter. However, by default, Kraken 2 will attempt to use the dustmasker or segmasker programs provided as part of NCBI's BLAST suite to mask low-complexity regions (see [Masking of Low-complexity Sequences]).

MacOS NOTE: MacOS and other non-Linux operating systems are not explicitly supported by the developers, and MacOS users should refer to the Kraken-users group for support in installing the appropriate utilities to allow for full operation of Kraken 2. We will attempt to use MacOS-compliant code when possible, but development and testing time is at a premium and we cannot guarantee that Kraken 2 will install and work to its full potential on a default installation of MacOS.

In particular, we note that the default MacOS X installation of GCC does not have support for OpenMP. Without OpenMP, Kraken 2 is limited to single-threaded operation, resulting in slower build and classification runtimes.
Network connectivity: Kraken 2's standard database build and download commands expect unfettered FTP and rsync access to the NCBI FTP server. If you're working behind a proxy, you may need to set certain environment variables (such as ftp_proxy or RSYNC_PROXY) in order to get these commands to work properly.

Kraken 2's scripts default to using rsync for most downloads; however, you may find that your network situation prevents use of rsync. In such cases, you can try the --use-ftp option to kraken2-build to force the downloads to occur via FTP.
MiniKraken: At present, users with low-memory computing environments can replicate the "MiniKraken" functionality of Kraken 1 in two ways: first, by increasing the value of $k$ with respect to $\ell$ (using the --kmer-len and --minimizer-len options to kraken2-build); and secondly, through downsampling of minimizers (from both the database and query sequences) using a hash function. This second option is performed if the --max-db-size option to kraken2-build is used; however, the two options are not mutually exclusive. In a difference from Kraken 1, Kraken 2 does not require building a full database and then shrinking it to obtain a reduced database.

Kraken 2 Databases

A Kraken 2 database is a directory containing at least 3 files:

hash.k2d: Contains the minimizer to taxon mappings
opts.k2d: Contains information about the options used to build the database
taxo.k2d: Contains taxonomy information used to build the database

None of these three files are in a human-readable format. Other files may also be present as part of the database build process, and can, if desired, be removed after a successful build of the database.

In interacting with Kraken 2, you should not have to directly reference any of these files, but rather simply provide the name of the directory in which they are stored. Kraken 2 allows both the use of a standard database as well as custom databases; these are described in the sections [Standard Kraken 2 Database] and [Custom Databases] below, respectively.

Custom Database

To build a custom database: Install a taxonomy. Usually, you will just use the NCBI taxonomy, which you can easily download using:

kraken2-build --download-taxonomy --db $DBNAME

This will download the accession number to taxon maps, as well as the taxonomic name and tree information from NCBI. These files can be found in $DBNAME/taxonomy/. If you need to modify the taxonomy, edits can be made to the names.dmp and nodes.dmp files in this directory; you may also need to modify the *.accession2taxid files appropriately.

These libraries include all those available through the --download-library option (see next point), except for the plasmid and non-redundant databases. If you are not using custom sequences (see the --add-to-library option) and are not using one of the plasmid or non-redundant database libraries, you may want to skip downloading of the accession number to taxon maps. This can be done by passing --skip-maps to the kraken2-build --download-taxonomy command.

To provide support for building Kraken 2 databases from three publicly available 16S databases:

Greengenes (Kraken 2 database name: greengenes), using all available 16S data.
RDP (Kraken 2 database name: rdp), using the bacterial and archaeal 16S data.
SILVA (Kraken 2 database name: silva), using the Small subunit NR99 sequence set.

Standard Kraken 2 Database

To create the standard Kraken 2 database, you can use the following command::

kraken2-build --standard --db $DBNAME Replace "$DBNAME" above with your preferred database name/location. Please note that the database will use approximately 100 GB of disk space during creation.