Next-generation sequencing (NGS) technologies have revolutionized the field of genomics, enabling the rapid and cost-effective generation of massive amounts of data. However, the analysis and interpretation of NGS data poses significant challenges, requiring specialized bioinformatics tools and databases. In this blog post, we will review some of the common tasks and applications of NGS data analysis, and introduce some of the tools and databases that are available for researchers and practitioners.
Tools and Databases for Analysis of Next Generation Sequencing
De novo genome assembly
One of the main tasks in NGS data analysis is de novo genome assembly, which aims to reconstruct the complete genome sequence of an organism from short reads, without relying on a reference genome. De novo assembly is especially useful for studying non-model organisms, novel variants, and complex genomes. However, de novo assembly is also computationally intensive and prone to errors, especially due to the presence of repetitive sequences in the genome.
There are several algorithms and software tools for de novo genome assembly, which can be broadly classified into two categories: overlap-layout-consensus (OLC) and de Bruijn graph (DBG).
OLC methods rely on finding overlaps between reads and constructing a layout of the genome, followed by resolving conflicts and generating a consensus sequence. OLC methods can produce high-quality assemblies, but they are also memory-intensive and sensitive to sequencing errors. Examples of OLC assemblers include Celera Assembler, Minia, and SGA.
DBG methods, on the other hand, rely on breaking the reads into smaller units called k-mers, and constructing a graph where each node represents a k-mer and each edge represents an overlap of k-1 bases between two k-mers. DBG methods can handle large and complex genomes, and are more robust to sequencing errors, but they also tend to produce fragmented assemblies and lose information about repeats and heterozygosity. Examples of DBG assemblers include Velvet, SOAPdenovo, and SPAdes .
Sequence alignment and variant calling
Another common task in NGS data analysis is sequence alignment and variant calling, which aims to map the reads to a reference genome and identify the differences between the sample and the reference. Sequence alignment and variant calling are essential for studying the genetic diversity, evolution, and function of genomes, as well as for detecting mutations and diseases.
There are many tools and algorithms for sequence alignment and variant calling, which can be divided into two categories: heuristic and probabilistic .
Heuristic methods rely on finding exact or approximate matches between the reads and the reference, using techniques such as hash tables, suffix arrays, and Burrows-Wheeler transform. Heuristic methods are fast and scalable, but they may miss some alignments or variants due to the limitations of the matching criteria. Examples of heuristic aligners include BWA, Bowtie, and Novoalign.
Probabilistic methods, on the other hand, rely on calculating the likelihood of the reads given a reference and a model of sequencing errors, using techniques such as hidden Markov models, Bayesian inference, and maximum likelihood estimation. Probabilistic methods are more accurate and sensitive, but they are also more computationally demanding and require more parameters. Examples of probabilistic aligners include MAQ, SAMtools, and GATK.
Functional annotation and analysis
The final task in NGS data analysis is functional annotation and analysis, which aims to assign biological meaning and significance to the genomic sequences and variants. Functional annotation and analysis are crucial for understanding the structure, function, and regulation of genes and genomes, as well as for discovering novel genes and pathways, and for identifying the molecular mechanisms of diseases and phenotypes.
There are many tools and databases for functional annotation and analysis, which can be classified into three categories: gene prediction, gene annotation, and gene expression .
Gene prediction methods rely on finding the coding regions and the boundaries of genes in the genome, using techniques such as ab initio, homology-based, and transcriptome-based methods. Gene prediction methods can help to discover new genes and to improve the accuracy of genome assembly and alignment. Examples of gene prediction tools include AUGUSTUS, GeneMark, and Glimmer.
Gene annotation methods rely on assigning functional information to the genes and their products, such as gene names, gene ontology terms, protein domains, pathways, and interactions. Gene annotation methods can help to elucidate the biological roles and relationships of genes and proteins, and to facilitate further analysis and interpretation of genomic data. Examples of gene annotation tools and databases include NCBI RefSeq, Ensembl, and UniProt.
Gene expression methods rely on measuring the abundance and activity of genes and their products, such as mRNA, proteins, and metabolites, using techniques such as microarrays, RNA-seq, proteomics, and metabolomics. Gene expression methods can help to reveal the dynamic and regulatory aspects of gene function, and to identify the differential expression and co-expression of genes and pathways under different conditions and stimuli. Examples of gene expression tools and databases include Cufflinks, DESeq2, and GEO.
Conclusion
In this blog post, we have briefly introduced some of the common tasks and applications of NGS data analysis, and some of the tools and databases that are available for researchers and practitioners. However, this is by no means a comprehensive or exhaustive list, and there are many more tools and databases that can be found online or in the literature. Moreover, the field of bioinformatics is constantly evolving and developing, and new tools and databases are being created and improved every day. Therefore, it is important for bioinformaticians and biologists to keep up with the latest advances and innovations, and to choose the most suitable and reliable tools and databases for their specific needs and goals.
References
- Lee, H. C., Lai, K., Lorenc, M. T., Imelfort, M., Duran, C., & Edwards, D. (2012). Bioinformatics tools and databases for analysis of next-generation sequence data. Briefings in functional genomics, 11(1), 12-241
- Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., … & Pevzner, P. A. (2012). SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology, 19(5), 455-4772
- Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. bioinformatics, 25(14), 1754-1760.