Next-generation sequencing (NGS) is a revolutionary technology that has enabled the rapid and high-throughput generation of massive amounts of genomic data. NGS has transformed the fields of biology, medicine, and biotechnology by providing unprecedented insights into the structure, function, and evolution of genomes. However, NGS also poses significant challenges for data analysis, as the raw data are complex, noisy, and often incomplete. Therefore, bioinformatics tools and databases are essential for processing, interpreting, and annotating NGS data.
In this blog, we will review some of the common steps and tools involved in NGS data analysis, from raw data to biological discoveries. We will also highlight some of the open-source resources available for the genomics community.
Raw Data Processing
The first step in NGS data analysis is to process the raw data generated by the sequencing machines. This involves quality control, filtering, trimming, and error correction of the reads. Some of the tools that can perform these tasks are:
- FastQC: For quality control and assessment of NGS data. It generates various statistics and plots to evaluate the quality of the reads, such as base quality, sequence length, GC content, duplication levels, and overrepresented sequences.
- Trimmomatic: For trimming and filtering NGS data. It removes low-quality bases, adapter sequences, and other contaminants from the reads, and outputs high-quality reads for downstream analysis.
- SPAdes: For error correction and de novo assembly of NGS data. It uses a combination of k-mers and graph algorithms to correct sequencing errors and assemble the reads into contigs or scaffolds.
Alignment and Mapping
The next step in NGS data analysis is to align or map the processed reads to a reference genome or transcriptome. This allows the identification of the genomic regions or genes that are covered by the reads, and the quantification of their expression levels. Some of the tools that can perform these tasks are:
- BWA: This tool aligns short reads to a reference genome. It uses a Burrows-Wheeler transform and a seed-and-extend strategy to efficiently and accurately map the reads to the reference, allowing for gaps and mismatches.
- Bowtie2: This tool also aligns short reads to a reference genome. It uses a FM-index and a backtrack algorithm to rapidly and sensitively map the reads to the reference, allowing for gaps and mismatches.
- STAR: A tool for aligning RNA-seq reads to a reference transcriptome. It uses a suffix array and a splice-aware algorithm to accurately and comprehensively map the reads to the reference, accounting for splicing events and novel junctions.
Variant Calling and Annotation
The final step in NGS data analysis is to call and annotate the variants that are detected in the aligned or mapped reads. This involves identifying the differences between the reads and the reference, such as single nucleotide polymorphisms (SNPs), insertions, deletions, and structural variants. It also involves annotating the variants with their functional and clinical implications, such as their effects on genes, proteins, and diseases. Some of the tools that can perform these tasks are:
- GATK: A toolkit for variant discovery and genotyping in NGS data. It uses a Bayesian model and a haplotype-based approach to call and genotype variants across multiple samples, and provides various filters and quality metrics to assess the reliability of the calls.
- Strelka2: A tool for small variant calling in NGS data. It uses a probabilistic model and a haplotype-based approach to call and genotype SNPs and indels in tumor-normal pairs, and provides various filters and quality metrics to assess the reliability of the calls.
- ANNOVAR: A tool for variant annotation and prioritization in NGS data. It uses a comprehensive database and a flexible framework to annotate the variants with their functional and clinical effects, such as gene names, amino acid changes, population frequencies, and disease associations.
Conclusion
In this blog, we have briefly introduced some of the common steps and tools involved in NGS data analysis, from raw data to biological discoveries. However, this is not an exhaustive list, and there are many other tools and databases that can be used for different purposes and applications. Moreover, NGS data analysis is an active and evolving field, and new tools and databases are constantly being developed and improved. Therefore, we encourage the readers to explore the open-source resources available for the genomics community, and to keep up with the latest advances and innovations in NGS data analysis.
References
(1) Bioinformatics and Computational Tools for Next-Generation Sequencing Analysis in Clinical Genetics. https://www.mdpi.com/2077-0383/9/1/132.
(2) Open-Source Bioinformatics Tools | For NGS Data – Illumina. https://www.illumina.com/science/genomics-research/open-source-bioinformatics.html.
(3) Bioinformatics tools and databases for analysis of next-generation sequence data. https://academic.oup.com/bfg/article/11/1/12/191515.
(4) Computational Methods for Next Generation Sequencing Data Analysis. https://www.wiley.com/en-us/Computational+Methods+for+Next+Generation+Sequencing+Data+Analysis-p-9781118169483.
(5) Computational Methods for Next Generation Sequencing Data Analysis. https://www.wiley.com/en-in/Computational+Methods+for+Next+Generation+Sequencing+Data+Analysis-p-9781119272175.