DrOmics Labs

NGS File Formats : The Simple Guide for Beginners

Next-generation sequencing (NGS) is a powerful technology that has revolutionized the field of genomics. NGS generates vast amounts of data, which are stored in various file formats. These file formats are used to store different types of data, including nucleotide sequences, quality scores, alignments, genomic annotations, and coverage data.

The most commonly encountered file formats in NGS data analysis include:

FASTA (FAST-All):

Purpose: Primarily used to store biological sequences such as nucleotide sequences (DNA or RNA) or amino acid sequences (proteins).

Format: It consists of a single-line description, preceded by a greater-than symbol (“>”), followed by the sequence data on subsequent lines.

Example: 

>Sequence_1

ATCGTGCATGCA

FASTQ (FAST-QuaLity):

Purpose: It extends the FASTA format by including quality scores for each base in the sequence.

Format: Each sequence is represented in four lines: a sequence identifier, the sequence itself, a plus sign (“+”), and quality scores.

Example:

@Sequence_1

ATCGTGCATGCA

+

IIIIIIIIIIII

 

SAM/BAM/CRAM (Sequence Alignment/Map, Binary Alignment/Map, and Compressed RAM):

 

Purpose: These formats are used to store information about read alignments to a reference genome.

Format:

SAM: Human-readable text format.

BAM: Binary version of SAM, more efficient for storage and processing.

CRAM: A compressed version of BAM.

SAM example:

Example:

@SQ SN:ref LN:45

read1 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *

To convert the above SAM entry to BAM, you would use a tool like Samtools. Here’s a simplified command:

samtools view -bS input.sam > output.bam

To convert BAM to CRAM, you would use a tool like Samtools again. Here’s a simplified command:

samtools view -T reference.fasta -C -o output.cram input.bam

 

BED/GTF (Browser Extensible Data/General Transfer Format):

Purpose: Used for representing genomic annotations, such as gene structures, transcript locations, and other features.

Format:

BED: A simple, tab-delimited format.

GTF: A more elaborate format with additional information.

bash

BED format example:

chr1    100     200    gene1   0   +

chr1     150      250    exon1   0   +

GTF format example:

chr1   .     gene   100   200    .    +    .             gene_id “gene1”;

chr1 .       exon    150    250    .    +    .          gene_id   “gene1”;      transcript_id

“transcript1”;

 

bedgraph

Purpose: Used to represent continuous-valued data across a genome, such as coverage information from sequencing experiments.

Format: A four-column, tab-delimited format.

Example:

chr1    100    200    0.5

chr1    201    300    1.2

These file formats are crucial for storing and exchanging biological data in bioinformatics and genomics research. Each format serves a specific purpose and is tailored to the type of information it encapsulates.

Leave a Comment

Your email address will not be published. Required fields are marked *

Bitbucket