NGS File Formats : The Simple Guide for Beginners

Next-generation sequencing (NGS) is a powerful technology that has revolutionized the field of genomics. NGS generates vast amounts of data, which are stored in various file formats. These file formats are used to store different types of data, including nucleotide sequences, quality scores, alignments, genomic annotations, and coverage data.

The most commonly encountered file formats in NGS data analysis include:

FASTA (FAST-All):

Purpose: Primarily used to store biological sequences such as nucleotide sequences (DNA or RNA) or amino acid sequences (proteins).

Format: It consists of a single-line description, preceded by a greater-than symbol (“>”), followed by the sequence data on subsequent lines.

Example:

>Sequence_1

ATCGTGCATGCA

FASTQ (FAST-QuaLity):

Purpose: It extends the FASTA format by including quality scores for each base in the sequence.

Format: Each sequence is represented in four lines: a sequence identifier, the sequence itself, a plus sign (“+”), and quality scores.

Example:

@Sequence_1

ATCGTGCATGCA

IIIIIIIIIIII

SAM/BAM/CRAM (Sequence Alignment/Map, Binary Alignment/Map, and Compressed RAM):

Purpose: These formats are used to store information about read alignments to a reference genome.

Format:

SAM: Human-readable text format.

BAM: Binary version of SAM, more efficient for storage and processing.

CRAM: A compressed version of BAM.

SAM example:

Example:

@SQ SN:ref LN:45

read1 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *

To convert the above SAM entry to BAM, you would use a tool like Samtools. Here’s a simplified command:

samtools view -bS input.sam > output.bam

To convert BAM to CRAM, you would use a tool like Samtools again. Here’s a simplified command:

samtools view -T reference.fasta -C -o output.cram input.bam

BED/GTF (Browser Extensible Data/General Transfer Format):

Purpose: Used for representing genomic annotations, such as gene structures, transcript locations, and other features.

Format:

BED: A simple, tab-delimited format.

GTF: A more elaborate format with additional information.

bash

BED format example:

chr1 100 200 gene1 0 +

chr1 150 250 exon1 0 +

GTF format example:

chr1 . gene 100 200 . + . gene_id “gene1”;

chr1 . exon 150 250 . + . gene_id “gene1”; transcript_id

“transcript1”;

bedgraph

Purpose: Used to represent continuous-valued data across a genome, such as coverage information from sequencing experiments.

Format: A four-column, tab-delimited format.

Example:

chr1 100 200 0.5

chr1 201 300 1.2

These file formats are crucial for storing and exchanging biological data in bioinformatics and genomics research. Each format serves a specific purpose and is tailored to the type of information it encapsulates.

Wellness DNA Test

Precision Medicine

Consumer Genomics

NGS File Formats : The Simple Guide for Beginners

FASTA (FAST-All):

FASTQ (FAST-QuaLity):

SAM/BAM/CRAM (Sequence Alignment/Map, Binary Alignment/Map, and Compressed RAM):

BED/GTF (Browser Extensible Data/General Transfer Format):

bedgraph

Leave a Comment Cancel Reply

Important Links

Our Services

Our Policies

Wellness DNA Test

Precision Medicine

Consumer Genomics

FASTA (FAST-All):

FASTQ (FAST-QuaLity):

SAM/BAM/CRAM (Sequence Alignment/Map, Binary Alignment/Map, and Compressed RAM):

BED/GTF (Browser Extensible Data/General Transfer Format):

bedgraph

Related Posts

Leave a Comment Cancel Reply

Important Links

Our Services

Our Policies