Next-generation sequencing (NGS) is a powerful technology that has revolutionized the field of genomics. NGS generates vast amounts of data, which are stored in various file formats. These file formats are used to store different types of data, including nucleotide sequences, quality scores, alignments, genomic annotations, and coverage data.
The most commonly encountered file formats in NGS data analysis include:
FASTA (FAST-All):
Purpose: Primarily used to store biological sequences such as nucleotide sequences (DNA or RNA) or amino acid sequences (proteins).
Format: It consists of a single-line description, preceded by a greater-than symbol (“>”), followed by the sequence data on subsequent lines.
Example:
>Sequence_1
ATCGTGCATGCA
FASTQ (FAST-QuaLity):
Purpose: It extends the FASTA format by including quality scores for each base in the sequence.
Format: Each sequence is represented in four lines: a sequence identifier, the sequence itself, a plus sign (“+”), and quality scores.
Example:
@Sequence_1
ATCGTGCATGCA
+
IIIIIIIIIIII
SAM/BAM/CRAM (Sequence Alignment/Map, Binary Alignment/Map, and Compressed RAM):
Purpose: These formats are used to store information about read alignments to a reference genome.
Format:
SAM: Human-readable text format.
BAM: Binary version of SAM, more efficient for storage and processing.
CRAM: A compressed version of BAM.
SAM example:
Example:
@SQ SN:ref LN:45
read1 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
To convert the above SAM entry to BAM, you would use a tool like Samtools. Here’s a simplified command:
samtools view -bS input.sam > output.bam
To convert BAM to CRAM, you would use a tool like Samtools again. Here’s a simplified command:
samtools view -T reference.fasta -C -o output.cram input.bam
BED/GTF (Browser Extensible Data/General Transfer Format):
Purpose: Used for representing genomic annotations, such as gene structures, transcript locations, and other features.
Format:
BED: A simple, tab-delimited format.
GTF: A more elaborate format with additional information.
bash
BED format example:
chr1 100 200 gene1 0 +
chr1 150 250 exon1 0 +
GTF format example:
chr1 . gene 100 200 . + . gene_id “gene1”;
chr1 . exon 150 250 . + . gene_id “gene1”; transcript_id
“transcript1”;
bedgraph
Purpose: Used to represent continuous-valued data across a genome, such as coverage information from sequencing experiments.
Format: A four-column, tab-delimited format.
Example:
chr1 100 200 0.5
chr1 201 300 1.2
These file formats are crucial for storing and exchanging biological data in bioinformatics and genomics research. Each format serves a specific purpose and is tailored to the type of information it encapsulates.