DrOmics Labs

Bioinformatics Secrets: 10 Powerful Sequence and Structure Databases for Breakthrough Discoveries

Sequence and structure databases are essential resources for bioinformatics, as they store and provide access to the vast amount of biological data generated by various experimental methods. Sequence databases contain the information of nucleic acids (DNA and RNA) and proteins, while structure databases store the three-dimensional coordinates of macromolecules and their complexes. In this blog, we will introduce some of the most commonly used sequence and structure databases, and discuss their features and applications.

Sequence Databases

Sequence databases can be classified into three categories: primary, secondary, and specialised databases. Primary databases are the repositories of raw sequence data submitted by researchers or large-scale sequencing projects. Secondary databases are derived from primary databases by adding annotations, classifications, or predictions. Specialised databases focus on specific types of sequences or organisms.

Primary Sequence Databases

The three major primary sequence databases are:

  1. NCBI Database: Hosts raw sequence data and collaborates with EMBL and DDBJ to form the INSDC for synchronised and accessible data.
  2. EMBL Database: A repository for nucleotide sequences, contributing to the INSDC collaboration.
  3. DDBJ Database: Focuses on nucleotide sequences and participates in the INSDC collaboration.

The primary databases accept all types of nucleotide sequences, such as genomic, transcriptomic, or metagenomic data, from various sources, such as bacteria, viruses, plants, animals, or humans. The primary databases also provide tools for searching, retrieving, and analysing the sequence data, such as BLAST, Entrez, and SRA.

Secondary Sequence Databases

Secondary sequence databases are built upon the primary databases by adding value to the raw sequence data. For example, some secondary databases annotate the sequences with functional or structural information, such as gene names, protein domains, or binding sites. 

  1. RefSeq: Annotates sequences with functional or structural information such as gene names and protein domains.
  2. UniProt: Provides comprehensive information on proteins, including functional annotations and protein-protein interactions.
  3. Ensembl: Offers genomic information with a focus on eukaryotic organisms.

Other databases, like Pfam, InterPro, and NCBI Taxonomy, classify sequences based on similarities or evolutionary relationships. Additionally, databases such as Rfam, PSORTb, and STRING predict properties or functions of sequences.

Specialised Sequence Databases

Specialised sequence databases are dedicated to specific types of sequences or organisms, and often provide more detailed or comprehensive information than the general databases. 

Some examples of specialised sequence databases:

  1. miRBase: Focuses on non-coding RNA sequences, specifically microRNA.
  2. snoRNA-LBME-db: Stores small nucleolar RNA sequences and their targets.
  3. WormBase: Provides genomic and biological data for nematodes.
  4. KEGG: Integrates genomic, chemical, and functional information for various organisms.

Structure Databases

Structure databases store the three-dimensional coordinates of macromolecules and their complexes, such as proteins, nucleic acids, or protein-nucleic acid complexes. The structures are usually determined by experimental methods, such as X-ray crystallography, nuclear magnetic resonance (NMR), or cryo-electron microscopy (cryo-EM), or by computational methods, such as homology modeling or molecular docking. Structure databases also provide tools for visualising, comparing, and analysing the structures, such as PyMOL, VMD, or Chimera.

Primary Structure Databases

The main primary structure database is the Protein Data Bank (PDB), which is the single worldwide repository of macromolecular structure data. The PDB accepts structures determined by any method, and assigns a unique four-letter code to each entry. The PDB also provides a standard format for representing the structure data, called the PDB format, which contains the atomic coordinates, the connectivity, the secondary structure, and other information. The PDB is maintained by the Worldwide Protein Data Bank (wwPDB), which consists of four regional partners: RCSB PDB (USA), PDBe (Europe), PDBj (Japan), and BMRB (USA). The wwPDB ensures that the structure data are consistent and accessible through the PDB website or FTP server.

Secondary Structure Databases

Secondary structure databases are derived from the primary structure database by adding annotations, classifications, or predictions. For example, some secondary databases annotate the structures with functional or biological information, such as ligands, interactions, or pathways. 

Some examples of secondary structure databases:

  1. PDBsum: Annotates structures with functional or biological information, providing a graphical overview.
  2. PDBbind: Collects binding affinities of protein-ligand complexes.
  3. SCOP: Organises structures hierarchically into folds, superfamilies, and families.
  4. CATH: Classifies structures based on class, architecture, topology, and homology.
  5. ProTherm: Contains thermodynamic data for protein stability.
  6. DynaMine: Predicts backbone dynamics of proteins.

 

Specialised Structure Databases

Specialised structure databases are dedicated to specific types of structures or complexes, and often provide more detailed or comprehensive information than the general databases.

Some examples of specialised structure databases:

  1. NDB (Nucleic Acid Database): Focuses on nucleic acid structures determined by any method.
  2. RNA 3D Hub: Provides annotations and classifications of RNA structures and motifs.
  3. PDBFlex: Analyses conformational changes in protein structures and complexes.
  4. Proteopedia: Offers a wiki-style platform for exploring protein structures and their functions.

These specialised databases offer in-depth information on specific types of structures or interactions.

The Importance of Sequence and Structure Databases in Biological Research

Sequence and structure databases play a pivotal role in advancing biological research by serving as indispensable repositories for the vast and diverse data generated through genome sequencing and protein structure determination projects. Their significance lies in several key aspects:

  1. Data Storage:

   – Sequence databases (e.g., NCBI, EMBL) store vast genomic, transcriptomic, and metagenomic data.

   – Protein Data Bank (PDB) houses three-dimensional coordinates, revealing protein structures.

  1. Global Access:

   – INSDC ensures synchronised access to genetic data with unique accession numbers.

   – PDB serves as a universal repository for diverse protein structures.

  1. Annotation and Enrichment:

   – Secondary databases (e.g., RefSeq, UniProt) add functional and structural annotations to genetic data.

   – Structural databases (e.g., SCOP, CATH) classify proteins, aiding in understanding evolutionary relationships.

  1. Analysis Tools:

   – Bioinformatics tools (e.g., BLAST, Entrez) in sequence databases aid in efficient data retrieval and analysis.

   – Structural databases (PDB) offer tools like PyMOL, VMD for three-dimensional structure analysis.

  1. Specialised Information:

   – Specialised databases (e.g., miRBase, NDB) focus on specific elements like microRNA or nucleic acid structures.

   – They provide detailed, organism-specific information complementing general databases.

Conclusion

Sequence and structure databases are indispensable for bioinformatics, as they provide the raw material and the foundation for many analyses and applications. By storing and organising the biological data in a systematic and standardised way, they enable the access, retrieval, and comparison of the data across different sources and methods. By adding value to the data through annotations, classifications, or predictions, they enhance the understanding and interpretation of the data and facilitate the discovery of new knowledge. By focusing on specific types or aspects of the data, they offer more depth and breadth of information and cater to the needs of different users and purposes. In this blog, we have introduced some of the most commonly used sequence and structure databases, and discussed their features and applications. We hope that this blog will help you to find and use the appropriate databases for your bioinformatics projects. Sequence and structure databases serve as foundational pillars in biological research, offering a structured and accessible platform for storing, annotating, and analysing the wealth of data generated by modern genomic and structural biology projects. These databases foster collaboration, standardisation, and innovation in the pursuit of a deeper understanding of the complexities of life at the molecular level.

Leave a Comment

Your email address will not be published. Required fields are marked *

Bitbucket