Biological Databases
Dr. Himanshu Avashthi
Assistant Professor
Department of Bioinformatics,
Faculty of Engineering & Technology,
Marwadi University, Rajkot
Biological Databases
• Biological databases are organized collections of biological data that are
stored electronically in a computer system.
• These are essential for storing, managing, and retrieving biological data for
research and analysis.
• These databases enable scientists to access and use vast amounts of
information generated from various experiments and studies in the field of
biology.
Importance of Databases
• Databases act as a storehouse of information.
• Databases are used to store and organize data in such a way that the
information can be easily retrieved through various search criteria.
• It facilitates the discovery of new biological insights from raw data.
Features of Ideal Biological Database
• An ideal biological database should possess several key features to ensure it is useful,
reliable, and accessible for researchers and scientists. Some essential features are:
1. Comprehensive and Accurate Data
✓Should cover a wide range of relevant data.
✓Data should be correct, validated, and up-to-date to maintain scientific credibility.
2. User-Friendly Interface
✓ A user-friendly interface allows researchers to find and use the data efficiently.
✓Advanced search capabilities with filters, and easy retrieval options.
Contd…
3. Regular Updates
✓Frequent updates to incorporate new data and research findings.
4. Accessibility and Availability
✓Preferably free and open (open access) to the scientific community to maximize its
utility.
5. Security and Privacy
✓To protect data from unauthorized access and breaches.
6. Non-redundant
Avoid unnecessary duplication of data.
Other important features are:
Classification of Biological Databases are based on:
1. Data Sources
2. Data types
1. Based on the sources of Biological Data: It is of three types
Primary Databases
Secondary Databases
Composite Databases (OWL, NRDB, UniProt)
1. Primary Databases
• It can also be called an archival database.
• They are populated with experimentally derived (obtained) data such as
nucleotide sequence, protein sequence, or macromolecular structure.
• Experimental results are submitted directly into the database by researchers.
• The data entered here remains uncurated (no modifications are performed
over the data).
Contd…
• The data are given accession numbers when entered into the database.
• Once given a database accession number, the data in primary databases are
never changed: they form part of the scientific record.
• The same data can later be retrieved using the accession number.
• Accession number identifies each data uniquely and it never changes.
Primary Nucleotide Sequence Databases
• The primary repository of nucleotide sequences are:
• GenBank (National Center of Biotechnology Information, USA)
• EMBL-EBI (European Molecular Biology Laboratory), UK
• DDBJ (DNA Data Bank of Japan)
• They synchronize on a daily basis, and the unique accession numbers are
managed consistently.
• Good amount of redundancy.
GenBank ([Link]
• GenBank is the NIH genetic sequence database,
• It is an annotated collection of all publicly available DNA sequences.
EMBL-EBI ([Link]
• EMBL is a repository providing free and unrestricted access to annotated
DNA and RNA sequences.
• Data arrive at ENA from various sources such as submitted raw sequencing
data, sequence assembly information, and routine exchange with INSDC.
DDBJ ([Link]
• DDBJ collects nucleotide sequence data as a member of INSDC and provides freely
available nucleotide sequence data to support research activities in life sciences.
International Nucleotide Sequence Database Collaboration
• INSDC integrates the information of the NCBI, EMBL-EBI, and DDBJ
databases.
• This information is exchanged, updated, and synchronized daily.
Primary Protein Sequence Databases
1. PIR-PSD: (Protein Information Resource-International Protein Sequence Database)
2. SWISS-PROT
3. TrEMBL (Translated EMBL)
Primary Protein Sequence Databases
1. PIR-PSD: ([Link]
• World’s first superfamily-based classified, functionally annotated, comprehensive,
and expertly curated protein sequence database.
Primary Protein Sequence Databases
2. SWISS-PROT: ([Link]
It is a manually curated (annotated) protein sequence database that provides high
levels of annotation information on the protein’s function, domain structure, and
post-translational modifications with minimal redundancies.
Primary Protein Sequence Databases
3. TrEMBL: ([Link]
• It is a computationally annotated protein sequence database
• It contains translation of all coding sequences in the EMBL nucleotide sequence database
that are not yet integrated into Swiss-Prot.
Primary Protein Sequence Databases
• Merger of all three databases (PIR, Swiss Prot, and TrEMBL) into a single
resource i.e. UniProt (Universal Protein Resource) Consortium.
UniProt ([Link]
• It comprises of three sub-databases:
1. The UniProt Knowledgebase (UniProtKB): Includes SwissProt (Manually
annotated) and TrEMBL (Automatically annotated)
It is the main source of functional protein information, providing accurate,
consistent, and detailed annotations.
2. The UniProt Archive (UniParc):
A comprehensive sequence archive that stores all protein sequences—annotated or
unannotated—from major public databases. It is updated daily.
3. The UniProt Reference Clusters (UniRef):
Offers clustered sets of sequences at different identity levels to reduce redundancy
and improve search efficiency across UniProtKB and UniParc.
2. Secondary Databases
• Secondary databases conatain data that is derived from the results of analyzing
primary data.
• In other words, they store the interpreted or processed results of primary data.
• These are also known as curated or annotated databases.
• Secondary databases often integrate information from multiple sources, including
other databases, and scientific literature.
• They are highly curated, often using a complex combination of computational
tools, manual analysis, and interpretation to generate new information or
knowledge from existing scientific data.
Secondary Protein Sequence Databases
1. PROSITE
2. Pfam
3. InterPro
Secondary Protein Sequence Databases
1. PROSITE: ([Link]
• It is a manually curated database of protein domains, families, and
functional (catalytic) sites.
2. Pfam: ([Link]
• It is a collection of curated protein families to provide complete and accurate
information.
• Rather than performing a typical BLAST search, Pfam uses the HMM
algorithm, which gives greater weight to matches at conserved sites,
allowing better homology detection.
3. InterPro: ([Link]
• It provides functional analysis of proteins by classifying them into families and
predicting domains and important functional sites.
• To classify proteins, InterPro uses predictive models, known as signatures,
provided by several different databases (referred to as member databases) that
make up the InterPro consortium.
References
• [Link]
• [Link]
• [Link]
• [Link]
• [Link]
• [Link]
• [Link]
• [Link]
• [Link]
• [Link]
• [Link]