Introduction to Bioinformatics
MBG1002
Genomics data mining
(Biological Databases)
Assistant Prof. Cemalettin Bekpen
Sandra Porter
• What is a Database ?
• Definition 1: the collection, classification, storage, and analysis of biochemical and biological
information using computers especially as applied to molecular genetics and genomics
(Merriam-Webster dictionary)
• A collection of
• structured
• searchable (index) → table of contents
• updated periodically (release) → new edition
• cross-referenced (hyperlinks) → links with other db data
Bioinformatics
Bioinformatics Database: a field that works on the problems involving intersection of
Biology/Computer Science/Statistics and stores this information as a database structure
efinition 1: the collection, classification, storage, and analysis of biochemic
ological information using computers especially as applied to molecular ge
d genomics (Merriam-Webster dictionary)
Archana Bhardwaj
• Includes also associated tools (software) necessary for
db access, db updating, db information insertion, db
information deletion ..
• Why biological databases ?
• Explosive growth in biological data
• Data (sequences, 3D structures, 2D gel analysis, Expression
analysis) are no longer published in a conventional manner,
but directly submitted to databases
• Essential tools for biological research, as classical publications
used to be !
[Link]
Bioinformatic Databases
The amount of data deposited in
databanks is increasing rapidly due
to the availability of NGS (Next-
Generation Sequencing)
technologies.
5
E. Hemond
L., Ghambir
• Biological Databases
• More than 1000 different databases
• Generally accessible thorugh the web
• Variable size: <100kb to > 10 Gb
• DNA > 10 Gb
• Protein: 1Gb
• 3D structure: 5Gb
• Update frequency: daily to annually
• Different kinds of bioinformatic databases
• General Purpose
• Data type specific (structure, expression)
• Organism specific (human, yeast)
• Pathway Information
• Specialized data
• Categories of Databases
• Bibliography
• Sequences (DNA, protein)
• Genomics
• Protein domain/family
• Mutation/polymorphism
• Proteomics (2D gel, MS)
• 3D structure
• Metabolic networks
• Regulatory networks
• others
Other Databases
Databases
Archana Bhardwaj
• Bibliographic databases
• Bibliographic reference databases contain citations and
abstract informations of published life science article
• Example: PubMed – developed by the National Center
for Biotechnology Information.
• PubMed provides access to bibliographic information
such as MEDLINE, PreMEDLINE, HealthSTAR, and to
integrated molecular biology databases (composite db)
• PubMed (Medline)
• MEDLINE covers the fields of medicine, nursing, dentistry,
veterinary medicine, the health care system, and the preclinical
sciences
• Contains citations from approximately 5,200 worldwide journals in
37 lenguages; 60 languages for older journals
• Contains over 20 millions citations since 1948 until now
• Contains links to biological db and to some journals
• New recors are
• added PreMEDLINE
• daily !
• A search by subject: “Aging”
• A search by authors: “Bekpen”
• Sequence Databases
• Sequences (Gene, RNA, Protein, Genome)
• Accession number (AC)
• References
• Taxonomic data
• Annotation
• Keywords
• Cross references
Sources of data: research groups (direct submission),
literature supplementary information, genome sequencing
institutes, patents
Main databases: GenBank, EMBL-EBI, DDBJ
(These three databases exchange information routinely)
Benson DA, et al., 2012
Benson DA, et al., 2012
[Link]
[Link]
[Link]
• General Sequence Databases – disadvantages
• Huge amount of Data
• Sequence redundancy (Archive: nothing goes out highly
redundant)
• Sequence accuracy – (full of errors in sequences, in
annotations, in CDS attributions)
• Sequence annotations – (no consistency of annotations;
most annotations are done by the submitters)
• Heterogeneity of the quality
• Sequence contamination
• The solution;
• Highly curate
databases of non-
redundant sequences
[Link]
archived-data-in-ecology-and-evolution-The_fig6_283717086
[Link]
• Sequence format: GenBank format
• Sequence format: FASTA format
>Sequence Name,
[sequence …….]
[Link]
[Link]
[Link]
• Genomic databases
• Contain information on genes, gene location (mapping),
gene nomenclature, and links to sequence databases
• Exist for most organisms important for life science
research
• Examples: UCSC, GDB (human), MGI (mouse), FlyBase
(Drosophila), SGD (yeast), MaizeDB (maize), SubtiList (B.,
subtilis), etc.)
[Link]
[Link]
[Link]
[Link]
[Link]
[Link]
• Summary
• What is the best db for sequence analysis ?
• Which does contain the highest quality data ?
• Which is the more comprehensive ?
• Which is the more up to date ?
• Which is the less redundant ?
• Which is the more indexed (allows complex queries) ?
• Which Web server does respond most quickly ?
[Link]
[Link]