Bioinforma)cs Resources - PDB -

Bioinforma)cs Resources - PDB - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J. Reeb Ins)tut für Informa)k I12

Orga - Exam Date Exam takes place on Friday, July 31 st Room: MW 0250 (Mechanical Engineering Building) Time scheduled: 8.30-10.30 (might be later) Dura)on: approx. 90 min

Adver)sement Bachelor thesis: Carry your Genes (CyG) In collabora)on with Certgate GmbH and Iteratec GmbH Affects: Personalized medicine, mobile apps, encryp)on Hiwi opportunity included see h\ps://www.rostlab.org/teaching/theses

PDB History 1968: Brookhaven RAster Display (BRAD) 1969: Edgar Meyer came up with a file format for atomic coordinates 1971: remote access with SEARCH program wri\en by Meyer - > PDB func)onal 1998: transfer to RCSB (Research Collaboratory for Structural Biology) 2003: forma)on of wwpdb (PDBe, RCSB, PDBj, BMRB(2006))

References F.C. Bernstein, T.F. Koetzle, G.J.B. Williams, E.F. Meyer Jr., M.D. Brice, J.R. Rodgers, O. Kennard, T. Shimanouchi, M. Tasumi (1977) The Protein Data Bank: a computer- based archival file for macromolecular structures. J. Mol. Biol. 112: 535-542. H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne (2000) The Protein Data Bank Nucleic Acids Research, 28: 235-242. H.M. Berman, K. Henrick, H. Nakamura (2003) Announcing the worldwide Protein Data Bank Nature Structural Biology 10 (12): 98. h\p://www.rcsb.org/pdb/home/home.do

Current Composi)on* Experimental Method X-ray diffraction Proteins Nucleic Acids Protein/Nucleic Acid complexes Other Total 90.662 1.622 4.510 4 96.798 9.597 1.118 225 8 10.948 566 29 184 0 779 Hybrid 70 3 2 1 76 Other 165 4 6 13 188 Total 101.060 2.776 4.927 26 108.789 NMR Electron microscopy *May, 18th, 2015

Growth of PDB All Entries 120000 100000 80000 Yearly Total 60000 40000 20000 0 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014

Entries According to Method 120000 100000 Total X-Ray 80000 NMR EM 60000 40000 20000 0

Growth of X- Ray Structures 100000 90000 80000 70000 60000 50000 40000 30000 20000 10000 0 Yearly Total 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014

Growth of NMR Structures 12000 10000 8000 Yearly Total 6000 4000 2000 0 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014

Growth of EM Structures 800 700 600 500 Yearly Total 400 300 200 100 0 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014

Unique CATH Folds (Topologies) 1600 1400 1200 1000 800 Yearly Total 600 400 200 0 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014

Unique CATH Superfamilies 3000 2500 2000 1500 Yearly Total 1000 500 0 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014

Atomic Coordinate Entry Format aka PDB format current version 3.30 comprises 190 pages mp://mp.wwpdb.org/pub/pdb/doc/ format_descrip)ons/format_v33_a4.pdf

Record Format allowed characters: abcdefghijklmnopqrstuvwxyzabcdefghi JKLMNOPQRSTUVWXYZ 1234567890 `-=[]\;',./~!@#$%^&*()_+{} :<>?,:; are delimiters, otherwise need to be escaped by \ a file consists of mul)ple lines each line is 80 characters wide including EOL lines are self- iden)fying: first six columns contains the record name followed by a blank

Single Line Records, One Time/One Line CRYST1: Unit cell parameters, space group, and Z. END: Last record in the file. HEADER: First line of the entry, contains PDB ID code, classifica)on, and date of deposi)on. NUMMDL: Number of models....

One Time/Mul)ple Line (incompl.) AUTHOR: List of contributors. KEYWDS: List of keywords describing the macromolecule. SOURCE: Biological source of macromolecules in the entry. TITLE: Descrip)on of the experiment represented in the entry. subsequent lines have a con)nua)on number

Mul)ple Times/One Line (incompl.) ATOM: Atomic coordinate records for standard groups. CONECT: Connec)vity records. DBREF: Reference to the entry in the sequence database(s). HELIX: Iden)fica)on of helical substructures. SHEET: Iden)fica)on of sheet substructures.

Mul)ple Times/Mul)ple Lines (incompl.) FORMUL: Chemical formula of non- standard groups. HETNAM: Compound name of the heterogens. SEQRES: Primary sequence of backbone residues. SITE: Iden)fica)on of groups comprising important en)ty sites. subsequent lines have a con)nua)on number

Record Order Records have to appear in a defined order There are mandatory and op)onal records Some mandatory records depends on condi)ons Mandatory records without content are NULL examples for mandatory records: - HEADER - TITLE - COMPND -...

Records Belongs to Sec)ons Section Record Type Title HEADER, OBSLTE, TITLE, SPLIT, CAVEAT, COMPND, SOURCE, KEYWDS,EXPDTA, NUMMDL, MDLTYP, AUTHOR, REVDAT, SPRSDE, JRNL Remark REMARKs 0-999 Primary structure DBREF, SEQADV, SEQRES MODRES Secondary structure HELIX, SHEET Coordinate MODEL, ATOM, ANISOU, TER, HETATM, ENDMDL......

Records Even Have Formats A Records consists of fields with specified data Data could be: A- Z, a- z, atom name, a nine character string represen)ng a date, a number,... Complex data: token (string followed by : ), a comma separated list of strings, a fixed format string literal...

Example Header COLUMNS DATA TYPE FIELD DEFINITION ------------------------------------------------------------------------------------ 1-6 Record name HEADER 11-50 String(40) classification Classifies the molecule(s). * 51-59 Date depdate Deposition date. This is the date the coordinates were received at the PDB. 63-66 IDcode idcode This identifier is unique within the PDB. * taken from a class list from the current wwpdb Annotation Documentation Appendices (http://www.wwpdb.org/docs.html)

Classifica)on of Structures: CATH/SCOP came up in the middle of the 1990s both are quite similar aim: organize the protein structures available in PDB, based on single domains hierarchical system (roughly): - secondary structure content - fold - super families - families

SCOP: a Structural Classifica)on of Proteins Murzin, A., Brenner, S. E., Hubbard, T. J. P. and Chothia, C. (1995) J. Mol. Biol., 247, 536-540 Hubbard, T. P., Murzin, A., Brenner, S. E. and Chothia, C. (1997), Nucl. Acids Res. 25(1), 236-239 (easier to obtain) fully manually curated, driven by expert analysis associated with the ASTRAL compendium latest news: SCOPe (UC Berkeley), SCOP2 (MRC Lab Mol Biol, Cambridge, UK)

CATH - Faces taken from http://www.tgac.ac.uk/scientific-advisory-board/ taken from http://www.ebi.ac.uk/about/people/janet-thornton

CATH semi- automa)c procedure for deriving a novel hierarchical classifica)on of protein domain structures four main levels: - C: protein class, mainly secondary structure composi)on of each domain - A: architecture, summarizes shapes based on orienta)on of secondary structure elements - T: topology, sequen)al connec)vity is considered - H: homologous superfamily, high similarity with similar func)ons, evolu)onary rela)onship assumed