The major driving force in their
folding is hydrophobic
interactions
Important evolutionary units
~40% of known structures in
PDB are multidomain proteins
Of these >40% contain discontinuous domains
Domain insertion is a common evolutionary mechanism
60-80% of genes in genomes code
for multidomain proteins
Combine with different partners
Extend functional repertoire
Comparing proteins
Basic scoring system
Sequence identity % = number of identical residues/number of residues aligned x 100
>35% identity means proteins are evolutionarily related
Superimposition
Root mean square deviation
Compares the proximity of residues to one another
< 3.5A if related
Make alignment
Sequence
Structure
Calculate distance between C-alpha of each pair of aligned residues
Pythagoras in 3D
Add together for all residues
Divide by number of residues
So - if 2 proteins have 99% identity...
They will have a common 3D structure as long as <50 residues have this
Their function is probably similar but even one residue different can completely change function.
Recognising domains
DNA sequence
Limited use
Very closely related proteins wil
have very similar DNA sequences
Perhaps useful for very short
evolutionary distances
Amino acid sequences
Domains have similar amino acid sequences
As they diverge the sequence pattern is lost
Protein structure
Domains have the same fold
Has it been seen before?
Independently folding units
Many computer programs that try to
use structural data to identify domains
Each domain takes a specific topology/fold
There is a limit to how many folds are possible in nature
~10^3-4
Even though there are millions of protein sequences there are
only so many different ways that structures can be fitted together!
Why structural similarity?
Divergent evolution from common ancestor
If something works it is unlikely to be selected against
Structure is much more highly conserved than sequence
Makes sense; If a
sequence is different that
is absolute. But some
amino acids have
common properties and
hence the structure can be
retained even when the
sequence is not.
Convergent evolution
Only so many ways to pack helices and strands in 3D space
Energetically favourable
Protein classification
Domains are an important part of structure
Structure is conserved because it determines function
Protein data bank
CATH
Class
Assigned automatically for >90%
Major secondary structure
Architecture
Gross orientation of secondary structures
Shape of the fold
Beta-roll
Up-down bundle
Alphabeta-prism
alpha-beta-alpha sandwich
Topology
Connections
Number of 2ry strctures
Homologous superfamily
Highly similar structures and functions
Is there enough evidence for
shared evolutionary origin?
Process
1. Chop proteins into domains
2. Sequence and structural analysis programs
group by evolutionary and structural families
CATH domain structures and
sequence relatives in sequenced
genomes are fairly representative
Applications
"Ab initio" methods using protein structure
Algorithms for recognising boundaries
Swindells, 1995 - Detective
Each domain should have a
recognisable hydrophobic core
Siddiqui and Barton, 1995 - DOMAK
Residues comprising a domain
make more internal contacts than
external ones
Holm & Sander, 1994 - PUU
Parser for protein folding units
Maximal interaction within domains
Minimal interaction between domains
Seek consensus; in practice ~20% of cases
E.g. Beta-amylase
PUU: 1 domain
DOMAK: 2 domains
Detective: 3 domains
Sequence methods
Sequence-sequence comparison
BLAST
NW
Sequence-profile comparison
PSI-BLAST
Align sequences and
colour-code degree
of conservation
If identity <35%
Structural methods
Contact map
Points of contact between residues in a protein
Distance matrices
Enter text here
Structural classification of proteins
Class
Fold
Superfamily
Family
Species
Protein family database Pfam
Superfamily
Clade
Sequence based
Confers with SCOP
Methods vary but agree on 60-70% of cases
Homology
An absolute value
Either homologous or not!
Orthologs
Common ancestor
Different species
Same function
Paralogs
Common ancestor
Gene duplication
Same or different species
Different but related function
At least 2 conditions must be met:
Significant structural similarity
Significant sequence similarity
Functional similarity
e.g. toxins
Cholera
Pertussis
Heat stable enterotoxin
High structural similarity
Related functions
No evidence that they evolved
from a common ancestor
Rossmann fold
NAD binding domain
Cofactor that reversibly accepts a hydride ion
Lost or gained by the substrate in the redox reaction
Found in all living cells
Metabolism
Accepts or donates electrons in redox
Remove two hydrogen atoms from reactant (R)
Hydride ion H-
Reduces NAD+ to NADH
Proton H+
Released into solution
2 nucleotides joined through their phosphate group
One contains an adenine base
The other nicotinamide
One of the most ubiquitous domains
Alpha-beta fold
Central beta sheet
Surrounded by approximately 5 alpha helices
Strands in characteristic order 654123
Crossover forms binding site
Lactate dehydrogenase
Has a Rossman fold at N-terminal domain
Convert L-lactate to pyruvate
Last step in anaerobic glycolysis
C-terminal catalytic domain
Substrate specificity
Precise reaction
Specific to lactate/malate dehydrogenases
Malate dehydrogenase
Interconversion of malate to oxaloacetate
N-terminal domain is a Rossmann fold
C-terminal catalytic domain
Paralogs
17% sequence identity in humans
Duplication event
BUT structurally very similar
Function of NAD-binding domain conserved
Change in sequence confers change in substrate specificity
Human/zebrafish orthologs
76% sequence identity
Suggests speciation from common ancestor
Alcohol dehydrogenase
Two ADCDs flank a Rossmann fold
Same structure as LDH
17% identity
Combine with different catalytic domains to achieve different functions
Sequence diversity
Different evolutionary constraints in
different positions in the protein structure
Surface residues have the least evolutionary
constraints and can also accommodate small
insertions and deletions
Core residues more highly conserved,
critical for folding and stability
Functional residues also highly conserved
Critical for enzyme function or
for protein-protein interactions
Structural diversity
The core is usually highly conserved
Within a family ~50%
Residue insertions in loops
connecting 2ry structures
Substitutions can cause shifts
in orientations of 2ry structure
Domain embellishments
Functional diversity
Domain superfamily
Dependent on the fold
Some can support many similar functions
P-loop hydrolases
Some have limited repertoire of functions
Globins
1 amino acid change can
change the function of a protein
Can share <10% sequence identity but still
have the same function in different organisms
Defining function
Biochemical
Conservation
Chemistry?
Substrate?
Product?
Biological
Is cell localisation conserved?
Myoglobin
Haemoglobin
Schema
Enzyme classification system
GO terms
Challenges in comparing protein structures
Insertions or deletions of residues
Usually in connecting loops not secondary structures
Indels
Structural cores usually highly conserved
Can still be considerable structural
differences between relatives outside the core.
Insertions usually in loops connecting secondary structures
Substitutions can shift orientations of secondary structures
Coping
Ignore variable loop regions
Only compare secondary structures
Use algorithms which explicitly handle insertions/deletions
Dynamic programming
Simulated annealing
CATHEDRAL
Combines rapid graph theory secondary structure filter with dynamic programming
Accurate residue alignment
SVM
Combine scores
Assess significance of match
Scan against all domain structural representatives in CATH
Fast structure comparison
Dihedral angles + chirality
Create overlap graph
Largest common structural motif
Compare using Bron Kerbosch algorithm
Largest common graph
Generally ~1000x faster than residue-based methods