Protein Evolution

Protein Evolution

Common supersecondary motifs
1. Helix-loop-helix
2. Coiled-coil
3. Helix bundle
4. Beta-alpha-beta unit
5. Hairpin
6. Beta-meander
7. Greek key
8. Beta-sandwich
9. Combine to form domains
  1. Independent folding units in tertiary structure
    1. Individual domains have specific function
      1. The major driving force in their folding is hydrophobic interactions
  2. Important evolutionary units
    1. ~40% of known structures in PDB are multidomain proteins
      1. Of these >40% contain discontinuous domains
        Domain insertion is a common evolutionary mechanism
    2. 60-80% of genes in genomes code for multidomain proteins
  3. Combine with different partners
    1. Extend functional repertoire
Comparing proteins
1. Basic scoring system
  1. Sequence identity % = number of identical residues/number of residues aligned x 100
  2. >35% identity means proteins are evolutionarily related
2. Superimposition
3. Root mean square deviation
  1. Compares the proximity of residues to one another
    1. < 3.5A if related
    2. Make alignment
      1. Sequence
      2. Structure
      3. Calculate distance between C-alpha of each pair of aligned residues
        Pythagoras in 3D
        Add together for all residues
        Divide by number of residues
4. So - if 2 proteins have 99% identity...
  1. They will have a common 3D structure as long as <50 residues have this
  2. Their function is probably similar but even one residue different can completely change function.
Recognising domains
1. DNA sequence
  1. Limited use
    1. Very closely related proteins wil have very similar DNA sequences
    2. Perhaps useful for very short evolutionary distances
2. Amino acid sequences
  1. Domains have similar amino acid sequences
  2. As they diverge the sequence pattern is lost
3. Protein structure
  1. Domains have the same fold
    1. Has it been seen before?
  2. Independently folding units
    1. Many computer programs that try to use structural data to identify domains
4. Each domain takes a specific topology/fold
  1. There is a limit to how many folds are possible in nature
    1. ~10^3-4
      1. Even though there are millions of protein sequences there are only so many different ways that structures can be fitted together!
Why structural similarity?
1. Divergent evolution from common ancestor
  1. If something works it is unlikely to be selected against
  2. Structure is much more highly conserved than sequence
    1. Makes sense; If a sequence is different that is absolute. But some amino acids have common properties and hence the structure can be retained even when the sequence is not.
2. Convergent evolution
  1. Only so many ways to pack helices and strands in 3D space
  2. Energetically favourable
Protein classification
1. Domains are an important part of structure
  1. Structure is conserved because it determines function
    1. Protein data bank
2. CATH
  1. Class
    1. Assigned automatically for >90%
    2. Major secondary structure
  2. Architecture
    1. Gross orientation of secondary structures
      1. Shape of the fold
    2. Beta-roll
    3. Up-down bundle
    4. Alphabeta-prism
    5. alpha-beta-alpha sandwich
  3. Topology
    1. Connections
    2. Number of 2ry strctures
  4. Homologous superfamily
    1. Highly similar structures and functions
    2. Is there enough evidence for shared evolutionary origin?
  5. Process
    1. 1. Chop proteins into domains
      1. 2. Sequence and structural analysis programs group by evolutionary and structural families
  6. CATH domain structures and sequence relatives in sequenced genomes are fairly representative
  7. Applications
    1. "Ab initio" methods using protein structure
      1. Algorithms for recognising boundaries
        Swindells, 1995 - Detective
        Each domain should have a recognisable hydrophobic core
        Siddiqui and Barton, 1995 - DOMAK
        Residues comprising a domain make more internal contacts than external ones
        Holm & Sander, 1994 - PUU
        Parser for protein folding units
        Maximal interaction within domains
        Minimal interaction between domains
        Seek consensus; in practice ~20% of cases
        E.g. Beta-amylase
        PUU: 1 domain
        DOMAK: 2 domains
        Detective: 3 domains
    2. Sequence methods
      1. Sequence-sequence comparison
        BLAST
        NW
      2. Sequence-profile comparison
        PSI-BLAST
        Align sequences and colour-code degree of conservation
        If identity <35%
    3. Structural methods
      1. Contact map
        Points of contact between residues in a protein
      2. Distance matrices
    4. Enter text here
3. Structural classification of proteins
  1. Class
  2. Fold
  3. Superfamily
  4. Family
  5. Species
4. Protein family database Pfam
  1. Superfamily
  2. Clade
  3. Sequence based
    1. Confers with SCOP
5. Methods vary but agree on 60-70% of cases
Homology
1. An absolute value
  1. Either homologous or not!
2. Orthologs
  1. Common ancestor
    1. Different species
    2. Same function
3. Paralogs
  1. Common ancestor
    1. Gene duplication
    2. Same or different species
    3. Different but related function
4. At least 2 conditions must be met:
  1. Significant structural similarity
  2. Significant sequence similarity
  3. Functional similarity
5. e.g. toxins
  1. Cholera
  2. Pertussis
  3. Heat stable enterotoxin
  4. High structural similarity
    1. Related functions
    2. No evidence that they evolved from a common ancestor
Rossmann fold
1. NAD binding domain
  1. Cofactor that reversibly accepts a hydride ion
    1. Lost or gained by the substrate in the redox reaction
    2. Found in all living cells
      1. Metabolism
        Accepts or donates electrons in redox
        Remove two hydrogen atoms from reactant (R)
        Hydride ion H-
        Reduces NAD+ to NADH
        Proton H+
        Released into solution
    3. 2 nucleotides joined through their phosphate group
      1. One contains an adenine base
      2. The other nicotinamide
2. One of the most ubiquitous domains
3. Alpha-beta fold
  1. Central beta sheet
    1. Surrounded by approximately 5 alpha helices
    2. Strands in characteristic order 654123
      1. Crossover forms binding site
4. Lactate dehydrogenase
  1. Has a Rossman fold at N-terminal domain
  2. Convert L-lactate to pyruvate
    1. Last step in anaerobic glycolysis
  3. C-terminal catalytic domain
    1. Substrate specificity
      1. Precise reaction
    2. Specific to lactate/malate dehydrogenases
      1. Malate dehydrogenase
        Interconversion of malate to oxaloacetate
        N-terminal domain is a Rossmann fold
        C-terminal catalytic domain
      2. Paralogs
        17% sequence identity in humans
        Duplication event
        BUT structurally very similar
        Function of NAD-binding domain conserved
        Change in sequence confers change in substrate specificity
  4. Human/zebrafish orthologs
    1. 76% sequence identity
      1. Suggests speciation from common ancestor
5. Alcohol dehydrogenase
  1. Two ADCDs flank a Rossmann fold
  2. Same structure as LDH
    1. 17% identity
6. Combine with different catalytic domains to achieve different functions
Sequence diversity
1. Different evolutionary constraints in different positions in the protein structure
  1. Surface residues have the least evolutionary constraints and can also accommodate small insertions and deletions
  2. Core residues more highly conserved, critical for folding and stability
  3. Functional residues also highly conserved
    1. Critical for enzyme function or for protein-protein interactions
2. Structural diversity
  1. The core is usually highly conserved
    1. Within a family ~50%
  2. Residue insertions in loops connecting 2ry structures
  3. Substitutions can cause shifts in orientations of 2ry structure
  4. Domain embellishments
3. Functional diversity
  1. Domain superfamily
    1. Dependent on the fold
      1. Some can support many similar functions
        P-loop hydrolases
      2. Some have limited repertoire of functions
        Globins
  2. 1 amino acid change can change the function of a protein
  3. Can share <10% sequence identity but still have the same function in different organisms
  4. Defining function
    1. Biochemical
      1. Conservation
        Chemistry?
        Substrate?
        Product?
    2. Biological
      1. Is cell localisation conserved?
        Myoglobin
        Haemoglobin
    3. Schema
      1. Enzyme classification system
      2. GO terms
Challenges in comparing protein structures
1. Insertions or deletions of residues
  1. Usually in connecting loops not secondary structures
  2. Indels
  3. Structural cores usually highly conserved
    1. Can still be considerable structural differences between relatives outside the core.
  4. Insertions usually in loops connecting secondary structures
  5. Substitutions can shift orientations of secondary structures
  6. Coping
    1. Ignore variable loop regions
      1. Only compare secondary structures
    2. Use algorithms which explicitly handle insertions/deletions
      1. Dynamic programming
      2. Simulated annealing
CATHEDRAL
1. Combines rapid graph theory secondary structure filter with dynamic programming
  1. Accurate residue alignment
  2. SVM
    1. Combine scores
      1. Assess significance of match
  3. Scan against all domain structural representatives in CATH
2. Fast structure comparison
  1. Dihedral angles + chirality
    1. Create overlap graph
      1. Largest common structural motif
      2. Compare using Bron Kerbosch algorithm
        Largest common graph
  2. Generally ~1000x faster than residue-based methods
SSAP
1. Residue-based method
  1. Double dynamic programming
2. Compare vector environments
  1. Path matrix
    1. Generate path for each
      1. Compare with dynamic algorithm
        Add path to summary matrix
        Apply dynamic algorithm to summary matrix
        Final alignment

Próximo

Descrição

Resumo de Recurso

Semelhante

	Criado por Jen Harris mais de 11 anos atrás