Bioinformatics

Page 1

Bioinformatics History 1995: First whole genome (of haemophilus influenzae) was sequenced 1996: First whole genome of a eukaryotic organism sequenced (Saccharomyces cerevisiae) 2001: First human genome sequenced (in draft form) Basic Local Assignment Search Tool (BLAST) BLAST is a tool which conducts local alignments based on functional domains which are conserved (at least in part) across a variety of structures/organisms. It's algorithms allow for comparison against all known structures on the NCBI GenBank and is not limited to simple comparisons - for example a sample of mRNA can be queried in BLAST and match a segment of genomic DNA. The most common uses of the BLAST tool are BLASTx and BLASTn. Despite this other tools exists, such as BLASTp (input a protein sequence as query, generates a list of structures with homologous protein sequences) tBLASTn (slowest of all BLAST types, compares a protein against all 6 reading frames of the known nucleotide database) and PSI-BLAST (used to identify relatives of a specific protein as it is more effective in analysing and identifying trends in evolutionary lineages) BLASTx: Compares each read against every documented structure on the database for amino acid homology (less specific but faster than BLASTn). BLASTn: Compares each read against the non-redundant NCBI GenBank database for nucleotide homology (more specific but slower than BLASTx). Pairs of segments with high levels of homology are known as 'high scoring segments pairs' (HSPs) BLAST bit score = A statistical quality gauge generated from the raw alignment score, typically used for quantitative result comparison. A higher score indicates a greater alignment to the query sequence. The Expectation (E) value is a similar quantitative indicator, showing an estimation of results with scores greater than or equal to the raw alignment score that are predicted to random occur. A lower E value indicates a more significant result. In terms of BLASTn, Max Identity estimates the maximum percentage of nucleotides in a sequence alignment which are likely to be identical to the given template sequence. Substitution Matrix BLAST utilises observed substitution patterns in order to predict homologous alignments for a query sequence, by which a score is attributed or aligning any feasible pairs of amino acids (BLASTx) or nucleotides (BLASTn). Such scores represent the likelihood of a respective amino acid/nucleotide being substituted for another. It can also be thought of as the rate at which an amino acid/nucleotide is substituted for another over a set period. The BLAST programs mainly utilise the BLOSUM (BLOcks SUbstitution Matrix) substitution matrices (aside from the nucelotide programs which do not use substitution matrices). These were published in 1992 by Steven and Jorja Henikoff It has been shown that the BLOSUM-62 matrix, which is the primary substitution matrix used by the protein-based BLAST software, is one of the most effective at identifying protein similarities. PAM (Point Accepted Mutation) substitution matrices may also be used in homology searches. These matrices were published by Margaret Dayhoff in 1978 and are based upon the analysis of over 1500 mutations in the phylogenetic trees of ~71 protein families. The main difference between PAM and BLOSUM matrices is that the former matrices attempt to score all amino acid/nucleotides against known sequences, whereas the latter are based on substitutions which occur in segments (or 'blocks') which show the most similar regions that are present within comparison sequences. Multiple Sequence Alignments Clustal Omega is the most widely used tool available for sequence alignment The first instance of the software was created by Higgins in 1988. Creates a distance matrix before figuring out a guide tree and using this information to subsequently build a multiple alignment. Protein Structure Prediction JPRED4: Estimates the secondary structure of a protein using the JNet algorithm and presents its data either visually in Jalview or as a text file. SWISS-MODEL: Homology modelling of 3D protein structure based on templates sourced from BLAST and stored in the protein databank (PDB). Whilst it predicts both secondary and tertiary structure, it is also able to give a prediction of the quaternary structure of the models it produces. You must submit your sequence in FASTA format (if comes up in exam remember to briefly describe FASTA vs FASTQ formats for extra marks) Q-MEAN is the confidence score used by SWISS-MODEL to assess the likelihood of its prediction being accurate. A prediction with a Q-MEAN score of less than -4.00 is considered to be inaccurate. GMQE (Global Model Quality Estimation) assesses the quality of the model built using a specific alignment and template which has a certain coverage of the query sequence. A higher GMQE suggests a more reliable model. QSQE (Quaternary Structure Quality Estimate) assesses the quality of the predicted quaternary structure of the given model. Resultant models can be then viewed using the PyMOL software Phyre 2: Protein Homology/AnalogY Recognition Engine that predicts secondary and tertiary structure. Gives a confidence score and a percentage coverage showing how much of the reference sequence matches with the query sequence. Process: It first scans the protein sequence against protein databases using PSI-BLAST, before using PsiPred to predict secondary structure. It then builds a model of the sequence using the Hidden Markov Model (HMM) process. These model capture the tendency of each amino acid to mutate based on previously analysed similar sequences. I-TASSER: Iterative Threading ASSEmbly Refinement C-score assess the confidence of the program in the models it has predicted, calculated by assessing the importance of threading templates used as well as the convergence of the structural predictions. Process: Locates assembly templates by comparing the query sequence directly with the protein databank before constructing its best predicted models using the Local Meta-Threading Server (LOMETS). The function of the structure is then predicted by comparing the sequence against the protein function database. I-TASSER is currently recognised as the most extensive and reliable publicly available protein structural prediction programme and has been recognised as such at 6 of the biannual Critical Assessment of protein Structure Prediction (CASP) competitions. Interestingly, at the last CASP competition, I-TASSER was blown out of the water by the AlphaFold program produced by the Google AI DeepMind. It was able to best predict the structure of 25 out of the 43 proteins given to the competitors. To put that into perspective, the previous best attempt was 3/43 by the Zhang group (producers of I-TASSER) 2 years previously. Despite this, DeepMind has not been made publicly available and there is a danger that google may keep this to themselves and find a way to monetise protein structure prediction, making accurate scientific research in this area ever-more expensive and decreasing its accessibility to groups with insufficient funding.

Next up

Bioinformatics

Description

Resource summary

Page 1

Similar

	Created by Matthew Coulson almost 5 years ago