Next Generation Sequencing

Página 1

Next Generation Sequencing Sequencing: Investigating and establishing which nucleotides make up a chain of genetic material (DNA/RNA). Whole Genome Sequencing: Establishing the entire genetic makeup of an organism. It is important to note that when determining the genetic makeup of an organism, plasmids are considered in this analysis. First Generation Sequencing: Sanger Sequencing Developed by Frederick Sanger in 1977, involving in vitro DNA synthesis. Based on the principle of DNA replication. Process: First, amplify the chose section of DNA. Denature double stranded DNA into single strands using heat. Add primer to each strand of ssDNA (short oligonucleotide sequence complementary to the sequence at the 3' end, serves as starting point for DNA synthesis in a 3'-->5' direction). This primer provides the inital 3' hydroxyl group required to form a phosphodiester bond with the first fluorescent deoxynucleotide. Dideoxynucleotides, however, have this hydroxyl group replaces with a Hydrogen and are thus known as chain terminators. Sanger sequencing is therefore also referred to as the chain termination method of DNA sequencing. The DNA with bound primers are then added equally to four separate solutions before all four types of deoxynucleotides and one type of dideoxynucleotide are added along with DNA polymerase. The dideoxynucleotide used in each reaction is the only difference between each of the four reactions. Given dideoxynucleotides will be randomly added at different points within each reaction, this produces many partially replicated strands of complementary DNA for each individual nucleotide in the DNA sequence. Polyacrylamide Gel Electrophoresis is then used to separate the partially replicated strands of DNA based on their size. Smallest fragments migrate the furthest in the gel and vice versa. Using autoradiography, each separately fluorescently tagged nucleotide can be identified in the PAGE electrophoresis gel, therefore by reading the gel from bottom (smallest fragment = start of DNA sequence) to top you are able to determine the specific nucleotide present at each point in the sequence. Second Generation Sequencing: Next Generation Sequencing This type of sequencing, also known as deep and high-throughput sequencing, is a cell free system which does not require bacterial plasmids for DNA amplification. Process: Library Preparation and Amplification: Lets use an example of blood being extracted from a human volunteer, which is then to be sequenced using NGS. First, DNA must be synthesised from RNA in the sample. There are multiple ways of doing this, but I will explain one of the most typically used. Typically, reverse transcriptase enzymes are utilised to bring about the synthesis of this DNA, before deoxyuridine triphosphate generates the second strand, on which a single adenine overhand is left to allow one of two sequencing-compatible primers to be attached. Following this, the first strand (containing uracil) is destroyed leaving the second single strand to be 'clustered'. The DNA library must then be prepared via 'clustering', which is a flow cell-mediated process by which complementary DNA is amplified. In laboratory work in the Centre for Virus Research, Glasgow, the KAPA low throughput library preparation kit is typically used. Each flow cell in the library preparation system is made up of a large quantity of two differing oligonucleotide adaptors - one for each of the two sequencing compatible primers mentioned previously. When the ssDNA is added to the flow cell, it attaches to one of the two types of adaptors, before DNA polymerase generates a second strand of DNA from its sequence. The original strand is washed off and the newly generated strand folds over, generating a 'hybridisation bridge' with the other type of oligonucleotide adaptor on the flow cell. DNA Polymerase then synthesises another strand of DNA from this hybridisation bridge. The now dsDNA is then denatured, resulting in two single strands of DNA which then repeat the above process millions of time in what is known as bridge amplification. To complete the amplification process, the reverse strands are washed away and the 3' ends of the forward strands are subsequently blocked therefore preventing undesirable priming reactions. Sequencing: There are a few different ways of NGS sequencing, but here I describe the main method utilised - the NextSeq Illumina Sequencing method. Firstly, the sequencing-compatible primers are extended by addition of the sequencing reaction mixture required for the process (containing four fluorescently tagged reversible terminating nucleotides as well as DNA polymerase). This yields the first of two reads. As one of the terminators interacts with the DNA strand, it generates a fluorescent signal. The specific wavelength, its luminosity and its exact position of this signal are recorded by the onboard charge-coupled camera. This happens on a monumental scale, with millions of reads being sequenced at one time, allowing the process of large datasets with relative ease and speed. Once all templates are filled the first run is finished and the product as well as the 3' block are washed off, allowing the strands to fold over to bind with the second oligonucleotide adaptor. DNA polymerase synthesises a complementary strand and the first strand is subsequently washed off, before the second sequencing read is processed by the same process as the first. Data Export: The analysis software exports the data in the form of base call files, which are then converted to FASTQ files for bioinformatic analysis. Such files are made up of four separate lines which can be analysed by a researcher: The first line always begins with '@' and displays the identifying tag of the DNA sequence. The second line displays the specific nucleotides which have been recorded at each position on the sequence. The third line usually takes the simple form of'+', separating the second and fourth lines to avoid confusion. The fourth line displays the respective quality scores for each base call recorded by the sequencer using the PHRED software. Third Generation Sequencing: Single Molecule This type of sequencing (aka long-read sequencing) reads DNA sequences by scanning the sequence at a single molecule level, instead of separating strands into smaller segments as is currently required of first and second generation sequencing. Advantages: This would overcome the barrier posed by the middle sections of NGS-read DNA not being read properly, as well as overcoming the limited quality of the reads that occur at the start and end of a segment. This will allow for much longer reads than other currently available technologies, as the DNA does not have to be segmented before sequencing. Minimal sample preparation and processing is required prior to single molecule sequencing, therefore the sequencing process can occur a lot faster and for a reduced price, making sequencing more accessible. Much smaller pieces of kit, an example is the minION kit currently available which is the size of a USB stick. Challenges: The error rates of nucleotide identification are currently much higher than in NGS and therefore the technology is not yet considered to be accurate enough for widespread use. This is typically caused by instability of some of the components, one reported example is that of the DNA polymerase becoming damaged the more the sequencer is utilised. Due to the high speed of the process, fluorescent signals given off by nucleotides can be blurred by neighbouring signals, thus bringing about inaccurate results. New computational developments such as the Hidden Markov Model have been highlighted as potential ways around this issue. Genome Assembly De Novo Assembly: Uses short reads produced by NGS to build contiguous segments (contigs) of original genome sequences by analysing and identifying segments of significant overlap between reads. A 'scaffold' is contigs which are joined together in attempt to form an original sequence The N50 statistic defines the contig length at which, when added to all larger contigs, give a total which exceeds 50% of the total assembly length. "Given a set of contigs, the N50 is defined as the sequence length of the shortest contig at 50% of the total assembly length. It can be thought of as the point of half of the mass of the distribution; the number of bases from all contigs longer than the N50 will be close to the number of bases from all contigs shorter than the N50. For example, consider 9 contigs with the lengths 2,3,4,5,6,7,8,9,and 10; their sum is 54, half of the sum is 27, and the size of the assembly also happens to be 54. 50% of this assembly would be 10 + 9 + 8 = 27 (half the length of the sequence). Thus the N50=8, which is the size of the contig which, along with the larger contigs, contain half of sequence of a particular assembly (NOT genome, as the assembly may only be of a few contigs, or the reference genome may not be available)." Not My Own Words The NG50 statistic defines the contig length at which, when added to all larger contigs, give a total which exceeds 50% of the total estimated/reference genome Read Mapping: Aligns reads against a reference sequence for comparison in order to ideally generate a sequence identical to the original, but often simply gives a sequence which is similar.

Próximo

Next Generation Sequencing

Descrição

Resumo de Recurso

Página 1

Semelhante

	Criado por Matthew Coulson quase 5 anos atrás