November 2017 – Melissa Hernandez's Molecular Ecology Lab Notebook

Lab 14 – Genomics

In this lab, we began by downloading tutorials for reference-based assembly and for de novo assembly. Once downloaded, files were imported into Geneious.

The reference-based assembly worked with a dataset of Illumina sequence reads that map to a single gene in the E. coli genome. The first step was trimming the data according to the default settings. Next, trimmed reads and the reference sequence (yghJ CDS) were assembled via mapping to reference. Following the assembly, a contig of the reads mapped to the reference and an assembly report were generated. High and low coverage of the reference sequence and the consensus sequence. The next step was detecting SNPs in the mapped data using the Find Variations/SNPs under the Annotate/Predict menu. An annotation track was added to the reference sequence following this step. Additionally, a table of all the annotations on the sequence was generated, which included polymorphism annotations. Finally, SNPs were filtered based on their overlap with another annotation track or annotation type. Specifically, SNPs that were present in regions of low coverage were filtered out. Below are responses to questions regarding the reference-based assembly tutorial.

Five reads that had their ends trimmed due to low quality bases included: #10 (185658/2), #11 (55687/1), #18 (191505/2), #23 (135909), #53 (190007/2).
I think paired-end reads are more important to use for de novo assembly because reference-based assembly entails utilizing already sequenced genomes as a reference for mapping the location of reads. In other words, the reads are aligned to the reference sequence. With de novo assembly, there is no prior sequence knowledge, thus paired-end reads can provide information about the sequence on each end.
It was a relatively quick process for the reads to map to the reference genome. According to the yghJ paired Illumina reads assembled to yghJ CDS (divergence reference) Report, it took about 7.02 seconds.
When the yghJ CDS (divergent reference) sequence was assembled to the yghJ paired Illumina reads, 5,058 of 5,060 reads were assembled. In addition, there was a coverage of about 4, 581 bases, with an average coverage of about 98.1 bases. This assembly yielded a maximum coverage of about 139 and a minimum of 1. The following intervals of the assembly contained the lowest coverage: 1 – 43, 121 – 173, 294 – 310, 4303, 4338 – 4401, 4435, 4499 – 4581. These intervals were located either at the beginning or at the end of the consensus sequence. Thus, they might have been insufficient reads to accurately identify the bases at these ends.
When the consensus sequence changed between using a “100% Identical” and “Highest Quality” consensus, the four following sites changed: #20 (R à A), #38 (M à C), #84 (W à T), #127 (R à G). In terms of polymorphism, these sites reveal that there are two possibilities of bases that can be present in the reads. As a result, this can lead to large variety.
One region in the sequence that had >2 standard deviations below that mean in coverage was 4467 – 4483. We would not want to classify this region as a SNP because there was too much variation in this region. With SNPs, we are only looking for a site where there is a base difference. In this region, however, it is possible an incorrect fragment was added, although there was a single base pair difference.
Two CDS positions where there was a transition mutation include CDS position 4,575 (A to G) and CDS position 4,464 (A to G). Two sites where there was a transversion mutation includes CDS position 4,485 (G to T) and CDS position 3, 614 (C to A). For the latter transversion mutation, there was an effect on the protein. The former transversion mutation had no effect on the protein.
Below is a screenshot of the polymorphism table.

9. One region of low coverage that was excluded using the “Compare Annotations” tool in Geneious that did not result in excluding SNPs was region 4,499 – 4,581. In this region, 12 SNPs were not excluded. One region where SNPs were excluded included 4,438 – 4,581. 10. Below is a screenshot of the annotated reference genome with SNP calls.

The de novo assembly tutorial utilized short read next-generation sequencing data to perform a de novo assembly of a section of the Staphylococcus aureus genome. First, the reads were assembled using de novo assembly. An assembly with 4 contigs was produced. To see how the contigs align to the original sequence, sequences were assembled using Map to Reference under the Align/Assemble menu. In the region around 90,000, there was no reconstructed contig, which is why the two longest contigs could not be joined. Next, two sets of reads were combined into a single paired reads file. Once the paired reads file was created, De Novo Assemble was selected from the Align/Assemble menu. The resulting consensus sequence was mapped to the NC_009487 reference sequence. A final contig was generated that was almost full in length with a few positions that were ambiguous due to errors in the original data. The bases were corrected using the Find Variants/SNPs under the Annotate and Predict menu. Ambiguous bases in the consensus sequence such as “R” were changed according to a “0% – Majority” threshold. The final step in the tutorial was remapping the new consensus sequence to the NC_009487 sequence.

From the assembly, 25, 172 reads were assembled along with 4 contigs that were produced. The largest contig (contig 1) was assembled using 17, 994 reads. The minimum length of the shortest contig (contig 4) that was assembled from 42 reads was 512 bp long. The means of length of contigs >1000bp was 133, 444 bp. Finally, the yhe NC50 score for this assembly was 1.
The de novo assembly took about 17.06 seconds, slightly longer than the reference-based assembly. Contig 4 had the lowest maximum coverage and contig 1 had the highest maximum coverage.
The region of that had no coverage was 90,759 – 92,269. This region was where the longest contigs could not be joined.
The de novo assembly with paired data took about 5.61 seconds; this assembly took less time than the two previous assemblies.
Correcting the Paired Reads Assembly sequence was necessary because it contained ambiguous bases. These regions might have been the product of poor assembly or read errors. By correcting the bases in the consensus sequence, we can ensure there is consensus between the sequences and the consensus sequence. Also, it ensures that the consensus sequence contains the most common base relative to the other sequence.
The final assembly sequence that was constructed using the pair reads sequence and the NC_009487 extraction sequence was 285,156 bp long. Below is a screenshot of the final consensus sequence that was generated.

The second part of the lab entailed scoring the ISSR gels for each primer (Omar and 17898) for banding – a “1” if a band was present and a “0” if a band was not present. After the bands were scored, results were transferred into an Excel spreadsheet. Next, a nexus data file was formatted using the “ISSR_data_format_example” as a template. The matrix was pasted into the data file, “taxa labels” were added that corresponded to the names of the individuals, and the nchar was set to 11 to account for the total number of bands scored. Because not all classmates chose the same three individuals to use for both primers, two nexus files were generated, one for each marker. Below is a screenshot of each nexus file.

Nexus file for 17898 primer:

Nexus file for Omar primer:

Lab 13 – Population Genetics Analysis I

The first part of this lab was performing a gel electrophoresis of our ISSR samples. After obtaining the ISSR strips (17898 and Omar ISSRs were used) assembled last week, three samples from each strip were selected to perform the run. The following samples were selected from strip 1 (17898): PRK01, PRK02, PRK03. From strip 2 (Omar), samples PRK13, PRK14, PRK15 were selected. 1 µl of loading dye was transferred to each of the six samples using a p10.

The samples were loaded into the appropriate ISSR gel tray. The gels ran at 60 volts for 1.5 hours.

Next, an ITS alignment was assembled in Geneious utilizing forward and reverse reads of the ITS marker for 41 individuals of Lupinus arboreus collected from 13 geographic regions.

An assembly was constructed using the forward and reverse sequences for each individual. Once the assembly was complete, the sequences were edited. Unsuccessful regions at the beginning and at the end were trimmed and incorrect base calls and ambiguous bases were amended. Then, edited consensus sequences were extracted. These steps were repeated for all 41 individuals.

An alignment was assembled using the 41 consensus sequences extracted in the previous step. The alignment was edited by trimming the ends so that sequence lengths were in agreement. The edited ITS alignment was used to construct a phylogenetic tree using the MrBayes tool in Geneious. The MrBayes analysis was run using the following parameters: Chain length was set as 1,500,000; Burn-in Length was set at 100,000; HKY85 was selected as the substitution model; and 500 was set for the Subsampling Frequency.

The following phylogenetic tree with support values was generated.

Below are the posterior distribution and the trace from the MrBayes run.

The tree included one clade with a support value of 0.9895. The individuals of this clade were PRD05, PSF03, GWC01, and GWC02. PRD05 was located at Drakes Beach in Point Reyes, PSF03 was obtained from Presidio, GWC01 and GWC02 were obtained from Grey Whale Cove State Beach. PRD05 was collected from a mixed lupine plant. PSF03, GWC01, and GWC02 were sampled from a yellow flowered lupine plant.

Based on the individuals represented within each clade of the phylogenetic tree, it can be concluded that the individuals are found in geographically close populations, rather than in the same population. Some individuals with purple flowers were found in the larger clade, where most yellow flowered individuals were grouped. In addition, not all purple flowered individuals were grouped in the same clade. Individuals from purple flowered samples (PHO, PHT, and PRD) were present in the largest clade and in a clade with a support value of 0.9805 or 0.7472.

As previously mentioned, the internal transcribed spacer (ITS) is found in fungi and was utilized as a primer to perform PCR reactions with DNA extracted from the leaflets collected from lupine plants. In assessing the phylogenetic relationships of the Lupinus populations, it appears that ITS does not provide enough resolution to distinguish populations phylogenetically. Some samples of either the purple or yellow flowered individuals were not grouped together in the same clade. If these samples were present together, then ITS could have been successful at the population level. Finally, the presence of the polytomy indicates phylogenetic uncertainty, in that we cannot determine how the individuals are related.

Lab 12 – Plant DNA PCR III

ISSR amplification reactions made last week using 5 random extracted DNA samples (belonging to other students) with the 17898 ISSR were used for gel electrophoresis. 1 µl of loading dye was added to each PCR reaction tube using a p10. Pipette tips were changed between each sample.

Using a pipette of the same size, 10 µl of each DNA-loading-dye mixture was added to the appropriate well. The gel was run at 60volts for 1.5 hours.

After the run, the gel tray was scanned to determine successful ISSRs.

Successful ISSRs included Omar and 17898.

Next, 1:10 dilutions using our extracted DNA samples (Lab 9) were made. 5 1.5mL centrifuge tubes were labeled with the corresponding sample ID and “1:10.” Labels are as follows: PRK01 1:10; PRK02 1:10; PRK03 1:10; PRK04 1:10; PRK05 1:10. Using a p10, 10 µl of sample DNA was added to the appropriate centrifuge tube. A p200 was used to add 90 µl of ddH₂O to each tube. Assembling dilutions ensured the concentration of potentially present plant secondary compounds were minute, as these compounds could affect PCR success.

Two 0.2 mL 5-tube strips for suitable for PCR were labeled with the appropriate sample ID (tubes of strip 1 were labeled as follows: PRK01, PRK02, PRK003, PRK04, PRK05 and strip 2 was labeled as PRK11, PRK12, PRK13, PRK14, PRK14, PRK15). Contents in strip 1 were used to perform PCR using 17898 and Omar was assigned to strip 2.

1 µl of 1:10 diluted DNA was added to the corresponding tubes of strips 1 and 2 using a p10. Pipette tips were changed after each sample.

For each ISSR, a master mix was made that included all the reagents necessary to perform the PCR reaction. By creating a master mix, the likelihood of errors associated with pipetting small volumes were minimized. The ingredients and associated volumes used for each PCR reaction are as follows: 12.5 µl ddH₂O; 3.00 µl 10x buffer +Mg; 1.00 µl BSA; 2.00 µl dNTPs; 0.25 µl primer; 0.25 µl Taq. The primer corresponding to each ISSR was added to the appropriate master mix. Volumes were multiplied by 20 to compensate for the quantity of reactions for the group.

A p200 was used to transfer 19 µl of the master mix containing the 17898 ISSR to each tube of strip 1. 19 µl of the master mix containing the Omar ISSR was added to each tube of strip 2. Finally, strips were placed into PCR machine.