Wednesday, August 31, 2011

ITS RNA secondary structure

I have recently been conducting phylogenetic and taxonomic studies of selected groups of lichen-forming fungi using sequences from the quickly evolving nuclear ribosomal ITS (internal transcribed spacer) region to examine relationships within and between species (e.g., Hodkinson & Lendemer 2011, Hodkinson et al. 2010, Lendemer & Hodkinson 2009, 2010, in prep). In order to properly analyze the evolutionary relationships between the organisms from which these molecules were derived, I built secondary structure models for the RNA molecules encoded by ITS1 and ITS2 (the two rapidly evolving sections of the ITS region) for some of the groups. 

The ITS1 and ITS2 spacer regions encode stretches of RNA that fold up in specific conformations and help to assemble the ribosomes (the pieces of cellular machinery that build protein molecules based on specific messenger RNA sequences transcribed from DNA). The particular folding pattern is referred to as the molecule's "secondary structure."  Here is an example of a secondary structure model that I put together for ITS2 of Parmotrema perforatum:

Notice the A(adenine)-U(uracil) pairings and the G(guanine)-C(cytosine) pairings, just like the complementary strands of DNA (except that with DNA you have T for thymine instead of U for uracil).

There are two main reasons that one might want to have a secondary structure model when inferring phylogeny:

[1] Nucleotide Alignment - An understanding of the overall structure of the molecule can aid in discerning which sets of sites in different organisms actually represent the same character when they have different states and there are adjacent nucleotides that have been inserted or deleted in some taxa (Kjer 1995). Many studies use principles of secondary structure to aid in alignment.

[2] Phylogenetic Inference - Since paired sites in some sense evolve in tandem (if one nucleotide changes, the linked nucleotide will often change to compensate over evolutionary time), it is most appropriate within a likelihood framework to apply a different model of evolution to the paired nucleotides so that this can be taken into consideration. This type of inference can be done with RAxML (Stamatakis 2006) and I have recently integrated this into my workflow (Hodkinson & Lendemer in prep).

The really interesting thing to think about is the fact that this type of macromolecule needs to be able to move in order to function, which means that the structure is not actually static, but dynamic. While we usually use the 'best' structure for phylogenetic inference, there are actually many structures that are nearly equally good, and the molecule actually changes its conformation through space and time, flipping between these different conformations in order to perform its functions in the cell. To drive the point home, here is a quick video I made of the ITS2 molecule of Cladonia stipitata Lendemer & Hodkinson (2009) shifting between different likely conformations:


Sources cited:

Hodkinson, B. P., and J. C. Lendemer. 2011. Molecular analyses reveal semi-cryptic species in Xanthoparmelia tasmanica. Bibliotheca Lichenologica 106: 115-126.
Download publication (PDF file)
Download nucleotide alignment (NEXUS file)

Hodkinson, B. P., and J. C. Lendemer. In prep. Systematics of a enigmatic sterile crustose lichen. 

Hodkinson, B. P., J. C. Lendemer, and T. L. Esslinger. 2010. Parmelia barrenoae, a macrolichen new to North America and Africa. North American Fungi 5(3): 1-5.
Download publication (PDF file)

Kjer, K. M. 1995. Use of rRNA secondary structure in phylogenetic studies to identify homologous positions: an example of alignment and data presentation from the frogs. Molecular Phylogenetics and Evolution 4: 314–330.

Lendemer, J. C., and B. P. Hodkinson. 2009. The Wisdom of Fools: new molecular and morphological insights into the North American apodetiate species of Cladonia. Opuscula Philolichenum 7: 79-100.
Download publication (PDF file)
Download nucleotide alignment (NEXUS file)

Lendemer, J. C., and B. P. Hodkinson. 2010. A new perspective on Punctelia subrudecta in North America: previously-rejected morphological characters corroborate molecular phylogenetic evidence and provide insight into an old problem. The Lichenologist 42(4): 405-421.
Download publication (PDF file)
Download nucleotide alignment (NEXUS file)

Stamatakis, A. 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22: 2688–2690.

Wednesday, August 17, 2011

Taxonomy: Art or Science?

When Googling "science definition," the first thing that came up was "The intellectual and practical activity encompassing the systematic study of the structure and behavior of the physical and natural world through observation and experiment." After a little more research, I was surprised to see that this seems to be one of the stricter definitions of science (others may be as broad as "the state of knowing" or some such...), but it is one with which I can get on board. I tend to think of science itself in a very strict sense, as the process of developing and testing hypotheses. However, my big caveat is that there are many activities that are involved in (and are absolutely essential to) the practice of science that are not science per se according to that definition. This does not diminish their value to science. Some of this has to do with the acquisition of background knowledge that informs the hypotheses to be tested, while some of it is associated with making the results of inquiry available and comprehensible to the scientific community and the public.

So then is taxonomy art or science? With taxonomy, there is not a "right" answer, although there are plenty of wrong answers if one wishes to have a system that is informed by the results of scientific inquiry. Taxonomic units are all in some sense arbitrary. Although a group of organisms may form a "clade," whether we recognize that clade with a certain name is somewhat arbitrary. I personally like to think of taxonomic units being defined by specific innovations (morphological, molecular, ecological, etc.) that have changed the evolutionary trajectory of a group, but that rule is certainly not universally applied, and there could certainly be many alternative taxonomies even if such standards were applied.

For me, the argument for taxonomy as an art does not actually diminish taxonomy in any way as part of what we must do in order to be effective and responsible scientists. In fact, having this perspective on taxonomy can help to enhance the understanding of the significance of taxonomy for science. As scientists, we must use what we discover through the scientific process to help facilitate communication about natural phenomena. Taxonomy is a tool that we use to communicate ideas about organisms, so taxonomy is an absolutely necessary part of the pursuit of scientific truth, even if it is not "science" itself.

One test for me of whether taxonomy is itself a science in the very strictest sense of the word is whether it is directly involved in the process of hypothesis testing. One can use principles of phylogenetics, ecology, or molecular biology to test hypotheses, but taxonomic principles would not be used. When we begin to dissect some of the scientific questions that are often deemed "taxonomic questions," it can be argued that they are not actually taxonomic in nature, and that the taxonomic repercussions would really only be a byproduct of obtaining results through scientific inquiry. For instance, a question like "Is this a good genus?" is really asking something like "Do the species form a distinct clade?", which is a question that is evolutionary in nature. Likewise, the question "Do these individuals make up one species?" is perhaps just a way of saying "How can we properly apply a biological, morphological, chemical, ecological, and/or phylogenetic species concept to this group of individuals?", a question that draws on different fields of biology.

I can see that many systematists would hesitate to state that taxonomy is an art, because of what it implies. If it is an art, then it opens the door for people to say that people who do taxonomy are not really scientists at all. But a consummate scientist is not just someone who constantly tests hypotheses one after another without consideration for anything else. To be a scientist, one must also lay the groundwork for scientific pursuits, and defining the terms used to communicate ideas about specific units of the tree of life (whether or not it is itself an artistic pursuit) is crucial to the advancement of science.

- Brendan

Thursday, August 4, 2011

Using Sequencher for Multiple Sequence Alignments

Much of the molecular research that I have done over the years has involved working with DNA sequences generated through Sanger sequencing. These sequences are never perfect, and always require manual correction. It is especially helpful to correct sequences and align them to other similar sequences simultaneously. In this way, alignment and structural data can be taken into consideration when interpreting the chromatograms for the DNA sequences.

So I wrote a couple of simple Perl scripts that would allow me to make my alignments in Sequencher (the standard program for editing raw sequence reads) and easily move it over to Mesquite or MacClade (standard programs for assembling data matrices for downstream phylogenetic analyses) so that it could be joined with a reference alignment that I had made previously. In this way, I could avoid completely realigning all sequences to one another through an automatic alignment program, thereby preserving certain sequence alignment patterns (note that I often deal with over 1000 sequences at a time). If you use Linux or Macintosh, running a Perl script is generally a pretty simple matter (since Perl interpreters are typically built into the operating system).  If you use Windows, you will probably need to download an interpreter like Strawberry Perl or ActivePerl.

The type of data that I was dealing with was a set of bidirectional Sanger sequences (one forward, one reverse primer for each sequence) of fragments ~650 bp in length. These sequences were cloned and therefore had vector overhang on both ends of both strands, which had to be deleted. If you have data that are similar, here is a procedure that can be used to preserve the Sequencher alignment pattern and bring it into MacClade/Mesquite (potentially for merging with a curated reference alignment, if you have one of these):

[a] In the Sequencher alignment, make sure at least one sequencing strand of each pair of strands (from the bidirectionally-sequenced pool of DNA fragments) has all of the corrected bases, and delete the second strand for each pair. This gives an alignment with one strand for each sequence. [This Sequencher alignment can be tweaked visually to align with a reference set that is already pre-aligned by introducing gaps into the Sequencher alignment to accommodate the gaps in the reference alignment.]
[b] The Sequencher alignment can then be exported as a contig in aligned fasta format and subsequently opened in MacClade/Mesquite. [Note: If you have exported the sequences from Sequencher as a concatenated set of sequence fragments, it might use ':' instead of '-' to represent the gaps; make sure all of the gaps are changed to '-' for integration into MacClade or Mesquite (this can be done as a simple search and replace with any text editor).]

For my particular sequences, I had to deal with the issue of all of the sequence names being proceeded by my initials and having strand-specific information tacked on to the end (both standard pieces of information added by the sequencing facility). Here is another blog post with the Perl script that I wrote for editing the fasta file to extract the 10-digit alpha-numeric code used to identify my sequences. Also, I had to line my sequence block up with the portion of my reference alignment with which it correlated. In my particular situation, the block of sequences that I had aligned began 488 bases into the reference alignment.  Here is the script that I used to add 488 bases to the front of each sequence in the fasta file (this script relies on having a 10-digit code name for each sequence):


print "\nPlease type the name of your input file: ";
my $filename = <STDIN>;
chomp $filename;
open (FASTA, $filename);
        if ($filename =~ /(.*)\.[^.]*/)
                open OUT, ">$1.ed.fasta";

while (<FASTA>)
       if ($_ =~ /^>(..........)/)
                print OUT "\r>$1\r\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\n";
                print OUT $_;

The final step was to simply open up my reference alignment in MacClade and import the newly-generated fasta file of aligned cloned sequences... and they lined up perfectly!  I then tweaked exclusion sets, saved the full alignment, and was ready for downstream phylogenetic analyses.

Even though MacClade and Mesquite are very good programs overall for alignment, aligning a set of 1000+ sequences is extremely cumbersome, and Sequencher can be much faster and easier as long as the sequences are relatively conserved.  With this set of Perl scripts discussed above, hopefully researchers will no longer perceive impediments or inefficiency in a process that includes aligning and correcting relatively conserved sequences in Sequencher (with all of the raw sequence data) before moving them over to MacClade/Mesquite for final data set assembly and formatting.

- Brendan



The above protocols are published in the following sources:

Hodkinson, B. P. 2011. A Phylogenetic, Ecological, and Functional Characterization of Non-Photoautotrophic Bacteria in the Lichen Microbiome. Doctoral Dissertation, Duke University, Durham, NC.
Download Dissertation (PDF file)

Hodkinson, B. P., N. R. Gottel, C. W. Schadt, and F. Lutzoni. 2011. Data from: Photoautotrophic symbiont and geography are major factors affecting highly structured and diverse bacterial communities in the lichen microbiome. Dryad Digital Repository doi:10.5061/dryad.t99b1.

Hodkinson, B. P., N. R. Gottel, C. W. Schadt, and F. Lutzoni. In press. Photoautotrophic symbiont and geography are major factors affecting highly structured and diverse bacterial communities in the lichen microbiome. Environmental Microbiology.


This work was funded in part by NSF DEB-1011504.