Wednesday, March 23, 2011

Characters, characters everywhere...

With the sort of technology available today for sequencing and the speed at which it proceeds, it's no wonder that researchers want assembly processes for double-barrel shotgun sequencing fragments to be faster as well. Thus this contest. Nature doesn't provide the species names or further details about the assembly procedures, but these don't concern me much.

When it comes to full genome sequencing projects, as a systematist I'm more concerned with the characters (the actual sequence of base pairs) and how they are used in phylogenetic inference. With pyrosequencing and other next generation sequencers, very soon it will become inexpensive and fast to sequence the entire genome of any organism. Ignoring the amount of information involved and the amount of digital space it will take to store all this information, what is one to do with all these characters?

Some people, or I should say, many modern systematists would like nothing better than to shove the entire genomes of species within a taxon (never mind that genomes are character sets of individuals, not species) and let an algorithm sift through the mess and work it out. I'd like to point out that this has been done before, and is generally regarded as a unfortunate but necessary stepping stone on the way to more scientifically acceptable methods. Still, the temptation for easy answers is alluring.

Consider this, however. We have a number of assembled genomes (by whatever method) and we have aligned them (hopefully not manually) so we can examine the shared areas. Could we possibly design a program which will automatically find shared sequences lengths and highlight them from longest to shortest, and discard those sequence lengths below a cutoff? Then we could actually look at the sequence lengths that may matter (there still will be homoplasy) and consider these entire shared sections to be our hypothetical homologs. We could then code the sequence lengths as individual characters and run a more traditional style phylogenetic inference. This may actually be faster than the "mass shoving" scenario, as there are less potential relationships for a computer to compare. It also removes a great deal of homoplasy which interferes with our hypothesis testing. More characters (if the characters are not specially shared) is not always better. Millions of characters will not give greater resolution to a phylogeny if 80% of them are either different or shared single base pairs scattered among non-shared lengths. Part of scientific efficiency is designing a crucial experiment which will quickly eliminate alternative hypotheses (PDF). Using an entire genome in a phylogenetic inference is like setting sail on Lake Superior in a kayak without a map or compass and hoping you'll hit Isle Royale after several days travel.


Edit: The process of a priori selecting characters for a phylogenetic inference is not new nor is it unusual. All morphological cladistics works within this method. Ignoring and not including non-pattern (homoplasy) in sequences is no different than disregarding highly variable morphology (e.g. color) or bland uniformity. Like all good science, not including unnecessary data is a matter of efficiency.

No comments: