Monday, December 20, 2010

Google's Ngram Viewer

I've been playing off and on with Google's Ngram viewer since it was announced on Friday. This is the tool that enables you to graph the frequency over time in usage of given words or phrases. All sorts of interesting experiments are possible -- for example, try comparing the usage of a word and a synonym vs. an antonym or a euphemism to compare their usage (or, you could examine those three words -- "antonym" seems to be much less frequently used but growing in frequency!).

But, I've already noted some anomalies. The plot for "United States of America" is surprisingly spiky, with surprisingly few mentions in the early 1800s. That is perhaps an artifact of the sources available for the Google book digitization project, but it does cast concern on some of the conclusions being drawn from this tool.

But worse, there are definitely some issues with dating and with automated text recognition. Search for "Genomics", and some awfully early references show up. These seem to fall into two categories: serious book dating errors and text errors. In the former category, I don't believe Nucleic Acids Research published in 1835, and a number of other periodicals seem to be afflicted with similar misdatings. In the latter, "générales" seems to be a favorite to transmute to "genomics".

These issues do not invalidate the tool, but they do urge caution in interpreting results -- particularly if trying to explore the emergence and acceptance of a new term.

An approach to deal with this would be to turn the problem around. A systematic search for anachronistic word patterns could identify misdatings or questionable datings in either direction. Not only would this identify documents transported backwards in time, but also ones which should be flagged for time travel in the other direction. For example, using the tool I discovered that someone sharing my surname co-authored a screed against Masonry back in the 1700s -- and this same work shows up as a modern book due to a reprinting in recent years.

But in any case, it is an interesting way to explore language and culture. Even without a little tidying & curation.

Sunday, December 19, 2010

Bone Marrow Registries of Contention, and the Future of Tissue Typing?

This summer I took TNG to the Portsmouth Air Show to enjoy viewing aerobatics and looking at some aircraft up close. As with many such events, there is a vendors section & at this one I came across the Caitlyn Raymond Bone Marrow Registry. Curious to check out someone else's consent form, experience a buccal swab & contribute to a good cause, I signed up. Quick & painless. It's also only the second time I've consented to have my own DNA analyzed, and the first professional job (we sequenced one polymorphism from each student in an undergrad class at Delaware).

I don't regret that decision, but one part that was a bit odd at first was filling out my medical insurance information. Okay, someone has to pay for the DNA testing but it seemed a little odd to stick my insurance with it -- but I didn't give it a lot of thought at the time. Since that time, I've regularly seen the registry at various community events as well as a kiosk at the local mall.

Yesterday's Globe had an article causing me to revisit that memory. U Mass Medical Center runs the Caitlyn Raymond registry, and someone there saw a dubious opportunity and ran with it. The lead for the article focused on the fact that professional models had been used as greeters at many events, helping a very high recruitment rate. Okay, that could be seen as just creative. But, the back end is that U Mass has been charging as much as $4K per sample for testing. YIKES! That's in excess of what I've heard BRCA testing goes for. Now U Mass will be getting a lot of attention from a number of attorneys general.

One point in the article is a concern that the use of models may compromise the informed consent process. The proof, as it continued, will be if registrants from the Raymond pool fail to follow-through with donations at an unusually high rate, but given that most will never be contacted it may never be known.

But it got me thinking: since the testing is purely a DNA analysis, then presumably each complete human genome sequence can be used to type an individual. Perhaps even the tests from 23 et al* hit the right markers, or at least some tightly linked ones.

So, is it ethical to reach out to such individuals? Given that I could, in theory, search the released DNA sequences from the personal genome project, would it be reasonable to try to track one down and beg for a donation? Of course, the odds of a successful match are tiny -- but as more and more PGP sequences pile up, the chance of such a search succeeding go up.

What about non-public DNA databases? Again, suppose a 23 et al had the right markers (or nearly so). Should you have to opt-in to be notified that you are predicted to be a possible donor match? Is there a mechanism to publish profiles to a central database, with an ability to ping the user back if a match is made? And if every newborn is being typed for a few thousand other markers, will testing for transplantation markers also be required?

* -- a great term, I believe originated by Kevin Davies in his $1K genome book.

Friday, December 10, 2010

Is Pacific Biosciences Really Polaroid Genomics?

The New England Journal of Medicine this week carries a paper from Harvard and Pacific Biosciences detailing the sequence of the Vibrio cholerae strain responsible for the outbreak of cholera in Haiti. The paper and supplementary materials (which contains detailed methods and some ginormous tables) are free for now. There's also a nice piece in BioIT World giving a lot of backstory. Not a few other media outlets have carried it as well, but that's where I've read.

All in all, the project took about a month from an initial phone call from Harvard to Pacific Biosciences until the publication in NEJM. Yow!! Actual sequence generation took place 2 days after PacBio received the DNA. And this is sequencing two isolates (which turned out to be essentially identical) of the Haitian bug plus three reference strains. While full sequence generation took longer, useful data emerged three hours after getting on the sequencer (though there are apparently around 10 wall clock hours of prep before you can get on the sequencer). With the right software & sufficient computational horsepower, one really could imagine standing around the sequencer and watching a genome develop before your eyes (just don't shake the machine!).

Between this & the data on PacBio's DevNet site (you'll need to register to get a password), in theory one could find the answers to all the nagging questions about the performance specs. Actually, this dataset is apparently available as the assembled sequence but only summary statistics for certain aspects of it. For example, apparently dropped bases are focused on C's & G's, so these were discounted.

Read lengths were 1,100+/-170bp, which is quite good -- and this is after filtering out lower quality data -- and 5% of the reads were monsters bigger than 2800 bases. It is interesting that they did not use the circular consensus method, which was previously published in a method paper (which I covered earlier) and yields higher qualities but shorter fragments. It would be particularly useful to know if the circular consensus approach effectively dealt with the C/G dropout issue.

One small focus of the paper, especially in the supplement, is depth of sequence analysis to infer copy number variation. There is a nice plot in Supplementary Figure 2 illustrating how the copy number varies with distance from the origin of replication. If you haven't looked at bacterial replication before, most bacteria have a single circular chromosome and initiate synthesis starting at one point (the 0 minute point in E.coli). In very rapidly dividing bacteria, the cell may not even wait for one round of synthesis to complete before firing off another synthesis round, but in any case in any dividing population there will be more DNA near the origin than near the terminus of replication. Presumably one could estimate the growth kinetics based on the slope of the copy number from ori to ter!

After subtracting out this effect, most of the copy number fits a Poisson model quite nicely (Supplementary Figure 3). However, there is still some variation. Much of this is around ribosomal RNA operons, which are challenging to assemble correctly since they appear in arrays of nearly (or completely) perfect repeats which are quite long. There's actually even a table of the sequencing depth for each strain at 500 nucleotide intervals! Furthermore, Supplementary Figure 4 shows the depth of coverage (uncorrected for the replication polarity effect) at 6X, 12X, 30X and 60X coverage, illustrating how many of the trends are actually noticeable in the 6X data.

What biology came out of this? A number of genetic elements were identified in the Haitian strains which are consistent with it being a very bad actor and also that it is a variant of a nasty Asian strain.

All-in-all, this neatly demonstrates how PacBio could be the backbone of a very rapid biosurveillance network. It is surprising that in this day-and-age that the CDC (as detailed in the BioIT article) even bothered with a pulsed field study; even on other platforms the turnaround for a complete sequence wouldn't be much longer than to do the gel study, and the results are so much richer. Other technologies might work too, but the very long read lengths and fast turnaround offered should be very appealing, even if the cost of the instrument (much closer to $1M than to my budget!) isn't. But, a few instruments around the world serving other customers but with priority given to such samples could form an important tripwire for new infections, whether they be acts of nature or evil persons. Now, it is important to note that this involved a known, culturable bug and the DNA was derived from pure cultures, not straight environmental isolates.

On a personal note, I am quite itchy to try out one of these beasts. As a result, I'm making sure we stash some DNA generated by several projects so that we could use them as test samples. We know something about the sequence of these samples and how they performed with their intended platforms, so they would be ideal test items. None of my applications are nearly as exciting as this work, but they are the workaday sorts of things which could be the building blocks of a major flow of business. Anyone with a PacBio interested in collaborating is welcome to leave me a private comment (I won't moderate it through), and of course my employer would pay reasonable costs for such an exercise. Or, I certainly wouldn't stamp "return to sender" on the crate if an instrument showed up on the loading dock! I don't see PacBio clearing the stage of all competitors, but I do see it both opening new markets and throwing some serious elbows against certain competing technologies.

Friday, December 03, 2010

Arsenic and New Microbes

Yesterday's announcement of a microbe which not only tolerates arsenic but actually appears to incorporate it in place of phosphorous has traveled a typical path for such a discovery: while it is quite a find, the media has generated more than a few ridiculous headlines. Yes, this potentially expands the definition of life, at least in an elemental sense, but it hardly suggests that such life forms exist elsewhere. A similar absurd atmosphere briefly reigned around a discovery of a potentially habitable world around a distant star -- the discoverer was quoted in at least one outlet that his find was guaranteed to have life. Given that we know very little about the probability of life starting, I always cringe when I hear someone announce that such events are either certain or certainly impossible; we simply can't calculate believable odds given our poor knowledge base. On the other end, suggestions have been raised as to this bug being a starting point for bioremediation of arsenic-contaminated aquifers; but really this discovery isn't a huge step in that direction beyond species already known to tolerate the stuff. It's also disappointing that none of the popular news items I've seen have pointed out how a periodic table can be read to show chemical similarity of phosphorous and arsenic.

That said, it is an intriguing discovery. The idea that all those phosphates on the metabolic diagrams might be substituted with arsenate is quite jarring. No reader of this space will be surprised to hear me advocate for immediate sequencing of this bug (if it hasn't already happened and just not yet reported). A microbial genome these days can be roughed out in well under a month (actually, sequence generation for Mycoplasma a decade ago took that long; clearly we can go faster now).

In order to interpret that genome, though, another whole line of experiments is needed. Assuming that the ability of this organism to incorporate arsenate in place of phosphate is confirmed, some of the precise enzymes capable of doing this trick need to be located. Simply finding arsenate-analogs of some key metabolites (such as phosphorylated intermediates in glycolysis) would point at a few enzymes, and then it would be valuable to demonstrate the purified enzymes pulling the trick. The next step then would be to test whether more conventional enzymes have this activity. Despite what many of us learned in various exposures to biochemistry from elementary school on up, enzymes aren't utterly specific for their substrates. Instead, there is a certain degree of promiscuity, though generally not with equal activity. So, to extend my analogy, if the new bug's triose phosphate isomerase can work on triose arsenates, then testing that activity in well-characterized TPIs would in order.

Assuming that such enzymes (from E.coli or human or yeast or what-not) do not have the activity, then crystal structures of the arsenate-lover would be an important next step. Of course, repeating this for the whole roster of enzymes in the bug would be quite an undertaking, but perhaps a number could be modeled to see if a consistent pattern of substitutions or other alterations emerges.

At one time, it was vogue to speculate on life forms which used silicon in place of carbon, given it's location one rung down on the periodic table. Did any author ever dare suggest arsenic for phosphate? I doubt it, but perhaps there was some mind playing with the possibilities who wrote it down somewhere (along with a large pile of other guesses that will not pan out).