Nucleotide Repetition Lengths

clojure code genomics .....

Later: Fun with Instaparse
Earlier: Updating the Genome Decoder

After a long hiatus, let’s continue our foray into amateur genomics with Clojure, in which we continue to encounter elegant expressions of Clojure’s power as well as a few additional wrinkles in the Clojure ecosystem (in this case, Incanter).

In our last post we upgraded our genome decoder and used it to show frequencies of the four nucleotides, illustrating Chargaff’s Rule.

Another simple question to be asked of the data is, What is the distribution of nucleotide repetition lengths? In other words, if we represent the number of times any nucleotide is repeated as a histogram, then AAGGCATTTT would yield two entries for 1, two entries for 2, zero for 3, and one for 4. What does this distribution look like for the entire human or yeast genome?

As a “null hypothesis”, consider the likelihood of getting $N$ A’s when drawing at random from A, G, C or T. The probability for the first selection to be A is 25%; for the first two draws to be AA the probability is $P(AA) = {1 \over {4^2}} = 0.625$, and for $N$ consecutive draws, $P(A \times N) = 4^{-N}$.

You can easily simulate (“Monto Carlo”) such a situation using a purely random sequence:

(defn randgenome []
  (repeatedly #(rand-nth [:A :G :C :T])))

This function yields a sequence like (:A :A :C :A :T :G :G :T :G :A :C :C ...).

To convert this into repetition lengths, we can first divide the sequence up into repeating elements using partition-by and the identity function:

(partition-by identity (randgenome))
;;=> ((:A :A) (:C) (:A) (:T) (:G :G) (:T) (:G) (:A) (:C :C) ...)

Extracting the lengths is then trivial: just map count over the sequences:

(defn get-lengths [s]
  (->> s
      (partition-by identity)
      (map count)))

(get-lengths (randgenome))
;;=> (2 1 1 1 2 1 1 1 2 ...)

(Note: running these functions unmodified will hang your REPL, since randgenome yields an infinite sequence! Wrap them with take if needed.)

Histogramming the data

The plots in my previous posts were made by the simple easy expedient of copying data into a Numbers spreadsheet on my Mac. We’d like to eventually do some more ambitious plotting, however, so let’s investigate a Clojure-based solution.

Incanter provides a broad set of Clojure-based utilities for data analysis, including matrices and linear algebra operations, statistical distributions, and so on. Built on top of JFreeChart, it implements several different kinds of plots, including histograms, and it should be an obvious candidate for displaying our length distribution histogram.

Unfortunately, because of two issues I discovered, Incanter will not work for our purposes. To make a long story short, it does not handle empty (zero) entries in histograms properly when displaying on a logarithmic scale, and it generates out-of-memory errors when given more than a million or so entries. (Coming from high energy physics, where such histograms are extremely common, these both seem like serious flaws to me, which I hope will be remedied in the next Incanter release.)

I wound up using the JFreeChart API directly, though this was very time consuming, as the API has approximately the same complexity as the entire US space program or the Large Hadron Collider. (In what follows below I elide details about false starts, descriptions of hair pulling, and expletives.)

First, I needed to convert the lengths from get-lengths into a histogram with arbitrarily many entries:

(defn make-hist
  Convert seq of input xs into a histogram of nbins bins, from xmin to
  xmax.  Discard overflows or underflows
  [xmin xmax nbins xs]
  (let [;; "base" histogram (zeros):
        zero-map (into (sorted-map)
                       (map (fn [x] [x 0]) (range nbins)))
        ;; get actual bin values for every input in xs:
        xbins (map #(int (* nbins (/ (- % xmin)
                                     (- xmax xmin))))
        ;; strip out undeflows & overflows:
        no-overflows (->> xbins
                          (remove #(< % 0))
                          (remove #(>= % nbins)))]
    ;; yield histogram as array of [ibin, height] pairs:
    (into [] (reduce #(update-in %1 [%2] inc) zero-map no-overflows))))

Supplying the minimum and maximum value ahead of time simplifies the histogramming code substantially and allows us to choose the same range for multiple plots, as will be shown below.

Armed with this transformation, the generation of the histogram plot is as follows:

(ns jenome.graphs
  (:import [ XYSeriesCollection XYSeries]
           [org.jfree.chart ChartFrame JFreeChart]
           [org.jfree.chart.plot XYPlot]
           [org.jfree.chart.axis NumberAxis]
           [org.jfree.chart.renderer.xy XYBarRenderer StandardXYBarPainter]

(defn trim-zeros 
  Convert zeros (or negatives) to small positive values to allow for
  graphing on log scale
  (map (fn [[x y]] [x (if (> y 0) y 0.0001)]) vals))

(defn draw-hist
  Draw histogram of bins as generated by make-hist
  [x-label values]
  (let [renderer (XYBarRenderer.)
        painter (StandardXYBarPainter.)
        series (XYSeries. [])
        blue (java.awt.Color. 0x3b 0x6c 0x9d)
        coll (XYSeriesCollection. series)
        y-axis (org.jfree.chart.axis.LogarithmicAxis. "Entries")
        plot (XYPlot. coll (NumberAxis. x-label) y-axis renderer)
        panel (JFreeChart. plot)
        frame (ChartFrame. "Histogram" panel)]
    (doto plot
      (.setBackgroundAlpha 0.0)
      (.setRangeGridlinesVisible false)
      (.setDomainGridlinesVisible false))
    (doto renderer
      (.setBarPainter painter)
      (.setPaint blue)
      (.setDrawBarOutline true)
      (.setOutlinePaint blue)
      (.setOutlineStroke (java.awt.BasicStroke. 1))
      (.setShadowVisible false))
    (doseq [[x y] values]
      (.add series (+ x 0.5) y))
    (.setLowerBound y-axis 0.5)
    (.setVisible (.getLegend panel) false)
    (doto frame
      (.setSize 800 250)
      (.setVisible true))))

I will omit a detailed explanation of how draw-hist works because I’m still not a JFreeChart expert (though if I do many more blog posts on this topic I may be forced to become one, however reluctantly). The principal difference with the Incanter implementation of histograms is that we provide our own set of bin heights and simply plot those. This follows a better separation of concerns anyways: our fairly simple binning function remains separate from the visual presentation of the bin positions and heights (we could, for example, add the counting of overflows and underflows – this is left as an exercise to the reader).

(trim-zeros exists to transform bin values of zero to small positive numbers to avoid taking the logarithm of zero, which is undefined.)

Armed with these tools, we can now make our first distribution:

(->> (randgenome)
     (take 1000000)
     (make-hist 0.5 60.5 60)
     (draw-hist "Repeat Lengths, random hypothesis"))
Nucleotide repetition lengths for the random hypothesis

As expected by our analysis above, this shows the falling “power-law” distribution of randomly-occuring nucleotide repetitions. Compare this with the genome for the yeast S. Cerviciae:

(->> (genome-sequence yeast)
     (make-hist 0.5 60.5 60)
     (draw-hist "Repeat Lengths, S. Cerviciae"))
Nucleotide repetition lengths for the yeast S. Cerviciae

And, for humans (here we did not wait for the result for the entire genome):

(->> (genome-sequence human)
     (take 10000000)
     (make-hist 0.5 60.5 60)
     (draw-hist "Repeat Lengths, human genome"))
Nucleotide repetition lengths for humans

The actual genome data clearly deviate from our null hypothesis of randomness, as one might expect. However, these graphs raise more questions than they answer. In particular, note that there is a sort of bimodal characteristic or extended, secondary bump in the data for humans, with hints of an outlier feature on the tail which may or may not just be a statistical fluctuation.

One complicating factor we neglected is the role of “N-blocks”, which should be eliminated without introducing artificially longer lengths; e.g. ANNNNA with the Ns removed should be considered two 1-blocks rather than a two-block AA group. This simple modification to get-lengths is left as another exercise to the reader (the change does not affect the resulting distributions).

A real biologist could no doubt tell us a lot about these distributions. Meanwhile, the latest code has been pushed to GitHub.

Later: Fun with Instaparse
Earlier: Updating the Genome Decoder