Nucleotide Repetition Lengths

code clojure

John Jacobsen
Sunday, November 3, 2013

Home Other Posts

Earlier post Updating the Genome Decoder code clojure

Later post Fun with Instaparse code clojure

After a long hiatus, let’s continue our foray into amateur genomics with Clojure, in which we continue to encounter elegant expressions of Clojure’s power as well as a few additional wrinkles in the Clojure ecosystem (in this case, Incanter).

In our last post we upgraded our genome decoder and used it to show frequencies of the four nucleotides, illustrating Chargaff’s Rule.

Another simple question to be asked of the data is, What is the distribution of nucleotide repetition lengths? In other words, if we represent the number of times any nucleotide is repeated as a histogram, then AAGGCATTTT would yield two entries for 1, two entries for 2, zero for 3, and one for 4. What does this distribution look like for the entire human or yeast genome?

As a “null hypothesis”, consider the likelihood of getting \(N\) A’s when drawing at random from A, G, C or T. The probability for the first selection to be A is 25%; for the first two draws to be AA the probability is \(P(AA) = {1 \over {4^2}} = 0.625\), and for \(N\) consecutive draws, \(P(A \times N) = 4^{-N}\).

You can easily simulate (“Monto Carlo”) such a situation using a purely random sequence:

(defn randgenome []
  (repeatedly #(rand-nth [:A :G :C :T])))

This function yields a sequence like (:A :A :C :A :T :G :G :T :G :A :C :C ...).

To convert this into repetition lengths, we can first divide the sequence up into repeating elements using partition-by and the identity function:

(partition-by identity (randgenome))
;;=> ((:A :A) (:C) (:A) (:T) (:G :G) (:T) (:G) (:A) (:C :C) ...)

Extracting the lengths is then trivial: just map count over the sequences:

(defn get-lengths [s]
  (->> s
      (partition-by identity)
      (map count)))

(get-lengths (randgenome))
;;=> (2 1 1 1 2 1 1 1 2 ...)

(Note: running these functions unmodified will hang your REPL, since randgenome yields an infinite sequence! Wrap them with take if needed.)

Histogramming the data

The plots in my previous posts were made by the simple easy expedient of copying data into a Numbers spreadsheet on my Mac. We’d like to eventually do some more ambitious plotting, however, so let’s investigate a Clojure-based solution.

Incanter provides a broad set of Clojure-based utilities for data analysis, including matrices and linear algebra operations, statistical distributions, and so on. Built on top of JFreeChart, it implements several different kinds of plots, including histograms, and it should be an obvious candidate for displaying our length distribution histogram.

Unfortunately, because of two issues I discovered, Incanter will not work for our purposes. To make a long story short, it does not handle empty (zero) entries in histograms properly when displaying on a logarithmic scale, and it generates out-of-memory errors when given more than a million or so entries. (Coming from high energy physics, where such histograms are extremely common, these both seem like serious flaws to me, which I hope will be remedied in the next Incanter release.)

I wound up using the JFreeChart API directly, though this was very time consuming, as the API has approximately the same complexity as the entire US space program or the Large Hadron Collider. (In what follows below I elide details about false starts, descriptions of hair pulling, and expletives.)

First, I needed to convert the lengths from get-lengths into a histogram with arbitrarily many entries:

(defn make-hist
  Convert seq of input xs into a histogram of nbins bins, from xmin to
  xmax.  Discard overflows or underflows
  [xmin xmax nbins xs]
  (let [;; "base" histogram (zeros):
        zero-map (into (sorted-map)
                       (map (fn [x] [x 0]) (range nbins)))
        ;; get actual bin values for every input in xs:
        xbins (map #(int (* nbins (/ (- % xmin)
                                     (- xmax xmin))))
        ;; strip out undeflows & overflows:
        no-overflows (->> xbins
                          (remove #(< % 0))
                          (remove #(>= % nbins)))]
    ;; yield histogram as array of [ibin, height] pairs:
    (into [] (reduce #(update-in %1 [%2] inc) zero-map no-overflows))))

Supplying the minimum and maximum value ahead of time simplifies the histogramming code substantially and allows us to choose the same range for multiple plots, as will be shown below.

Armed with this transformation, the generation of the histogram plot is as follows:

(ns jenome.graphs
  (:import [ XYSeriesCollection XYSeries]
           [org.jfree.chart ChartFrame JFreeChart]
           [org.jfree.chart.plot XYPlot]
           [org.jfree.chart.axis NumberAxis]
           [org.jfree.chart.renderer.xy XYBarRenderer StandardXYBarPainter]

(defn trim-zeros 
  Convert zeros (or negatives) to small positive values to allow for
  graphing on log scale
  (map (fn [[x y]] [x (if (> y 0) y 0.0001)]) vals))

(defn draw-hist
  Draw histogram of bins as generated by make-hist
  [x-label values]
  (let [renderer (XYBarRenderer.)
        painter (StandardXYBarPainter.)
        series (XYSeries. [])
        blue (java.awt.Color. 0x3b 0x6c 0x9d)
        coll (XYSeriesCollection. series)
        y-axis (org.jfree.chart.axis.LogarithmicAxis. "Entries")
        plot (XYPlot. coll (NumberAxis. x-label) y-axis renderer)
        panel (JFreeChart. plot)
        frame (ChartFrame. "Histogram" panel)]
    (doto plot
      (.setBackgroundAlpha 0.0)
      (.setRangeGridlinesVisible false)
      (.setDomainGridlinesVisible false))
    (doto renderer
      (.setBarPainter painter)
      (.setPaint blue)
      (.setDrawBarOutline true)
      (.setOutlinePaint blue)
      (.setOutlineStroke (java.awt.BasicStroke. 1))
      (.setShadowVisible false))
    (doseq [[x y] values]
      (.add series (+ x 0.5) y))
    (.setLowerBound y-axis 0.5)
    (.setVisible (.getLegend panel) false)
    (doto frame
      (.setSize 800 250)
      (.setVisible true))))

I will omit a detailed explanation of how draw-hist works because I’m still not a JFreeChart expert (though if I do many more blog posts on this topic I may be forced to become one, however reluctantly). The principal difference with the Incanter implementation of histograms is that we provide our own set of bin heights and simply plot those. This follows a better separation of concerns anyways: our fairly simple binning function remains separate from the visual presentation of the bin positions and heights (we could, for example, add the counting of overflows and underflows – this is left as an exercise to the reader).

(trim-zeros exists to transform bin values of zero to small positive numbers to avoid taking the logarithm of zero, which is undefined.)

Armed with these tools, we can now make our first distribution:

(->> (randgenome)
     (take 1000000)
     (make-hist 0.5 60.5 60)
     (draw-hist "Repeat Lengths, random hypothesis"))


Figure 1: Nucleotide repetition lengths for the random hypothesis

As expected by our analysis above, this shows the falling “power-law” distribution of randomly-occuring nucleotide repetitions. Compare this with the genome for the yeast S. Cerviciae:

(->> (genome-sequence yeast)
     (make-hist 0.5 60.5 60)
     (draw-hist "Repeat Lengths, S. Cerviciae"))


Figure 2: Nucleotide repetition lengths for the yeast S. Cerviciae

And, for humans (here we did not wait for the result for the entire genome):

(->> (genome-sequence human)
     (take 10000000)
     (make-hist 0.5 60.5 60)
     (draw-hist "Repeat Lengths, human genome"))


Figure 3: Nucleotide repetition lengths for humans

The actual genome data clearly deviate from our null hypothesis of randomness, as one might expect. However, these graphs raise more questions than they answer. In particular, note that there is a sort of bimodal characteristic or extended, secondary bump in the data for humans, with hints of an outlier feature on the tail which may or may not just be a statistical fluctuation.

One complicating factor we neglected is the role of “N-blocks”, which should be eliminated without introducing artificially longer lengths; e.g. ANNNNA with the Ns removed should be considered two 1-blocks rather than a two-block AA group. This simple modification to get-lengths is left as another exercise to the reader (the change does not affect the resulting distributions).

A real biologist could no doubt tell us a lot about these distributions. Meanwhile, the latest code has been pushed to GitHub.

Earlier post Updating the Genome Decoder code clojure

Later post Fun with Instaparse code clojure

Blog Posts (170)

Select from below, view all posts, or choose only posts for:art clojure code emacs lisp misc orgmode physics python ruby sketchup southpole writing


From Elegance to Speed code lisp clojure physics

Common Lisp How-Tos code lisp

Implementing Scheme in Python code python lisp

A Daily Journal in Org Mode writing code emacs

Show at Northwestern University Prosthetics-Orthotics Center art

Color Notations art

Painting and Time art

Learning Muscular Anatomy code clojure art emacs orgmode

Reflections on a Year of Daily Memory Drawings art

Repainting art

Daily Memory Drawings art

Questions to Ask art

Macro-writing Macros code clojure

Time Limits art

Lazy Physics code clojure physics

Fun with Instaparse code clojure

Updating the Genome Decoder code clojure

Getting Our Hands Dirty (with the Human Genome) code clojure

Validating the Genome Decoder code clojure

A Two Bit Decoder code clojure

Exploratory Genomics with Clojure code clojure

Rosalind Problems in Clojure code clojure

Introduction to Context Managers in Python code python

Processes vs. Threads for Integration Testing code python

Resources for Learning Clojure code clojure

Continuous Testing in Python, Clojure and Blub code clojure python

Programming Languages code clojure python

Milvans and Container Malls southpole

Oxygen southpole

Ghost southpole

Turkey, Stuffing, Eclipse southpole

Wind Storm and Moon Dust southpole

Shower Instructions southpole

Fresh Air and Bananas southpole

Traveller and the Human Chain southpole

Reveille southpole

Drifts southpole

Bon Voyage southpole

A Nicer Guy? southpole

The Quiet Earth southpole

Ten southpole

The Wheel art

Plein Air art

ISO50 southpole art

SketchUp and MakeHuman sketchup art

In Defense of Hobbies misc code art

Closure southpole

Takeoff southpole

Mummification southpole

Eleventh Hour southpole

Diamond southpole

Baby, It's Cold Outside southpole

Fruition southpole

Friendly Radiation southpole

A Place That Wants You Dead southpole

Marathon southpole

Deep Fried Macaroni and Cheese Balls southpole

Retrograde southpole

Three southpole

Transitions southpole

The Future southpole

Sunday southpole

Radio Waves southpole

Settling In southpole

Revolution Number Nine southpole

Red Eye to McMurdo southpole

That's the Way southpole

Faults in Ice and Rock southpole

Bardo southpole

Chasing the Sun southpole

Downhole southpole

Coming Out of Hibernation southpole

Managing the Most Remote Data Center in the World code southpole

Ruby Plugins for Sketchup art sketchup ruby code

The Cruel Stranger misc

Photoshop on a Dime art

Man on Wire misc

Videos southpole

Posing Rigs art

Metric art

Cuba southpole

Wickets southpole

Safe southpole

Broken Glasses southpole

End of the Second Act southpole

Pigs and Fish southpole

Last Arrivals southpole

Lily White southpole

In a Dry and Waterless Place southpole

Immortality southpole

Routine southpole

Tourists southpole

Passing Notes southpole

Translation southpole

RNZAF southpole

The Usual Delays southpole

CHC southpole

Wyeth on Another Planet art

Detox southpole

Packing southpole

Nails southpole

Gearing Up southpole

Gouache, and a new system for conquering the world art

Fall 2008 HPAC Studies art

YABP (Yet Another Blog Platform) southpole

A Bath southpole

Green Marathon southpole

Sprung southpole

Outta Here southpole

Lame Duck DAQer southpole

Eclipse Town southpole

One More Week southpole

IceCube Laboratory Video Tour; Midrats Finale southpole

SPIFF, Party, Shift Change southpole

Good things come in threes, or 18s southpole

Sleepless in the Station southpole

Post Deploy southpole

IceCube and The Beatles southpole

Midrats southpole

Video: Flight to South Pole southpole

Almost There southpole

The Pure Land southpole

There are no mice in the Hotel California Bunkroom southpole

Short Timer southpole

Sleepy in MacTown southpole

Sir Ed southpole

Pynchon, Redux southpole

Superposition of Luggage States southpole

Shortcut to Toast southpole

Flights: Round 1 southpole

Packing for the Pole southpole

Goals for Trip southpole

Balaklavas southpole

Tree and Man (Test Post) southpole

Schedule southpole

How to mail stuff to John at the South Pole southpole

Summer and Winter southpole

Homeward Bound southpole

Redeployment southpole

Short-timer southpole

The Cleanest Air in the World southpole

One more day (?) southpole

One more week (?) southpole

Closing Softly southpole

More Photos southpole

Super Bowl Wednesday southpole

Night Owls southpole

First Week southpole

More Ice Pix southpole

Settling In southpole

NPX southpole

Pole Bound southpole

Bad Dirt southpole

The Last Laugh southpole

First Delay southpole

Nope southpole

Batteries and Sheep southpole

All for McNaught southpole

t=0 southpole

The Big (Really really big...) Picture southpole

Giacometti southpole

Descent southpole

Video Tour southpole

How to subscribe to blog updates southpole

What The Blog is For southpole

Auckland southpole

Halfway Around the World; Dragging the Soul Behind southpole

Launched southpole

Getting Ready (t minus 2 days) southpole


Subscribe: RSS feed ... all topics ... or Clojure only / Lisp only