Validating the Genome Decoder

code clojure

John Jacobsen
Sunday, July 7, 2013

Home Other Posts

Earlier post A Two Bit Decoder code clojure

Later post Getting Our Hands Dirty (with the Human Genome) code clojure

Today we’ll validate the genome decoder we described yesterday, once again with our friend the yeast Saccharomyces cerevisiae (you may want to enjoy a slice of freshly-baked bread and a stein of Pilsner with this post).

We are aided in this case by the availability of the SacCer3 genome in both 2bit and FASTA formats. We can get the FASTA version in the same place we got the 2bit file:

mkdir /tmp/sacCer3_fasta
cd /tmp/sacCer3_fasta
tar xvzf chromFa.tar.gz

There is one FASTA file per sequence, starting something like this:


Back in the REPL, we can now spit out our own copy in the same format (carrying over yeast and other functions and vars from the previous post).

First we need a new directory for the FASTA files we’ll generate:

(.mkdir ( "/tmp/decoded"))

Then we convert the keywords in our sequence to strings:

(defn genome-str
  Convert e.g. [:A :G :T :C] to \"AGTC\"
  (->> s
       (map name)
       (apply str)))

A simple function will spit out the files, given a seq:

(defn write-seq
  Write a (potentially very long) sequence of lines to a text file
  [filename s]
  (with-open [wrt ( filename)]
    (doseq [x s]
      (.write wrt (str x "\n")))))

And now for the actual converter:

(doseq [{:keys [name dna-offset dna-size]} (sequence-headers yeast)]
  (let [fname (str "/tmp/decoded/" name ".fa")]
    (write-seq fname
               (cons (str ">" name)
                     (->> (genome-sequence yeast dna-offset dna-size)
                          (partition-all 50)
                          (map genome-str))))))

Did it work?

cd /tmp/sacCer3_fasta
for f in *.fa; do diff $f /tmp/decoded/$f; done

No output – it succeeded! This builds more confidence that we didn’t screw anything up in the decoder. (There are also a few other hard-coded unit tests in test_core.clj.)

Comparing file sizes, the FASTA files are about 4 times larger than the original 2bit file. If you tar and compress both versions, the FASTA files are still about 30% larger. It is left as as an exercise to the reader (with some spare hard disk) to do the same comparison with the human genome. (Though larger, the FASTA files would clearly be simpler to work with, and these days 4 GB isn’t too terribly much data; nevertheless, we’ll continue to use the 2bit file and decoder for our explorations.)

Now we are ready to begin playing with the actual data – starting in the next post.

Earlier post A Two Bit Decoder code clojure

Later post Getting Our Hands Dirty (with the Human Genome) code clojure

Blog Posts (170)

Select from below, view all posts, or choose only posts for:art clojure code emacs lisp misc orgmode physics python ruby sketchup southpole writing


From Elegance to Speed code lisp clojure physics

Common Lisp How-Tos code lisp

Implementing Scheme in Python code python lisp

A Daily Journal in Org Mode writing code emacs

Show at Northwestern University Prosthetics-Orthotics Center art

Color Notations art

Painting and Time art

Learning Muscular Anatomy code clojure art emacs orgmode

Reflections on a Year of Daily Memory Drawings art

Repainting art

Daily Memory Drawings art

Questions to Ask art

Macro-writing Macros code clojure

Time Limits art

Lazy Physics code clojure physics

Fun with Instaparse code clojure

Nucleotide Repetition Lengths code clojure

Updating the Genome Decoder code clojure

Getting Our Hands Dirty (with the Human Genome) code clojure

A Two Bit Decoder code clojure

Exploratory Genomics with Clojure code clojure

Rosalind Problems in Clojure code clojure

Introduction to Context Managers in Python code python

Processes vs. Threads for Integration Testing code python

Resources for Learning Clojure code clojure

Continuous Testing in Python, Clojure and Blub code clojure python

Programming Languages code clojure python

Milvans and Container Malls southpole

Oxygen southpole

Ghost southpole

Turkey, Stuffing, Eclipse southpole

Wind Storm and Moon Dust southpole

Shower Instructions southpole

Fresh Air and Bananas southpole

Traveller and the Human Chain southpole

Reveille southpole

Drifts southpole

Bon Voyage southpole

A Nicer Guy? southpole

The Quiet Earth southpole

Ten southpole

The Wheel art

Plein Air art

ISO50 southpole art

SketchUp and MakeHuman sketchup art

In Defense of Hobbies misc code art

Closure southpole

Takeoff southpole

Mummification southpole

Eleventh Hour southpole

Diamond southpole

Baby, It's Cold Outside southpole

Fruition southpole

Friendly Radiation southpole

A Place That Wants You Dead southpole

Marathon southpole

Deep Fried Macaroni and Cheese Balls southpole

Retrograde southpole

Three southpole

Transitions southpole

The Future southpole

Sunday southpole

Radio Waves southpole

Settling In southpole

Revolution Number Nine southpole

Red Eye to McMurdo southpole

That's the Way southpole

Faults in Ice and Rock southpole

Bardo southpole

Chasing the Sun southpole

Downhole southpole

Coming Out of Hibernation southpole

Managing the Most Remote Data Center in the World code southpole

Ruby Plugins for Sketchup art sketchup ruby code

The Cruel Stranger misc

Photoshop on a Dime art

Man on Wire misc

Videos southpole

Posing Rigs art

Metric art

Cuba southpole

Wickets southpole

Safe southpole

Broken Glasses southpole

End of the Second Act southpole

Pigs and Fish southpole

Last Arrivals southpole

Lily White southpole

In a Dry and Waterless Place southpole

Immortality southpole

Routine southpole

Tourists southpole

Passing Notes southpole

Translation southpole

RNZAF southpole

The Usual Delays southpole

CHC southpole

Wyeth on Another Planet art

Detox southpole

Packing southpole

Nails southpole

Gearing Up southpole

Gouache, and a new system for conquering the world art

Fall 2008 HPAC Studies art

YABP (Yet Another Blog Platform) southpole

A Bath southpole

Green Marathon southpole

Sprung southpole

Outta Here southpole

Lame Duck DAQer southpole

Eclipse Town southpole

One More Week southpole

IceCube Laboratory Video Tour; Midrats Finale southpole

SPIFF, Party, Shift Change southpole

Good things come in threes, or 18s southpole

Sleepless in the Station southpole

Post Deploy southpole

IceCube and The Beatles southpole

Midrats southpole

Video: Flight to South Pole southpole

Almost There southpole

The Pure Land southpole

There are no mice in the Hotel California Bunkroom southpole

Short Timer southpole

Sleepy in MacTown southpole

Sir Ed southpole

Pynchon, Redux southpole

Superposition of Luggage States southpole

Shortcut to Toast southpole

Flights: Round 1 southpole

Packing for the Pole southpole

Goals for Trip southpole

Balaklavas southpole

Tree and Man (Test Post) southpole

Schedule southpole

How to mail stuff to John at the South Pole southpole

Summer and Winter southpole

Homeward Bound southpole

Redeployment southpole

Short-timer southpole

The Cleanest Air in the World southpole

One more day (?) southpole

One more week (?) southpole

Closing Softly southpole

More Photos southpole

Super Bowl Wednesday southpole

Night Owls southpole

First Week southpole

More Ice Pix southpole

Settling In southpole

NPX southpole

Pole Bound southpole

Bad Dirt southpole

The Last Laugh southpole

First Delay southpole

Nope southpole

Batteries and Sheep southpole

All for McNaught southpole

t=0 southpole

The Big (Really really big...) Picture southpole

Giacometti southpole

Descent southpole

Video Tour southpole

How to subscribe to blog updates southpole

What The Blog is For southpole

Auckland southpole

Halfway Around the World; Dragging the Soul Behind southpole

Launched southpole

Getting Ready (t minus 2 days) southpole


Subscribe: RSS feed ... all topics ... or Clojure only / Lisp only