BASHing biology

Just a quick post from work… I’ve recently been using UNIX stalwarts sed, tr and grep quite regularly to manipulate DNA sequence files, so I thought I’d post a few handy little commands for your perusal.

To strip out all the annotations in a FASTA file, leaving just the bare, unseparated sequences:
grep -v '^>' sequence.fasta | tr -d "[:space:]"
To then output the total length of the sequences, simply append | wc -c.

To calculate the total number of sequences contained in a FASTA file:
grep '>' sequence.fasta | wc -l

To read out a DNA sequence, capitalise it and group the bases into threes (i.e. codons):
cat sequence.seq | sed s/.../\&\ /g | tr "[:lower:]" "[:upper:]"

To obtain the complement of a sequence, use tr as follows:
cat sequence.seq | tr TACGtacg ATGCATGC
For the reverse complement, append | rev.

I’ve also found the pbcopy and pbpaste utilities to be particularly useful. To grab the reverse complement of a sequence copied from, say, a web page, just type the following:
pbpaste | tr TACGtacg ATGCATGC | rev | pbcopy
This takes the sequence currently on the clipboard and replaces it with its reverse complement. There’s a caveat here, which is that rev only reverses each line of the input, not the entire input. As such, if you’ve copied a sequence containing line breaks, you’ll need to add a tr -d “[:space:]” to remove the white space before invoking rev.

Bit Wrangling

A meandering discourse on binary coercion

BASHing biology

Leave a Reply Cancel reply