moving the lamppost

random musings of a molecular biologist turned code jockey in the era of big data and open science.

Lab meeting 2013-08-21

1 Introduction

1.1 Purpose

Introduce core concepts relating to my project such as:

  1. Importance of Dengue and Malaria
  2. Transmission cycle dependant on Mosquitoes
  3. Limited effectiveness (medical/finacial) of treatment of the diseases in rural poor human communities
  4. Vector Control Strategies
  5. How Transgenesis modifies the vector control landscape
  6. Elements of Successful Transgenic Vector Control
  7. Introduction to control of transcription and importance of CRE/CRMs

2 Review of Literature Contributions and Development of Analytical Approach

2.1 Purpose

Review a selection of my contributions to the literature that have informed the approach used in this project throughout my tenure as a member of the lab and explain the methods that I have used based on the lessons learned.

  1. Sieglaff, D. H., Dunn, W. A., Xie, X. S., Megy, K., Marinotti, O., & James, A. A. (2009). Comparative genomics allows the discovery of cis-regulatory elements in mosquitoes. Proceedings of the National Academy of Sciences of the United States of America, 106(9), 3053–8. doi:10.1073/pnas.0813264106

    • Use of comparative genomics to discover putative CREs from orthologous promoter regions in mosquitoes that were able to be correlated with bloodmeal associated transcription control
    • For any specific transcription profile, typically a single motif was found to be significantly enriched in the corresponding promoter regions (modules not obvious)
  2. Bonizzoni, M., Dunn, W. A., Campbell, C. L., Olson, K. E., Dimon, M. T., Marinotti, O., & James, A. A. (2011). RNA-seq analyses of blood-induced changes in gene expression in the mosquito vector species, Aedes aegypti. BMC genomics, 12(1), 82. doi:10.1186/1471-2164-12-82

    • Provided first experience in data handling and analysis of RNA-seq big data

    • Illustrated challenges to using flanking regions as putative promoters in genomes at this level of annotation completion/quality
      • annotation contains missed first exons and other incorrect feature definitions
      • upstream genes sometimes more likely to represent exon of downstream gene
    • Nevertheless using SCOPE, putative CRMs were able to be predicted

  3. Bonizzoni, M., Dunn, W. a., Campbell, C. L., Olson, K. E., Marinotti, O., James, a. a., & Kulathinal, R. (2012). Strain Variation in the Transcriptome of the Dengue Fever Vector, Aedes aegypti. G3: Genes|Genomes|Genetics, 2(1), 103–114. doi:10.1534/g3.111.001107

    • Provided continued lessons on managing and analyzing RNAseq big data
    • Revealed that the transcriptomes of Ae. aegypti mosquitoes from distinct strains vary significantly in complexity and abundance of specific transcripts. This variation is evident in non- blood-fed mosquitoes and is enhanced after a bloodmeal.
    • differential use of paralogous genes
    • Support the need to take precautions when comparing differences in transcript abundance profiles between species that may have last shared a common ancestor as far back as 200 mya.
  4. Bonizzoni, M., Dunn, W. A., Campbell, C. L., Olson, K. E., Marinotti, O., & James, A. A. (2012). Complex Modulation of the Aedes aegypti Transcriptome in Response to Dengue Virus Infection. PloS one, 7(11), e50512. doi:10.1371/journal.pone.0050512

    • Example of how results (CRE-discovery in this case) are sensitive to the gene-sets use for analysis.
  5. Functional Genomics can be thought of as a problem of generating meaningful gene-sets

    • signal-to-noise is major challenge in generating useful gene-sets
  6. This project aims to generate meaningful gene-sets through integration of multiple data-types

    • comparative genomics (N-way 1-to-1 orthologs)
    • comparative transcriptomics (identical RNA-seq experiements on divergent Mosquito species with shared trait: bloodfeeding)
    • phylogenetics (estimated divergence since last common ancestor)
    • existing knowledge about certain bloodmeal related transcriptional control programs (20E interacting TFBS motifs)
  7. The last item presents analysis challenges that I addressed through custom graph-based analysis model (gFunc)

3 Software Development to Support Analytical Approach

3.1 Purpose

Illustrate how lessons learned in previous chapter (as well as new lessons introduced here) are addressed through custom software solutions.

  1. RNA-seq analysis, reproducibility, updating (blacktie)
  2. Integration of multiple disparate Omics scale data types (gFunc)
  3. Documentation of exactly how analyses were carried out (IPython notebooks and automated log keeping built in to blacktie)

Python Gotcha: generators not always drop-in replacements for lists

I was recently bitten by this one.

Generators, Iterators and Lists

In Python, there is a data construct/concept called a generator. In brief it is a special type of function that returns a type of iterator that in most use cases behaves just like a Python list. That is, you can do stuff like this to it:

# define a simple generator
In [1]: def my_generator(count):
   ...:      for i in range(count):
   ...:          yield i
In [2]: my_iterator = my_generator(3)

In [3]: for each in my_iterator:
   ...:      print each

The benefit of an iterator is that it mitigates the memory footprint of the output of something like my_generator(). So when you have an idea that a list of your resulting data will be very large, it may be wise to use an iterator rather than a list. They work by generating the data on the fly rather than all at once.


Though the title mentions generators specifically (this is because that’s what bit me just now) all of this pertains to all iterators. I did not mean to imply that generators were special in this regard.

Special thanks to reddit user Megatron_McLargeHuge for pointing out my sin of omission.

In our example above, my_iterator has no length as demonstrated below.

In [4]: my_iterator = my_generator(3)

In [5]: len(my_iterator)
TypeError                                 Traceback (most recent call last)
<ipython-input-5-110da8bf0288> in <module>()
----> 1 len(my_iterator)

TypeError: object of type 'generator' has no len()

This can bite you if you don’t know (or remember) that you are dealing with a dynamically generated iterator. In my case, I was using a generator function that returns an iterator for all unique 2-way combinations of a list of values. I wanted these unique combinations to be used inside a subsequent for loop.

Here was the code that got me:

def calc_mean_ptcis(graphHandler,n_way_ortho_table):
    returns list of n-way averaged PTCI values for N-way ortholog subgraphs if and only if
    all edges successfully generated non-None value PTCI results.

    # dictionary of nodes/edges indexed by gene names
    node_dict = graphHandler.node_dict
    edge_dict = graphHandler.edge_dict

    graph = graphHandler.graph

    mean_ptcis = []

    # calculate all pairwise combinations of indexes
    # so that each ortho-edge of n-way orthos are obtained

    # !---- look here ----! #
    index_combos = xuniqueCombinations(range(len(n_way_ortho_table.columns)),2)

    for node_list in n_way_ortho_table.itertuples():

        node_list = node_list[1:]
        ortho_edges = []
        for i in index_combos:
            key = tuple(sorted([node_list[i[0]],node_list[i[1]]]))

            except KeyError:

        ptcis = [dev.get_ptci(edge) for edge in ortho_edges]

            mean_ptci = np.mean(ptcis)
        except TypeError:

    return mean_ptcis

Iterators are consumed at the time of access

I expected index_combos to persist and be re-used in each iteration of the loop. However, index_combos is an iterator, so it is actually consumed in that first loop and returns nothing useful in ANY of the subsequent loops. I did not write xuniqueCombinations which is how I got bitten; I do host a gist of it here. It’s actually quite useful, and you should check it out.

The point is that I was expecting a list and got an iterator. The solution was to use a list comprehension to consume the iterator and store the output in a list. That new list is of course persistent and fixed the whole problem.

index_combos = [ x for x in xuniqueCombinations(range(len(n_way_ortho_table.columns)),2) ]

Final thoughts

I spent a good long time dissecting and re-writing all of the rest of that code before I realized what was happening. I hope that others might benefit from my pain by reading this.

Generators and iterators are awesome and help make Python the amazing language that it is, but the fact that they look like a list in many code use-cases can result in some not-so-fun debugging sessions unless you pay close attention!

Blacktie BUGFIX release:


The Bug:

In short, on a Mac pprocess complained:

"AttributeError: 'module' object has no attribute 'poll'"

when trying to set up a queue.

This caused a fatal unhandled exception that was discovered on a Macintosh. Version should restore functionality of blacktie for people experiencing this problem.

For more detail see github issue:

The Fix:

This is a quick fix that looks for that exception and continues without using pprocess if it is encountered. A more permanent fix will be included once I understand what is going on.


  • It is recommended that all users update to to avoid this issue.
  • Anyone with knowledge of this issue is encouraged to comment on the issues thread above or in the Comments section below.

What is blacktie?

blacktie is a pipeline to streamline the deployment of the popular “Tuxedo Protocol” for RNA-seq analysis. Please take a look at the Project Summary section of blacktie‘s documentation for more details.

Installing PyQt... because it’s too good for pip or easy_install.

This is a tutorial on installing PyQt and its dependency SIP. I am currently working on a Python package to integrate multiple disparate types of Omics data using a graph model. Some of my code requires ETE: a Python Environment for phylogenetic Tree Exploration which requires PyQt.

Normally, I would simply use

pip install -U package_name

to install package_name and ensure that all of its dependencies are installed and updated. However, PyQt and SIP do not use the standard file for encoding the installation procedures. They use So pip and other PyPI installer scripts will locate and download the code, but they will fail during the installation phase and complain that they can not be installed with setuptools.

This was frustrating me since I know that I had these installed on my last system. I just couldn’t remember how I did it. I must have done the whole thing manually. In any case, thanks to a post I found using the google box at, I found a slightly easier way to get the job done. I am posting it here to maybe help others, but also so that I have it documented for myself the next time I have to do this.

I am borrowing heavily from the post above but adding some extras that I ran into on my Arch Linux system.


First pass Trinity Results for Anopheles stephensi midgut RNA-seq (non-bloodfed)

Exploration of the first set of results from a Trinity run with default settings on one library of Anopheles stephensi midgut RNA-seq (As_00 lib one)

Analyze Lengths Of Assembled Contigs/Transcripts

Count all transcripts of length X:


from collections import defaultdict
from rSeq.utils.files import ParseFastA

def measure_and_count_transcripts(fasta_path):

    tx_lens = defaultdict(list)

    p = ParseFastA(fasta_path)
    seqs = p.to_dict()

    for name,seq in seqs.iteritems():

    print "The number of different transcript sizes encountered: %s" % (len(tx_lens))
    return tx_lens

Now I will plot the size data in a number of different ways.


# Prep to plot a Histogram of the numbers of transcripts in each length bin

trinity_tx_lens = measure_and_count_transcripts('/home/gus/Dropbox/common/projects/trinity_As/trinity_As_00_0_test/Trinity.fasta')

trinity_lengths = []

for item in trinity_tx_lens.iteritems():
    trinity_lengths.extend([item[0]] * len(item[1]))

The number of different transcript sizes encountered: 4632


hist(trinity_lengths, bins=50, histtype='stepfilled',log=1)
xlabel('transcript length')
ylabel('number of transcripts')
<matplotlib.text.Text at 0x3ecb2d0>


xlabel('Trinity Contigs (sorted by length)')
ylabel('Length of Contig')
<matplotlib.text.Text at 0x2ad3290>

Now I calculate the N50_length and N50_index for the trinity assembly.

N50_index: the number of largest contigs(transcripts) whose summed lengths equal at least 50% of the sum of ALL contigs (smaller numbers are ‘better’)

N50_length: the length of the last added contig from above (larger numbers are ‘better’)


def calc_N50s(lengths):
    total_length = sum(lengths)
    print "total_length:\t\t%s" % (total_length)

    running_total = 0
    collected_lengths = []
    n50_i = None

    for length in reversed(lengths):
        running_total += length

        if running_total >= total_length * 0.5:
            n50_i = len(collected_lengths)
            n50_l = length

    print "running_total:\t\t%s" % (running_total)
    print "N50_index:\t\t%s of %s" % (n50_i,len(lengths))
    print "N50_length:\t\t%s" % (n50_l)
    print "median contig length\t%s" % (median(lengths))


total_length:               31858558
running_total:              15931068
N50_index:                  3927 of 25167
N50_length:                 2469
median contig length        647.0