moving the lamppost

random musings of a molecular biologist turned code jockey in the era of big data and open science.

Python Gotcha: generators not always drop-in replacements for lists

I was recently bitten by this one.

Generators, Iterators and Lists

In Python, there is a data construct/concept called a generator. In brief it is a special type of function that returns a type of iterator that in most use cases behaves just like a Python list. That is, you can do stuff like this to it:

# define a simple generator
In [1]: def my_generator(count):
   ...:      for i in range(count):
   ...:          yield i
   ...: 
In [2]: my_iterator = my_generator(3)

In [3]: for each in my_iterator:
   ...:      print each
   ...: 
0
1
2

The benefit of an iterator is that it mitigates the memory footprint of the output of something like my_generator(). So when you have an idea that a list of your resulting data will be very large, it may be wise to use an iterator rather than a list. They work by generating the data on the fly rather than all at once.

Note

Though the title mentions generators specifically (this is because that’s what bit me just now) all of this pertains to all iterators. I did not mean to imply that generators were special in this regard.

Special thanks to reddit user Megatron_McLargeHuge for pointing out my sin of omission.

In our example above, my_iterator has no length as demonstrated below.

In [4]: my_iterator = my_generator(3)

In [5]: len(my_iterator)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-110da8bf0288> in <module>()
----> 1 len(my_iterator)

TypeError: object of type 'generator' has no len()

This can bite you if you don’t know (or remember) that you are dealing with a dynamically generated iterator. In my case, I was using a generator function that returns an iterator for all unique 2-way combinations of a list of values. I wanted these unique combinations to be used inside a subsequent for loop.

Here was the code that got me:

def calc_mean_ptcis(graphHandler,n_way_ortho_table):
    """
    returns list of n-way averaged PTCI values for N-way ortholog subgraphs if and only if
    all edges successfully generated non-None value PTCI results.
    """

    # dictionary of nodes/edges indexed by gene names
    node_dict = graphHandler.node_dict
    edge_dict = graphHandler.edge_dict

    graph = graphHandler.graph

    mean_ptcis = []

    # calculate all pairwise combinations of indexes
    # so that each ortho-edge of n-way orthos are obtained

    # !---- look here ----! #
    index_combos = xuniqueCombinations(range(len(n_way_ortho_table.columns)),2)

    for node_list in n_way_ortho_table.itertuples():

        node_list = node_list[1:]
        ortho_edges = []
        for i in index_combos:
            key = tuple(sorted([node_list[i[0]],node_list[i[1]]]))

            try:
                ortho_edges.append(edge_dict[key])
            except KeyError:
                break

        ptcis = [dev.get_ptci(edge) for edge in ortho_edges]

        try:
            mean_ptci = np.mean(ptcis)
            mean_ptcis.append(mean_ptci)
        except TypeError:
            pass

    return mean_ptcis

Iterators are consumed at the time of access

I expected index_combos to persist and be re-used in each iteration of the loop. However, index_combos is an iterator, so it is actually consumed in that first loop and returns nothing useful in ANY of the subsequent loops. I did not write xuniqueCombinations which is how I got bitten; I do host a gist of it here. It’s actually quite useful, and you should check it out.

The point is that I was expecting a list and got an iterator. The solution was to use a list comprehension to consume the iterator and store the output in a list. That new list is of course persistent and fixed the whole problem.

index_combos = [ x for x in xuniqueCombinations(range(len(n_way_ortho_table.columns)),2) ]

Final thoughts

I spent a good long time dissecting and re-writing all of the rest of that code before I realized what was happening. I hope that others might benefit from my pain by reading this.

Generators and iterators are awesome and help make Python the amazing language that it is, but the fact that they look like a list in many code use-cases can result in some not-so-fun debugging sessions unless you pay close attention!