Approximate reading time: 6 minutes
My favorite piece of code, written when I worked at Facebook, is a function that takes no arguments, returns nothing, and has no side-effects. Yet, it was still used in production. Here it is:
def _f():
value
That’s it. That’s literally the whole body of the function.
Let me step back a little bit and tell the story of why I needed to write this function in the first place, and then it will eventually make sense how this function is used.
When I joined Facebook back in early 2014, my first project was to migrate pipelines from a soon-to-be-deprecated framework, called Databee, to a newer one, called Dataswarm. The old framework consisted of more than a thousand pipelines distributed across dozens of namespaces, each one with its logic defined in two parts:
Each pipeline would run the Hive query, store its output in a staging table, and process each row with the lambda function, storing the result in a MySQL table. This was used to compute aggregate metrics for different products, usually by a small number of dimensions — two, unless people got creative.
(And they were creative, but I’ll get to that in a bit.)
Since we had more than a thousand pipelines, I would have to automate the migration process. Migrating the Hive queries to run in the new framework was trivial: I could just use an operator designed to run Hive queries. Loading data to MySQL was also trivial, using an operator that moved data from Hive to MySQL. But I didn’t know how I would apply the Python lambdas to each row.
As I quickly discovered, turns out it was possible to run Python code on individual rows from a Hive table in the new framework. It had a transformer module that allowed you to run custom Python functions on each row of your input, which was exactly what I needed.
It seemed perfect: a lambda is a custom function, a nameless one. So all I had to do was write code that automatically generated pipelines, reading from one Python file (in Databee) and writing to another one (in Dataswarm). It was all string manipulation. Tedious — sure — and full of corner cases, but still string manipulation.
But I quickly ran into a problem: the module only worked for functions defined at the top-level of the file, and with a name. You see, the way this worked was, the Dataswarm pipeline would send the name of the module and function to a transformer, saying "run this function", and the transformer would import the module and call the function.
Remember when I said lambdas are nameless? Yeah, this would not work for lambdas.
At this point I thought to myself, "How hard can this be? I’m a Python programmer, I’m just going to read the source code and make it work with lambdas." So I took a look at the transformer class. It said:
"Of course, that'd be a bit too easy. Hive won't run the transformer command through the shell, so we have to add a 'bash -c' around it. Now we encounter the Boss Demon, the Prince of Hell, Lord Quot-shel-escap himself. We'll use a base64 charm to disguise ourselves as a big pile of crap and thus escape His Dark Eye."
That sounded scary, and I started to realize it wasn’t going to be as easy as I thought. But it seemed doable: for regular functions we could just pass the module and function name, and for lambdas we could just serialize the function to bytecode and send that instead. The transformer would then either import the function or deserialize the lambda, and apply the resulting function to each row.
How do you serialize things in Python? The traditional way back then was to use the pickle[archived] module. Although unsafe if you don’t trust the source, it allows you to reliably serialize and deserialize objects even across different Python versions:
>>> import pickle
>>> pickle.dumps("Hello, World!")
"S'Hello, World!'\np0\n."
>>> pickle.loads("S'Hello, World!'\np0\n.")
'Hello, World!'
There’s just one problem:
>>> pickle.dumps(lambda: "Hello, World!")
Traceback (most recent call last):
...pickle.PicklingError: Can't pickle <function <lambda> at 0x103fe3c08>: it's not found as __main__.<lambda>
Lambdas cannot be serialized using pickle
.
There was another option left: Python also has a module called marshal[archived], used for reading and writing the "pseudo-compiled" pyc
files. Differently than pickle
, the marshal
format might change between versions, but it should be stable enough for our use case... right? Or would we be running different versions of Python in Dataswarm and in the transformer?
Turns out it didn’t matter:
>>> import marshal
>>> marshal.dumps(lambda: "Hello, World!")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: unmarshallable object
I discovered lambdas could not be marshalled either.
"There are two problems when serializing lambdas: first, they cannot be serialized, and second, their closures cannot be serialized."
Why can’t we serialize lambdas? A lambda function f
is composed of three parts:
f.func_code
);f.func_defaults
); andf.func_closure
).The first 2 can be serialized using marshal
, but the third one cannot. In Python, a closure is represented by a tuple of cell objects[archived], which are not serializable. But I discovered I could extract the value of a cell object, and this allowed me to serialize a tuple of values instead of a tuple of cells! It worked!
Later, I discovered there was just one simple gotcha: the values themselves could be lambdas. Remember I mentioned creative users? A lot of people would define lambdas in their pipelines and use them inside the main lambda, making them present in the closure. This was easy to fix: I just had to recursively serialize the values and all was fine.
With this, I was able to serialize lambdas by marshalling a tuple with these three parts:
from marshal import dumps
def serialize_lambda(f):
return dumps((f.func_code, f.func_defaults, get_values(f.func_closure))
This should be trivial to deserialize, returning the original lambda... right?
It’s pretty straightforward to build a lambda in Python from the serialized tuple, using the types[archived] module. All we need to do is:
from types import LambdaType
f = LambdaType(func_code, globals, '<lambda>', func_defaults, func_closure)
Here, func_code
and func_defaults
can be trivially unmarshalled from the first two elements in the tuple, since they can be marshalled directly. The only problem is that we need to build func_closure
back from the tuple of values, i.e., we need to convert the values back into cell objects.
And how does one create a cell object in Python from a given value?
At this point I discovered it’s simply not possible. There’s no Python builtin that will return the cell object I needed in order to build func_closure
. You can read and manipulate cell objects from a given function — as we did when serializing the lambda — you just can’t create them from a value.
At this point I thought that all my work was in vain, and I would have to backtrack and find a different solution.
Google gave me no answers.
The documentation[archived] for cell objects said:
"Cell objects are not likely to be useful elsewhere."
I then had an insight: even though I couldn’t create cell objects I could still create functions with closures! So all I had to do was create a function that had as a closure the value I wanted to put in the cell object:
def make_cell(value):
def _f():
value
return _f.func_closure[0]
And here it is, my infamous function _f
. Being used even though it’s never called, and even though it has no side-effects. Probably the smallest useful Python function, and virtually bug-free.
It was the last piece needed to deserialize lambdas, allowing me to migrate 1300+ pipelines automatically without changing anything in their logic.
You can engage with this post on Webmention, Twitter, Mastodon, WT.Social or Medium.