Lazy Properties in Python Using Descriptors
This is a bit of a side tangent from my normal at-least-vaguely-embedded-related articles, but I wanted to share a moment of enlightenment I had recently about descriptors in Python. The easiest way to explain a descriptor is a way to outsource attribute lookup and modification.
Python has a bunch of “magic” methods that are hooks into various object-oriented mechanisms that let you do all sorts of ridiculously clever things. Whether or not they’re a good idea is another story.
For example we can override string-conversion with __str__
, and we can override the +
behavior with __add__
, and so forth:
class Excessive(object): def __init__(self, value): self.value = value def __add__(self, other): return Excessive(self.value + other.value + 1) def __str__(self): return "Excessive(%d)" % self.value a = Excessive(10) b = Excessive(20) print("a=%s" % a) print("b=%s" % b) print("a+b=%s" % (a+b))
a=Excessive(10) b=Excessive(20) a+b=Excessive(31)
Those are the easy ones. Write a custom __getattr__
or __getattribute__
implementation and you’re starting to mess around with some real black magic.
Anyway, our topic today is descriptors, and to understand what they do, you need to understand how attribute lookup in Python works. If I have an object like a = Excessive(10)
and I evaluate a.value
, there’s some secret weird stuff that goes on behind the scenes. For many objects, value
is just a key in a dictionary inside the object, which you can actually access with __dict__
:
print(a.__dict__) print(a.value)
{'value': 10} 10
In the Excessive
class, the value
attribute gets set in the class’s __init__
method, and I can access it as .value
. Nothing fancy there. Just a plain old entry in a dictionary.
You may have also seen properties in Python, which are intended to act like properties in other dynamic languages, namely things that look like plain attributes but are actually controlled by getter and setter functions.
class Plain(object): def __init__(self, value): self.real_value = value def value(self): return self.real_value def set_value(self, newvalue): self.real_value = newvalue class Exaggerated(object): def __init__(self, value): self.real_value = value @property def value(self): return 10*self.real_value @value.setter def value(self, newvalue): raise ValueError("Denied!") p = Plain(50) print("p.value=%d" % p.value()) p.set_value(27) print("p.value=%d" % p.value()) e = Exaggerated(50) print("e.value=%d" % e.value) e.value = 1234
p.value=50 p.value=27 e.value=500
--------------------------------------------------------------------------- ValueError Traceback (most recent call last)in () 24 e = Exaggerated(50) 25 print("e.value=%d" % e.value) ---> 26 e.value = 1234 in value(self, newvalue) 15 @value.setter 16 def value(self, newvalue): ---> 17 raise ValueError("Denied!") 18 19 p = Plain(50) ValueError: Denied!
The Plain
class above doesn’t use properties, but uses plain Python methods to get and set the object’s value, which is stored in the real_value
attribute. You call .value()
to get a copy, and you call .set_value()
to change the value.
But people are lazy, and those two parentheses are just infinitesimally harder to type and read, so we have the property
decorator, which is that @property
thing. This takes a function that you give it and magically attaches it to the class to access each instance’s attribute which shares the function name. With the Exaggerated
class above, you just evaluate .value
, and it computes the result on demand. Properties can have setters (or, less frequently, deleters) to control write access to a property, so when I assign e.value = 1234
then the setter function gets called and, in the case of Exaggerated
, it raises an error.
You can learn to use property
fairly easily without really knowing how it works. Under the hood it uses something called a descriptor, which is really just a Python object that has a __get__
method (or a __set__
or __delete__
method). The way Python evaluates object attributes is a little quirky, and it looks to see if the class attribute has a __get__
method; if it doesn’t then the attribute is just returned and we have the plain behavior. If the __get__
method exists, then it’s called to retrieve the value returned by attribute access.
class Hijacker(object): """ a descriptor that hijacks a particular attribute """ def __init__(self, attrname): self.attrname = attrname def __get__(self, instance, owner=None): if instance is None: return self return getattr(instance, self.attrname) + " (hijacked)" def __repr__(self): return "Hijacker!" class Victim(object): """ a class that uses a descriptor """ def __init__(self, name, title): self._name = name self._title = title name = Hijacker('_name') title = Hijacker('_title') v = Victim("James","pilot") print(v.name) print(v.title) print(Victim.name)
James (hijacked) pilot (hijacked) Hijacker!
Again, a descriptor is a way to outsource attribute lookup and modification. In the above listing, Hijacker
is a descriptor, and it serves to look up the a particular attribute from another object and append " (hijacked)"
to the result. In the class Victim
, the class attributes Victim.name
and Victim.title
are Hijacker
objects. In designing Victim
, we give control to Hijacker
to evaluate the name
and title
attributes.
When we have an object like v = Victim("James")
and we evaluate v.name
, what happens is that Python first evaluates Victim.name
and sees that it’s an object which has a __get__
method, so the __get__
method gets called.
Essentially this:
v.name
evaluates as something like this:
type(v).name.__get__(v, type(v))
and the descriptor in this case has access to both the instance (v
) and the class (Victim
) and can do whatever it likes.
Why use descriptors?
Descriptors are relatively low-level features that are used to implement property
and a couple of other built-in Python features like classmethod
and staticmethod
. It’s harder to understand why you would go out of your way to use a descriptor. Why would you outsource attribute lookup? In the case of property
, it’s a general behavior that can be reused, so rather than coding it specially into each new object, we just use the logic that’s baked into property
.
I recently had a case where one of the comments recommended using descriptors and at the time I didn’t understand them well enough to do what I wanted, so I just left things alone. In the intervening weeks, there was an article posted to reddit on one use case for descriptors which I read, but it left a really bad taste in my mouth. The theme for articles like that one, is that descriptors can implement managed attributes; by outsourcing attribute lookup you can take away all the control of a particular object’s attributes, including the storage for that attribute, and give it to something else. That article talks about a validated email descriptor, so it won’t let you assign a value to a particular attribute unless it matches a regular expression, otherwise it raises an error. But to do this, it stores the attribute away from the object itself, and the article gets all muddled along in a discussion of a WeakKeyDictionary
pool, without really distinguishing this aspect from descriptors in general. So I gave up trying to understand.
Then a few days ago, I was working on libgf2
, the Python module for computing in \( GF(2) \) which no one actually uses, which I maintain, and I had an epiphany.
In libgf2
there’s a class called GF2QuotientRing
that allows computation with a particular representation of binary finite fields. This is a class that has a lot of methods, but as far as state goes, it subclasses from GF2Polynomial
which only has two state variables, one called coeffs
which is a bit vector of polynomial coefficients (0b10011
represents \( x^4 + x + 1 \)), and the other called bitlength
that just caches the bit length of the coeffs
attribute. GF2QuotientRing
doesn’t add any state variables on top of it.
Or at least it didn’t. But I wanted to add discrete logarithm and trace calculation to GF2QuotientRing
, and each of these have some computation cost. For discrete logarithm, my implementation is table driven, so at some point you have to build the tables, and there are a few choices:
- on construction, in
GF2QuotientRing.__init__()
— but this saddles everyGF2QuotientRing
object with the cost of building the tables, whether or not they’re ever used - at point of use, in
GF2QuotientRing.lograw()
(which is called by.log()
) — but if I calledlograw()
10 times, it would build the tables every time. - at point of use, in
GF2QuotientRing.lograw()
, but only on first use — this is the so-called “lazy evaluation” approach. If I had 10GF2QuotientRing
objects and only one of them called.lograw()
then the log tables would only be built for that one object; furthermore if I call.lograw()
10 times, the first time it will build the log tables and then afterwards it will reuse the same tables.
This is a textbook example of lazy evaluation, and I really wanted to try it. I was familiar with the concept in Java, but the situation is different there. In Java, there are no such things as properties, just fields to store state and methods to access it, so there’s no syntactic sugar to make it easy to calculate what looks like a field on-demand. Most of the discussion in Java on lazy evaluation involves how to make it threadsafe, and you get into things like double-checked locking which have all sorts of pitfalls.
In my Python library I’m not interested in thread-safety as much as a design that feels right and is reasonably efficient. Anyway, the descriptor approach works really well for lazy properties. It turns out that I’m not the only one to figure this out, which is kind of a bummer to realize, but whatever. My approach is a little bit different, and yes, it does use WeakKeyDictionary
, but there’s a good reason for it:
from weakref import WeakKeyDictionary class LazyProperty(object): def __init__(self, deferred_computation, attrkey=None, cacheClass=WeakKeyDictionary): self.deferred_computation = deferred_computation self.cache = cacheClass() self.attrkey = attrkey def __get__(self, instance, type=None): if instance is None: return self key = (instance if self.attrkey is None else getattr(instance, self.attrkey)) if key in self.cache: return self.cache[key] else: result = self.deferred_computation(instance) self.cache[key] = result return result @staticmethod def decorate(**kwargs): def decorator(func): return LazyProperty(func, **kwargs) return decorator
Rather than explain in the abstract, here’s a use example:
import primefac import time class Factorization(object): def __init__(self, n): self.n = n @LazyProperty def factors(self): print("computing factors of %d" % self.n) t1 = time.time() result = primefac.factorint(self.n) t2 = time.time() print("elapsed time: %.3f s" % (t2-t1)) return result def __repr__(self): return "Factorization(%d)" % self.n f1 = Factorization(45133901693*97895698669) print(f1)
Factorization(4418414839894196946617)
Factorization can be expensive, so the factors
attribute is computed on demand. The way this works is that the function def factors(self)
is decorated by LazyProperty
, so it’s equivalent (more or less) to this:
def factors_function(self):
print("computing factors of %d" % self.n)
t1 = time.time()
result = primefac.factorint(self.n)
t2 = time.time()
print("elapsed time: %.3f s" % (t2-t1))
return result
class Factorization(object):
def __init__(self, n):
self.n = n
factors = LazyProperty(factors_function)
def __repr__(self):
return "Factorization(%d)" % self.n
And the LazyProperty
object takes in a function as the deferred_computation
argument and stashes it for later. If the __get__
attribute is called, then it looks up the result in a cache, and if the cache doesn’t contain a value, then the deferred_computation
is executed and the result stored for next time.
f1.factors
computing factors of 4418414839894196946617 elapsed time: 1.320 s
{45133901693L: 1, 97895698669L: 1}
So factors
gets computed the first time it’s accessed. The next time, the cached value is used:
f1.factors
{45133901693L: 1, 97895698669L: 1}
Now the tricky and debatable part is where these cached values live.
The simplest way to handle lazy properties is to store the result of deferred computation in the target object itself, not in the descriptor. Which makes sense most of the time, and some of the lazy properties implementations, like Rick Copeland’s, do this.
The other way to do it is to let the descriptor manage the cache. Which seems kind of weird. When you have data associated with an object, why not store it in the object itself? Otherwise you need a cache, and if you aren’t careful then this can be the cause of a memory leak — for example if I have 10000 objects and get rid of 9999 of them, but the cache holds onto a value associated with each of those 10000 objects and never gets rid of them, then we have unreleased resources, which can lead to problems. By “get rid of” an object, I mean that I remove all references to it that are reachable from variables in the global namespace (aka “rooted references”), so that it can be garbage collected.
The way to handle a cache without such a memory leak is to purge entries in the cache when appropriate, and one way is to tie those cache entries directly to the lifetime of their associated objects, which is what WeakKeyDictionary
does. If you have wkd = WeakKeyDictionary
and you assign wkd[someObject] = someValue
and at some point later on, someObject
has no other rooted references, then the WeakKeyDictionary
will automatically remove the dictionary entry when its key is garbage collected.
print("cache before garbage collection:", Factorization.factors.cache.items()) f1 = None import gc gc.collect() print("cache after garbage collection: ", Factorization.factors.cache.items())
('cache before garbage collection:', [(Factorization(4418414839894196946617), {45133901693L: 1, 97895698669L: 1})]) ('cache after garbage collection: ', [])
Okay, so when the factorization object goes away, so does the cached computed value. Why take this sort of approach rather than storing it in the object itself?
The best reason I can think of involves value objects. These are objects with equivalency that depends only on the content of the objects rather than the fact that they are separate objects. If I create two distinct Factorization
objects, they’ll each have separate entries in the cache:
f1 = Factorization(45133901693*97895698669) f2 = Factorization(45133901693*97895698669) f1.factors f2.factors print("cache:", Factorization.factors.cache.items())
computing factors of 4418414839894196946617 elapsed time: 1.362 s computing factors of 4418414839894196946617 elapsed time: 1.329 s ('cache:', [(Factorization(4418414839894196946617), {45133901693L: 1, 97895698669L: 1}), (Factorization(4418414839894196946617), {45133901693L: 1, 97895698669L: 1})])
This seems unnecessary and I can do better by making a class of value objects, where I define equivalency by overriding __eq__
and __hash__
:
class Factorization2(object): def __init__(self, n): self.n = n @LazyProperty def factors(self): print("computing factors of %d" % self.n) t1 = time.time() result = primefac.factorint(self.n) t2 = time.time() print("elapsed time: %.3f s" % (t2-t1)) return result def __eq__(self, other): return type(self) == type(other) and self.n == other.n def __hash__(self): return self.n.__hash__() def __repr__(self): return "Factorization2(%d)" % self.n f1 = Factorization2(45133901693*97895698669) f2 = Factorization2(45133901693*97895698669) f1.factors f2.factors print("cache:", Factorization2.factors.cache.items())
computing factors of 4418414839894196946617 elapsed time: 1.321 s ('cache:', [(Factorization2(4418414839894196946617), {45133901693L: 1, 97895698669L: 1})])
By doing this, even if I create a new object, if it is equivalent to one already in the cache, then the two will share the same computed value. And that’s the real power of storing cached computed values outside the objects themselves.
My LazyProperty
implementation also supports alternative caches; for example, maybe you don’t want computed value lookup on the object itself, but rather some designated attribute within that object:
class ThingyWithFactors(object): def __init__(self, n, name): self.n = n self.name = name @LazyProperty.decorate(attrkey='n', cacheClass=dict) def factors(self): print("computing factors of %d" % self.n) t1 = time.time() result = primefac.factorint(self.n) t2 = time.time() print("elapsed time: %.3f s" % (t2-t1)) return result def __repr__(self): return "ThingyWithFactors(%d,%r)" % (self.n, self.name) t1 = ThingyWithFactors(45133901693*97895698669, "Bill") t2 = ThingyWithFactors(45133901693*97895698669, "Ted") t1.factors t2.factors print("cache:", ThingyWithFactors.factors.cache.items())
computing factors of 4418414839894196946617 elapsed time: 1.326 s ('cache:', [(4418414839894196946617L, {45133901693L: 1, 97895698669L: 1})])
Here t1
and t2
are two separate objects, and are not equivalent, but they share the computed factors cache because they have the same value of n
. Here I used a dict
cache rather than a WeakKeyDictionary
, so the cache will never automatically purge its entries even if t1
and t2
are garbage collected; WeakKeyDictionary
doesn’t work on types that don’t allow weak references, and plain integers are one of those types. (The cachetools
module might be a good thing to look at, in this case, as another way to handle a cache of manage values.)
In one of my cases, I have GF2QuotientRing
objects which can have different polynomials of the same degree \( N \), where I have to factor \( 2^N-1 \), so having them share a cache based on \( N \) makes sense. (Also there aren’t that many choices of \( N \) that are feasible to factor in a short amount of time, so I don’t have to worry about the cache becoming too large and causing memory leak problems, so a plain dict
is fine.)
Other uses of descriptors
A lazy property only makes sense if the property’s value is immutable. There still might be other reasons to outsource an attribute lookup — for example, suppose you have a database and you’re creating objects that are essentially “views” of the database, so rather than store attribute values in these view objects, the values should be stored in and retrieved from the database. The classical way to do this is to have an object that contains only two state variables:
- some kind of ID
- a reference to a database connection
This doesn’t need generalized descriptors to handle it; you could use the Python property
decorator to use getters and setters. But you could also implement it by just storing an ID in the object itself, and let a descriptor manage the database connection. (That approach seems like a code smell to me — then the database connection is a singleton, and you can’t create sets of objects with separate database connections.)
In researching this article, I ran across a 2016 book by Jacob Zimmerman which is 64 pages devoted solely to the subject of Python descriptors — some of the Python frameworks like django
use descriptors, so a book like this might be good for further reading.
Wrapup
A descriptor is an object in Python that allows outsourcing of attribute lookups. We showed a couple of examples, including the LazyProperty
class which facilitates caches of computed values.
© 2017 Jason M. Sachs, all rights reserved.
- Comments
- Write a Comment Select to add a comment
To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.
Please login (on the right) if you already have an account on this platform.
Otherwise, please use this form to register (free) an join one of the largest online community for Electrical/Embedded/DSP/FPGA/ML engineers: