Sunday, December 25, 2022

Metaprogramming in Python 2: An Interning Decorator


Note:
the code related to this post can be found in the github IBM-Metaprogramming-for-Python repo, in intern.py, available under the Apache 2.0 license.

Contents

Part 1: Introduction 

Part 2: An Interning Decorator

Interning

Interning is a mechanism for ensuring that any two equal objects of the same class are actually identical (that is, the same object).  This can save memory, by preventing the proliferation of many objects that are equal to each other.  It can also speed up computation by using pointer comparison (which takes small and constant time) instead of object equality, which can be arbitrarily complex.  And, in case object creation requires expensive computation, this can be avoided whenever an equal object to the one being requested already exists.

Sometimes it is necessary to ensure that a change to the state of an object is seen by every part of the program that has access to that object or any other object that is equal to it.  Interning achieves this goal by ensuring that there is only one such object, so that a change to it is automatically seen everywhere in the program.

Interning has a cost of its own; this includes storage for a table that keeps track of existing objects, as well as a small overhead on the creation of objects, which must look at the table before creating new objects.

Interning is used for strings in many programming languages; for example, the Java compiler interns all literal strings (those that appear in double quotes in the source files); Java, Python, and many other languages provide an intern() method for strings, for use at the programmer's discretion.

Interning is useful for many other types of objects besides strings. If it is only used to save memory, it need not be perfectly accurate; as long as it drastically reduces the number of new objects created, it serves its purpose.  However, in order to allow the replacement of equality comparisons by object identity, it must ensure that there will never be two distinct but equal objects.

Interning can be implemented using a static method that creates objects instead of using the constructor.  However, this is cumbersome and unsafe, since users can still call the constructor to generate new objects, bypassing the interning mechanisms.  (In Java this can be prevented by making the constructor private, but Python doesn't have a corresponding mechanism.)

A better way is to change the behavior of the constructor so that it doesn't always create a new object.  In some languages, including C++ and Java, this is impossible, since a constructor is guaranteed to return a fresh object.  However, languages such as Python do provide mechanisms to change the behavior of constructors (and other parts of the language) in a very flexible way, using metaprogramming (see Part 1).

To implement a general interning mechanism in Python, we will create an @intern decorator for classes whose elements should be interned.  This decorator will add or modify the class's __new__() method to use an interning table, and the __init__() method to ensure that object initialization is done exactly once per object.

Applying Interning to a Class

Default Interning

The simplest way to apply interning to a class is just to decorate it with @intern, as in the following example:

@intern
class Foo:
    def __init__(self,
                 s1: str,
                 i1: int,
                 t1: tuple,
                 set1: frozenset,
                 kt1: tuple = ()):
        self.s1 = s1
        self.i1 = i1
        self.t1 = t1
        self.set1 = set1
        self.kt1 = kt1

(All the examples shown here are taken from the test file.)

This class has five fields, which are initialized from corresponding parameters of the constructor.  If you create two instances of this class with equal (but not necessarily identical) parameters, you will get the same identical instance in both cases, as shown by the first test case:

obj1 = Foo('test1', 1, ((1, 2), (3, 4)), frozenset((10, 11)))
obj2 = Foo('test1', 1, ((1, 2), (3, 4)), frozenset((11, 10)))
self.assertIs(obj1, obj2)
 

Note that the tuple and sets are distinct objects, yet the values of obj1 and obj2 are the same object (in other words, obj1 is obj2 would evaluate to True).

This implementation of the @intern decorator indexes objects based on the arguments to the constructor rather than the object's actual fields (which are not available unless the object is actually created, defeating the purpose of the exercise).  This is based on the assumption that equal constructor arguments create equal objects; an assumption that often but not always true.  But don't worry!  This behavior can be customized, as we will see shortly.

Another caveat of the default implementation is that supplying the default value of a keyword argument in the constructor call looks different than a call that omits that keyword argument.  Therefore, the next object, obj3, is not the same as the previous two:

obj3 = Foo('test1', 1, ((1, 2), (3, 4)), frozenset((11, 10)), kt1=())
self.assertIsNot(obj1, obj3)

Similarly, providing a keyword argument positionally, without using the keyword, is different from anything else we have seen above:

obj4 = Foo('test1', 1, ((1, 2), (3, 4)), frozenset((11, 10)), ())
self.assertIsNot(obj1, obj4)
self.assertIsNot(obj3, obj4)

Using a Key Function

These issues can be solved by a very general mechanism built into the intern decorator: user-provided key functions.  These are functions that accept the same parameters as the class constructor, and return an object that can be used to index class objects in the internal table.  A simple and common example is the following:

def key_func(s1: str, i1: int, t1: tuple, set1: frozenset, kt1: tuple = ()):
    return s1, i1, t1, set1, kt1


@intern(key=key_func)
class Bar:
    def __init__(self, s1: str, i1: int, t1: tuple, set1: frozenset, kt1: tuple = ()):
        self.s1 = s1
        self.i1 = i1
        self.t1 = t1
        self.set1 = set1
        self.kt1 = kt1
This class is defined with the @intern decorator, but in this case it is given a key parameter, which is the function key_func.  This function simply returns a tuple of all the constructor parameters, but it puts the value of the keyword parameter into the same tuple, regardless of how it was specified in the construtor call (positionally, by keyword, or omitted).  The Bar constructor calls corresponding to the four examples shown above for Foo therefore all return the same object:

obj1 = Bar('test1', 1, ((1, 2), (3, 4)), frozenset((10, 11)))
obj2 = Bar('test1', 1, ((1, 2), (3, 4)), frozenset((11, 10)))
self.assertIs(obj1, obj2)
obj3 = Bar('test1', 1, ((1, 2), (3, 4)), frozenset((11, 10)), kt1=())
self.assertIs(obj1, obj3)
obj4 = Bar('test1', 1, ((1, 2), (3, 4)), frozenset((11, 10)), ())
self.assertIs(obj1, obj4)

Mutable Constructor Arguments

Python refuses to use objects that are not hashable as keys of tables (such as instances of dict).  In particular, mutable objects (those whose state can change during runtime) should not be hashable, otherwise they will be lost if used as table keys when their values (and therefore also their hash values) change.

This means that any mutable (or, in general, non hashable) argument to the constructor of a class that uses interning must not be used as a key.  For example, the following test succeeds because the creation of the object raises an exception:

@intern
class Qux:
    def __init__(self, s1: str, i1: int, l1: list, set1: set, kt1: tuple = ()):
        self.s1 = s1
        self.i1 = i1
        self.l1 = l1
        self.set1 = set1
        self.kt1 = kt1

try:
    Qux('test1', 1, [(1, 2), (3, 4)], {10, 11})
except TypeError:
    pass
else:
    self.fail("Expected TypeError: unhashable type: 'list'")

The class Qux is different from Foo only in the type of the third constructor parameter, which is a list instead of a tuple.  (Unless you activate runtime type checking, you will not see a difference between the behaviors of Foo and Qux.)  Since the default behavior of @intern is to use the constructor arguments to compose the key, Python will raise a TypeError when a mutable list and a mutable set are used as the key to a table.

Key functions can be used to solve this issue as well.  The key function needs to construct a hashable key; for example, it can replace lists by tuples, sets by frozen sets (elements of frozenset), and tables by tuples of key-value pairs (each of which is itself a tuple).  This is done in the following example:

def key_func_for_list(s1: str, i1: int, l1: list, set1: set, kt1: tuple = ()):
    return s1, i1, tuple(l1), frozenset(set1), kt1

@intern(key=key_func_for_list)
class Quux:
    def __init__(self, s1: str, i1: int, l1: list, set1: set, kt1: tuple = ()):
        self.s1 = s1
        self.i1 = i1
        self.l1 = l1
        self.set1 = set1
        self.kt1 = kt1

obj1 = Quux('test1', 1, [(1, 2), (3, 4)], {10, 11})
obj2 = Quux('test1', 1, [(1, 2), (3, 4)], {11, 10})
self.assertIs(obj1, obj2)
obj3 = Quux('test1', 1, [(1, 2), (3, 4)], {11, 10}, kt1=())
self.assertIs(obj1, obj3)
obj4 = Quux('test1', 1, [(1, 2), (3, 4)], {11, 10}, ())
self.assertIs(obj1, obj4)

This now works as expected, with all four constructor calls returning the same object.

In general, of course, you need to take care of deep data structures that have unhashable elements; for example, a tree that is represented as lists nested within other lists must be converted to another form, such as tuples nested in tuples.

Indexing by Object Identity

Sometimes you want to intern your objects based on the identity of the constructor arguments rather than by equality.  This is also easy to do using a key function, as in the following example:

def key_by_id(s1: str, i1: int, l1: list, set1: set, kt1: tuple = ()):
    return tuple(map(id, (s1, i1, l1, set1, kt1)))

@intern(key=key_by_id)
@dataclass
class Xyzzy:
    s1: str
    i1: int
    l1: list
    set1: set
    kt1: tuple = ()

list1 = [(1, 2), (3, 4)]
set1 = {10, 11}
obj1 = Xyzzy('test1', 1, list1, set1)
obj2 = Xyzzy('test1', 1, list1, set1, kt1=())
self.assertIs(obj1, obj2)
obj3 = Xyzzy('test1', 1, [(1, 2), (3, 4)], set1, kt1=())
self.assertIsNot(obj1, obj3) 

In this case, the key consists of the "identities" of the constructor parameters, as given by the built-in Python function id (in CPython these are memory addresses).

Both obj1 and obj2 are constructed with identical parameters, and are therefore the same object.  (Note that the string and int arguments are identical because of the interning behavior of Python for these data types; this behavior is undocumented and may differ between implementations of versions of Python.)

However, obj3 is given a different list, and is therefore not identical to the previous object.

This example also demonstrates that @intern can be applied on data classes (decorated with @dataclass).  However, in such cases you must ensure that the @intern decorator comes before @dataclass (which means that it is applied after it), since @dataclass will not create an __init__() method if one has been created by @intern, and @intern must have access to the __init__() method generated by @dataclass.

Caveat

Suppose you use object identities as keys, as in the Xyzzy example, but the generated object does not keep a pointer to the argument, as in the following example:

@intern(key=key_by_id)
class Gritch:
    def __init__(self, s1: str, i1: int, l1: list, set1: set, kt1: tuple = ()):
        self.s1 = s1
        self.i1 = i1
        self.l1 = tuple(l1)
        self.set1 = tuple(set1)
        self.kt1 = kt1

The following test case passes on my current Python installation (CPython 3.11.1 for x64):

obj1 = Gritch('test1', 1, [(1, 2), (3, 4)], {10, 11})
obj2 = Gritch('test1', 1, [(10, 20), (30, 40)], {100, 110})
self.assertIs(obj1, obj2)

How is this possible?  The values of obj1 and obj2 should be different, and so they can never be represented by the same object!

The reason for this is that the first list can be garbage-collected immediately after obj1 is created, since there are no longer any references to it.  The second list (used to initialize obj2) can in principle be allocated at the same address as the first one, and the same holds for the set argument.  The Python garbage collector (in my installation, at least) takes advantage of this fact, and reuses the memory without even waiting for a full garbage-collection cycle.  The keys used for interning both objects therefore contain the same ids for the list and set, and the constructor returns obj1, which is indexed under this key.

This can never happen if you object keeps a reference to the arguments used in the key, since they can't be garbage collected as long as the object exists in the table, and so this object will never be confused for any other.

Singleton

The same approach can be used to create singleton classes, where the first call to the constructor creates the singleton object, and further calls return the same object.

In fact, if the constructor has no parameters (except for self, of course), the @intern decorator will create a singleton.  However, a direct implementation of the singleton pattern may be very slightly more efficient.

Implementing Interning

Note: the implementation described below can be found in the github IBM-Metaprogramming-for-Python repo, in intern.py.

The @intern  decorator is actually an instance of the class Intern, defined as follows:

intern = Intern()

This enables users to create independent interning objects, with possibly different behaviors (in this implementation, default key functions).  The Intern class constructor accepts an optional default_key parameter; if provided, it should be a function that takes a class object as a parameter and returns the appropriate key function.  (Of course, it may ignore the class parameter and return the same key function for all classes decorated with this instance.)

Let's call an instance of the Intern class an interner.  In order to be used as a decorator, an interner needs to be callable; the main interning functionality is therefore implemented inside the __call__() method, whose header is:

def __call__(self,
             cls: Optional[type] = None, *,
             key: Callable = None,
             reset_method: Optional[str] = 'reset_intern_memory'):

The first (positional) parameter cls is the decorated class.  The keyword parameters can be provided as arguments to the decorator; for example, we used @intern(key=key_by_id) as a decorator for Xyzzy and Gritch.  The key parameter is an optional key function, as explained above.  The reset_method parameter is an optional name for a method that can be used to reset the interning table, or None to prevent the creation of such a method.

The decorator returns the same class it received, suitably modified.

The following lines allow the decorator to be used with or without arguments:

if cls is None:
    return partial(self, key=key, reset_method=reset_method)

In brief, if no class argument is provided, the call expects to receive the keyword arguments, and returns another decorator function that encapsulates those arguments and can be called on the class.

The next lines choose a key function, giving precedence to the one given as a decorator parameter over the default_key of the interner:

if key is None and self.default_key is not None:
    key = self.default_key(cls)

The table used by the interner is created by the expression

memory = WeakValueDictionary()

A WeakValueDictionary is a data structure similar to a dict, except that it doesn't prevent the garbage collector to collect its values (provided, of course, they have no other [strong] references).  This is done in order to avoid memory leaks due to stale objects being held only by the interner.

The next order of business, and perhaps the most significant, is the creation (or modification) of the decorated class's __new__() method.  This is where the table is searched for an existing object, which will be returned if it is found without creating a new object.  If no existing object is found under the key, a new object will be generated, put into the table, and returned.

The code has two possible versions of this method, depending on whether or not the class already has a __new__() method.  If it does, it will be called to create the new object if necessary.

If the class has no __new__() method of its own (that is, it uses object.__new__), the code calls object.__new__().  That method does not accept any additional arguments beyond the class itself, unlike custom __new__() methods that can be given the constructor arguments.  The definition of the new __new__() method in this case is:

def __new__(cls, *args, **kwargs):
    element_key = (args, tuple(sorted(kwargs.items()))) if key is None else key(*args, **kwargs)
    try:
        result = memory[element_key]
        setattr(result, '*already_initialized*', True)
        return result
    except KeyError:
        result = object.__new__(cls)
        memory[element_key] = result
        return result

The try block attempts to retrieve a previously-created object from the table; if successful, this object will be returned.  (We will discuss the strangely-named *already_initialized* attribute a little later.)

If the key doesn't exist in the table, a KeyError exception will be raised.  In this case, we will call object.__new__() to create a new (uninitialized) object, put it in the table, and return it.

If the class has a custom __new__() method, the definition of the new __new__() method is only slightly different, as marked in color below; old_new is the value of getattr(cls, '__new__'):

def __new__(cls, *args, **kwargs):
    element_key = ((args, tuple(sorted(kwargs.items())))
                   if key is None
                   else key(*args, **kwargs))
    try:
        result = memory[element_key]
        setattr(result, '*already_initialized*', True)
        return result
    except KeyError:
        result = old_new(cls, *args, **kwargs)
        memory[element_key] = result
        return result

In either case, the new method is installed on the class using the call setattr(cls, '__new__', __new__).

According to Python's data model, the object returned by a __new__() method should be initialized by invoking on it the __init__() method, with the provided constructor arguments.  However, this rule is not followed if the __new__() method returns an object whose type is different from the class whose constructor was called.  Recall that the __new__() method is allowed to return a value of any type; it would not make sense to initialize an object or a different class with the arguments appropriate to the original class.

This rule implies that if __new__() does return an object of the same class, it will be initialized.  In our case, this will be wrong, because __new__() will return an already-initialized object from its table.  The state of this object may already have been modified, and initializing it again would incorrectly overwrite these changes.  We therefore need to prevent a second initialization.  This is the purpose of the *already_initialized* attribute; we set this to indicate that the object shouldn't be initialized again.  We use a name that is not a legal identifier in order to prevent accidental collisions with users who may choose a similar, but legal, name for a class attribute.

The way to prevent re-initialization is to create or modify the class's __init__() method so it skips initialization if *already_initialized* is true for the object.  Again there are two cases, depending on whether or not the class already has an __init__() method:

init = getattr(cls, '__init__', None)
if init is None:
    def __init__(self, *args, **kwargs):
        if not (hasattr(self, '*already_initialized*')
                and getattr(self, '*already_initialized*')):
            super().__init__(self, *args, **kwargs)
else:
    @wraps(init)
    def __init__(self, *args, **kwargs):
        if not (hasattr(self, '*already_initialized*')
                and getattr(self, '*already_initialized*')):
    
       init(self, *args, **kwargs)

setattr(cls, '__init__', __init__)

Finally a reset method is created if requested, and the original (but modified) class is returned:

if reset_method:
    def reset_intern_method():
        nonlocal memory
        memory = WeakValueDictionary()

    setattr(cls, reset_method, reset_intern_method)

return cls


Metaprogramming in Python 1: Introduction

Contents

Part 1: Introduction 

Part 2: An Interning Decorator

See also the discussion of how to do super() in Python correctly; the implementation uses inspection to find the correct super method to apply.

Series Introduction

Metaprogramming is (among other things) a way of programming general techniques that operate by transforming a target program in various ways.  Python has many features that are implemented using metaprogramming, including static and class methods, abstract classes and methods, data classes, properties, and context managers.

Python also exposes metaprogramming facilities that enable Python developers to create their own features.  These include an elaborate data model that has many hooks for modifying the standard behavior of the language, and decorators, which are a handy shorthand mechanism for applying new features to user code.

In this series of posts I will present several new and useful features and show how they can be implemented using Python's metaprogramming facilities.

A Little History

Metaprogramming, in the sense of programs operating on other programs, includes assemblers, compilers, linkers, and loaders, and therefore harks back to the early days of computing.  Metaprogramming in the more restricted sense used in this series, of language facilities that enable modifying the default behavior of the language itself, also has its roots in one of the earliest high-level language, Lisp.  Unlike other early languages, which focused on scientific or business processing, Lisp was all about symbolic processing, and was one of the major languages used for Artificial Intelligence research for several decades.

Because of its focus on symbolic processing, a list is the basic data structure of Lisp (hence the name of the language, a contraction of list processing).  A Lisp program is also a list (with other lists nested inside it), so that writing Lisp interpreters and manipulating Lisp programs in Lisp is easy and natural.  The creators and many of the users of Lisp were also interested in language design, and created many features and language dialects by modifying the Lisp interpreter, and using metaprogramming.  These came together in the early 1980s to create a specification for Common Lisp, an attempt to collect the best features of previous dialects into a single language.  The Common Lisp Object System (CLOS) exposed its implementation through the CLOS Meta-Object Protocol, which enabled extensive customizations of the behavior of the basic language mechanisms.

Lisp, and Common Lisp in particular, had a strong influence on many newer languages, including Python, which follows the tradition of providing powerful metaprogramming capabilities.  This series will demonstrate how these can be used to add useful features to the language in a way that is very easy to use.

Metaprogramming

There are many definitions of metaprogramming; very generally, it refers to programs that operate on other programs (https://cs.lmu.edu/~ray/notes/metaprogramming/).  In the context of this series, it refers to a set of mechanisms for changing the standard behavior of a programming language in flexible ways.  The major mechanisms we will use are:

  • dynamically and programmatically adding or modifying class methods;
  • modifying the behavior of class constructors so that they don't necessarily create objects of the class, or even any new objects;
  • modifying classes, methods, or functions using decorators; and
  • inspecting classes, methods, and functions to find relevant properties at runtime.

Adding or Modifying Class Methods

Adding or modifying class methods can be done as for any class attribute, using setattr() and getattr().  These methods are similar to the use of dot notation to access a class attribute (as in "abc".capitalize()).  Unlike dot notation, you can use them to get and set attributes whose names are not fixed in the source code.  We will use these to create or modify the behavior of classes according to the feature we are implementing.

Modifying Class Constructors

Many of Python's metaprogramming facilities operate through "magic" methods, whose names have double underscores at the beginning and end (and are therefore called "dunder methods"). Most of these are part of Python's data model.

Changing the behavior of class constructors is done in Python through the magic method __new__().  This is a static method that is automatically invoked during the processing of a constructor call.  By default, it creates a new object, which is then initialized by the __init__() method.  However, you can override this method to return any object of any class, including previously-created objects.

Decorators

Decorators are a very powerful Python mechanism that can apply arbitrary code to a class, method, or function object and modify or replace it to customize its behavior.  One of the best known decorators in Python is @staticmethod, which turns a method defined inside a class into a static method (one that does not take the self object as a parameter).  A decorator is a function that takes the decorated class, method, or function, and returns another one to replace the original.  Syntactically, it is written as the name of the decorator preceded by an at-sign (@); the decorator is placed before the class, method, or function definition it modifies.

The meaning of a decorator written this way is to replace the defined element by whatever the decorator function returns when given the result of the original definition.  For example, the effect of the definition

@dataclass
class Foo:
    f1: int
    f2: str

will be first to create the class Foo, which has two class-level fields, f1 and f2.  Without the decorator, these fields belong to the class rather than to specific instances, and are therefore shared between all instances.  However, the decorator corresponds to the following statement:

Foo = dataclass(Foo)

The dataclass function examines the class object it receives (the one that has two class-level fields), and modifies it by (among other things) creating a method __init__(f1: int, f2: str); this new method create two instance-level fields in every object of the class.

The dataclass decorator can take various arguments that change the way it works on the given class, making it very flexible.  For more details, see the documentation.

Inspection

Inspection (also called introspection) is the ability of a program to investigate properties of itself.  This can include its global data (variables, classes, functions), the methods of a class, the arguments of a function or method, and much more.  This is useful for metaprogramming because the behavior being implemented may depend on these properties of the parts of the program it modifies.

As one example, modifying the behavior of object initialization, as done by the __init__() method, depends on whether or not a class defines this method.  Finding such a method definition in the class, as described above, is a simple case of inspection.

For a more complex example, see the series of posts on how to do super() in Python correctly and the corresponding implementation.

Putting Things Together

When adding a feature to the language, you will often have to create new classes, methods, or functions, or modify existing ones.  Your code will often need to override some of the magic dunder methods.  You will then probably deliver this functionality in the form of a decorator, like dataclass .  Subsequent posts will illustrate how this is done.