Note: the code related to this post can be found in the github IBM-Metaprogramming-for-Python repo, in intern.py, available under the Apache 2.0 license.
Contents
Part 2: An Interning DecoratorInterning
Interning is a mechanism for ensuring that any two equal objects of the same class are actually identical (that is, the same object). This can save memory, by preventing the proliferation of many objects that are equal to each other. It can also speed up computation by using pointer comparison (which takes small and constant time) instead of object equality, which can be arbitrarily complex. And, in case object creation requires expensive computation, this can be avoided whenever an equal object to the one being requested already exists.
Sometimes it is necessary to ensure that a change to the state of an object is seen by every part of the program that has access to that object or any other object that is equal to it. Interning achieves this goal by ensuring that there is only one such object, so that a change to it is automatically seen everywhere in the program.
Interning has a cost of its own; this includes storage for a table that keeps track of existing objects, as well as a small overhead on the creation of objects, which must look at the table before creating new objects.
Interning is used for strings in many programming languages; for example, the Java compiler interns all literal strings (those that appear in double quotes in the source files); Java, Python, and many other languages provide an intern() method for strings, for use at the programmer's discretion.
Interning is useful for many other types of objects besides strings. If it is only used to save memory, it need not be perfectly accurate; as long as it drastically reduces the number of new objects created, it serves its purpose. However, in order to allow the replacement of equality comparisons by object identity, it must ensure that there will never be two distinct but equal objects.
Interning can be implemented using a static method that creates objects instead of using the constructor. However, this is cumbersome and unsafe, since users can still call the constructor to generate new objects, bypassing the interning mechanisms. (In Java this can be prevented by making the constructor private, but Python doesn't have a corresponding mechanism.)
A better way is to change the behavior of the constructor so that it doesn't always create a new object. In some languages, including C++ and Java, this is impossible, since a constructor is guaranteed to return a fresh object. However, languages such as Python do provide mechanisms to change the behavior of constructors (and other parts of the language) in a very flexible way, using metaprogramming (see Part 1).
To implement a general interning mechanism in Python, we will create an @intern decorator for classes whose elements should be interned. This decorator will add or modify the class's __new__() method to use an interning table, and the __init__() method to ensure that object initialization is done exactly once per object.
Applying Interning to a Class
Default Interning
The simplest way to apply interning to a class is just to decorate it with @intern, as in the following example:
@intern
class Foo:
def __init__(self,
s1: str,
i1: int,
t1: tuple,
set1: frozenset,
kt1: tuple = ()):
self.s1 = s1
self.i1 = i1
self.t1 = t1
self.set1 = set1
self.kt1 = kt1
(All the examples shown here are taken from the test file.)
This class has five fields, which are initialized from corresponding parameters of the constructor. If you create two instances of this class with equal (but not necessarily identical) parameters, you will get the same identical instance in both cases, as shown by the first test case:
obj1 = Foo('test1', 1, ((1, 2), (3, 4)), frozenset((10, 11)))
obj2 = Foo('test1', 1, ((1, 2), (3, 4)), frozenset((11, 10)))
self.assertIs(obj1, obj2)
Note that the tuple and sets are distinct objects, yet the values of obj1 and obj2 are the same object (in other words, obj1 is obj2 would evaluate to True).
This implementation of the @intern decorator indexes objects based on the arguments to the constructor rather than the object's actual fields (which are not available unless the object is actually created, defeating the purpose of the exercise). This is based on the assumption that equal constructor arguments create equal objects; an assumption that often but not always true. But don't worry! This behavior can be customized, as we will see shortly.
Another caveat of the default implementation is that supplying the default value of a keyword argument in the constructor call looks different than a call that omits that keyword argument. Therefore, the next object, obj3, is not the same as the previous two:
obj3 = Foo('test1', 1, ((1, 2), (3, 4)), frozenset((11, 10)), kt1=())
self.assertIsNot(obj1, obj3)
Similarly, providing a keyword argument positionally, without using the keyword, is different from anything else we have seen above:
obj4 = Foo('test1', 1, ((1, 2), (3, 4)), frozenset((11, 10)), ())
self.assertIsNot(obj1, obj4)
self.assertIsNot(obj3, obj4)
Using a Key Function
These issues can be solved by a very general mechanism built into the intern decorator: user-provided key functions. These are functions that accept the same parameters as the class constructor, and return an object that can be used to index class objects in the internal table. A simple and common example is the following:
def key_func(s1: str, i1: int, t1: tuple, set1: frozenset, kt1: tuple = ()):
return s1, i1, t1, set1, kt1
@intern(key=key_func)
class Bar:
def __init__(self, s1: str, i1: int, t1: tuple, set1: frozenset, kt1: tuple = ()):
self.s1 = s1
self.i1 = i1
self.t1 = t1
self.set1 = set1
self.kt1 = kt1
This class is defined with the @intern decorator, but in this case it is given a key parameter, which is the function key_func. This function simply returns a tuple of all the constructor parameters, but it puts the value of the keyword parameter into the same tuple, regardless of how it was specified in the construtor call (positionally, by keyword, or omitted). The Bar constructor calls corresponding to the four examples shown above for Foo therefore all return the same object:
obj1 = Bar('test1', 1, ((1, 2), (3, 4)), frozenset((10, 11)))
obj2 = Bar('test1', 1, ((1, 2), (3, 4)), frozenset((11, 10)))
self.assertIs(obj1, obj2)
obj3 = Bar('test1', 1, ((1, 2), (3, 4)), frozenset((11, 10)), kt1=())
self.assertIs(obj1, obj3)
obj4 = Bar('test1', 1, ((1, 2), (3, 4)), frozenset((11, 10)), ())
self.assertIs(obj1, obj4)
Mutable Constructor Arguments
Python refuses to use objects that are not hashable as keys of tables (such as instances of dict). In particular, mutable objects (those whose state can change during runtime) should not be hashable, otherwise they will be lost if used as table keys when their values (and therefore also their hash values) change.
This means that any mutable (or, in general, non hashable) argument to the constructor of a class that uses interning must not be used as a key. For example, the following test succeeds because the creation of the object raises an exception:
@intern
class Qux:
def __init__(self, s1: str, i1: int, l1: list, set1: set, kt1: tuple = ()):
self.s1 = s1
self.i1 = i1
self.l1 = l1
self.set1 = set1
self.kt1 = kt1
try:
Qux('test1', 1, [(1, 2), (3, 4)], {10, 11})
except TypeError:
pass
else:
self.fail("Expected TypeError: unhashable type: 'list'")
The class Qux is different from Foo only in the type of the third constructor parameter, which is a list instead of a tuple. (Unless you activate runtime type checking, you will not see a difference between the behaviors of Foo and Qux.) Since the default behavior of @intern is to use the constructor arguments to compose the key, Python will raise a TypeError when a mutable list and a mutable set are used as the key to a table.
Key functions can be used to solve this issue as well. The key function needs to construct a hashable key; for example, it can replace lists by tuples, sets by frozen sets (elements of frozenset), and tables by tuples of key-value pairs (each of which is itself a tuple). This is done in the following example:
def key_func_for_list(s1: str, i1: int, l1: list, set1: set, kt1: tuple = ()):
return s1, i1, tuple(l1), frozenset(set1), kt1
@intern(key=key_func_for_list)
class Quux:
def __init__(self, s1: str, i1: int, l1: list, set1: set, kt1: tuple = ()):
self.s1 = s1
self.i1 = i1
self.l1 = l1
self.set1 = set1
self.kt1 = kt1
obj1 = Quux('test1', 1, [(1, 2), (3, 4)], {10, 11})
obj2 = Quux('test1', 1, [(1, 2), (3, 4)], {11, 10})
self.assertIs(obj1, obj2)
obj3 = Quux('test1', 1, [(1, 2), (3, 4)], {11, 10}, kt1=())
self.assertIs(obj1, obj3)
obj4 = Quux('test1', 1, [(1, 2), (3, 4)], {11, 10}, ())
self.assertIs(obj1, obj4)
This now works as expected, with all four constructor calls returning the same object.
In general, of course, you need to take care of deep data structures that have unhashable elements; for example, a tree that is represented as lists nested within other lists must be converted to another form, such as tuples nested in tuples.
Indexing by Object Identity
Sometimes you want to intern your objects based on the identity of the constructor arguments rather than by equality. This is also easy to do using a key function, as in the following example:
def key_by_id(s1: str, i1: int, l1: list, set1: set, kt1: tuple = ()):
return tuple(map(id, (s1, i1, l1, set1, kt1)))
@intern(key=key_by_id)
@dataclass
class Xyzzy:
s1: str
i1: int
l1: list
set1: set
kt1: tuple = ()
list1 = [(1, 2), (3, 4)]
set1 = {10, 11}
obj1 = Xyzzy('test1', 1, list1, set1)
obj2 = Xyzzy('test1', 1, list1, set1, kt1=())
self.assertIs(obj1, obj2)
obj3 = Xyzzy('test1', 1, [(1, 2), (3, 4)], set1, kt1=())
self.assertIsNot(obj1, obj3)
In this case, the key consists of the "identities" of the constructor parameters, as given by the built-in Python function id (in CPython these are memory addresses).
Both obj1 and obj2 are constructed with identical parameters, and are therefore the same object. (Note that the string and int arguments are identical because of the interning behavior of Python for these data types; this behavior is undocumented and may differ between implementations of versions of Python.)
However, obj3 is given a different list, and is therefore not identical to the previous object.
This example also demonstrates that @intern can be applied on data classes (decorated with @dataclass). However, in such cases you must ensure that the @intern decorator comes before @dataclass (which means that it is applied after it), since @dataclass will not create an __init__() method if one has been created by @intern, and @intern must have access to the __init__() method generated by @dataclass.
Caveat
Suppose you use object identities as keys, as in the Xyzzy example, but the generated object does not keep a pointer to the argument, as in the following example:@intern(key=key_by_id)
class Gritch:
def __init__(self, s1: str, i1: int, l1: list, set1: set, kt1: tuple = ()):
self.s1 = s1
self.i1 = i1
self.l1 = tuple(l1)
self.set1 = tuple(set1)
self.kt1 = kt1
The following test case passes on my current Python installation (CPython 3.11.1 for x64):
obj1 = Gritch('test1', 1, [(1, 2), (3, 4)], {10, 11})
obj2 = Gritch('test1', 1, [(10, 20), (30, 40)], {100, 110})
self.assertIs(obj1, obj2)
How is this possible? The values of obj1 and obj2 should be different, and so they can never be represented by the same object!
The reason for this is that the first list can be garbage-collected immediately after obj1 is created, since there are no longer any references to it. The second list (used to initialize obj2) can in principle be allocated at the same address as the first one, and the same holds for the set argument. The Python garbage collector (in my installation, at least) takes advantage of this fact, and reuses the memory without even waiting for a full garbage-collection cycle. The keys used for interning both objects therefore contain the same ids for the list and set, and the constructor returns obj1, which is indexed under this key.
This can never happen if you object keeps a reference to the arguments used in the key, since they can't be garbage collected as long as the object exists in the table, and so this object will never be confused for any other.
Singleton
The same approach can be used to create singleton classes, where the first call to the constructor creates the singleton object, and further calls return the same object.
In fact, if the constructor has no parameters (except for self, of course), the @intern decorator will create a singleton. However, a direct implementation of the singleton pattern may be very slightly more efficient.
Implementing Interning
Note: the implementation described below can be found in the github IBM-Metaprogramming-for-Python repo, in intern.py.
The @intern decorator is actually an instance of the class Intern, defined as follows:
intern = Intern()
This enables users to create independent interning objects, with possibly different behaviors (in this implementation, default key functions). The Intern class constructor accepts an optional default_key parameter; if provided, it should be a function that takes a class object as a parameter and returns the appropriate key function. (Of course, it may ignore the class parameter and return the same key function for all classes decorated with this instance.)
Let's call an instance of the Intern class an interner. In order to be used as a decorator, an interner needs to be callable; the main interning functionality is therefore implemented inside the __call__() method, whose header is:
def __call__(self,
cls: Optional[type] = None, *,
key: Callable = None,
reset_method: Optional[str] = 'reset_intern_memory'):
The first (positional) parameter cls is the decorated class. The keyword parameters can be provided as arguments to the decorator; for example, we used @intern(key=key_by_id) as a decorator for Xyzzy and Gritch. The key parameter is an optional key function, as explained above. The reset_method parameter is an optional name for a method that can be used to reset the interning table, or None to prevent the creation of such a method.
The decorator returns the same class it received, suitably modified.
The following lines allow the decorator to be used with or without arguments:
if cls is None:
return partial(self, key=key, reset_method=reset_method)
In brief, if no class argument is provided, the call expects to receive the keyword arguments, and returns another decorator function that encapsulates those arguments and can be called on the class.
The next lines choose a key function, giving precedence to the one given as a decorator parameter over the default_key of the interner:
if key is None and self.default_key is not None:
key = self.default_key(cls)
The table used by the interner is created by the expression
memory = WeakValueDictionary()
A WeakValueDictionary is a data structure similar to a dict, except that it doesn't prevent the garbage collector to collect its values (provided, of course, they have no other [strong] references). This is done in order to avoid memory leaks due to stale objects being held only by the interner.
The next order of business, and perhaps the most significant, is the creation (or modification) of the decorated class's __new__() method. This is where the table is searched for an existing object, which will be returned if it is found without creating a new object. If no existing object is found under the key, a new object will be generated, put into the table, and returned.
The code has two possible versions of this method, depending on whether or not the class already has a __new__() method. If it does, it will be called to create the new object if necessary.
If the class has no __new__() method of its own (that is, it uses object.__new__), the code calls object.__new__(). That method does not accept any additional arguments beyond the class itself, unlike custom __new__() methods that can be given the constructor arguments. The definition of the new __new__() method in this case is:
def __new__(cls, *args, **kwargs):
element_key = (args, tuple(sorted(kwargs.items()))) if key is None else key(*args, **kwargs)
try:
result = memory[element_key]
setattr(result, '*already_initialized*', True)
return result
except KeyError:
result = object.__new__(cls)
memory[element_key] = result
return result
The try block attempts to retrieve a previously-created object from the table; if successful, this object will be returned. (We will discuss the strangely-named *already_initialized* attribute a little later.)
If the key doesn't exist in the table, a KeyError exception will be raised. In this case, we will call object.__new__() to create a new (uninitialized) object, put it in the table, and return it.
If the class has a custom __new__() method, the definition of the new __new__() method is only slightly different, as marked in color below; old_new is the value of getattr(cls, '__new__'):
def __new__(cls, *args, **kwargs):
element_key = ((args, tuple(sorted(kwargs.items())))
if key is None
else key(*args, **kwargs))
try:
result = memory[element_key]
setattr(result, '*already_initialized*', True)
return result
except KeyError:
result = old_new(cls, *args, **kwargs)
memory[element_key] = result
return result
In either case, the new method is installed on the class using the call setattr(cls, '__new__', __new__).
According to Python's data model, the object returned by a __new__() method should be initialized by invoking on it the __init__() method, with the provided constructor arguments. However, this rule is not followed if the __new__() method returns an object whose type is different from the class whose constructor was called. Recall that the __new__() method is allowed to return a value of any type; it would not make sense to initialize an object or a different class with the arguments appropriate to the original class.
This rule implies that if __new__() does return an object of the same class, it will be initialized. In our case, this will be wrong, because __new__() will return an already-initialized object from its table. The state of this object may already have been modified, and initializing it again would incorrectly overwrite these changes. We therefore need to prevent a second initialization. This is the purpose of the *already_initialized* attribute; we set this to indicate that the object shouldn't be initialized again. We use a name that is not a legal identifier in order to prevent accidental collisions with users who may choose a similar, but legal, name for a class attribute.
The way to prevent re-initialization is to create or modify the class's __init__() method so it skips initialization if *already_initialized* is true for the object. Again there are two cases, depending on whether or not the class already has an __init__() method:
init = getattr(cls, '__init__', None)
if init is None:
def __init__(self, *args, **kwargs):
if not (hasattr(self, '*already_initialized*')
and getattr(self, '*already_initialized*')):
super().__init__(self, *args, **kwargs)
else:
@wraps(init)
def __init__(self, *args, **kwargs):
if not (hasattr(self, '*already_initialized*')
and getattr(self, '*already_initialized*')):
init(self, *args, **kwargs)
setattr(cls, '__init__', __init__)
Finally a reset method is created if requested, and the original (but modified) class is returned:
if reset_method:
def reset_intern_method():
nonlocal memory
memory = WeakValueDictionary()
setattr(cls, reset_method, reset_intern_method)
return cls