CategoriesPython

Advanced Python Data Classes: Custom Tools

Python’s dataclasses module, added in 3.7, is a great way to create classes designed to hold data. Although they don’t do anything that a regular class couldn’t do, they take out a lot of boilerplate code and let you focus on the data.

If you aren’t already familiar with dataclasses, check out the docs. There are also plenty of great tutorials covering their features.

In this tutorial, we’re going to look at a way to write tools that extend dataclasses.

Let’s start with a simple dataclass that holds a UUID, username, and email address of a user.

from dataclasses import dataclass, field
import uuid


@dataclass
class UserData:
    username: str
    email: str
    _id: uuid.UUID = field(default_factory=uuid.uuid4)


if __name__ == "__main__":
    username = input("Enter username: ")
    email = input("Enter your email address: ")

    data = UserData(username, email)
    print(data)

This is pretty simple. Ask the user for a username and an email address, then show them the shiny new data class instance that we made using their information. The class will, by default, generate a unique id for every user.

But what if we have sneaky users who might try giving an invalid email address, just to break things?

It’s simple enough to extend data classes to support field validation. dataclass is just a decorator that takes a class and adds various methods and attributes to it, so let’s make our own decorator that does the same thing.

def validated_dataclass(cls):
    cls.__post_init__ = lambda self: print("Initializing!")
    cls = dataclass(cls)
    return cls

@validated_dataclass
class UserData:
...

Here, we add a simple __post_init__ method to the class, which will be called by the data class every time we instantiate the class. But how can we use this power to validate an email address?

This is where the metadata argument of a field comes in. Basically, it’s a dict that we can set when defining a field in the data class. It’s completely ignored by the regular dataclass implementation, so we can use it to include information about the field for our own purposes.

Here’s how UserData looks after adding a validator for the email field.

from dataclasses import dataclass, field

def validate_email(value):
    if "@" not in value:
        raise ValueError("There must be an '@' in your email!")
    
    return value


@validated_dataclass
class UserData:
    username: str
    email: str = field(metadata={"validator": validate_email})
    _id: uuid.UUID = field(default_factory=uuid.uuid4)

Now the email field of the data class will carry around that validator function, so that anyone can access it. Let’s update the decorator to make use of it.

from dataclasses import dataclass, field, fields

def validated_dataclass(cls):
    cls = dataclass(cls)
    def _set_attribute(self, attr, value):
        for field in fields(self):
            if field.name == attr and "validator" in field.metadata:
                value = field.metadata["validator"](value)
                break

        object.__setattr__(self, attr, value)

    cls.__setattr__ = _set_attribute
    return cls

The new decorator replaces the regular __setattr__ with a function that first looks at the metadata of the fields. If there is a validator function associated with the attribute, it calls the function and uses its return value as the value to set.

The power of this approach is that now anybody can validate fields on their data classes by importing this decorator and defining a validator function in the metadata of their field. It’s a drop-in replacement to extend any data class.

One downside to this is the performance cost. Even attributes that don’t need validation will run through the list of fields every time they’re set. In another article, I’ll look at how much of a cost this actually is, and explore some optimizations we can make to reduce the overhead.

Another downside is the potential lack of readability of setting metadata on every field. If that becomes a problem, you could try defining the metadata dict elsewhere, so the field would look like email: str = field(metadata=email_metadata).

The possible uses of metadata are limitless! Combined with custom decorators that use dataclass behind the scenes, we can add all sorts of functionality to data classes.

For serious validation needs, it’s still most likely to be better to use something like Pydantic or Marshmallow, rather than make your own. Both of them have either built-in support for data classes, or there are other packages available to add that support.

If you have any ideas for extending data classes, let me know in the comments!

Leave a Reply

Your email address will not be published.