Python Dataclasses vs Pydantic: The Complete Production Guide
TL;DR
Python dataclasses (standard library) give you type-annotated classes with auto-generated __init__, __repr__, __eq__, and ordering methods—solving the boilerplate problem for plain data containers. Pydantic v2 builds on this with runtime validation, type coercion, JSON parsing, and settings management, powered by a Rust core for performance. Use dataclasses for internal domain models where types are guaranteed correct. Use Pydantic at system boundaries (APIs, configs, external data) where validation matters. This guide covers every feature, footgun, and real-world pattern for both.
PART 1 — Python Dataclasses: Complete Coverage
1. Motivation and Design Goals
Before Python 3.7, creating a simple data container required verbose boilerplate:
class User:
def __init__(self, id: int, name: str, email: str):
self.id = id
self.name = name
self.email = email
def __repr__(self):
return f"User(id={self.id!r}, name={self.name!r}, email={self.email!r})"
def __eq__(self, other):
if not isinstance(other, User):
return NotImplemented
return (self.id, self.name, self.email) == (other.id, other.name, other.email)
Maintaining this is tedious and error-prone. Add a field? Update three methods. Forget to update __eq__? Subtle bugs.
Dataclasses solve this: they auto-generate these methods from type annotations. The design goals:
- Zero runtime overhead: Generated code is identical to what you’d write by hand
- Type-checker friendly: Annotations drive behavior, not runtime inspection
- Opt-in magic: Control exactly which methods get generated
- No new base class: Works with regular classes, inheritance, descriptors
from dataclasses import dataclass
@dataclass
class User:
id: int
name: str
email: str
That’s it. You get __init__, __repr__, and __eq__ for free. The decorator introspects class annotations at definition time and injects methods.
2. The @dataclass Decorator In Depth
@dataclass(
init=True, # Generate __init__
repr=True, # Generate __repr__
eq=True, # Generate __eq__
order=False, # Generate __lt__, __le__, __gt__, __ge__
unsafe_hash=False, # Generate __hash__ (dangerous, see below)
frozen=False, # Make immutable
match_args=True, # Generate __match_args__ for pattern matching
kw_only=False, # All fields keyword-only
slots=False, # Use __slots__
weakref_slot=False # Add __weakref__ to __slots__
)
class Example:
...
init, repr, eq
These are self-explanatory. Setting init=False means you’ll provide your own __init__. Useful when you need custom initialization logic but still want __repr__ and __eq__.
@dataclass(init=False)
class Timestamped:
created_at: datetime
def __init__(self):
self.created_at = datetime.now(timezone.utc)
order=True
Generates comparison methods based on field order. Fields are compared as tuples.
@dataclass(order=True)
class Version:
major: int
minor: int
patch: int
v1 = Version(1, 2, 3)
v2 = Version(1, 3, 0)
assert v1 < v2 # Compares (1,2,3) < (1,3,0)
Footgun: If you have eq=False and order=True, you violate Python’s invariant that a <= b and b <= a implies a == b. Don’t do this.
unsafe_hash=True: The Danger Zone
Hashing mutable objects is a bug waiting to happen:
@dataclass(unsafe_hash=True) # BAD IDEA
class Mutable:
value: int
d = {}
m = Mutable(value=1)
d[m] = "found"
m.value = 2 # MUTATE
print(d[m]) # KeyError! Hash changed, dict can't find it
Safe pattern: Only use unsafe_hash=True with frozen=True.
frozen=True: Immutability
Makes instances immutable after __init__. Attempts to assign raise FrozenInstanceError.
@dataclass(frozen=True)
class Point:
x: float
y: float
p = Point(1.0, 2.0)
p.x = 3.0 # FrozenInstanceError
Frozen dataclasses are hashable by default (no need for unsafe_hash), making them safe for dict keys and sets.
Performance: Frozen dataclasses aren’t faster at runtime. The immutability is enforced by replacing __setattr__ and __delattr__, not through memory layout tricks.
slots=True: Memory Optimization
Python 3.10+ allows slots=True to use __slots__ for memory efficiency:
@dataclass(slots=True)
class Compact:
a: int
b: str
Without slots, each instance carries a __dict__ (~200 bytes overhead). With slots, attributes are stored in a fixed array. For millions of instances, this matters.
Trade-off: You can’t add arbitrary attributes:
c = Compact(1, "hi")
c.new_field = 123 # AttributeError: 'Compact' object has no attribute 'new_field'
Inheritance caveat: All classes in the hierarchy must use slots=True, or you lose the benefit.
3. Field Mechanics
field() Function
from dataclasses import dataclass, field
@dataclass
class Record:
id: int
tags: list[str] = field(default_factory=list)
metadata: dict = field(default_factory=dict, repr=False)
_internal: int = field(default=0, init=False)
Parameters:
default: Default value (must be immutable)default_factory: Callable returning default (for mutable defaults)init: Include in__init__(defaultTrue)repr: Include in__repr__(defaultTrue)compare: Include in__eq__and ordering (defaultTrue)hash: Include in__hash__(defaultNonemeans usecompare)metadata: Arbitrary dict for tooling (not used by dataclasses itself)
The Mutable Default Footgun
@dataclass
class Bad:
items: list = [] # SyntaxError! Mutable default
@dataclass
class Good:
items: list = field(default_factory=list)
Why? All instances would share the same list. default_factory creates a new list per instance.
init=False: Manual Initialization
@dataclass
class Computed:
width: int
height: int
area: int = field(init=False)
def __post_init__(self):
self.area = self.width * self.height
c = Computed(10, 20)
print(c.area) # 200
repr=False: Hide Sensitive Data
@dataclass
class Credentials:
username: str
password: str = field(repr=False)
creds = Credentials("admin", "secret123")
print(creds) # Credentials(username='admin') # password hidden
compare=False: Exclude From Comparisons
@dataclass
class CachedData:
key: str
value: str
cache_time: float = field(compare=False)
# Two instances are equal if key/value match, ignoring cache_time
metadata: Custom Annotations
Used by third-party libraries (e.g., serialization frameworks):
@dataclass
class APIModel:
user_id: int = field(metadata={"json_name": "userId"})
created_at: datetime = field(metadata={"format": "iso8601"})
# Access via fields()
from dataclasses import fields
for f in fields(APIModel):
print(f.name, f.metadata)
4. Post-Init Lifecycle
post_init
Called after __init__ completes. Use for validation, computed fields, or normalization:
@dataclass
class Email:
address: str
def __post_init__(self):
if "@" not in self.address:
raise ValueError(f"Invalid email: {self.address}")
self.address = self.address.lower()
Modifying Frozen Dataclasses
You can’t assign to frozen instances normally, but __post_init__ has a workaround:
@dataclass(frozen=True)
class Normalized:
name: str
normalized: str = field(init=False)
def __post_init__(self):
# Use object.__setattr__ to bypass frozen check
object.__setattr__(self, "normalized", self.name.lower())
n = Normalized("HELLO")
print(n.normalized) # "hello"
n.normalized = "x" # FrozenInstanceError
5. Inheritance Rules and Pitfalls
Subclasses inherit fields from parents:
@dataclass
class Base:
x: int
@dataclass
class Derived(Base):
y: int
# Derived.__init__(x, y)
Ordering matters: Fields without defaults must come before fields with defaults:
@dataclass
class Parent:
a: int
b: int = 10
@dataclass
class Child(Parent):
c: int # ERROR! Non-default after default
Fix: Give c a default, or rework the hierarchy.
Override fields:
@dataclass
class Parent:
x: int = 10
@dataclass
class Child(Parent):
x: int = 20 # Overrides default
c = Child()
print(c.x) # 20
Slots inheritance: If the parent doesn’t use slots, the child won’t get slot benefits even if it specifies slots=True.
6. Dataclasses + Typing
Optional, Union, Literal
from typing import Optional, Literal
@dataclass
class Config:
host: str
port: int
tls: bool
log_level: Literal["DEBUG", "INFO", "ERROR"] = "INFO"
proxy: Optional[str] = None
Dataclasses don’t validate types at runtime. Type checkers (mypy, pyright) will catch errors, but:
c = Config(host=123, port="wat", tls="yes") # No error at runtime!
This is intentional. Dataclasses are about reducing boilerplate, not validation.
ClassVar: Class-Level Attributes
from dataclasses import dataclass
from typing import ClassVar
@dataclass
class Versioned:
VERSION: ClassVar[int] = 2
data: str
ClassVar tells the dataclass decorator to ignore this field (not in __init__, etc.).
InitVar: Init-Only Parameters
Fields that exist only during __init__, not as instance attributes:
from dataclasses import dataclass, field, InitVar
@dataclass
class Database:
host: str
port: int
timeout: InitVar[int] = 30
connection_string: str = field(init=False)
def __post_init__(self, timeout: int):
self.connection_string = f"postgresql://{self.host}:{self.port}?timeout={timeout}"
db = Database("localhost", 5432, timeout=60)
print(db.connection_string) # Uses timeout
# db.timeout -> AttributeError, doesn't exist
7. Dataclasses + Serialization
asdict() and astuple()
from dataclasses import asdict, astuple
@dataclass
class Point:
x: float
y: float
p = Point(1.5, 2.5)
print(asdict(p)) # {'x': 1.5, 'y': 2.5}
print(astuple(p)) # (1.5, 2.5)
Deep conversion: Nested dataclasses are recursively converted:
@dataclass
class Line:
start: Point
end: Point
line = Line(Point(0, 0), Point(10, 10))
print(asdict(line))
# {'start': {'x': 0, 'y': 0}, 'end': {'x': 10, 'y': 10}}
Footgun: asdict() doesn’t handle arbitrary objects gracefully:
@dataclass
class Record:
timestamp: datetime
r = Record(datetime.now())
asdict(r) # Returns {'timestamp': <datetime object>}, NOT a string
You need custom serialization for complex types. Common pattern:
def to_serializable(obj):
if isinstance(obj, datetime):
return obj.isoformat()
return obj
def serialize_dataclass(dc):
return {k: to_serializable(v) for k, v in asdict(dc).items()}
Deserializing: No Built-In Support
Dataclasses don’t have from_dict(). You must construct instances manually:
data = {"x": 1.5, "y": 2.5}
p = Point(**data)
For nested structures, you need recursion or a library (e.g., dacite, cattrs).
8. Performance Characteristics
Memory Layout
Standard dataclass with __dict__:
- Instance overhead: ~200 bytes + attribute storage
- Attribute access: hash table lookup
With slots=True:
- Instance overhead: ~40 bytes + attribute storage
- Attribute access: direct array index (faster)
Benchmark (1 million instances):
@dataclass
class NoSlots:
a: int
b: int
c: int
# Memory: ~350 MB
@dataclass(slots=True)
class WithSlots:
a: int
b: int
c: int
# Memory: ~120 MB
Construction Speed
Dataclasses are just Python code. No metaclass overhead, no dynamic dispatch. Construction time is identical to hand-written __init__.
Comparison time: Generated __eq__ is tuple comparison under the hood. With slots=True, attribute access is faster, so equality checks are slightly faster.
9. Common Anti-Patterns
Mutable Default Anti-Pattern
@dataclass
class Container:
items: list = [] # NO! All instances share the list
Always use default_factory.
Overusing frozen=True
Frozen dataclasses force immutability at the Python level, but they’re not truly immutable if they contain mutable objects:
@dataclass(frozen=True)
class Config:
settings: dict
c = Config(settings={"debug": True})
c.settings["debug"] = False # Mutates "frozen" object!
True immutability requires immutable data structures (e.g., frozendict, tuples).
Using Dataclasses for Validation
Dataclasses don’t validate. This silently succeeds:
@dataclass
class Age:
value: int
age = Age(value=-50) # No error!
Solution: Use __post_init__ for checks, or switch to Pydantic.
10. When NOT to Use Dataclasses
Don’t use dataclasses when:
- You need validation: Dataclasses don’t validate types or constraints
- You’re parsing untrusted input: No coercion or error handling
- You need complex serialization:
asdict()is shallow and doesn’t handle custom types - You want ORM features: Just use SQLAlchemy or similar
- You need before/after hooks: Dataclasses only have
__post_init__
Use cases for dataclasses:
- Internal domain models with trusted data
- Type-safe configuration objects (when initialized from code)
- DTOs between layers (when types are guaranteed)
- Replacing namedtuples with better type checking
PART 2 — Pydantic v2: Complete Guide
1. Philosophy and Design Differences
Pydantic solves a different problem: validating and parsing data from untrusted external sources. While dataclasses reduce boilerplate, Pydantic adds:
- Runtime validation: Type annotations are enforced at runtime
- Type coercion: Convert
"123"→123,"true"→True - JSON parsing: Direct deserialization from JSON strings
- Error aggregation: Collect all validation errors, not just the first
- Settings management: Parse environment variables with validation
Key insight: Pydantic is built for system boundaries (APIs, configs, files). Dataclasses are for internal models.
2. BaseModel Deep Dive
from pydantic import BaseModel
class User(BaseModel):
id: int
name: str
email: str
Unlike dataclasses, you inherit from BaseModel. This gives you:
__init__with validationmodel_dump()for serializationmodel_validate()for parsing dictsmodel_validate_json()for parsing JSON strings
Model Construction
user = User(id=1, name="Alice", email="[email protected]")
print(user.id) # 1
# Type coercion
user2 = User(id="2", name="Bob", email="[email protected]")
print(user2.id, type(user2.id)) # 2 <class 'int'>
Pydantic converts compatible types automatically.
Immutability
By default, Pydantic models are mutable:
user.name = "Eve" # OK
For immutability:
from pydantic import ConfigDict
class ImmutableUser(BaseModel):
model_config = ConfigDict(frozen=True)
id: int
name: str
u = ImmutableUser(id=1, name="Alice")
u.name = "Eve" # ValidationError: Instance is frozen
Slots
Pydantic v2 models do not use __slots__ by default (they need __dict__ for dynamic features). You can opt in:
class CompactModel(BaseModel):
model_config = ConfigDict(use_attribute_docstrings=True)
# Pydantic doesn't use __slots__ by default
# For memory efficiency, use dataclasses with Pydantic validation
For memory-sensitive use cases, consider hybrid patterns (covered later).
3. Validation System
Type Coercion Rules
Pydantic tries to convert input to the annotated type:
class Data(BaseModel):
count: int
ratio: float
active: bool
# All of these work
d1 = Data(count="42", ratio="3.14", active="yes")
print(d1.count, d1.ratio, d1.active) # 42 3.14 True
d2 = Data(count=42.7, ratio=5, active=1)
print(d2.count, d2.ratio, d2.active) # 42 5.0 True
Bool coercion: "yes", "true", "1", "on" → True; "no", "false", "0", "off" → False.
Strict vs Non-Strict Mode
Disable coercion:
from pydantic import Field
class StrictData(BaseModel):
count: int = Field(strict=True)
StrictData(count="42") # ValidationError: Input should be a valid integer
StrictData(count=42) # OK
Global strict mode:
class AllStrict(BaseModel):
model_config = ConfigDict(strict=True)
count: int
ratio: float
Field Validators
After validators (run after type coercion):
from pydantic import field_validator
class User(BaseModel):
username: str
age: int
@field_validator("username")
@classmethod
def username_alphanumeric(cls, v: str) -> str:
if not v.isalnum():
raise ValueError("Username must be alphanumeric")
return v
@field_validator("age")
@classmethod
def age_positive(cls, v: int) -> int:
if v < 0:
raise ValueError("Age must be positive")
return v
User(username="alice", age=25) # OK
User(username="alice!", age=25) # ValidationError: username
User(username="alice", age=-5) # ValidationError: age
Before validators (run before type coercion):
class Normalized(BaseModel):
email: str
@field_validator("email", mode="before")
@classmethod
def lowercase_email(cls, v):
if isinstance(v, str):
return v.lower()
return v
Normalized(email="[email protected]") # Stores "[email protected]"
Wrap validators (control the entire validation):
from pydantic import field_validator, ValidationInfo
from pydantic_core import core_schema
class Logged(BaseModel):
value: int
@field_validator("value", mode="wrap")
@classmethod
def log_validation(cls, v, handler):
print(f"Validating: {v}")
result = handler(v) # Call default validation
print(f"Result: {result}")
return result
Logged(value="123")
# Output:
# Validating: 123
# Result: 123
Model Validators
Validate across multiple fields:
from pydantic import model_validator
class DateRange(BaseModel):
start: datetime
end: datetime
@model_validator(mode="after")
def check_dates(self) -> "DateRange":
if self.end <= self.start:
raise ValueError("end must be after start")
return self
DateRange(start=datetime(2024, 1, 1), end=datetime(2023, 1, 1)) # ValidationError
Use mode="before" to access raw dict:
class FlexibleInput(BaseModel):
value: int
@model_validator(mode="before")
@classmethod
def handle_legacy(cls, data):
if isinstance(data, dict) and "old_value" in data:
data["value"] = data.pop("old_value")
return data
FlexibleInput(old_value=42) # Works, converts to new format
4. Field Definitions
from pydantic import Field
class Product(BaseModel):
id: int
name: str = Field(min_length=1, max_length=100)
price: float = Field(gt=0, le=1_000_000)
quantity: int = Field(default=0, ge=0)
description: str | None = Field(default=None, description="Product description")
tags: list[str] = Field(default_factory=list)
Constraints:
- Strings:
min_length,max_length,pattern(regex) - Numbers:
gt,ge,lt,le,multiple_of - Collections:
min_length,max_length
default vs default_factory
Same as dataclasses:
class Config(BaseModel):
options: dict = Field(default_factory=dict) # New dict per instance
Aliasing
Map Python names to JSON/external names:
class APIResponse(BaseModel):
user_id: int = Field(alias="userId")
created_at: datetime = Field(alias="createdAt")
# Parse from API
data = {"userId": 123, "createdAt": "2024-01-01T00:00:00Z"}
response = APIResponse(**data)
print(response.user_id) # 123
# Serialize with aliases
print(response.model_dump(by_alias=True))
# {'userId': 123, 'createdAt': datetime(...)}
Population by name:
class Flexible(BaseModel):
model_config = ConfigDict(populate_by_name=True)
user_id: int = Field(alias="userId")
# Accept both
Flexible(userId=1) # OK
Flexible(user_id=1) # Also OK
5. Parsing Inputs
From Dicts
data = {"id": 1, "name": "Alice", "email": "[email protected]"}
user = User(**data) # OK
user = User.model_validate(data) # Explicit validation
From JSON Strings
json_data = '{"id": 1, "name": "Alice", "email": "[email protected]"}'
user = User.model_validate_json(json_data)
This is faster than json.loads() + User(**data) because Pydantic’s Rust core parses JSON natively.
Lists of Models
users_data = [
{"id": 1, "name": "Alice", "email": "[email protected]"},
{"id": 2, "name": "Bob", "email": "[email protected]"},
]
# Option 1: List comprehension
users = [User(**d) for d in users_data]
# Option 2: TypeAdapter (preferred)
from pydantic import TypeAdapter
UserList = TypeAdapter(list[User])
users = UserList.validate_python(users_data)
TypeAdapter for Non-BaseModel Types
from pydantic import TypeAdapter
# Validate basic types
IntValidator = TypeAdapter(int)
print(IntValidator.validate_python("123")) # 123
# Validate complex structures
DictAdapter = TypeAdapter(dict[str, list[int]])
result = DictAdapter.validate_python({"nums": ["1", "2", "3"]})
print(result) # {'nums': [1, 2, 3]}
6. Serialization
model_dump()
user = User(id=1, name="Alice", email="[email protected]")
# Default
print(user.model_dump())
# {'id': 1, 'name': 'Alice', 'email': '[email protected]'}
# Exclude fields
print(user.model_dump(exclude={"email"}))
# {'id': 1, 'name': 'Alice'}
# Include only certain fields
print(user.model_dump(include={"id", "name"}))
# {'id': 1, 'name': 'Alice'}
# Use aliases
class APIModel(BaseModel):
user_id: int = Field(alias="userId")
m = APIModel(userId=123)
print(m.model_dump(by_alias=True)) # {'userId': 123}
model_dump_json()
json_str = user.model_dump_json()
print(json_str) # '{"id":1,"name":"Alice","email":"[email protected]"}'
# With indentation
print(user.model_dump_json(indent=2))
Nested Models
class Address(BaseModel):
street: str
city: str
class Person(BaseModel):
name: str
address: Address
p = Person(name="Alice", address={"street": "123 Main", "city": "NYC"})
print(p.model_dump())
# {'name': 'Alice', 'address': {'street': '123 Main', 'city': 'NYC'}}
Exclude Unset
Only serialize fields that were explicitly set:
class Partial(BaseModel):
a: int = 1
b: int = 2
c: int = 3
p = Partial(a=10)
print(p.model_dump()) # {'a': 10, 'b': 2, 'c': 3}
print(p.model_dump(exclude_unset=True)) # {'a': 10}
Useful for PATCH operations in REST APIs.
7. Error System
from pydantic import ValidationError
class User(BaseModel):
id: int
age: int = Field(gt=0, lt=150)
email: str
try:
User(id="not_an_int", age=-5, email=123)
except ValidationError as e:
print(e.json())
Output (formatted):
[
{
"type": "int_parsing",
"loc": ["id"],
"msg": "Input should be a valid integer, unable to parse string as an integer",
"input": "not_an_int"
},
{
"type": "greater_than",
"loc": ["age"],
"msg": "Input should be greater than 0",
"input": -5
},
{
"type": "string_type",
"loc": ["email"],
"msg": "Input should be a valid string",
"input": 123
}
]
Key properties:
type: Error codeloc: Field location (tuple for nested fields)msg: Human-readable messageinput: The invalid value
Custom Error Messages
class User(BaseModel):
age: int = Field(gt=0, lt=150, description="User age")
@field_validator("age")
@classmethod
def validate_age(cls, v):
if v < 13:
raise ValueError("Users must be at least 13 years old")
return v
8. Settings Management
Pydantic’s killer feature for application config:
from pydantic_settings import BaseSettings
class AppSettings(BaseSettings):
database_url: str
redis_host: str = "localhost"
redis_port: int = 6379
debug: bool = False
api_key: str
# Reads from environment variables
settings = AppSettings()
Set environment variables:
export DATABASE_URL="postgresql://localhost/mydb"
export API_KEY="secret123"
Run Python:
print(settings.database_url) # "postgresql://localhost/mydb"
print(settings.debug) # False (default)
Custom Prefix
class AppSettings(BaseSettings):
model_config = ConfigDict(env_prefix="APP_")
database_url: str
# Now looks for APP_DATABASE_URL
.env File Support
class AppSettings(BaseSettings):
model_config = ConfigDict(env_file=".env")
database_url: str
api_key: str
.env file:
DATABASE_URL=postgresql://localhost/mydb
API_KEY=secret123
Nested Settings
class DatabaseSettings(BaseModel):
host: str
port: int
name: str
class AppSettings(BaseSettings):
database: DatabaseSettings
# Set via environment:
# DATABASE__HOST=localhost
# DATABASE__PORT=5432
# DATABASE__NAME=mydb
Double underscore __ for nested fields.
Secrets Support
from pydantic import SecretStr
class AppSettings(BaseSettings):
api_key: SecretStr
settings = AppSettings(api_key="secret123")
print(settings.api_key) # SecretStr('**********')
print(settings.api_key.get_secret_value()) # "secret123"
Prevents accidental logging of secrets.
9. Advanced Features
Computed Fields
Values derived from other fields:
from pydantic import computed_field
class Rectangle(BaseModel):
width: float
height: float
@computed_field
@property
def area(self) -> float:
return self.width * self.height
r = Rectangle(width=10, height=5)
print(r.area) # 50.0
print(r.model_dump()) # {'width': 10.0, 'height': 5.0, 'area': 50.0}
Computed fields are included in serialization by default.
Private Attributes
Not validated, not serialized:
class Stateful(BaseModel):
public_value: int
_cache: dict = {}
def compute(self):
if "result" not in self._cache:
self._cache["result"] = self.public_value * 2
return self._cache["result"]
s = Stateful(public_value=10)
s._cache["custom"] = 123
print(s.model_dump()) # {'public_value': 10} # _cache excluded
Root Models
Validate types that aren’t dicts:
from pydantic import RootModel
class ItemList(RootModel[list[int]]):
pass
items = ItemList([1, 2, 3])
print(items.root) # [1, 2, 3]
data = ItemList.model_validate(["1", "2", "3"])
print(data.root) # [1, 2, 3] (coerced)
Useful for APIs that return arrays at the top level.
Discriminated Unions
Type-safe polymorphism:
from typing import Literal, Union
from pydantic import Field
class Cat(BaseModel):
type: Literal["cat"]
meow_volume: int
class Dog(BaseModel):
type: Literal["dog"]
bark_volume: int
class Snake(BaseModel):
type: Literal["snake"]
length: float
Animal = Union[Cat, Dog, Snake]
class Zoo(BaseModel):
animals: list[Animal] = Field(discriminator="type")
zoo_data = {
"animals": [
{"type": "cat", "meow_volume": 8},
{"type": "dog", "bark_volume": 10},
{"type": "snake", "length": 2.5},
]
}
zoo = Zoo(**zoo_data)
for animal in zoo.animals:
if isinstance(animal, Cat):
print(f"Cat: {animal.meow_volume}")
elif isinstance(animal, Dog):
print(f"Dog: {animal.bark_volume}")
Pydantic looks at the type field to decide which model to use.
Generics
from typing import Generic, TypeVar
T = TypeVar("T")
class Response(BaseModel, Generic[T]):
data: T
status: int
# Use with different types
IntResponse = Response[int]
r1 = IntResponse(data=123, status=200)
UserResponse = Response[User]
r2 = UserResponse(data={"id": 1, "name": "Alice", "email": "[email protected]"}, status=200)
Recursive Models
class TreeNode(BaseModel):
value: int
children: list["TreeNode"] = []
# Must enable with model_rebuild() in Python < 3.10 or use from __future__ import annotations
tree = TreeNode(
value=1,
children=[
TreeNode(value=2),
TreeNode(value=3, children=[TreeNode(value=4)])
]
)
10. Performance and Internals
The Rust Core
Pydantic v2 rewrote core validation in Rust (pydantic-core). Benefits:
- 5-50x faster validation than v1
- Native JSON parsing: Faster than Python’s
jsonmodule - Lower memory overhead: Efficient internal repr
Benchmark (parsing 10k user objects from JSON):
- Pydantic v1: ~500ms
- Pydantic v2: ~20ms
- Manual
json.loads()+ dict access: ~15ms (but no validation)
When Validation is Expensive
Complex validators can be slow:
class Expensive(BaseModel):
data: list[int]
@field_validator("data")
@classmethod
def unique_check(cls, v):
if len(v) != len(set(v)): # O(n) check
raise ValueError("Items must be unique")
return v
# For 1 million items, this is slow
Optimization: Use Pydantic’s built-in constraints when possible:
from typing import Set
class Better(BaseModel):
data: Set[int] # Enforces uniqueness automatically
When to Avoid Pydantic
- Hot inner loops: If you’re validating the same trusted data millions of times per second, validation overhead matters. Use dataclasses or plain classes.
- Memory-constrained environments: Pydantic models use more memory than slotted dataclasses.
- No external data: If your data is generated internally, dataclasses are simpler.
11. Migration Notes: v1 → v2
Major breaking changes:
Config class → model_config
v1:
class Model(BaseModel):
class Config:
frozen = True
v2:
class Model(BaseModel):
model_config = ConfigDict(frozen=True)
Validators
v1: @validator
v2: @field_validator
v1:
@validator("field")
def check_field(cls, v):
return v
v2:
@field_validator("field")
@classmethod
def check_field(cls, v):
return v
Serialization
v1: .dict(), .json()
v2: .model_dump(), .model_dump_json()
Parsing
v1: .parse_obj(), .parse_raw()
v2: .model_validate(), .model_validate_json()
12. Common Mistakes and Footguns
Forgetting Validation Runs on Every Assignment
class Expensive(BaseModel):
values: list[int]
@field_validator("values")
@classmethod
def validate_values(cls, v):
print("Validating!") # Prints on EVERY assignment
return v
m = Expensive(values=[1, 2, 3]) # "Validating!"
m.values = [4, 5, 6] # "Validating!" again
Use model_config = ConfigDict(validate_assignment=False) if reassignment doesn’t need validation.
Mutable Default Strikes Again
class Bad(BaseModel):
items: list = [] # Pydantic allows this, but it's still wrong!
b1 = Bad()
b2 = Bad()
b1.items.append(1)
print(b2.items) # [1] # Shared!
Use Field(default_factory=list).
Over-Validating
Don’t validate internal data that’s already correct:
# BAD: Internal domain logic using Pydantic
def process_user(user: User): # User is Pydantic model
# Every attribute access pays validation tax
...
# GOOD: Use Pydantic at boundaries, dataclasses internally
PART 3 — Dataclasses vs Pydantic
1. Feature Comparison Table
| Feature | Dataclasses | Pydantic |
|---|---|---|
| Purpose | Reduce boilerplate | Validate external data |
| Runtime validation | ❌ No | ✅ Yes |
| Type coercion | ❌ No | ✅ Yes |
| JSON parsing | ❌ Manual | ✅ Built-in |
| Serialization | asdict() (shallow) | model_dump() (rich) |
| Immutability | frozen=True | model_config |
| Slots | slots=True (3.10+) | ❌ No (uses __dict__) |
| Memory overhead | Low (especially with slots) | Higher |
| Speed | Fastest (no validation) | Fast (Rust core), but slower than no validation |
| Settings from env | ❌ Manual | ✅ BaseSettings |
| Error aggregation | ❌ N/A | ✅ All errors at once |
| Nested validation | ❌ No | ✅ Recursive |
| Field constraints | ❌ Manual via __post_init__ | ✅ Built-in (Field) |
| Standard library | ✅ Yes | ❌ No (third-party) |
2. Performance Comparison
Benchmark: Create 100k instances from dicts
# Dataclass (no validation)
@dataclass
class DC:
id: int
name: str
value: float
for d in data:
DC(**d) # ~50ms
# Pydantic (with validation)
class PM(BaseModel):
id: int
name: str
value: float
for d in data:
PM(**d) # ~200ms
# Pydantic (construct without validation)
for d in data:
PM.model_construct(**d) # ~70ms
Lessons:
- Dataclasses are faster when data is trusted
- Pydantic validation adds ~4x overhead
model_construct()bypasses validation for internal use
3. Correct Use Cases
Use Dataclasses When:
✅ Internal domain models
@dataclass(frozen=True, slots=True)
class OrderLine:
product_id: int
quantity: int
unit_price: Decimal
✅ Performance-critical paths
# Processing millions of records
@dataclass(slots=True)
class LogEntry:
timestamp: float
level: str
message: str
✅ Simple DTOs between layers
@dataclass
class ServiceResult:
success: bool
data: Any
error: str | None = None
Use Pydantic When:
✅ API request/response models
class CreateUserRequest(BaseModel):
username: str = Field(min_length=3, max_length=20)
email: str
age: int = Field(ge=13)
✅ Configuration from environment
class AppConfig(BaseSettings):
database_url: str
api_key: SecretStr
debug: bool = False
✅ Parsing external data (JSON, YAML, etc.)
class APIResponse(BaseModel):
user_id: int
created_at: datetime
response = APIResponse.model_validate_json(api_response_text)
✅ Validation boundaries
# Validate at system edge
def create_user(request: CreateUserRequest) -> User:
# request is validated
# Convert to internal domain model (dataclass)
return User(id=generate_id(), username=request.username)
4. Hybrid Patterns
Pattern 1: Pydantic for Input, Dataclasses for Domain
# API layer
class CreateOrderRequest(BaseModel):
customer_id: int
items: list[dict]
# Domain layer
@dataclass(frozen=True)
class Order:
id: int
customer_id: int
items: list[OrderLine]
created_at: datetime
# Service layer
def create_order(request: CreateOrderRequest) -> Order:
# Validate at boundary
items = [OrderLine(**item) for item in request.items]
return Order(
id=generate_id(),
customer_id=request.customer_id,
items=items,
created_at=datetime.now(timezone.utc)
)
Pattern 2: Pydantic Settings + Dataclass Models
# Config with Pydantic
class DatabaseConfig(BaseSettings):
host: str
port: int
name: str
# Runtime models with dataclasses
@dataclass(slots=True)
class User:
id: int
name: str
Pattern 3: Dataclasses with Pydantic Validation
Use pydantic.dataclasses for dataclass syntax with Pydantic validation:
from pydantic.dataclasses import dataclass as pydantic_dataclass
@pydantic_dataclass
class User:
id: int
name: str
age: int
# This is a dataclass, but with Pydantic validation!
User(id="123", age="30") # Coerces types
Trade-off: You get validation but lose some performance.
5. Decision Framework
Is data from external source?
|
Yes | No
|
+--------------+--------------+
| |
Use Pydantic Do you need validation?
| |
| Yes | No
| |
| +--------------+--------------+
| | |
| Pydantic or Use dataclasses
| dataclass + __post_init__ |
| | |
| | |
+-------> Validation boundary <--------------+
Questions to ask:
- Is the data coming from users, APIs, files, or environment? → Pydantic
- Do I need type coercion? → Pydantic
- Is this a performance bottleneck? → Dataclasses (especially with slots)
- Do I need settings management? → Pydantic BaseSettings
- Is this an internal domain model used everywhere? → Dataclasses
- Do I need comprehensive validation rules? → Pydantic
PART 4 — Real-World Patterns
1. API Request/Response Models
FastAPI with Pydantic:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
app = FastAPI()
class CreateUserRequest(BaseModel):
username: str = Field(min_length=3, max_length=20, pattern=r"^[a-zA-Z0-9_]+$")
email: str
age: int = Field(ge=13, le=120)
class UserResponse(BaseModel):
id: int
username: str
email: str
created_at: datetime
@app.post("/users", response_model=UserResponse)
def create_user(request: CreateUserRequest):
# Request is automatically validated
user = save_user(request) # Internal logic
return UserResponse(
id=user.id,
username=user.username,
email=user.email,
created_at=user.created_at
)
Key points:
- Pydantic validates incoming JSON automatically
response_modelvalidates outgoing data- Errors return 422 with detailed validation messages
2. Config Systems
Three-tier config with Pydantic:
from pydantic_settings import BaseSettings
from pydantic import SecretStr, Field
class DatabaseConfig(BaseModel):
host: str = "localhost"
port: int = 5432
name: str
user: str
password: SecretStr
@property
def url(self) -> str:
pwd = self.password.get_secret_value()
return f"postgresql://{self.user}:{pwd}@{self.host}:{self.port}/{self.name}"
class RedisConfig(BaseModel):
host: str = "localhost"
port: int = 6379
db: int = 0
class AppSettings(BaseSettings):
model_config = ConfigDict(env_nested_delimiter="__")
env: str = "development"
debug: bool = False
database: DatabaseConfig
redis: RedisConfig
secret_key: SecretStr
api_keys: list[str] = Field(default_factory=list)
# Load from environment
# DATABASE__HOST=localhost
# DATABASE__PORT=5432
# DATABASE__NAME=myapp
# DATABASE__USER=postgres
# DATABASE__PASSWORD=secret
# REDIS__HOST=redis
# SECRET_KEY=supersecret
# API_KEYS=["key1","key2"]
settings = AppSettings()
print(settings.database.url)
3. Domain Models
Internal domain models with dataclasses:
from dataclasses import dataclass, field
from decimal import Decimal
from datetime import datetime
@dataclass(frozen=True)
class Money:
amount: Decimal
currency: str = "USD"
def __add__(self, other: "Money") -> "Money":
if self.currency != other.currency:
raise ValueError(f"Cannot add {self.currency} and {other.currency}")
return Money(self.amount + other.amount, self.currency)
@dataclass(frozen=True)
class OrderLine:
product_id: int
product_name: str
quantity: int
unit_price: Money
@property
def total(self) -> Money:
return Money(self.unit_price.amount * self.quantity, self.unit_price.currency)
@dataclass(frozen=True)
class Order:
id: int
customer_id: int
lines: tuple[OrderLine, ...]
created_at: datetime
@property
def total(self) -> Money:
if not self.lines:
return Money(Decimal(0))
return sum((line.total for line in self.lines[1:]), start=self.lines[0].total)
# Use frozen dataclasses for immutable domain objects
# Use tuples instead of lists for truly immutable collections
4. Validation Boundaries
Clean architecture with validation at edges:
# == API Layer (Pydantic) ==
class CreateOrderAPI(BaseModel):
customer_id: int
items: list[dict]
# == Application Layer ==
@dataclass
class CreateOrderCommand:
customer_id: int
items: list[OrderLineData]
@dataclass
class OrderLineData:
product_id: int
quantity: int
class OrderService:
def create_order(self, command: CreateOrderCommand) -> Order:
# Business logic with validated data
lines = [
OrderLine(
product_id=item.product_id,
product_name=self.get_product_name(item.product_id),
quantity=item.quantity,
unit_price=self.get_product_price(item.product_id)
)
for item in command.items
]
return Order(
id=self.generate_id(),
customer_id=command.customer_id,
lines=tuple(lines),
created_at=datetime.now(timezone.utc)
)
# == API Handler ==
@app.post("/orders")
def create_order_endpoint(request: CreateOrderAPI):
# Validate at boundary
command = CreateOrderCommand(
customer_id=request.customer_id,
items=[OrderLineData(**item) for item in request.items]
)
# Pass validated data to domain
order = order_service.create_order(command)
return OrderResponse.from_domain(order)
Pattern:
- API layer: Pydantic validates external input
- Application layer: Simple dataclasses (commands/queries)
- Domain layer: Rich dataclasses with business logic
- No validation inside domain: Data is pre-validated
5. Large-Scale Codebase Recommendations
Directory Structure
project/
├── api/
│ ├── models/ # Pydantic request/response models
│ │ ├── requests.py
│ │ └── responses.py
│ └── routes/
├── domain/
│ ├── entities/ # Dataclass domain entities
│ ├── value_objects/ # Frozen dataclass value objects
│ └── services/
├── infrastructure/
│ ├── database/
│ └── external_apis/ # Pydantic models for external APIs
└── config/
└── settings.py # Pydantic BaseSettings
Naming Conventions
- Pydantic models:
CreateUserRequest,UserResponse,ExternalAPIModel - Dataclass entities:
User,Order,Product - Value objects:
Email,Money,Address(frozen dataclasses)
Type Hints
# Use Protocol for interfaces (not BaseModel or dataclass)
from typing import Protocol
class UserRepository(Protocol):
def find_by_id(self, user_id: int) -> User | None: ...
def save(self, user: User) -> None: ...
# Implementations use dataclasses
@dataclass
class InMemoryUserRepository:
users: dict[int, User] = field(default_factory=dict)
def find_by_id(self, user_id: int) -> User | None:
return self.users.get(user_id)
def save(self, user: User) -> None:
self.users[user.id] = user
Testing
# Use dataclasses for test fixtures
@dataclass
class UserBuilder:
id: int = 1
name: str = "Test User"
email: str = "[email protected]"
def with_id(self, id: int) -> "UserBuilder":
return dataclass.replace(self, id=id)
def build(self) -> User:
return User(id=self.id, name=self.name, email=self.email)
# In tests
def test_user_service():
user = UserBuilder().with_id(42).build()
result = service.process(user)
assert result.success
Performance Guidelines
- Hot paths: Use slotted dataclasses
- API boundaries: Pydantic is fine (amortized over network I/O)
- Bulk processing: Consider
model_construct()for Pydantic models when re-validating trusted data - Serialization: Use
orjsonwith Pydantic for maximum JSON performance
import orjson
from pydantic import BaseModel
class FastModel(BaseModel):
model_config = ConfigDict(
# Use orjson for faster JSON serialization
json_dumps=orjson.dumps,
json_loads=orjson.loads
)
Migration Strategy
Migrating a large codebase from ad-hoc dicts to typed models:
- Start with API boundaries: Add Pydantic models to all endpoints
- Config next: Move to
BaseSettings - Domain models: Gradually introduce dataclasses for core entities
- Don’t refactor everything: Focus on high-value areas
- Use
TypedDictas intermediate: When full model migration is too much
from typing import TypedDict
class UserDict(TypedDict):
id: int
name: str
email: str
# Later, upgrade to dataclass or Pydantic
Conclusion
Dataclasses are for reducing boilerplate in trusted internal code. They’re fast, memory-efficient (with slots), and part of the standard library. Use them for domain models, DTOs, and anywhere you need structured data without validation overhead.
Pydantic is for validating data from external sources. It coerces types, aggregates errors, parses JSON natively, and handles settings management. Use it at system boundaries: APIs, configs, file parsing.
The right approach: Don’t choose one or the other. Use both. Pydantic at the edges, dataclasses in the core. This gives you safety where you need it and performance where it matters.
Key takeaway: Type hints alone don’t validate. Dataclasses enforce structure at development time (via type checkers). Pydantic enforces correctness at runtime. Know which problem you’re solving.
Continue reading
Next article
AI Agents from Scratch Part 1: Understanding the ReAct Pattern (Research Report Generator)
Related Content
Serverless Architecture and AWS Lambda: Everything You Need to Know in 2025
Master serverless architecture with AWS Lambda. Complete guide covering FaaS, event-driven patterns, cold starts, Node.js & Python examples, and production best practices.
Microservices vs Monoliths
Comprehensive comparison of microservices and monolithic architectures. Learn when to use each approach, their benefits, trade-offs, and best practices for modern software development.
Codexity Part 8: The Complete Answer Engine
The final chapter. Assemble every module into a running application. Complete source code, Docker deployment, configuration, testing, and performance tuning for the full Codexity answer engine.