Introduction
So you know what Avro is, but how do you actually write these schemas? What types can you use? How do you handle optional fields or nested data?
This guide covers everything about Avro schemas - from basic types to complex nested structures. Use it as a reference when building schemas for Kafka or Hadoop projects.
What's a Schema Anyway?
It's a JSON document that defines your data structure. Think of it as a contract - it tells both the data writer and reader exactly what fields exist, what types they are, and whether they're required or optional.
Primitive Types (The Basics)
Let's start simple. Avro has these basic types that cover most everyday data:
null
When something has no value.
{ "type": "null" }boolean
True or false. Simple as that.
{ "type": "boolean" }int
32-bit integer. Good for most numbers (±2 billion range).
{ "type": "int" }long
64-bit integer. For really big numbers or timestamps.
{ "type": "long" }float
32-bit decimal number.
{ "type": "float" }double
64-bit decimal. More precision than float.
{ "type": "double" }bytes
Raw binary data (images, files, whatever).
{ "type": "bytes" }string
Text. The most common type you'll use.
{ "type": "string" }Quick Tip: int vs long
Use int for regular numbers like user IDs or counts. Use long for timestamps or when you need more range. Long takes a bit more space but prevents overflow headaches.
Records (Structured Objects)
Records are what you'll use most. They're like objects in JavaScript or structs in C - a collection of fields grouped together.
Basic Record
{
"type": "record",
"name": "User",
"namespace": "com.example.users",
"doc": "A user in the system",
"fields": [
{
"name": "userId",
"type": "long",
"doc": "Unique ID"
},
{
"name": "username",
"type": "string"
},
{
"name": "email",
"type": "string"
},
{
"name": "createdAt",
"type": "long",
"doc": "Timestamp in milliseconds"
}
]
}Here's what each part means:
- type: Always "record" for objects
- name: What you call this thing (like a class name)
- namespace: Optional, but helps avoid name conflicts
- doc: Comments explaining what this is
- fields: The actual data fields
Why Use Namespaces?
In Schema Registry, you might have multiple "User" records from different teams. Namespaces keep them separate: com.team1.User vs com.team2.User.
Enums (Fixed Choices)
Enums are great when a field can only have specific values. Status fields, categories, types - stuff like that.
{
"type": "record",
"name": "Order",
"fields": [
{
"name": "orderId",
"type": "string"
},
{
"name": "status",
"type": {
"type": "enum",
"name": "OrderStatus",
"symbols": ["PENDING", "PROCESSING", "SHIPPED", "DELIVERED", "CANCELLED"]
}
}
]
}Now status can only be one of those five values. Try to set it to "FINALIZED" and you'll get an error. Catches bugs early.
Evolving Enums
You can add new values to the end and things keep working. Adding "RETURNED" to OrderStatus won't break existing data.
But don't remove or reorder symbols - that breaks everything. If you need flexibility, just use strings instead.
Arrays (Lists)
Arrays are for lists of things - tags, items, IDs, whatever. All elements must be the same type.
{
"type": "record",
"name": "BlogPost",
"fields": [
{
"name": "title",
"type": "string"
},
{
"name": "tags",
"type": {
"type": "array",
"items": "string"
}
},
{
"name": "viewCounts",
"type": {
"type": "array",
"items": "int"
}
}
]
}The items field says what type goes in the array. Tags are strings, viewCounts are ints. Pretty straightforward.
Maps (Key-Value Pairs)
Maps store key-value pairs. Keys are always strings, but values can be any type. Perfect for metadata or dynamic properties.
{
"type": "record",
"name": "KafkaEvent",
"fields": [
{
"name": "eventId",
"type": "string"
},
{
"name": "metadata",
"type": {
"type": "map",
"values": "string"
}
}
]
}Example data:
{
"eventId": "evt-12345",
"metadata": {
"source": "web-app",
"userId": "user-789",
"ipAddress": "192.168.1.1"
}
}You'll see this pattern all the time in Kafka event messages. Different events have different metadata, and maps handle that nicely.
Unions (Optional Fields)
Unions let a field accept multiple types. Most commonly used to make fields optional by combining with null.
{
"type": "record",
"name": "Product",
"fields": [
{
"name": "productId",
"type": "string"
},
{
"name": "name",
"type": "string"
},
{
"name": "description",
"type": ["null", "string"],
"default": null
},
{
"name": "discount",
"type": ["null", "double"],
"default": null
}
]
}["null", "string"] means "this can be null OR a string". The default: null helps with schema evolution.
Important: Order Matters
Always put "null" first: ["null", "string"] not ["string", "null"]. It's better for backward compatibility when adding optional fields later.
Nested Records
You can nest records inside other records. Common pattern for things like customers with addresses.
{
"type": "record",
"name": "Customer",
"namespace": "com.example.customers",
"fields": [
{
"name": "customerId",
"type": "long"
},
{
"name": "name",
"type": "string"
},
{
"name": "shippingAddress",
"type": {
"type": "record",
"name": "Address",
"fields": [
{ "name": "street", "type": "string" },
{ "name": "city", "type": "string" },
{ "name": "state", "type": "string" },
{ "name": "zipCode", "type": "string" }
]
}
},
{
"name": "billingAddress",
"type": ["null", "Address"],
"default": null
}
]
}Notice how billingAddress reuses the Address type we defined in shippingAddress. Once you name a type, you can reference it anywhere.
Fixed (Fixed-Length Bytes)
Fixed types are for binary data with a known length. UUIDs, hashes, crypto keys - that kind of thing.
{
"type": "record",
"name": "SecurityEvent",
"fields": [
{
"name": "eventId",
"type": {
"type": "fixed",
"name": "UUID",
"size": 16
}
},
{
"name": "hash",
"type": {
"type": "fixed",
"name": "SHA256",
"size": 32
}
}
]
}UUID is 16 bytes, SHA-256 is 32 bytes. Fixed types are more efficient than bytes when you know the exact size.
Logical Types (Dates, Times, Decimals)
Logical types add meaning to primitive types. They're stored as int/long/bytes but interpreted specially.
{
"type": "record",
"name": "Transaction",
"fields": [
{
"name": "transactionDate",
"type": {
"type": "int",
"logicalType": "date"
},
"doc": "Days since 1970-01-01"
},
{
"name": "transactionTime",
"type": {
"type": "long",
"logicalType": "timestamp-millis"
},
"doc": "Milliseconds since 1970-01-01"
},
{
"name": "amount",
"type": {
"type": "bytes",
"logicalType": "decimal",
"precision": 10,
"scale": 2
},
"doc": "Money amount with 2 decimal places"
}
]
}Common Ones:
- date: int for days since Unix epoch
- time-millis: int for time of day
- timestamp-millis: long for timestamps
- timestamp-micros: long for microsecond timestamps
- decimal: bytes for precise decimal numbers
- uuid: string for UUIDs
Schema Evolution Rules
Here's what you can and can't do when changing schemas over time:
Safe Changes:
- ✓Add fields with defaults
- ✓Remove fields with defaults
- ✓Add enum values at the end
- ✓Add union types
Breaking Changes:
- ✗Change field types
- ✗Rename fields
- ✗Remove enum values
- ✗Add required fields
Golden Rule
Always provide defaults when adding new fields. This lets old data work with new schemas and new data work with old schemas. Read more in the Confluent Avro guide.
Tools for Working with Schemas
Free tools to help you create and validate schemas:
Official Documentation
Official Resources
- Avro Specification
Complete schema spec
- Confluent Avro Guide
Schema evolution guide
- Apache Avro GitHub
Source and examples
Related Guides
- What is Apache Avro?
Avro introduction
- Schema Examples
Ready-to-use templates
- Format Examples
Schema with data