Avro Schema Guide - Complete Reference

Everything you need to know about designing and using Avro schemas

January 202612 min read

Introduction

So you know what Avro is, but how do you actually write these schemas? What types can you use? How do you handle optional fields or nested data?

This guide covers everything about Avro schemas - from basic types to complex nested structures. Use it as a reference when building schemas for Kafka or Hadoop projects.

What's a Schema Anyway?

It's a JSON document that defines your data structure. Think of it as a contract - it tells both the data writer and reader exactly what fields exist, what types they are, and whether they're required or optional.

Primitive Types (The Basics)

Let's start simple. Avro has these basic types that cover most everyday data:

null

When something has no value.

{ "type": "null" }

boolean

True or false. Simple as that.

{ "type": "boolean" }

int

32-bit integer. Good for most numbers (±2 billion range).

{ "type": "int" }

long

64-bit integer. For really big numbers or timestamps.

{ "type": "long" }

float

32-bit decimal number.

{ "type": "float" }

double

64-bit decimal. More precision than float.

{ "type": "double" }

bytes

Raw binary data (images, files, whatever).

{ "type": "bytes" }

string

Text. The most common type you'll use.

{ "type": "string" }

Quick Tip: int vs long

Use int for regular numbers like user IDs or counts. Use long for timestamps or when you need more range. Long takes a bit more space but prevents overflow headaches.

Records (Structured Objects)

Records are what you'll use most. They're like objects in JavaScript or structs in C - a collection of fields grouped together.

Basic Record

{
  "type": "record",
  "name": "User",
  "namespace": "com.example.users",
  "doc": "A user in the system",
  "fields": [
    {
      "name": "userId",
      "type": "long",
      "doc": "Unique ID"
    },
    {
      "name": "username",
      "type": "string"
    },
    {
      "name": "email",
      "type": "string"
    },
    {
      "name": "createdAt",
      "type": "long",
      "doc": "Timestamp in milliseconds"
    }
  ]
}

Here's what each part means:

  • type: Always "record" for objects
  • name: What you call this thing (like a class name)
  • namespace: Optional, but helps avoid name conflicts
  • doc: Comments explaining what this is
  • fields: The actual data fields

Why Use Namespaces?

In Schema Registry, you might have multiple "User" records from different teams. Namespaces keep them separate: com.team1.User vs com.team2.User.

Enums (Fixed Choices)

Enums are great when a field can only have specific values. Status fields, categories, types - stuff like that.

{
  "type": "record",
  "name": "Order",
  "fields": [
    {
      "name": "orderId",
      "type": "string"
    },
    {
      "name": "status",
      "type": {
        "type": "enum",
        "name": "OrderStatus",
        "symbols": ["PENDING", "PROCESSING", "SHIPPED", "DELIVERED", "CANCELLED"]
      }
    }
  ]
}

Now status can only be one of those five values. Try to set it to "FINALIZED" and you'll get an error. Catches bugs early.

Evolving Enums

You can add new values to the end and things keep working. Adding "RETURNED" to OrderStatus won't break existing data.

But don't remove or reorder symbols - that breaks everything. If you need flexibility, just use strings instead.

Arrays (Lists)

Arrays are for lists of things - tags, items, IDs, whatever. All elements must be the same type.

{
  "type": "record",
  "name": "BlogPost",
  "fields": [
    {
      "name": "title",
      "type": "string"
    },
    {
      "name": "tags",
      "type": {
        "type": "array",
        "items": "string"
      }
    },
    {
      "name": "viewCounts",
      "type": {
        "type": "array",
        "items": "int"
      }
    }
  ]
}

The items field says what type goes in the array. Tags are strings, viewCounts are ints. Pretty straightforward.

Maps (Key-Value Pairs)

Maps store key-value pairs. Keys are always strings, but values can be any type. Perfect for metadata or dynamic properties.

{
  "type": "record",
  "name": "KafkaEvent",
  "fields": [
    {
      "name": "eventId",
      "type": "string"
    },
    {
      "name": "metadata",
      "type": {
        "type": "map",
        "values": "string"
      }
    }
  ]
}

Example data:

{
  "eventId": "evt-12345",
  "metadata": {
    "source": "web-app",
    "userId": "user-789",
    "ipAddress": "192.168.1.1"
  }
}

You'll see this pattern all the time in Kafka event messages. Different events have different metadata, and maps handle that nicely.

Unions (Optional Fields)

Unions let a field accept multiple types. Most commonly used to make fields optional by combining with null.

{
  "type": "record",
  "name": "Product",
  "fields": [
    {
      "name": "productId",
      "type": "string"
    },
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "description",
      "type": ["null", "string"],
      "default": null
    },
    {
      "name": "discount",
      "type": ["null", "double"],
      "default": null
    }
  ]
}

["null", "string"] means "this can be null OR a string". The default: null helps with schema evolution.

Important: Order Matters

Always put "null" first: ["null", "string"] not ["string", "null"]. It's better for backward compatibility when adding optional fields later.

Nested Records

You can nest records inside other records. Common pattern for things like customers with addresses.

{
  "type": "record",
  "name": "Customer",
  "namespace": "com.example.customers",
  "fields": [
    {
      "name": "customerId",
      "type": "long"
    },
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "shippingAddress",
      "type": {
        "type": "record",
        "name": "Address",
        "fields": [
          { "name": "street", "type": "string" },
          { "name": "city", "type": "string" },
          { "name": "state", "type": "string" },
          { "name": "zipCode", "type": "string" }
        ]
      }
    },
    {
      "name": "billingAddress",
      "type": ["null", "Address"],
      "default": null
    }
  ]
}

Notice how billingAddress reuses the Address type we defined in shippingAddress. Once you name a type, you can reference it anywhere.

Fixed (Fixed-Length Bytes)

Fixed types are for binary data with a known length. UUIDs, hashes, crypto keys - that kind of thing.

{
  "type": "record",
  "name": "SecurityEvent",
  "fields": [
    {
      "name": "eventId",
      "type": {
        "type": "fixed",
        "name": "UUID",
        "size": 16
      }
    },
    {
      "name": "hash",
      "type": {
        "type": "fixed",
        "name": "SHA256",
        "size": 32
      }
    }
  ]
}

UUID is 16 bytes, SHA-256 is 32 bytes. Fixed types are more efficient than bytes when you know the exact size.

Logical Types (Dates, Times, Decimals)

Logical types add meaning to primitive types. They're stored as int/long/bytes but interpreted specially.

{
  "type": "record",
  "name": "Transaction",
  "fields": [
    {
      "name": "transactionDate",
      "type": {
        "type": "int",
        "logicalType": "date"
      },
      "doc": "Days since 1970-01-01"
    },
    {
      "name": "transactionTime",
      "type": {
        "type": "long",
        "logicalType": "timestamp-millis"
      },
      "doc": "Milliseconds since 1970-01-01"
    },
    {
      "name": "amount",
      "type": {
        "type": "bytes",
        "logicalType": "decimal",
        "precision": 10,
        "scale": 2
      },
      "doc": "Money amount with 2 decimal places"
    }
  ]
}

Common Ones:

  • date: int for days since Unix epoch
  • time-millis: int for time of day
  • timestamp-millis: long for timestamps
  • timestamp-micros: long for microsecond timestamps
  • decimal: bytes for precise decimal numbers
  • uuid: string for UUIDs

Schema Evolution Rules

Here's what you can and can't do when changing schemas over time:

Safe Changes:

  • Add fields with defaults
  • Remove fields with defaults
  • Add enum values at the end
  • Add union types

Breaking Changes:

  • Change field types
  • Rename fields
  • Remove enum values
  • Add required fields

Golden Rule

Always provide defaults when adding new fields. This lets old data work with new schemas and new data work with old schemas. Read more in the Confluent Avro guide.

Tools for Working with Schemas

Free tools to help you create and validate schemas:

Official Documentation

Official Resources

Related Guides