Supported Data Types
High-level overview
Mitzu distinguishes several data type categories:
- scalar - numbers with floating point precision, integers
- text - strings, varchars, and chars
- date - date types
- timestamp - timestamp types with or without timezones
- map/json/variant - dynamically varying complex type
- struct - static complex types
- array - repeated value type
- exotic types - UUID, Blob, etc.
Generic type support
Mitzu generally supports all type categories except "Exotic types" across all data warehouse and data lake types.
| Type | Support |
|---|---|
| Scalar | ✅ |
| Text | ✅ |
| Date | ✅ |
| Timestamp | ✅ |
| Map / JSON / Variant | ✅ |
| Struct | ✅ |
| Array | ✅ - only arrays of scalars and arrays of strings |
| Exotic types | ❌ (GEO spatial types, UUIDs, Binaries etc) |
Exotic types support is not included, as every data warehouse has a different set of exotic types.
Type recommendations per use-case
In this section, we describe our recommended types for each use case.
Data type for user ids
Generally, the data community has agreed to use UUID4 (or later versions once released) or CUIDs to identify users in data. For simplicity, we recommend using UUID4s. Except for a single character in the middle, these are strings of random characters.
✅ The recommended type category for user ids is text.
Data type for event times
Event times can mean four things in product event tracking:
- client time - the client's timestamp for when the event was logged
- client upload time - the client's timestamp for when the event was sent to a server or CDP
- server time - the timestamp for when the server received the event
- corrected client time - the timestamp calculated from the client time and the difference between the server time and client upload time
Generally, we recommend using the corrected client time for product analytics. Most CDPs store the corrected client timestamp as the client time in the data warehouse.
✅ The recommended type category for event times is timestamp (without timezones).
Data type for event names
Event names should be human-readable or quasi-human-readable text.
✅ The recommended type category for event names is text.
Data type for date partitions
Date partitions help efficiently read data from a data lake. Data lakes store files in "folders" with formats such as:
/date=2024-04-01/raw_data.parquet
Naturally, you might assume that date partitions should use a date type. However, experience shows that many data teams keep this information in text format in the table definition. Text format is acceptable as long as the date string is in ISO format.
Mitzu supports both text and date formats for partition columns.
✅ The recommended type category for date partitions is DATE or TEXT.
Other recommendations
Dynamic complex type support for data warehouses
This section discusses which data types can be used in each data warehouse for storing dynamic types.
| Data warehouse | Supported dynamic type |
|---|---|
| Snowflake | ✅ Flat Object type - no nesting |
| Databricks | ✅ Map<TEXT,SCALAR> or MAP<TEXT, TEXT> |
| BigQuery | ✅ JSON Strings |
| Trino | ✅ Map<TEXT,SCALAR> or MAP<TEXT, TEXT> |
| Athena | ✅ Map<TEXT,SCALAR> or MAP<TEXT, TEXT> |
| Postgres | ✅ Json or jsonb |
| Clickhouse | ✅ Json types |
Dynamic complex types are essential for modeling a large number of events with different sets of properties.
Static complex type support
Generally, we don't recommend using static complex types when storing events in a data warehouse or data lake.
These types are common in:
- Databricks, Athena, Trino - struct types
- Clickhouse - tuples
Try to avoid these types and use a flat-column hierarchy instead.