Supported Data Types
High-level overview
Mitzu distinguishes several data type categories
- scalar - numbers with floating point precision, integers
- text - strings, varchars, and chars
- date - date types
- timestamp - timestamp types with or without timezones
- map/json - dynamically varying complex type
- struct - static complex types
- array - repeated value type
- exotic types - UUID, Blob, etc.
Generic type support
Mitzu generally supports all type categories except "Exotic types" for all data warehouse or data lake types.
Type | Support |
---|---|
Scalar | ✅ |
Text | ✅ |
Date | ✅ |
Timestamp | ✅ |
Map / JSON | ✅ |
Struct | ✅ |
Array | ✅ - only arrays of scalars and arrays of strings |
Exotic types | ❌ (GEO spatial types, UUIDs, Binaries etc) |
Exotic types support is not added as every data warehouse have different set of exotic types.
Type recommendations per use-case
In this section, we are describing our recommended types per use case.
Data type for user ids
Generally, the data communities have agreed to use UUID4 (or later versions once released) or CUIDs to identify users in data. For simplicity, we recommend using UUID4s. Except for a single character in the middle, these are strings of random characters.
✅ The recommended type category for user ids is text.
Data type for event times
Event times can mean 4 things in product event tracking:
- client time - the client's timestamp for logging the event
- client upload time - the client's timestamp for sending the event to a server or CDP
- server time - the timestamp for the server receiving the event.
- corrected client time - the calculated timestamp based on the client time and the difference between the server time and client upload time.
Generally, the recommendation is to use the corrected client time for product analytics. Most CDPs store the corrected client timestamp as the client time in the data warehouse.
✅ The recommended type category for event times is the timestamp (without timezones)
Data type for event names
Event names should be human-readable or quasi-human-readable text.
✅ The recommended type category for event names is text.
Data type for date partitions
Date partitions help efficiently read the data from a data lake. Data lakes store files in "folders" that have formats, e.g.:
/date=2024-04-01/raw_data.parquet
Naturally, it would come to that date partitions should have date type. However, experience shows that many data teams keep this information in text format in the table definition. Text format is acceptable if the date string is in ISO format.
Mitzu supports both text and date formats for partition columns.
✅ The recommended type category for event names is DATE or TEXT.
Other recommendations
Dynamic complex type support for data warehouses
This section discusses which data types can be used in each data warehouse for storing dynamic types.
Data warehouse | Supported dynamic type |
---|---|
Snowflake | ✅ Flat Object type - no nesting |
Databricks | ✅ Map<TEXT,SCALAR> or MAP<TEXT, TEXT> |
BigQuery | ✅ JSON Strings |
Trino | ✅ Map<TEXT,SCALAR> or MAP<TEXT, TEXT> |
Athena | ✅ Map<TEXT,SCALAR> or MAP<TEXT, TEXT> |
Postgres | ✅ Json or jsonb |
Clickhouse | ✅ Json types |
Dynamic complex types are essential for modeling a large number of events with a different set of properties.
Static complex type support
Generally, we don't recommend using static complex types while storing events in a data warehouse or data lake.
These types are common in:
- Databricks, Athena, Trino - struct types
- Clickhouse - tupples
Try avoiding these types and use a flat-column hierarchy instead.