Querying Avro Files in Stage
Syntax
Avro Querying Features Overview
TiDB Cloud Lake provides comprehensive support for querying Avro files directly from stages. This allows for flexible data exploration and transformation without needing to load the data into a table first.
- Variant Representation: Each row in an Avro file is treated as a variant, referenced by
$1. This allows for flexible access to nested structures within the Avro data. - Type Mapping: Each Avro type is mapped to a corresponding variant type in TiDB Cloud Lake.
- Metadata Access: You can access metadata columns like
METADATA$FILENAMEandMETADATA$FILE_ROW_NUMBERfor additional context about the source file and row.
Tutorial
This tutorial demonstrates how to query Avro files stored in a stage.
Step 1. Prepare an Avro File
Consider an Avro file with the following schema named user:
{
"type": "record",
"name": "user",
"fields": [
{
"name": "id",
"type": "long"
},
{
"name": "name",
"type": "string"
}
]
}
Step 2. Create an External Stage
Create an external stage with your own S3 bucket and credentials where your Avro files are stored.
CREATE STAGE avro_query_stage
URL = 's3://load/avro/'
CONNECTION = (
ACCESS_KEY_ID = '<your-access-key-id>'
SECRET_ACCESS_KEY = '<your-secret-access-key>'
);
Step 3. Query Avro Files
Basic Query
Query Avro files directly from a stage:
SELECT
CAST($1:id AS INT) AS id,
$1:name AS name
FROM @avro_query_stage
(
FILE_FORMAT => 'AVRO',
PATTERN => '.*[.]avro'
);
Query with Metadata
Query Avro files directly from a stage, including metadata columns like METADATA$FILENAME and METADATA$FILE_ROW_NUMBER:
SELECT
METADATA$FILENAME,
METADATA$FILE_ROW_NUMBER,
CAST($1:id AS INT) AS id,
$1:name AS name
FROM @avro_query_stage
(
FILE_FORMAT => 'AVRO',
PATTERN => '.*[.]avro'
);
Type Mapping to Variant
Variants in TiDB Cloud Lake are stored as JSONB. While most Avro types map straightforwardly, some special considerations apply:
- Time Types:
TimeMillisandTimeMicrosare mapped toINT64as JSONB does not have a native Time type. Users should be aware of the original type when processing these values. - Decimal Types: Decimals are loaded as
DECIMAL128orDECIMAL256. An error may occur if the precision exceeds the supported limits. - Enum Types: Avro
ENUMtypes are mapped toSTRINGvalues in TiDB Cloud Lake.