Querying & Transforming
TiDB Cloud Lake enables direct querying of staged files without loading data into tables first. Query files from any stage type (user, internal, external) or directly from object storage and HTTPS URLs. Ideal for data inspection, validation, and transformation before or after loading.
Syntax
query only
SELECT {
[<alias>.]<column> [, [<alias>.]<column> ...] -- Query columns by name
| [<alias>.]$<col_position> [, [<alias>.]$<col_position> ...] -- Query columns by position
| [<alias>.]$1[:<column>] [, [<alias>.]$1[:<column>] ...] -- Query rows as Variants
}
FROM {@<stage_name>[/<path>] | '<uri>'} -- stage table function
[( -- stage table function parameters
[<connection_parameters>],
[ PATTERN => '<regex_pattern>'],
[ FILE_FORMAT => 'CSV | TSV | NDJSON | PARQUET | ORC | Avro | <custom_format_name>'],
[ FILES => ( '<file_name>' [ , '<file_name>' ... ])],
[ CASE_SENSITIVE => true | false ]
)]
[<alias>]
copy with transform
COPY INTO [<database_name>.]<table_name> [ ( <col_name> [ , <col_name> ... ] ) ]
FROM (
SELECT {
[<alias>.]<column> [, [<alias>.]<column> ...] -- Query columns by name
| [<alias>.]$<col_position> [, [<alias>.]$<col_position> ...] -- Query columns by position
| [<alias>.]$1[:<column>] [, [<alias>.]$1[:<column>] ...] -- Query rows as Variants
} ]
FROM {@<stage_name>[/<path>] | '<uri>'}
)
[ FILES = ( '<file_name>' [ , '<file_name>' ] [ , ... ] ) ]
[ PATTERN = '<regex_pattern>' ]
[ FILE_FORMAT = (
FORMAT_NAME = '<your-custom-format>'
| TYPE = { CSV | TSV | NDJSON | PARQUET | ORC | AVRO } [ formatTypeOptions ]
) ]
[ copyOptions ]
FROM Clause
the FROM Clause use similar syntax of Table Function. Like ordinary tables, table alias can be used when join with other tables.
table function parameters:
Query File Data
The select list supports three syntaxes; only one may be used, with no mixing.
Query rows as Variants
- Supported File Formats: NDJSON, AVRO, Parquet, ORC
syntax:
SELECT [<alias>.]$1[:<column>] [, [<alias>.]$1[:<column>] ...] <FROM Clause>
- Example:
SELECT $1:id, $1:name FROM ... - Table Schema: ($1: Variant). i.e. Single Column with Variant Object Type, each Variant representing a whole row
- Notes:
- The type of path expressions like
$1:columnis Variant too, it can be auto cast to native types when used in expressions or load to dest table column, sometimes you may want to cast manually before for type-specific operations (e.g.,CAST($1:id AS INT)) to make the semantics more explicit.
- The type of path expressions like
Query columns by name
- supported File Formats: NDJSON, AVRO, Parquet, ORC
SELECT [<alias>.]<column> [, [<alias>.]<column> ...] <FROM Clause>
- Example:
SELECT id, name FROM ... - Table Schema: Columns Mapping from Parquet or ORC file schema
- Notes:
- All files are required to have the same Parquet/ORC schema; otherwise, an error will be returned
Query columns by Position
- supported File Formats: CSV, TSV
SELECT [<alias>.]$<col_position>[, [<alias>.]$<col_position>, ...] <FROM Clause>
- Example:
SELECT $1, $2 FROM ... - Table Schema: Columns of type
VARCHAR NULL - Notes
<col_position>starts from 1
Query Metadata
You can also include file metadata in your queries, which is useful for tracking data lineage and debugging:
SELECT METADATA$FILENAME, METADATA$FILE_ROW_NUMBER, $1, <FROM Clause>
(
FILE_FORMAT => 'ndjson_query_format',
PATTERN => '.*[.]ndjson'
);
The following file-level metadata fields are available for the supported file formats:
Use cases:
- Data lineage: Track which source file contributed each record
- Debugging: Identify problematic records by file and line number
- Incremental processing: Process only specific files or ranges within files
Tutorials by File Formats
- Querying Parquet Files
- Querying ORC Files
- Querying NDJSON Files
- Querying Avro Files
- Querying CSV Files
- Querying TSV Files
Schema Evolution
- Schema Evolution: Automatically add new columns to a table when loading Parquet files with evolving schemas.