📣

TiDB Cloud Premium is now in public preview. Unlimited growth, instant elasticity, advanced security for enterprise workloads. Try it out →

Get Started

Overview

Quick Start

Guides

Connect
Manage Resources
Data Integration
Load Data
Unload Data
Multimodal Data Analytics
Performance Tuning
Security & Reliability
Management

Troubleshooting

Tutorials

Ingest & Stream Data
Migrate Data

Reference

SQL Reference
General Reference

Support Services

INFER_SCHEMA

Automatically detects the file metadata schema and retrieves the column definitions.

infer_schema currently supports the following file formats:

Parquet - Native support for schema inference
CSV - With customizable delimiters and header detection
NDJSON - Newline-delimited JSON files

Compression Support: All formats also support compressed files with extensions .zip, .xz, .zst.

Note

Each individual file has a maximum size limit of 100MB for schema inference.

Note

When processing multiple files, infer_schema automatically merges different schemas:

Compatible types are promoted (e.g., INT8 + INT16 → INT16)
Incompatible types fall back to VARCHAR (e.g., INT + FLOAT → VARCHAR)
Missing columns in some files are marked as nullable
New columns from later files are added to the final schema

This ensures all files can be read using the unified schema.

Syntax

INFER_SCHEMA(
  LOCATION => '{ internalStage | externalStage }'
  [ PATTERN => '<regex_pattern>']
  [ FILE_FORMAT => '<format_name>' ]
  [ MAX_RECORDS_PRE_FILE => <number> ]
  [ MAX_FILE_COUNT => <number> ]
)

Parameters

Parameter	Description	Default	Example
`LOCATION`	Stage location: `@<stage_name>[/<path>]`	Required	`'@my_stage/data/'`
`PATTERN`	Regular expression pattern to match staged files. It matches the file path portion after `@<stage_name>[/<path>]`. See Filtering Staged Files with PATTERN.	All files	`'.[.]csv'`, `'.[.]parquet'`
`FILE_FORMAT`	File format name for parsing	Stage's format	`'csv_format'`, `'NDJSON'`
`MAX_RECORDS_PRE_FILE`	Max records to sample per file	All records	`100`, `1000`
`MAX_FILE_COUNT`	Max number of files to process	All files	`5`, `10`

Examples

Parquet Files

-- Create stage and export data
CREATE STAGE test_parquet;
COPY INTO @test_parquet FROM (SELECT number FROM numbers(10)) FILE_FORMAT = (TYPE = 'PARQUET');

-- Infer schema from parquet files using pattern
SELECT * FROM INFER_SCHEMA(
    location => '@test_parquet',
    pattern => '.*[.]parquet'
);

Result:

+-------------+-----------------+----------+----------+----------+
| column_name | type            | nullable | filenames| order_id |
+-------------+-----------------+----------+----------+----------+
| number      | BIGINT UNSIGNED |    false | data_... |        0 |
+-------------+-----------------+----------+----------+----------+

CSV Files

-- Create stage and export CSV data
CREATE STAGE test_csv;
COPY INTO @test_csv FROM (SELECT number FROM numbers(10)) FILE_FORMAT = (TYPE = 'CSV');

-- Create a CSV file format
CREATE FILE FORMAT csv_format TYPE = 'CSV';

-- Infer schema using pattern and file format
SELECT * FROM INFER_SCHEMA(
    location => '@test_csv',
    pattern => '.*[.]csv',
    file_format => 'csv_format'
);

Result:

+-------------+---------+----------+----------+----------+
| column_name | type    | nullable | filenames| order_id |
+-------------+---------+----------+----------+----------+
| column_1    | BIGINT  |     true | data_... |        0 |
+-------------+---------+----------+----------+----------+

For CSV files with headers:

-- Create CSV file format with header support
CREATE FILE FORMAT csv_headers_format
TYPE = 'CSV'
field_delimiter = ','
skip_header = 1;

-- Export data with headers
CREATE STAGE test_csv_headers;
COPY INTO @test_csv_headers FROM (
  SELECT number as user_id, 'user_' || number::string as user_name
  FROM numbers(5)
) FILE_FORMAT = (TYPE = 'CSV', output_header = true);

-- Infer schema with headers
SELECT * FROM INFER_SCHEMA(
    location => '@test_csv_headers',
    file_format => 'csv_headers_format'
);

Limit records for faster inference:

-- Sample only first 5 records for schema inference
SELECT * FROM INFER_SCHEMA(
    location => '@test_csv',
    pattern => '.*[.]csv',
    file_format => 'csv_format',
    max_records_pre_file => 5
);

NDJSON Files

-- Create stage and export NDJSON data
CREATE STAGE test_ndjson;
COPY INTO @test_ndjson FROM (SELECT number FROM numbers(10)) FILE_FORMAT = (TYPE = 'NDJSON');

-- Infer schema using pattern and NDJSON format
SELECT * FROM INFER_SCHEMA(
    location => '@test_ndjson',
    pattern => '.*[.]ndjson',
    file_format => 'NDJSON'
);

Result:

+-------------+---------+----------+----------+----------+
| column_name | type    | nullable | filenames| order_id |
+-------------+---------+----------+----------+----------+
| number      | BIGINT  |     true | data_... |        0 |
+-------------+---------+----------+----------+----------+

Limit records for faster inference:

-- Sample only first 5 records for schema inference
SELECT * FROM INFER_SCHEMA(
    location => '@test_ndjson',
    pattern => '.*[.]ndjson',
    file_format => 'NDJSON',
    max_records_pre_file => 5
);

Schema Merging with Multiple Files

When files have different schemas, infer_schema merges them intelligently:

-- Suppose you have multiple CSV files with different schemas:
-- file1.csv: id(INT), name(VARCHAR)
-- file2.csv: id(INT), name(VARCHAR), age(INT)
-- file3.csv: id(FLOAT), name(VARCHAR), age(INT)

SELECT * FROM INFER_SCHEMA(
    location => '@my_stage/',
    pattern => '.*[.]csv',
    file_format => 'csv_format'
);

Result shows merged schema:

+-------------+---------+----------+-----------+----------+
| column_name | type    | nullable | filenames | order_id |
+-------------+---------+----------+-----------+----------+
| id          | VARCHAR |     true | file1,... |        0 |  -- INT+FLOAT→VARCHAR
| name        | VARCHAR |     true | file1,... |        1 |
| age         | BIGINT  |     true | file1,... |        2 |  -- Missing in file1→nullable
+-------------+---------+----------+-----------+----------+

Pattern Matching and File Limits

Use pattern matching to infer schema from multiple files:

-- Infer schema from all CSV files in the directory
SELECT * FROM INFER_SCHEMA(
    location => '@my_stage/',
    pattern => '.*[.]csv'
);

Limit the number of files processed to improve performance:

-- Process only the first 5 matching files
SELECT * FROM INFER_SCHEMA(
    location => '@my_stage/',
    pattern => '.*[.]csv',
    max_file_count => 5
);

Compressed Files

infer_schema automatically handles compressed files:

-- Works with compressed CSV files
SELECT * FROM INFER_SCHEMA(location => '@my_stage/data.csv.zip');

-- Works with compressed NDJSON files
SELECT * FROM INFER_SCHEMA(
    location => '@my_stage/data.ndjson.xz',
    file_format => 'NDJSON',
    max_records_pre_file => 50
);

Create Table from Inferred Schema

The infer_schema function displays the schema but doesn't create tables. To create a table from the inferred schema:

-- Create table structure from file schema
CREATE TABLE my_table AS
SELECT * FROM @my_stage/ (pattern=>'.*[.]parquet')
LIMIT 0;

-- Verify the table structure
DESC my_table;

INFER_SCHEMA

Syntax

Parameters

Examples

Parquet Files

CSV Files

NDJSON Files

Schema Merging with Multiple Files

Pattern Matching and File Limits

Compressed Files

Create Table from Inferred Schema

Was this page helpful?