📣
TiDB Cloud Premium is now in public preview. Unlimited growth, instant elasticity, advanced security for enterprise workloads. Try it out →

INFER_SCHEMA



Automatically detects the file metadata schema and retrieves the column definitions.

infer_schema currently supports the following file formats:

  • Parquet - Native support for schema inference
  • CSV - With customizable delimiters and header detection
  • NDJSON - Newline-delimited JSON files

Compression Support: All formats also support compressed files with extensions .zip, .xz, .zst.

Syntax

INFER_SCHEMA( LOCATION => '{ internalStage | externalStage }' [ PATTERN => '<regex_pattern>'] [ FILE_FORMAT => '<format_name>' ] [ MAX_RECORDS_PRE_FILE => <number> ] [ MAX_FILE_COUNT => <number> ] )

Parameters

ParameterDescriptionDefaultExample
LOCATIONStage location: @<stage_name>[/<path>]Required'@my_stage/data/'
PATTERNRegular expression pattern to match staged files. It matches the file path portion after @<stage_name>[/<path>]. See Filtering Staged Files with PATTERN.All files'.*[.]csv', '.*[.]parquet'
FILE_FORMATFile format name for parsingStage's format'csv_format', 'NDJSON'
MAX_RECORDS_PRE_FILEMax records to sample per fileAll records100, 1000
MAX_FILE_COUNTMax number of files to processAll files5, 10

Examples

Parquet Files

-- Create stage and export data CREATE STAGE test_parquet; COPY INTO @test_parquet FROM (SELECT number FROM numbers(10)) FILE_FORMAT = (TYPE = 'PARQUET'); -- Infer schema from parquet files using pattern SELECT * FROM INFER_SCHEMA( location => '@test_parquet', pattern => '.*[.]parquet' );

Result:

+-------------+-----------------+----------+----------+----------+ | column_name | type | nullable | filenames| order_id | +-------------+-----------------+----------+----------+----------+ | number | BIGINT UNSIGNED | false | data_... | 0 | +-------------+-----------------+----------+----------+----------+

CSV Files

-- Create stage and export CSV data CREATE STAGE test_csv; COPY INTO @test_csv FROM (SELECT number FROM numbers(10)) FILE_FORMAT = (TYPE = 'CSV'); -- Create a CSV file format CREATE FILE FORMAT csv_format TYPE = 'CSV'; -- Infer schema using pattern and file format SELECT * FROM INFER_SCHEMA( location => '@test_csv', pattern => '.*[.]csv', file_format => 'csv_format' );

Result:

+-------------+---------+----------+----------+----------+ | column_name | type | nullable | filenames| order_id | +-------------+---------+----------+----------+----------+ | column_1 | BIGINT | true | data_... | 0 | +-------------+---------+----------+----------+----------+

For CSV files with headers:

-- Create CSV file format with header support CREATE FILE FORMAT csv_headers_format TYPE = 'CSV' field_delimiter = ',' skip_header = 1; -- Export data with headers CREATE STAGE test_csv_headers; COPY INTO @test_csv_headers FROM ( SELECT number as user_id, 'user_' || number::string as user_name FROM numbers(5) ) FILE_FORMAT = (TYPE = 'CSV', output_header = true); -- Infer schema with headers SELECT * FROM INFER_SCHEMA( location => '@test_csv_headers', file_format => 'csv_headers_format' );

Limit records for faster inference:

-- Sample only first 5 records for schema inference SELECT * FROM INFER_SCHEMA( location => '@test_csv', pattern => '.*[.]csv', file_format => 'csv_format', max_records_pre_file => 5 );

NDJSON Files

-- Create stage and export NDJSON data CREATE STAGE test_ndjson; COPY INTO @test_ndjson FROM (SELECT number FROM numbers(10)) FILE_FORMAT = (TYPE = 'NDJSON'); -- Infer schema using pattern and NDJSON format SELECT * FROM INFER_SCHEMA( location => '@test_ndjson', pattern => '.*[.]ndjson', file_format => 'NDJSON' );

Result:

+-------------+---------+----------+----------+----------+ | column_name | type | nullable | filenames| order_id | +-------------+---------+----------+----------+----------+ | number | BIGINT | true | data_... | 0 | +-------------+---------+----------+----------+----------+

Limit records for faster inference:

-- Sample only first 5 records for schema inference SELECT * FROM INFER_SCHEMA( location => '@test_ndjson', pattern => '.*[.]ndjson', file_format => 'NDJSON', max_records_pre_file => 5 );

Schema Merging with Multiple Files

When files have different schemas, infer_schema merges them intelligently:

-- Suppose you have multiple CSV files with different schemas: -- file1.csv: id(INT), name(VARCHAR) -- file2.csv: id(INT), name(VARCHAR), age(INT) -- file3.csv: id(FLOAT), name(VARCHAR), age(INT) SELECT * FROM INFER_SCHEMA( location => '@my_stage/', pattern => '.*[.]csv', file_format => 'csv_format' );

Result shows merged schema:

+-------------+---------+----------+-----------+----------+ | column_name | type | nullable | filenames | order_id | +-------------+---------+----------+-----------+----------+ | id | VARCHAR | true | file1,... | 0 | -- INT+FLOAT→VARCHAR | name | VARCHAR | true | file1,... | 1 | | age | BIGINT | true | file1,... | 2 | -- Missing in file1→nullable +-------------+---------+----------+-----------+----------+

Pattern Matching and File Limits

Use pattern matching to infer schema from multiple files:

-- Infer schema from all CSV files in the directory SELECT * FROM INFER_SCHEMA( location => '@my_stage/', pattern => '.*[.]csv' );

Limit the number of files processed to improve performance:

-- Process only the first 5 matching files SELECT * FROM INFER_SCHEMA( location => '@my_stage/', pattern => '.*[.]csv', max_file_count => 5 );

Compressed Files

infer_schema automatically handles compressed files:

-- Works with compressed CSV files SELECT * FROM INFER_SCHEMA(location => '@my_stage/data.csv.zip'); -- Works with compressed NDJSON files SELECT * FROM INFER_SCHEMA( location => '@my_stage/data.ndjson.xz', file_format => 'NDJSON', max_records_pre_file => 50 );

Create Table from Inferred Schema

The infer_schema function displays the schema but doesn't create tables. To create a table from the inferred schema:

-- Create table structure from file schema CREATE TABLE my_table AS SELECT * FROM @my_stage/ (pattern=>'.*[.]parquet') LIMIT 0; -- Verify the table structure DESC my_table;

Was this page helpful?