📣
TiDB Cloud Premium is now in public preview. Unlimited growth, instant elasticity, advanced security for enterprise workloads. Try it out →

Amazon S3 Integration Task



This page describes how to create an Amazon S3 integration task that imports files from an S3 bucket into TiDB Cloud Lake. CSV, Parquet, and NDJSON file formats are supported, and the task can be configured for one-time import or continuous ingestion.

If you need to create reusable AWS credentials first, see Amazon S3 - Credentials.

Supported File Formats

FormatDescription
CSVComma-separated values with configurable delimiters and headers
ParquetColumnar storage format, efficient for analytical workloads
NDJSONNewline-delimited JSON, one JSON object per line

Prerequisites

  • An Amazon S3 - Credentials data source has already been created
  • The AWS credentials have read access to the target S3 bucket
  • If you plan to enable Clean Up Original Files, the credentials also need write and delete permissions

Creating an S3 Integration Task

Step 1: Basic Info

  1. Navigate to Data > Data Integration and click Create Task.

  2. Select an S3 data source, then configure the basic settings:

    FieldRequiredDescription
    Data SourceYesSelect an existing Amazon S3 - Credentials data source from the dropdown
    NameYesA name for this integration task
    File PathYesS3 URI with optional wildcard pattern (e.g., s3://mybucket/data/2025-*.csv)
    File TypeAutoAuto-detected from file extension. Supported: CSV, Parquet, NDJSON

CSV Options

When the file type is CSV, additional options are available:

FieldDefaultDescription
Record Delimiter\nLine separator. Options: \n, \r, \r\n
Field Delimiter,Column separator. Supports custom values
Has HeaderYesWhether the first row contains column names. If disabled, columns are auto-named as c1, c2, c3, etc.

File Path Patterns

The file path supports wildcard patterns for matching multiple files:

s3://mybucket/data/2025-*.csv # All CSV files starting with "2025-" s3://mybucket/logs/*.parquet # All Parquet files in the logs directory s3://mybucket/events/data.ndjson # A single specific file

Step 2: Preview Data

After configuring the basic settings, click Next to preview the source data.

S3 Preview Data

The system reads the first matching file and displays:

  • Sample data with column names and types
  • A list of matching files (up to 25 files) with their sizes

Step 3: Set Target Table

Configure the destination in TiDB Cloud Lake:

FieldDescription
WarehouseSelect the target TiDB Cloud Lake warehouse for running the import
Target DatabaseChoose the target database in TiDB Cloud Lake
Target TableThe table name in TiDB Cloud Lake

S3 Set Target Table

The system auto-detects columns from the source files. You can review and edit column names and types before proceeding.

Ingestion Options

OptionDefaultDescription
Continuous IngestionOnWhen enabled, the system periodically (every 30 seconds) polls the S3 path and imports new files
Error HandlingAbortAbort: Stop on first error. Continue: Skip failed rows and continue importing
Clean Up Original FilesOffWhen enabled, deletes source files from S3 after successful import
Allow Duplicate ImportsOffWhen enabled, allows re-importing files that have already been imported

Click Create to finalize the integration task.

Task Behavior

Continuous IngestionBehavior
OnRuns continuously, polling S3 every 30 seconds for new files and importing them automatically.
OffImports matching files once and stops. Already-imported files are skipped unless Allow Duplicate Imports is enabled.

Advanced Configuration

Continuous Ingestion

When enabled, the task runs as a long-lived process that periodically scans the S3 path for new files. Each cycle:

  1. Lists objects matching the file path pattern
  2. Identifies new files not yet imported
  3. Imports new files into the target table using COPY INTO
  4. Records import results in the task history

This is useful for data pipelines where upstream systems continuously write new files to S3.

Error Handling

  • Abort (default): The import stops at the first error encountered. Use this when data quality is critical and you want to investigate any issues before proceeding.
  • Continue: Skips rows that cause errors and continues importing the remaining data. Use this when partial imports are acceptable and you want to maximize data throughput.

Clean Up Original Files (PURGE)

When enabled, source files are deleted from S3 after they are successfully imported into TiDB Cloud Lake. This helps manage storage costs and prevents reprocessing. Ensure your AWS credentials have s3:DeleteObject permission on the target bucket.

Allow Duplicate Imports (FORCE)

By default, the system tracks which files have been imported and skips them in subsequent runs. Enabling this option forces re-import of all matching files, regardless of whether they have been previously imported. This is useful when you need to reload data after schema changes or data corrections.

Was this page helpful?