Optimize COPY Commands: Load Data 3x Faster Use 50% Fewer Credits

Optimize COPY Commands: Load Data 3x Faster, Use 50% Fewer Credits

File size, compression, and parallel loading significantly impact load performance and costs. Discover the optimal configuration for your data pipeline.

The Problem

Many organizations use default COPY settings, loading data inefficiently. This results in slow loads, wasted compute, and higher costs than necessary.

File Size Optimization

Too Small: (<100MB) Causes excessive overhead, poor parallelization
Optimal: 100-250MB compressed files for best parallelization
Too Large: (>1GB) Reduces parallelization, increases retry costs

Compression Matters

GZIP compression typically provides 3-5x reduction. Comparison:

10GB uncompressed: 15 minutes, 20 credits
2GB compressed: 8 minutes, 10 credits

Optimal COPY Configuration

COPY INTO target_table FROM @my_stage/path/ FILE_FORMAT = ( TYPE = CSV COMPRESSION = GZIP FIELD_DELIMITER = ',' SKIP_HEADER = 1 ) SIZE_LIMIT = 250000000 -- 250MB per file ON_ERROR = CONTINUE PURGE = TRUE;

Best Practices

File Size: 100-250MB compressed per file
Compression: Always use GZIP or ZSTD
Parallelization: Split large files before loading
Error Handling: Use ON_ERROR = CONTINUE for resilience
Cleanup: Always use PURGE = TRUE

Real-World Example

A data engineering team was loading 100GB of daily data using 10 large uncompressed files (10GB each). Loads took 45 minutes and consumed 60 credits. By compressing files with GZIP (reduced to 20GB total), splitting into 80 files of 250MB each, and optimizing COPY settings, load time dropped to 12 minutes and credit consumption to 20 credits - a 67% cost reduction.

Key Takeaways

Optimal file size is 100-250MB compressed
Always use compression (GZIP or ZSTD)
More files = better parallelization (to a point)
Use PURGE = TRUE to clean up staged files

Optimize COPY Commands: Load Data 3x Faster Use 50% Fewer Credits