Build a Lightweight Serverless ETL Pipeline to Iceberg Tables with AWS Lambda Athena

Introduction

I’m Aki, an AWS Community Builder (@jitepengin).

Previously, I published two articles on ETL implementations using AWS Lambda × DuckDB × PyIceberg and AWS Glue Python Shell × DuckDB × PyIceberg:

ETL with AWS Lambda × DuckDB × …


This content originally appeared on DEV Community and was authored by Aki

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

Previously, I published two articles on ETL implementations using AWS Lambda × DuckDB × PyIceberg and AWS Glue Python Shell × DuckDB × PyIceberg:

ETL with AWS Lambda × DuckDB × PyIceberg
https://zenn.dev/penginpenguin/articles/77d4a9b1e90e3a

ETL with AWS Glue Python Shell × DuckDB × PyIceberg
https://zenn.dev/penginpenguin/articles/5072c1270f33f2
https://dev.to/aws-builders/lightweight-etl-with-aws-glue-python-shell-duckdb-and-pyiceberg-153l

In both implementations, PyIceberg played a key role in registering and updating data.
In this article, I will introduce a simpler approach that leverages Athena without using PyIceberg.

Architecture used in this article

In this setup, an S3 file upload triggers a Lambda function, which runs an Athena query and stores the transformed data in Apache Iceberg format back to S3.

Key Points

  • By executing Athena asynchronously from Lambda, you can handle workloads that exceed Lambda's 15-minute execution limit.
  • The S3 bucket for file uploads must be registered as an external table beforehand.

For example, you can create the external table with Classification set to Parquet and match the schema of the target Iceberg table:

Sample Code

I prepared both synchronous and asynchronous versions of the Lambda function.

  • Synchronous execution: waits for Athena query results. It may hit Lambda's timeout for heavy workloads but allows easier error handling.

  • Asynchronous execution: does not wait for Athena results, avoiding Lambda timeout issues, but requires careful planning for error handling.

To use synchronous execution, call wait_for_query_completion().
To use asynchronous execution, comment out this call.

import boto3
import json
import time

def lambda_handler(event, context):
    try:
        athena_client = boto3.client('athena')
        DATABASE = 'icebergdb'
        OUTPUT_LOCATION = 's3://your-bucket/test/'

        # Get bucket name and key from S3 event
        s3_bucket = event['Records'][0]['s3']['bucket']['name']
        s3_key = event['Records'][0]['s3']['object']['key']
        s3_path = f"s3://{s3_bucket}/{s3_key}"

        print(f"Processing file: {s3_path}")

        query = f"""
        INSERT INTO icebergdb.yellow_tripdata
        SELECT *
        FROM yellow_tripdata_incoming
        WHERE "$path" = '{s3_path}'
        AND VendorID = 1
        """

        # Execute query
        response = athena_client.start_query_execution(
            QueryString=query,
            QueryExecutionContext={'Database': DATABASE},
            ResultConfiguration={'OutputLocation': OUTPUT_LOCATION},
            WorkGroup='primary'
        )

        query_execution_id = response['QueryExecutionId']
        print(f"Query started: {query_execution_id}")

        # Wait for query completion (optional)
        wait_for_query_completion(athena_client, query_execution_id)

        return {
            'statusCode': 200,
            'body': json.dumps({
                'message': 'Data insertion completed successfully',
                'queryExecutionId': query_execution_id,
                'processedFile': s3_path
            })
        }

    except Exception as e:
        print(f"Error processing file: {str(e)}")
        return {
            'statusCode': 500,
            'body': json.dumps({
                'error': str(e),
                'message': 'Failed to process file'
            })
        }

def wait_for_query_completion(athena_client, query_execution_id, max_wait_time=300):
    """Function to wait for query completion"""
    start_time = time.time()

    while time.time() - start_time < max_wait_time:
        response = athena_client.get_query_execution(QueryExecutionId=query_execution_id)
        status = response['QueryExecution']['Status']['State']

        if status in ['SUCCEEDED']:
            print(f"Query {query_execution_id} completed successfully")
            return True
        elif status in ['FAILED', 'CANCELLED']:
            error_message = response['QueryExecution']['Status'].get('StateChangeReason', 'Unknown error')
            raise Exception(f"Query failed: {error_message}")

        time.sleep(5)

    raise Exception(f"Query timeout after {max_wait_time} seconds")

Results

The sample data is the commonly used NYC taxi dataset:
https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

Before

After

We successfully registered data into an Iceberg table (OTF) using this approach.

Optimization Tips

Partitioning

The query selects from yellow_tripdata_incoming and inserts into the Iceberg table.
Adding partitions to the incoming table can significantly improve performance.

CREATE EXTERNAL TABLE yellow_tripdata_incoming (
    VendorID int,
    tpep_pickup_datetime timestamp,
    ...
)
PARTITIONED BY (
    year int,
    month int,
    day int
)
STORED AS PARQUET
LOCATION 's3://your-bucket/yellow_tripdata/';

SQL Optimization

Best practices like selecting only necessary columns and applying early filtering can reduce query costs.
For more complex cases, consider filtering data in a subquery before inserting into the Iceberg table.

Asynchronous Execution

Using asynchronous execution reduces Lambda runtime and can help optimize cost.
However, if you need to handle errors or act on results, consider integrating CloudWatch or EventBridge.

Operational Considerations

  • This architecture uses both an input external table and an Iceberg table, requiring management of both (e.g., schema changes).

  • Query cost and S3 scan optimization should be considered in design.

Comparison with Other Approaches

Architecture Pros Cons
Lambda × Athena ・Lightweight and serverless
・No external libraries needed
・Async failures are not caught directly
・Not ideal for fine-grained transformations
Lambda × PyIceberg ・Flexible data processing inside Lambda
・Directly manipulate Iceberg tables
・Watch for Lambda runtime limits
・Library management required
Glue Python Shell × PyIceberg ・Flexible processing in Shell
・Supports longer and larger batch workloads
・Longer startup than Lambda in some cases (cost impact)

Conclusion

In this article, I demonstrated an ETL implementation to Iceberg tables using AWS Lambda and Athena.
The main advantages of this approach are its simplicity and lightweight setup, without relying on external libraries like PyIceberg.

By executing Athena asynchronously, you can overcome Lambda's 15-minute runtime limit, enabling low-cost, flexible ETL pipelines.

For asynchronous execution, monitoring using CloudWatch, EventBridge, or SNS is crucial since Lambda cannot directly detect query failures.
Using $path filtering and proper partitioning reduces the risk of importing unintended data and minimizes query cost.

I hope this article provides a reference for those considering lightweight data processing or near-real-time ETL pipelines with Iceberg tables.


This content originally appeared on DEV Community and was authored by Aki


Print Share Comment Cite Upload Translate Updates
APA

Aki | Sciencx (2025-08-19T00:17:57+00:00) Build a Lightweight Serverless ETL Pipeline to Iceberg Tables with AWS Lambda Athena. Retrieved from https://www.scien.cx/2025/08/19/build-a-lightweight-serverless-etl-pipeline-to-iceberg-tables-with-aws-lambda-athena/

MLA
" » Build a Lightweight Serverless ETL Pipeline to Iceberg Tables with AWS Lambda Athena." Aki | Sciencx - Tuesday August 19, 2025, https://www.scien.cx/2025/08/19/build-a-lightweight-serverless-etl-pipeline-to-iceberg-tables-with-aws-lambda-athena/
HARVARD
Aki | Sciencx Tuesday August 19, 2025 » Build a Lightweight Serverless ETL Pipeline to Iceberg Tables with AWS Lambda Athena., viewed ,<https://www.scien.cx/2025/08/19/build-a-lightweight-serverless-etl-pipeline-to-iceberg-tables-with-aws-lambda-athena/>
VANCOUVER
Aki | Sciencx - » Build a Lightweight Serverless ETL Pipeline to Iceberg Tables with AWS Lambda Athena. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/08/19/build-a-lightweight-serverless-etl-pipeline-to-iceberg-tables-with-aws-lambda-athena/
CHICAGO
" » Build a Lightweight Serverless ETL Pipeline to Iceberg Tables with AWS Lambda Athena." Aki | Sciencx - Accessed . https://www.scien.cx/2025/08/19/build-a-lightweight-serverless-etl-pipeline-to-iceberg-tables-with-aws-lambda-athena/
IEEE
" » Build a Lightweight Serverless ETL Pipeline to Iceberg Tables with AWS Lambda Athena." Aki | Sciencx [Online]. Available: https://www.scien.cx/2025/08/19/build-a-lightweight-serverless-etl-pipeline-to-iceberg-tables-with-aws-lambda-athena/. [Accessed: ]
rf:citation
» Build a Lightweight Serverless ETL Pipeline to Iceberg Tables with AWS Lambda Athena | Aki | Sciencx | https://www.scien.cx/2025/08/19/build-a-lightweight-serverless-etl-pipeline-to-iceberg-tables-with-aws-lambda-athena/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.