Building a Serverless Data Ingestion – Difficulties

This is part three in a four part series on implementing a serverlessJSON based approach using AWS for data ingestion

Outlining the architecture and development process, I glossed over all of the problems and issues that had to be overcome along the way. The majority of my work life and free time isn’t spent using Python, so the majority of the issues confronted are likely to be straightforward for more experienced developers. Doing something new though, I did run into a few issues which were interesting and warrant at least jotting down for my own memory.

  1. Learning about the Docker File
  2. AWS Lambda events and layers
  3. Learning Boto3

Learning about the Docker File

When starting off with Docker, I was throwing things at the wall and seeing what stuck. Originally, I was using a standard Ubuntu image to do testing from for the final function which would be up in AWS Lambda. This was not the right approach in retrospect. I should have started with the amazonlinux image that is readily available on Docker Hub. Once understanding how to create the Docker File from that image, the next step was understanding how to get the code into the container.

The first instinct I had was to create the Docker File in a specific subdirectory of the code base. I’d have a structure like follows:

The entirety of the GitHub repo is Forsta, with subdirectories serving specific purposes.

  • Database: Contains code to create the DynamoDB database tables, and other configurations.
  • Parser: Has the code for moving the data between S3 buckets and into DynamoDB from S3 buckets. Additionally, it contains functions to clean the data and create a primary/unique key for the DynamoDB table.
  • Test: Contain all unit tests or end to end tests I would need to create. It ended up containing the function executed by the AWS Lamdba function, which needs to be rectified in the future.
  • Docker: The final directory was aiming to be Docker, which would have contained a repository of different Docker Files which would be used for different Lambda functions. That’s where I ran into some issues with pathing.

Based upon where the Docker File was in this path, I was unable to easily use the “add” command which made me unable to pull the required files into the Docker container to test my code. My recommendation, have one main area which the Docker File lives in the topmost directory of your repo (in this case, right below Forsta), and you can easily get all of the code you need into the container.

AWS Lambda

This was my first time using AWS Lambda, and it was a bit bumpy at first. My original approach was to create a class, which I would then call in Lambda. While this was basically what the end result was, the route getting there involved some discovery/mistakes.

The first time I attempted to deploy code to Lambda, I just had the class, zipped it up, and tried publishing the Lambda function. In order to use these published classes, I didn’t think through the fact that something would have to call the function, other than my test scripts.

The second time I published a function, one of my test scripts which worked in the container, to run the desired code to see if it worked. Again, this did not work out. After doing some further research, I found that the AWS Lambda function requires an event to kick off the execution of the desired code. In retrospect, this makes complete sense.

The third attempt I got right, after looking a this great tutorial. The key is to create a wrapper which accepts the right events from the AWS environment which kicks off the underlying code I was looking to execute. You can see the repo here with

from tests import test_parser

def lambda_handler(event, context):

All of this could have been averted by reading the documentation before trying to deploy. In order to get to this point, I had to refactor the directory structure a couple times (leading to code impacts), and deploy multiple times. Lesson learned, documentation is in fact worth reading.

Learning Boto3

Let’s start off with the basics. What is Boto3? Luckily, the Boto3 documentation has a simple overview on the landing page.

You use the AWS SDK for Python (Boto3) to create, configure, and manage AWS services, such as Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). The SDK provides an object-oriented API as well as low-level access to AWS services.

– Boto3 Documentation

This library underpins everything that was done as part of the effort. Really, the complications came in the form of understanding how to get the data cleaned in a format that would be useful to have in a DynamoDB table. Trying to get the data to where I needed it was easy.

This can be seen here primarily in the ingester class located here and pictured below.

import boto3
import time as time
import gzip
import json
from io import BytesIO

class ingester():
 #def __init__(self):
 #print the name of all the buckets the configured account has access to
 def s3_list_buckets(self):
  for bucket in boto3.resource('s3').buckets.all():
   response_dict = boto3.client('s3').list_objects(
   #ensures bucket has content before trying to pull content info out
    print('No objects in ' + + ' exist.')
    objs_contents = response_dict['Contents']
    #unnecessary, good for reference
    #for i in range(len(objs_contents)):
    # file_name = objs_contents[i]['Key']
    # print(file_name)

# Read data file from S3 location
# Unpack/Unzip into JSON
# Load to landing bucket location
 def copy_object(self,source_bucket,object_key,target_bucket):
   target_object = object_key + str(time.time())
   copy_source = {
    'Bucket' : source_bucket,
    'Key' : object_key
   s3 = boto3.resource('s3')
   landing_bucket = s3.Bucket(target_bucket)
    landing_bucket.copy(copy_source, target_object)
   except Exception as ex:
    print('Success! Object loaded to: ' + target_object)
    return (target_object)

# turns the data contained in the s3 gzip compressed file to text document
 def convert_object(self,target_bucket,target_key):
   data = []
   s3_client = boto3.client('s3')
   read_object = s3_client.get_object(
     Bucket = target_bucket,
     Key = target_key
   read_byte_object = BytesIO(read_object['Body'].read()) 
   raw_data = gzip.GzipFile(None, 'rb', fileobj=read_byte_object).read().decode('ASCII') #.decode('utf-8')
   s3_client.put_object(Body=raw_data, Bucket=target_bucket,Key=target_key[target_key.rindex('/')+1:] + str(time.time())+'.txt')

Looking at the convert_object function, you can see there was quite a bit of finagling needed in order to get the required data format and move the contents into my single landing bucket. This single bucket is where I’m storing all of my information, as outlined in the architecture. After doing this project, I realized the hard part of the library, just like anything, is learning how the different functions return the data and should be used in tandem to make a coherent solution. But I will say, the documentation is great and there are a plethora of resources/blogs.

Specifically, I’ll call out the following as a great place to start when looking to get something like this off the ground and into the cloud.