Building a Serverless Data Ingestion – Development Process

This is part two in a four part series on implementing a serverlessJSON based approach using AWS for data ingestion

  • Architecture: What’s the approach?
  • Development Process: How did I set up my environment that was effective and efficient for developing?
  • Difficulties: What issues came up, and how did they get resolved?
  • End results: Does this architecture achieve the goals that it set out to achieve?

One of the biggest blockers to getting started with building out the serverless data ingestion was figuring out the best way to develop code which could be deployed on the different AWS services being used. Traditionally I’ve deployed code to a central server or cluster from which everything could be tested and promoted. Deploy to a server, test on the server, then move to a production server or location on the same server where production files/code live. What happens when there is no server?

Docker

I’d put off learning Docker for quite a while due to the complexity introduced when running Docker, but in this case, being able to replicate the environment Lambda functions run on was the first time Docker clicked for me. Loosely following the excellent tutorial from Nicola Pietroluongo located here, I was able to stumble my way through creating my first dockerfile, resulting the below code which can be found here on GitHub.

FROM amazonlinux
RUN yum update -y
RUN yum install python3 -y
RUN yum install nano -y
RUN yum install zip -y
RUN yum install unzip -y

#AWS CLI Installation
#RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
#RUN unzip awscliv2.zip
#RUN ./aws/install

#create working directory
ADD . /user/src 
RUN pip3 install boto3 -t /user/src/Forsta/Parser

#v1
#Pull base image
#FROM ubuntu:latest

#Installation packages
#RUN apt-get update
#RUN apt-get install -y curl
#RUN apt-get install -y unzip
#RUN apt-get install -y python3
#RUN apt-get update
#RUN apt-get install -y python3-pip
#RUN pip3 install boto3
#RUN apt-get install nano

#AWS CLI installation
#RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
#RUN unzip awscliv2.zip
#RUN ./aws/install



#Add's current directory into container home directory.
#ADD . /home

While tinkering around with different approaches to the code, I was able to add or remove dependencies from the environment as needed (as you can see with all of the commented out packages in the above). I was constantly creating and destroying the environments running two or three commands in my local command line.

When developing on EC2 or other server environments, dependency management has always been a pain and has at times resulted in a bloated environment (slower, higher maintenance costs, etc.). This is due to to the packages that are unnecessary or never used are installed in the environment because they may have been used while developing the code, found to not be the best solution, but not deleted from the server environment. Using Docker was awesome due to the fact that each time I was iterating my code, I was spinning up the environment through the Dockerfile and commenting out the dependencies that weren’t needed, preventing this issue.

Deployment

The development of the code was all done using visual studio code, and once ready for unit tests, the Dockerfile above would be run. The actual python code, along with all dependencies are all placed in the container at the directory location /user/src/Forsta/Parser, as specified in the Dockerfile. If the code resulted in the desired outcome, I then Zip the files along with dependencies.

This Zip file is what we wanted to eventually get into a Lambda function. Once this Zip file was present in the container I spun up, the file was pulled down to my local machine, then uploaded through the AWS Management Console (this could all be automated) and ready to execute since I’d already setup the correct account through IAM.

The actual code getting executed is located here, and shown below.

from tests import test_parser

def lambda_handler(event, context):
    test_parser.t_parser()
    print("Completed")

Code Repo

Most people are familiar with Github at this point. I used GitHub Desktop to maintain the code base entirely on the main branch. Nothing fancy here, as I was working alone on this and able to do the quickest/fastest solution. As a side note, one of the items worth mentioning is that I picked up a code repo from a few years ago to start this (as shown below). I’ve had multiple computers and storage mediums die since then, but being able to pickup the repo and see the history was super useful.

Even if the code had been stored on my local C drive, who knows if I’d have been able to find it or remember why certain things were done. Being able to go through the version history and see the files between commits, helped greatly in refreshing and picking up the code to create these pipelines.

Dynamo DB

From an AWS standpoint, nothing too special. Everything was setup manually, but this could all be automated. Just like the deployment process. I did script out the creation of the landing table as shown below and available in the repo here.

import boto3

def create_landing_table():
    dynamodb = boto3.resource('dynamodb',region_name='us-east-1')

    landing_table = dynamodb.create_table(
        TableName='landing_table',
        KeySchema=[
            {
            'AttributeName': 'uuid',
            'KeyType': 'HASH'
            },
            {
            'AttributeName': 'upload_date',
            'KeyType': 'RANGE'
            }
        ],
        AttributeDefinitions=[
            {
                'AttributeName': 'uuid',
                'AttributeType': 'S'
            },
            {
                'AttributeName': 'upload_date',
                'AttributeType': 'S'
            }
        ],
        ProvisionedThroughput={
            'ReadCapacityUnits':10,
            'WriteCapacityUnits':10
        }
    )

    print('Table Status: ',landing_table.table_status)

In the future, I’m hoping to parameterize the creation of tables as needed. Due to this being a document database, all that needs to be defined is the creation of the unique identifiers. Eventually, I’ll parameterize the creation of the sort keys as necessary for performance.

With all the above, you now have an idea of how I developed on my local machine, deployed code to Lambda, and setup my final landing table in DynamoDB. If you missed the first post in the series which provides an overview of what I was trying to build, you can find that post here.

Building a Serverless Data Ingestion – Architecture

Data and analytics always seems to start with the same problem. How do get the data where it’s needed so that we can start getting insights? The problem isn’t getting the data from point A to B, but doing this in a way that is easy, cost-effective, reliable, and appropriately scalable for the use case. With the rise of the different cloud providers and their toolsets, I thought it would be fun to give a swing at implementing a serverless, JSON based approach using AWS.

This will be series of articles which will be broken down into the following:

  • Architecture: What’s the approach?
  • Development Process: How did I set up my environment that was effective and efficient for developing?
  • Difficulties: What issues came up, and how did they get resolved?
  • End results: Does this architecture achieve the goals that it set out to achieve?

Diving into the architecture plan is outlined below. We’ll go into each of the boxes in detail, but first let’s frame the use case for this project:

I want a solution that can be used in my personal data projects, can scale up to N data ingestion pipelines as needed, and is cheap to operate.

With that goal in mind, the solution uses technologies that support these objectives:

  1. Scalability: All of these technologies can scale from gigabytes to terabytes of data automatically, being fully managed services. Additionally, the Lambda python functions that have been written are entirely serverless.
  2. Cost: Cost is all based upon usage. So if nothing is used, all I’m paying for is storage costs for the storage of persistent data. DynamoDB’s on-demand capacity based pricing charges $.25 per a Gb, so using this service as a landing location before moving into Snowflake is extremely affordable considering the budget.
  3. Upkeep/maintenance: Everything but the data layer is server-less, so no EC2 to keep up. No patching or server status’ needing to monitored. Or the worst case, no script kitties entering into an unprotected servers in my VPC that require me to start over from scratch.

So pretty straightforward from an overall technology standpoint right? The other item to note is how the Lambda functions are written in the python. The idea behind the S3 bucket structure is to funnel all of the data for ingestion into a single location, and ensure that the data is in a similar format to be landed in Dynamo DB.

With the Lambda functions in the GitHub repo here, we ensure that there is a key present that uniquely identifies the exact upload record and it’s origination so I can reuse the upload process for as many different feeds as we want, from whatever buckets we want. Completely configurable to point to a bucket you own, or someone else’s bucket, you can and land it in your own bucket.

Here’s one of the functions demonstrating a super straight forward movement/copy function to get our data to a single ingestion bucket:

# Read data file from S3 location
# Unpack/Unzip into JSON
# Load to landing bucket location
 def copy_object(self,source_bucket,object_key,target_bucket):
   target_object = object_key + str(time.time())
   copy_source = {
    'Bucket' : source_bucket,
    'Key' : object_key
   }
   s3 = boto3.resource('s3')
   landing_bucket = s3.Bucket(target_bucket)
   try:
    landing_bucket.copy(copy_source, target_object)
   except Exception as ex:
    print(ex)
   else:
    print('Success! Object loaded to: ' + target_object)
    return (target_object)

After this, it’s a matter of moving the data along the layers with our Lambda functions, manipulating the data as necessary, and ending up with that data inside of DynamoDB. The idea here being, if we build out the required functions in Lambda, these core python classes used in the Lambda functions to load the data for as many sources as we want, as long as they are similar.

As an example, do you have customer data being sent from many different sources, slightly differently? Well we can get that data into a single DynamoDB table to load into our relational Snowflake database for analytics, or access the data directly using DynamoDB’s API. All of the data in this example is landed in a single table, and can be identified by source for individual processing/analytics.

Although this all sounds straightforward, developing this architecture was truly easier than other side projects/tinkering I’ve done due to the tools that are available to develop Lambda functions and interact with AWS infrastructure. In the next section I’ll talk about the tools I used, how code was deployed, and few other relevant items that made all of this easier to do than expected.