Information Technology

AI for Analysis in 2026

It’s been a long time since I’ve looked at using AI tools for data analysis. Frankly, after playing around with Chat GPT in 2023 it made me skeptical that these tools would be useful in the near term. Thinking back to 2023, we were at the start of the true AI boom. Chat GPT was just released in November of 2022 and entering the public zeitgeist. In the past few years we’ve seen an explosion of AI features in the tools and technologies we use in our personal and professional lives.

In this post we’ll be attempting to use Claude as a data analyst to perform hypothesis testing and analysis, and measuring it against our 2023 Chat GPT Analysis findings.

Can Claude act as an analyst for us?
Are the issues we found in 2023’s Chat GPT model resolved?
How could we use this going forward?

Can Claude perform as an analyst?

When we looked at Chat GPT for data analysis in 2023 we found it to be quick, but lacking the ability to understand context and detect data quality issues. Let’s see how Claude measures up against the 2023 Chat GPT model. To do this analysis we will be using the free tier of Claude and the Sonnet 4.6 model. This model is listed as “The best combination of speed and intelligence“.

Using the Claude desktop client we’ll now get our hypothesis tested. Using the desktop client you have the ability to upload data to a project, and then query that dataset within the project. The interface is intuitive and extremely straightforward. No tagging or explaining of the dataset was done to produce the output. Just upload the data, and prompt the model.

Prompt:

“You are an analyst who has been provided traffic datasets for Texas. These tell us by date and time how many traffic accidents occurred in Texas. I would like you to provide a response that supports or rejects this hypothesis: Traffic accidents were lower in 2020 due to Covid.”

Output Received:

This is a large step up from what I expected to be standard for the model. Not only did it produce a basic visualization, it directly provided KPI’s that would be used to validate the hypothesis.

In the details section below, they also provide the more in depth explanation. Not only does it walk me through the data, it also calls out the partial years of 2016 and 2023, removing them from the detailed analysis. Finally, and most importantly, it gives a clear explicit response to the hypothesis and went a level deeper than expected.

In the full response, detailed explanation of the factors that impact accepting or rejecting the hypothesis.

Conclusion

Can Claude act as an analyst?

Yes. For a basic analysis and hypothesis testing Claude is able to clearly analyze data, provide clarity into how it arrived there, and summarize data.

Are the issues we found in 2023’s Chat GPT model resolved?

The Claude model used for this analysis was clearly better. It caught the anomaly with the partial years of data and excluded them from the conclusion. Claude also gave in depth analysis of data points that both supported and refuted the hypothesis. Also, while not an issue in the original model, the visualization provided as a response to the prompt was useful for framing up KPIs and showing the trend over time.

How could we use this going forward?

The Claude model seems well positioned to be a tool for both experienced analysts doing a first pass of a dataset, and users who aren’t trying to do anything too complex with data. For standalone datasets that are small and generally simple this could expedite the time to insight and enable less data focused users to get valuable insights with the tools they have on hand.

Chat GPT for Data Analysis: Part 2

In the previous blog post we took a look at Chat GPT and it’s capabilities to perform simple analysis on data. For this next article, we’re going to look at using Chat GPT for some more advanced analysis. As we walk through this attempt, all resources used and analysis created will be made available.

The Data Set

For this analysis we’ll be using a dataset from Kaggle, titled US Accidents (2016-2023) which provides us with a nationwide view of all traffic accidents reported between 2016-2023. All citations can be found at the bottom of this article.

The dataset is a total of 3GB, but for the purpose of this analysis we will be focusing on Texas only data for ease of use.

Hypothesis Being Tested

2020 will have a lower total volume of accidents due to Covid.
Higher volumes of crashes during rush hour in both urban and rural parts of the state.

Hypothesis 1: 2020 will have a lower total volume of accidents due to Covid.

To start of with the descriptive analysis, we begin by loading the data into a Tableau dashboard. The reason for this is that the size of the data is ~3GB, which is too big to handle with straight excel analysis. Plugging into Tableau, we are able to build visuals and a way to extract the data for analysis as necessary.

Tableau and Excel

Looking at the dashboard, we can see 2020 is the third lowest in accidents. This raises the question, why do 2016 and 2023 have such low rates of accidents? What we’ll be looking for is any large gaps/inconsistencies in the data.

My hunch is that there is partial reporting in both 2023 and 2016. To prove this, I’ll pop Tableau back open to add the data into the dashboard for analysis. Below you can see the data with month added in, and the assumption was correct. There is partial data in 2016 and 2023.

This shows that 2020 did in fact have lower volumes of accidents. Looking at the months when Covid lockdowns were occurring (March-July), we can see that accidents rates were lower than previous and later years suggesting in part that this was due to Covid. In order to prove causation we would have to go a lot more in depth, which is beyond what I’m looking to do right now.

Chat GPT

Plugging in the data that can be found in the “Raw Data” section of the Tableau dashboard, let’s try to enter the prompt and see what it comes up with. The full chat log can be found by clicking here.

*Question is answered effectively, but doesn’t catch the anomaly*

As you can see based upon the above prompt and response, Chat GPT arrived at the same conclusion we did with our Tableau dashboard with about 10% of the effort. No visualization, no pivot table creation and interpretation. Chat GPT did not catch the anomaly in the data however.

When asked to check for anomalies and re-analyze the data, the same conclusion we arrived at in Tableau and excel is generated by Chat GPT.

Hypothesis 2: We can clearly see higher volumes of crashes during rush hour in both urban and rural parts of the state.

For the first step in this process we have to identify whether a county is urban and rural. This is not identified in the core data set being used, but is readily available in the US census information. We’ll come up with a rough rule that any county with a 2020 population of less than 50,000 is rural.

Tableau and Excel

Performing this one time analysis in Tableau and Excel was relatively easy.

1. Upload county information to the Tableau dashboard and joined with existing excel data. (~5 minutes)

2. Create LOD and calculated fields to provide the variables (~15 minutes)

I had to create quite a few calculated fields and an LOD expression to handle the join between the accident data and the county information (different granularity).

*Provides the number of hours, count of accidents, and population*

3. Saved the data to excel (~2 minutes)

4. Perform the calculation myself using formulas (~5 minutes)

Chat GPT

We start by uploading the two datasets, and having Chat GPT featurize the county information to identify whether a county is urban or rural. After that, it’s as easy as asking the questions and getting the response. The data is already joined together correctly using the county column from both datasets.

The original response given by the tool is below. Which is too straightforward of an interpretation, as it does not take into account population numbers or the duration of rush hour (6 hours out of the day). We need a different metric than what was provided by Chat GPT.

*Too simple, and not doesn’t account for other variables*

Let’s re-prompt with average accidents per an hour for 100,000 in population.

While this would be a seemingly straightforward request, unfortunately Chat GPT did not perform well. It took about 30 minutes to get Chat GPT to use the right numbers and make the right assumptions. When you compare something like excel with Chat GPT for a simple model where assumptions need to be made, there’s no doubt in my mind that excel is more efficient. In order to troubleshoot Chat GPT and steer towards the correct answers, I had to spin up an excel.

Conclusion:

Chat GPT continues to prove effective for basic exploration of data, and even joining datasets together for more complex analysis. When it came to the assumptions made for the calculation, it did however miss the mark and have to be reinforced multiple times, and in some cases corrected. While we eventually arrived at the right conclusion, I found dealing with these issues in Chat GPT uncomfortable. It was hard to tell how much more time it would take to get Chat GPT corrected due to sometimes having to correct the same issue multiple times.

Appendix

Tableau
Kaggle Data Set
- Citations:
  - Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.
  - Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. “Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.” In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.
US Census Dataset
Chat GPT Log #1 – Correct Answer
Chat GPT Log #2

Chat GPT for Data Analysis

AI is everywhere nowadays. It is going to replace all the office jobs, and make Google obsolete. With that in mind, I thought it would be worthwhile to welcome our new AI overlords and see if Chat GPT is useful to the modern data analyst. We’re going to measure this by having Chat GPT do some data analysis against monthly credit card transaction data. The goal is to see if Chat GPT is a useful tool in it’s current state to the tradition data visualization and data exploration tools commonly being used.

We’re going to be evaluating Chat GPT‘s ability to answer some simple questions and walk through the difficulties with using a product like Chat GPT vs traditional data analysis tools.

The Data

To start off with, I uploaded credit card transaction data from Chase. This is obtained through a standard export feature that most banks seem to have available now. The table has the following information:

Transaction Date: Date the transaction occurred at the point of sale.
Post Date: Date the transaction was reflected on the account.
Description: Usually contains the business and other information. Different for every vendor.
Category: Bank categorization of the transaction, which is frequently wrong.
Type: If it was a purchase or payment on the credit card. Payment being reducing the outstanding balance.
Amount: The credit/debit amount for the transaction.
Memo: Always blank for my credit card at the moment.

Loading to Chat GPT

Using Chat GPT at chat.openai.com, there are two options to loading the data. We’ll be trying out both, starting with the Free Tier.

Free Tier: Upload the data in the chat window, using Chat GPT 3.5.
Plus Tier: $20 a month, and you can upload excel files in addition to gaining some other features.

Free Tier

Uploading to the free tier was super easy, but I could imagine would cause issues with less well structured data. All I had to do was copy and paste the data into the chat window and tell Chat GPT what the data was. Looking at the video, you can see the upload is super simple. However, we run in to the first problem. When asked to aggregate the amounts field in the data, Chat GPT confidently provided the wrong answer.

*Uploading and describing data to Chat GPT*

Providing the wrong answer on the first pass isn’t great, but it isn’t a dealbreaker. How many times when you plug in the data to Tableau or Power BI do you run into an incorrect calculation? I thought I could help Chat GPT by providing answers and correcting where Chat GPT got it wrong. Unfortunately, that did not help.

At this point, I’m going to throw in the towel with the free tier. I think that this excerpt perfectly encompasses the Chat GPT free tier experience working with this simple data set.

*How can you trust a data analysis tool that does this*?

Chat GPT Plus

Using the Chat GPT plus tier allows the use of the Advanced Data Analysis feature which is currently in Beta (as of October 18th, 2023). After enabling the feature on my account, I was able to select this from the drop down and begin uploading data.

Uploading the file was super simple from a UI perspective, but there are a couple issues that would prevent me from adopting Chat GPT, or replacing Tableau entirely:

File size limit of 100 MB
You have to trust Open AI to safeguard or delete any data uploaded

On the plus side, I was blown away by the results of using Chat GPT on a simple dataset. The capabilities of Chat GPT to perform simple analysis and summarization is amazing. Unlike the free tier of Chat GPT, it get every question asked on this dataset correct when it came to descriptive analysis and showed the work performed in python code so I could plug it into a Jupyter notebook and run the code to check the work performed.

*Thorough and correct results after uploading the file*

Conclusion

For those folks who regularly extract small datasets from a dashboard, database, or application, this could be an extremely useful tool to supplement the tools already common in enterprises for analysts. Pay $20 dollars a month to automate and save time doing ad-hoc analysis. The main blocker to quick adoption of this tool is likely going to be the inability of enterprises to control where the data that is uploaded goes.

Building a Serverless Data Ingestion – End Result

In this final post, we’ll go over the final implementation of this serverless data ingestion pipeline. What is the result of all the effort put forward to build this serverless data ingestion process? I think the best way to break this down is to compare what we were originally aiming for, and what was implemented. Below you can see diagram that was created in the post outlining the overall architecture.

Pictured on the left is what was originally proposed. On the right, is what was actually implemented. Turns out the implementation pretty much stuck to the plan, with the additional enhancement of using an event to kick off the Lambda functions. This allows for everything to kick off in the appropriate sequence once a file is placed in our “ingestion-bucket-11.15.2021”.

The usage of AWS events to kick off Lambda functions was extraordinarily easy, and there’s plenty of good documentation to get started. The S3 event passes through all the metadata needed to parameterize and operate the pipeline in JSON. The documentation from Amazon, makes it super easy to access and use when setting up your Lambda functions.

Below you can see the actual code executed in the Lambda function. Notice that the variables bucket_name and file_name are both retrieved from the event.

from classes import ingester as ing
from classes import forstaparser as fp

def lambda_handler(event, context):
 bucket_name = event['Records'][0]['s3']['bucket']['name']
 file_name = event['Records'][0]['s3']['object']['key']
 target_bucket = 'landing-bucket-11.15.2021'
 upload_table = 'landing_table'
 source_type = 's3'

 if bucket_name == target_bucket:
  upload_file_name = ing.ingester.convert_object(None,bucket_name,file_name)
  raw_logs = fp.parser.read_logs(None,source_type,target_bucket,upload_file_name)
  fp.parser.dynamo_landing_load(None,upload_table,raw_logs,file_name)
 else:
  landing_ingester = ingester.ingester()
  landing_ingester.copy_bucket(bucket_name,target_bucket)
  #test_parser.t_parser()
 print("Completed")

Put simply, the function does the following. First it receives the event metadata, parses through the JSON, and obtains the bucket name that we want to transfer the files from. Notice we can point to any bucket, and it will always drop the file into ‘landing-bucket-11.15.2021’. Using the event metadata means I can reuse this Lambda function as often as I want to create a central dumping ground for staging data to be loaded.

Second, once files are put into ‘landing-bucket-11.15.2021’ another event kicks off. This event cleans the data, ensuring proper encoding (UTF-8), and then loads the data into our DynamoDB landing table. All in all pretty simple.

Below you can see everything running in action.

As we can see above, the files were automatically copied, and for posterity’s sake, we can check the cloud logs to see we have a 100% success rate in the last hour. Looking at the below we can see the result of 100% successful executions for our first Lambda function.

The next step is automatically kicked off whenever an object is PUT into the ‘landing-bucket-11.15.2021’ and loads the data into DynamoDB. With the current setup, we can see the data uploads successfully and the data is now available in DynamoDB to be ingested into whatever processes/analytics we want! The best part being that once this is setup it is automated, and auditable going forward due to the tools AWS offers.

In order to build this, it may not have seemed like a long journey. But keep in mind, in the process of building this little project out I’ve had to pickup and learn quite a few tools. Docker, Lambda, S3, IAM, Python, Boto3, and a few more tools which we’ve covered in the previous posts. If I need to do this again, it’ll be much simpler based upon what I’ve learned.

Architecture: What’s the approach?
Development Process: How did I set up my environment that was effective and efficient for developing?
Difficulties: What issues came up, and how did they get resolved?
End Result: Does this architecture achieve the goals that it set out to achieve?

Thanks for reading along!

Building a Serverless Data Ingestion – Difficulties

This is part three in a four part series on implementing a serverless, JSON based approach using AWS for data ingestion

Architecture: What’s the approach?
Development Process: How did I set up my environment that was effective and efficient for developing?
Difficulties: What issues came up, and how did they get resolved?
End results: Does this architecture achieve the goals that it set out to achieve?

Outlining the architecture and development process, I glossed over all of the problems and issues that had to be overcome along the way. The majority of my work life and free time isn’t spent using Python, so the majority of the issues confronted are likely to be straightforward for more experienced developers. Doing something new though, I did run into a few issues which were interesting and warrant at least jotting down for my own memory.

Learning about the Docker File
AWS Lambda events and layers
Learning Boto3

Learning about the Docker File

When starting off with Docker, I was throwing things at the wall and seeing what stuck. Originally, I was using a standard Ubuntu image to do testing from for the final function which would be up in AWS Lambda. This was not the right approach in retrospect. I should have started with the amazonlinux image that is readily available on Docker Hub. Once understanding how to create the Docker File from that image, the next step was understanding how to get the code into the container.

The first instinct I had was to create the Docker File in a specific subdirectory of the code base. I’d have a structure like follows:

The entirety of the GitHub repo is Forsta, with subdirectories serving specific purposes.

Database: Contains code to create the DynamoDB database tables, and other configurations.
Parser: Has the code for moving the data between S3 buckets and into DynamoDB from S3 buckets. Additionally, it contains functions to clean the data and create a primary/unique key for the DynamoDB table.
Test: Contain all unit tests or end to end tests I would need to create. It ended up containing the function executed by the AWS Lamdba function, which needs to be rectified in the future.
Docker: The final directory was aiming to be Docker, which would have contained a repository of different Docker Files which would be used for different Lambda functions. That’s where I ran into some issues with pathing.

Based upon where the Docker File was in this path, I was unable to easily use the “add” command which made me unable to pull the required files into the Docker container to test my code. My recommendation, have one main area which the Docker File lives in the topmost directory of your repo (in this case, right below Forsta), and you can easily get all of the code you need into the container.

AWS Lambda

This was my first time using AWS Lambda, and it was a bit bumpy at first. My original approach was to create a class, which I would then call in Lambda. While this was basically what the end result was, the route getting there involved some discovery/mistakes.

The first time I attempted to deploy code to Lambda, I just had the class, zipped it up, and tried publishing the Lambda function. In order to use these published classes, I didn’t think through the fact that something would have to call the function, other than my test scripts.

The second time I published a function, one of my test scripts which worked in the container, to run the desired code to see if it worked. Again, this did not work out. After doing some further research, I found that the AWS Lambda function requires an event to kick off the execution of the desired code. In retrospect, this makes complete sense.

The third attempt I got right, after looking a this great tutorial. The key is to create a wrapper which accepts the right events from the AWS environment which kicks off the underlying code I was looking to execute. You can see the repo here with

from tests import test_parser

def lambda_handler(event, context):
    test_parser.t_parser()
    print("Completed")

All of this could have been averted by reading the documentation before trying to deploy. In order to get to this point, I had to refactor the directory structure a couple times (leading to code impacts), and deploy multiple times. Lesson learned, documentation is in fact worth reading.

Learning Boto3

Let’s start off with the basics. What is Boto3? Luckily, the Boto3 documentation has a simple overview on the landing page.

You use the AWS SDK for Python (Boto3) to create, configure, and manage AWS services, such as Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). The SDK provides an object-oriented API as well as low-level access to AWS services.
– Boto3 Documentation

This library underpins everything that was done as part of the effort. Really, the complications came in the form of understanding how to get the data cleaned in a format that would be useful to have in a DynamoDB table. Trying to get the data to where I needed it was easy.

This can be seen here primarily in the ingester class located here and pictured below.

import boto3
import time as time
import gzip
import json
from io import BytesIO

class ingester():
 #def __init__(self):
 #print the name of all the buckets the configured account has access to
 def s3_list_buckets(self):
  for bucket in boto3.resource('s3').buckets.all():
   print(bucket.name)
   response_dict = boto3.client('s3').list_objects(Bucket=bucket.name)
   print(response_dict.keys())
   #ensures bucket has content before trying to pull content info out
   try:
    response_dict['Contents']
   except:
    print('No objects in ' + bucket.name + ' exist.')
   else:
    print(response_dict['Contents']) 
    objs_contents = response_dict['Contents']
    print(objs_contents)
    #unnecessary, good for reference
    #for i in range(len(objs_contents)):
    # file_name = objs_contents[i]['Key']
    # print(file_name)

# Read data file from S3 location
# Unpack/Unzip into JSON
# Load to landing bucket location
 def copy_object(self,source_bucket,object_key,target_bucket):
   target_object = object_key + str(time.time())
   copy_source = {
    'Bucket' : source_bucket,
    'Key' : object_key
   }
   s3 = boto3.resource('s3')
   landing_bucket = s3.Bucket(target_bucket)
   try:
    landing_bucket.copy(copy_source, target_object)
   except Exception as ex:
    print(ex)
   else:
    print('Success! Object loaded to: ' + target_object)
    return (target_object)

# turns the data contained in the s3 gzip compressed file to text document
 def convert_object(self,target_bucket,target_key):
   data = []
   s3_client = boto3.client('s3')
   read_object = s3_client.get_object(
     Bucket = target_bucket,
     Key = target_key
   )
   read_byte_object = BytesIO(read_object['Body'].read()) 
   raw_data = gzip.GzipFile(None, 'rb', fileobj=read_byte_object).read().decode('ASCII') #.decode('utf-8')
   s3_client.put_object(Body=raw_data, Bucket=target_bucket,Key=target_key[target_key.rindex('/')+1:] + str(time.time())+'.txt')

Looking at the convert_object function, you can see there was quite a bit of finagling needed in order to get the required data format and move the contents into my single landing bucket. This single bucket is where I’m storing all of my information, as outlined in the architecture. After doing this project, I realized the hard part of the library, just like anything, is learning how the different functions return the data and should be used in tandem to make a coherent solution. But I will say, the documentation is great and there are a plethora of resources/blogs.

Specifically, I’ll call out the following as a great place to start when looking to get something like this off the ground and into the cloud.

Building a Serverless Data Ingestion – Development Process

This is part two in a four part series on implementing a serverless, JSON based approach using AWS for data ingestion

Architecture: What’s the approach?
Development Process: How did I set up my environment that was effective and efficient for developing?
Difficulties: What issues came up, and how did they get resolved?
End results: Does this architecture achieve the goals that it set out to achieve?

One of the biggest blockers to getting started with building out the serverless data ingestion was figuring out the best way to develop code which could be deployed on the different AWS services being used. Traditionally I’ve deployed code to a central server or cluster from which everything could be tested and promoted. Deploy to a server, test on the server, then move to a production server or location on the same server where production files/code live. What happens when there is no server?

Docker

I’d put off learning Docker for quite a while due to the complexity introduced when running Docker, but in this case, being able to replicate the environment Lambda functions run on was the first time Docker clicked for me. Loosely following the excellent tutorial from Nicola Pietroluongo located here, I was able to stumble my way through creating my first dockerfile, resulting the below code which can be found here on GitHub.

FROM amazonlinux
RUN yum update -y
RUN yum install python3 -y
RUN yum install nano -y
RUN yum install zip -y
RUN yum install unzip -y

#AWS CLI Installation
#RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
#RUN unzip awscliv2.zip
#RUN ./aws/install

#create working directory
ADD . /user/src 
RUN pip3 install boto3 -t /user/src/Forsta/Parser

#v1
#Pull base image
#FROM ubuntu:latest

#Installation packages
#RUN apt-get update
#RUN apt-get install -y curl
#RUN apt-get install -y unzip
#RUN apt-get install -y python3
#RUN apt-get update
#RUN apt-get install -y python3-pip
#RUN pip3 install boto3
#RUN apt-get install nano

#AWS CLI installation
#RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
#RUN unzip awscliv2.zip
#RUN ./aws/install



#Add's current directory into container home directory.
#ADD . /home

While tinkering around with different approaches to the code, I was able to add or remove dependencies from the environment as needed (as you can see with all of the commented out packages in the above). I was constantly creating and destroying the environments running two or three commands in my local command line.

When developing on EC2 or other server environments, dependency management has always been a pain and has at times resulted in a bloated environment (slower, higher maintenance costs, etc.). This is due to to the packages that are unnecessary or never used are installed in the environment because they may have been used while developing the code, found to not be the best solution, but not deleted from the server environment. Using Docker was awesome due to the fact that each time I was iterating my code, I was spinning up the environment through the Dockerfile and commenting out the dependencies that weren’t needed, preventing this issue.

Deployment

The development of the code was all done using visual studio code, and once ready for unit tests, the Dockerfile above would be run. The actual python code, along with all dependencies are all placed in the container at the directory location /user/src/Forsta/Parser, as specified in the Dockerfile. If the code resulted in the desired outcome, I then Zip the files along with dependencies.

This Zip file is what we wanted to eventually get into a Lambda function. Once this Zip file was present in the container I spun up, the file was pulled down to my local machine, then uploaded through the AWS Management Console (this could all be automated) and ready to execute since I’d already setup the correct account through IAM.

The actual code getting executed is located here, and shown below.

from tests import test_parser

def lambda_handler(event, context):
    test_parser.t_parser()
    print("Completed")

Code Repo

Most people are familiar with Github at this point. I used GitHub Desktop to maintain the code base entirely on the main branch. Nothing fancy here, as I was working alone on this and able to do the quickest/fastest solution. As a side note, one of the items worth mentioning is that I picked up a code repo from a few years ago to start this (as shown below). I’ve had multiple computers and storage mediums die since then, but being able to pickup the repo and see the history was super useful.

Even if the code had been stored on my local C drive, who knows if I’d have been able to find it or remember why certain things were done. Being able to go through the version history and see the files between commits, helped greatly in refreshing and picking up the code to create these pipelines.

Dynamo DB

From an AWS standpoint, nothing too special. Everything was setup manually, but this could all be automated. Just like the deployment process. I did script out the creation of the landing table as shown below and available in the repo here.

import boto3

def create_landing_table():
    dynamodb = boto3.resource('dynamodb',region_name='us-east-1')

    landing_table = dynamodb.create_table(
        TableName='landing_table',
        KeySchema=[
            {
            'AttributeName': 'uuid',
            'KeyType': 'HASH'
            },
            {
            'AttributeName': 'upload_date',
            'KeyType': 'RANGE'
            }
        ],
        AttributeDefinitions=[
            {
                'AttributeName': 'uuid',
                'AttributeType': 'S'
            },
            {
                'AttributeName': 'upload_date',
                'AttributeType': 'S'
            }
        ],
        ProvisionedThroughput={
            'ReadCapacityUnits':10,
            'WriteCapacityUnits':10
        }
    )

    print('Table Status: ',landing_table.table_status)

In the future, I’m hoping to parameterize the creation of tables as needed. Due to this being a document database, all that needs to be defined is the creation of the unique identifiers. Eventually, I’ll parameterize the creation of the sort keys as necessary for performance.

With all the above, you now have an idea of how I developed on my local machine, deployed code to Lambda, and setup my final landing table in DynamoDB. If you missed the first post in the series which provides an overview of what I was trying to build, you can find that post here.

Building a Serverless Data Ingestion – Architecture

Data and analytics always seems to start with the same problem. How do get the data where it’s needed so that we can start getting insights? The problem isn’t getting the data from point A to B, but doing this in a way that is easy, cost-effective, reliable, and appropriately scalable for the use case. With the rise of the different cloud providers and their toolsets, I thought it would be fun to give a swing at implementing a serverless, JSON based approach using AWS.

This will be series of articles which will be broken down into the following:

Architecture: What’s the approach?
Development Process: How did I set up my environment that was effective and efficient for developing?
Difficulties: What issues came up, and how did they get resolved?
End results: Does this architecture achieve the goals that it set out to achieve?

Diving into the architecture plan is outlined below. We’ll go into each of the boxes in detail, but first let’s frame the use case for this project:

I want a solution that can be used in my personal data projects, can scale up to N data ingestion pipelines as needed, and is cheap to operate.

With that goal in mind, the solution uses technologies that support these objectives:

Scalability: All of these technologies can scale from gigabytes to terabytes of data automatically, being fully managed services. Additionally, the Lambda python functions that have been written are entirely serverless.
Cost: Cost is all based upon usage. So if nothing is used, all I’m paying for is storage costs for the storage of persistent data. DynamoDB’s on-demand capacity based pricing charges $.25 per a Gb, so using this service as a landing location before moving into Snowflake is extremely affordable considering the budget.
Upkeep/maintenance: Everything but the data layer is server-less, so no EC2 to keep up. No patching or server status’ needing to monitored. Or the worst case, no script kitties entering into an unprotected servers in my VPC that require me to start over from scratch.

So pretty straightforward from an overall technology standpoint right? The other item to note is how the Lambda functions are written in the python. The idea behind the S3 bucket structure is to funnel all of the data for ingestion into a single location, and ensure that the data is in a similar format to be landed in Dynamo DB.

With the Lambda functions in the GitHub repo here, we ensure that there is a key present that uniquely identifies the exact upload record and it’s origination so I can reuse the upload process for as many different feeds as we want, from whatever buckets we want. Completely configurable to point to a bucket you own, or someone else’s bucket, you can and land it in your own bucket.

Here’s one of the functions demonstrating a super straight forward movement/copy function to get our data to a single ingestion bucket:

# Read data file from S3 location
# Unpack/Unzip into JSON
# Load to landing bucket location
 def copy_object(self,source_bucket,object_key,target_bucket):
   target_object = object_key + str(time.time())
   copy_source = {
    'Bucket' : source_bucket,
    'Key' : object_key
   }
   s3 = boto3.resource('s3')
   landing_bucket = s3.Bucket(target_bucket)
   try:
    landing_bucket.copy(copy_source, target_object)
   except Exception as ex:
    print(ex)
   else:
    print('Success! Object loaded to: ' + target_object)
    return (target_object)

After this, it’s a matter of moving the data along the layers with our Lambda functions, manipulating the data as necessary, and ending up with that data inside of DynamoDB. The idea here being, if we build out the required functions in Lambda, these core python classes used in the Lambda functions to load the data for as many sources as we want, as long as they are similar.

As an example, do you have customer data being sent from many different sources, slightly differently? Well we can get that data into a single DynamoDB table to load into our relational Snowflake database for analytics, or access the data directly using DynamoDB’s API. All of the data in this example is landed in a single table, and can be identified by source for individual processing/analytics.

Although this all sounds straightforward, developing this architecture was truly easier than other side projects/tinkering I’ve done due to the tools that are available to develop Lambda functions and interact with AWS infrastructure. In the next section I’ll talk about the tools I used, how code was deployed, and few other relevant items that made all of this easier to do than expected.

Using Redash for Visualization

Over the winter break I was having a conversation with my cousin concerning the awesomeness of Tableau and all it offers. While Tableau is a best in class product, there are a couple points that he raised which are valid points against the effectiveness of Tableau.

Tableau uses it’s own proprietary language and functions for a lot of aggregations/advanced functionality that could be done in SQL. SQL based tools are better, he referred to Metabase explicitly among others, due to the fact that most analysts know the language and therefore will easily be able to pick it up.
Tableau isn’t open source, so as a user, if something doesn’t exist in an open source tool and I know the language, a custom feature to the tool can be added easily (depending upon your ability to code in the respective language).

With that in mind, that got me down the path of looking at open source reporting and dash boarding tools that are heavily SQL based. When looking, I cracked upon both a Metabase and Redash instance with which to play around with. Metabase had a good number of features available, but an extremely limited amount of rows which could be ingested by the tool on the free tier.

Heroku – Different pricing tiers

So, not wanting to spend any money upfront, I went over to Redash and started using it to build the first dashboards in the tool using the free trial. Needless to say, I fell in love with the tool instantly then my love was instantly tempered by other limitations present in in Redash. Below is the dashboard that was created, which can no longer be accessed unfortunately due to the free trial period ending, but appeared as pictured below.

Redash dashboard for Steam and Sony Marketplace price changes

Running through the charts, you can see the game and pricing data that has been collected by my scrapers. Scrolling through the different charts you’ll see the following:

Largest Price Drops and Increases: Shows daily what the maximum price increase has been, and largest decrease has been, along with the average price change for all items which had a price change that day.
Average Price Change: The average price for items which changed price, along with the average previous price for all items in the Steam Marketplace and Sony Play Store which had a price change for the day. In green is average price change between the old price and new price.
Most Recent Price Decreases: The last 20 items which have decreased in price, and associated data.
Most Recent Price Increases: The last 20 items which have increased in price, along with associated data.

At the bottom of each of these panels you’ll see when the data last was refreshed from the database.

Now onto the original reason for writing this article. The pros and cons of using and building dashboards in this tool.

Pros

No row limit encountered on the free trial! What a great feature. The first requirement I had was that I wanted to be able to ingest a large amount of data and do aggregations using that data. Not having a hard row limit that I ran into with my small 40,000 record data-set originally sold me on this tool.
Easy to get up and running. Setting up with tool, and making use of the connectors already present in the tool was extremely easy.
SQL interface is extremely intuitive to anyone who has used SQL Server Management Studio/PgAdmin, or any other database querying GUI tool.
Refreshes are extremely easy to schedule and reliable.

Cons

Free trial, and free software if self hosted. If hosted using Redash.io like I did, the price is $49 a month on the lowest tier. Metabase hosting on Heroku is just as easy, and cheaper to use in the long term for small side projects.
No aggregations can be done within visualizations, which in my view is a must-have with any dash boarding/reporting tool. Redash forces you to push that logic into SQL code, which results in redundant/complex queries. It also forces you to pre-aggregate, so the feature of “No row limit” in the pro’s section no longer applies.
Visualization features are basic/limited. Other dash boarding tools allow you do stacked bars, generally more customization option, tool-tips, choosing color schemes, etc. which are not easily available or are limited compared to a tool like Tableau.

Redash query editor. Combine multiple query visualizations to create a dashboard.

Redash was a tool meant for another use case. Perhaps one where you need a basic tool for monitoring ETL, or some other system…Using Redash was a good experience, but using something like Metabase is the tool for me at the moment. Due to the fact that in Redash all calculations have to be pushed down to the SQL and cannot be done with aggregations in the tool, along with the pricing, it doesn’t seem to suit my use case.

From Nothing to Something (The Beginning)

As part of a personal project, which I’m managing (and actively working) here, I’ve decided to do a little write up on my approach, what I’m learning, and other technical things I’ve encountered. This is as much for my own memory, as it is in the hopes that I can help some others avoid the technical pitfalls that I have encountered.

The Product:

I’ve always been someone extremely interested in data, especially data that no one else is looking at. So, what is the logical place to go? The most accessible data is the data that is already out there for the grabbing. So…scraping.

What has no one else scraped, or at least scraped and aggregated AND displayed well? Game prices across different platforms. There’s aggregators for all different kinds of products (ammo, outdoor gear, etc.) but no one seems to have implemented one for games well, although they have tried.

With that goal in mind we are building a product for people to track game prices, and favorite games so that they no longer have to track news on multiple sites and check multiple web marketplaces for the best prices on games. This means we will be scraping Reddit, Twitter, and other news/social media sites, in addition to game marketplaces like Steam and Sony’s Playstore.

What I Hope to Gain:

At the end of the day, maybe we strike gold by building the coolest website and app that ever existed and people love. More realistically, I want to build a platform with which I can add data as needed for my own wants/needs. I want to become expert level using certain libraries and frameworks, and be at a point where I’m not just a Business Intelligence and ETL developer but can develop all over the stack as needed with ease.

Also, I want to gain experience in setting up a highly performant, extensible, ETL platform off of which I end up with an app on a marketplace and at least one download. All of which will be done on a shoe-string budget. I can then use that platform to pivot and build any sort of data-centric application for whatever purpose/reason I want.

The Steps:

So, with all this being said, there are three main topics I will be writing about on a broad level.

Writing scrapers with Python’s Scrapy library, which run 24/7 around the clock
Writing ETL’s to a Postgresql database with near real time availability and using a budget AWS instance
Serving up the data to end users using an open source tool

More updates in the coming days!

What Should Documentation Be?

The problems Business Intelligence organizations solve in organizations are generally the same. Pull some data out of somewhere, synthesize the data, analyze it, then create a picture of what is happening, what is going to happen, or what has happened. Working in small and large organizations, I’ve had the pleasure of seeing a variety of different processes used to deliver these insights. These range from the overbearing, and associated documentation that crush people’s productivity, to the lightweight that creates quicksand beneath teams feet through the lack of knowledge transfer.

Seeing the overbearing and the extremely light weight, there’s one conclusion I’ve arrived at concerning documentation…

Document as Little as Possible

Relevant literature

Don’t commit time to things that aren’t creating revenue or helping the business. Looking at IT projects, there is no doubt that the more documentation that there is, the less value there is. The perfect example came about when having a beer with a former co-worker.

It was brought up that the process at the company we both previously worked at had documentation that took longer to create then coding, testing, and implementing the change. Additionally, this painstakingly crafted documentation that the engineer had to spend time tracking down information for didn’t result in documents that would be useful to the team doing the work going forward. The process decreed that you must document X, Y, and Z in order to deploy the change/implementation so that’s what was done. The fundamental truth is that the “…benefit of having documentation must be greater than the cost of creating and maintaining it.”

Some people believe in the exact opposite of over documentation. Nothing should be documented. The code/implementation should speak for itself. This may work when you have a small size IT application the will always be managed by the same group of individuals (which likely won’t happen). Once you reach an application spanning multiple servers, teams, and databases the expectation for the code/implementation to “speak for itself” in a timely manner to those who have to report and get analytics out is unreasonable.

So, what’s being proposed in this rant? The only useful documentation that I’ve seen documents the “Why” and the “How”. Everything else doesn’t create value for the organization, as the cost to maintain and develop the documentation is too high.

Why

Creating a BI Product entails connecting the business process to an application(s) or database(s). Depending on the environment that you’re working in, Inmon, Kimball, or something else entirely, you need to know the answer to why things in your system exist. The “Why” is important not only from a high leadership level, but also at a low technical implementation level. The “Why” statement done at the low level helps to ensure that a team is using previously created tools and implementations as designed. And if a change is made that goes against the original “Why”, it is intentional and by design.

As an example, working on the Vehicle Profitability by VIN project, the Data Architect created both Inmon (3-NF) and Kimball (dimensional reporting) structures on the project. The “Why” was made extremely apparent through documentation, so the teams knew how to use the current implementation to achieve their goal in the best way possible.

Are you importing new invoice data? That should go into the wholesale invoice structure so that it flows up in the existing fact that contains the revenue information for vehicles which our reports feed from. Why? Because we want a single source of truth for vehicle revenue.

When documentation providing the “Why” for technical implementations exists, it makes adding on and changing the existing processes and assets easier. As opposed to re-inventing the wheel over and over.

How

basic data flow diagram

So after we know why something exists, the other piece that is useful for documentation is the “How”. The “How” shouldn’t be step by step instructions, it should function like a high level map. Data Flow Diagrams are a great example of “How” documentation that I’ve found useful for Business Intelligence products. Armed with the Data Flow Diagram and the “Why” of the design, team members who need to report on, extend, maintain, or refactor a system will be able to make informed decisions.

Make It Useful

At the end of the day, documentation gets in the way of creating code/analysis/direct business value. So the argument for spending time creating documentation is hard to make when someone hasn’t experienced the pains associated with lack of documentation. Lack of architecture that makes sense, misreported numbers, time wasted building processes that do exactly what existing processes already do.

Without documentation, maintaining and using a system or process as intended is impossible. With documentation that is accessible, searchable, and focuses on the “How” and “Why”, organizations can make smart and informed decisions of where to spend time, how to tweak things, and how to get value from their assets.

	Chat GPT for Data An… on Chat GPT for Data Analysi…
	Building a Serverles… on Building a Serverless Data Ing…
	Building a Serverles… on Building a Serverless Data Ing…
	Building a Serverles… on Building a Serverless Data Ing…
	Building a Serverles… on Building a Serverless Data Ing…

Michael Greis