Technical Learnings

AI for Analysis in 2026

It’s been a long time since I’ve looked at using AI tools for data analysis. Frankly, after playing around with Chat GPT in 2023 it made me skeptical that these tools would be useful in the near term. Thinking back to 2023, we were at the start of the true AI boom. Chat GPT was just released in November of 2022 and entering the public zeitgeist. In the past few years we’ve seen an explosion of AI features in the tools and technologies we use in our personal and professional lives.

In this post we’ll be attempting to use Claude as a data analyst to perform hypothesis testing and analysis, and measuring it against our 2023 Chat GPT Analysis findings.

Can Claude act as an analyst for us?
Are the issues we found in 2023’s Chat GPT model resolved?
How could we use this going forward?

Can Claude perform as an analyst?

When we looked at Chat GPT for data analysis in 2023 we found it to be quick, but lacking the ability to understand context and detect data quality issues. Let’s see how Claude measures up against the 2023 Chat GPT model. To do this analysis we will be using the free tier of Claude and the Sonnet 4.6 model. This model is listed as “The best combination of speed and intelligence“.

Using the Claude desktop client we’ll now get our hypothesis tested. Using the desktop client you have the ability to upload data to a project, and then query that dataset within the project. The interface is intuitive and extremely straightforward. No tagging or explaining of the dataset was done to produce the output. Just upload the data, and prompt the model.

Prompt:

“You are an analyst who has been provided traffic datasets for Texas. These tell us by date and time how many traffic accidents occurred in Texas. I would like you to provide a response that supports or rejects this hypothesis: Traffic accidents were lower in 2020 due to Covid.”

Output Received:

This is a large step up from what I expected to be standard for the model. Not only did it produce a basic visualization, it directly provided KPI’s that would be used to validate the hypothesis.

In the details section below, they also provide the more in depth explanation. Not only does it walk me through the data, it also calls out the partial years of 2016 and 2023, removing them from the detailed analysis. Finally, and most importantly, it gives a clear explicit response to the hypothesis and went a level deeper than expected.

In the full response, detailed explanation of the factors that impact accepting or rejecting the hypothesis.

Conclusion

Can Claude act as an analyst?

Yes. For a basic analysis and hypothesis testing Claude is able to clearly analyze data, provide clarity into how it arrived there, and summarize data.

Are the issues we found in 2023’s Chat GPT model resolved?

The Claude model used for this analysis was clearly better. It caught the anomaly with the partial years of data and excluded them from the conclusion. Claude also gave in depth analysis of data points that both supported and refuted the hypothesis. Also, while not an issue in the original model, the visualization provided as a response to the prompt was useful for framing up KPIs and showing the trend over time.

How could we use this going forward?

The Claude model seems well positioned to be a tool for both experienced analysts doing a first pass of a dataset, and users who aren’t trying to do anything too complex with data. For standalone datasets that are small and generally simple this could expedite the time to insight and enable less data focused users to get valuable insights with the tools they have on hand.

Chat GPT for Data Analysis

AI is everywhere nowadays. It is going to replace all the office jobs, and make Google obsolete. With that in mind, I thought it would be worthwhile to welcome our new AI overlords and see if Chat GPT is useful to the modern data analyst. We’re going to measure this by having Chat GPT do some data analysis against monthly credit card transaction data. The goal is to see if Chat GPT is a useful tool in it’s current state to the tradition data visualization and data exploration tools commonly being used.

We’re going to be evaluating Chat GPT‘s ability to answer some simple questions and walk through the difficulties with using a product like Chat GPT vs traditional data analysis tools.

The Data

To start off with, I uploaded credit card transaction data from Chase. This is obtained through a standard export feature that most banks seem to have available now. The table has the following information:

Transaction Date: Date the transaction occurred at the point of sale.
Post Date: Date the transaction was reflected on the account.
Description: Usually contains the business and other information. Different for every vendor.
Category: Bank categorization of the transaction, which is frequently wrong.
Type: If it was a purchase or payment on the credit card. Payment being reducing the outstanding balance.
Amount: The credit/debit amount for the transaction.
Memo: Always blank for my credit card at the moment.

Loading to Chat GPT

Using Chat GPT at chat.openai.com, there are two options to loading the data. We’ll be trying out both, starting with the Free Tier.

Free Tier: Upload the data in the chat window, using Chat GPT 3.5.
Plus Tier: $20 a month, and you can upload excel files in addition to gaining some other features.

Free Tier

Uploading to the free tier was super easy, but I could imagine would cause issues with less well structured data. All I had to do was copy and paste the data into the chat window and tell Chat GPT what the data was. Looking at the video, you can see the upload is super simple. However, we run in to the first problem. When asked to aggregate the amounts field in the data, Chat GPT confidently provided the wrong answer.

*Uploading and describing data to Chat GPT*

Providing the wrong answer on the first pass isn’t great, but it isn’t a dealbreaker. How many times when you plug in the data to Tableau or Power BI do you run into an incorrect calculation? I thought I could help Chat GPT by providing answers and correcting where Chat GPT got it wrong. Unfortunately, that did not help.

At this point, I’m going to throw in the towel with the free tier. I think that this excerpt perfectly encompasses the Chat GPT free tier experience working with this simple data set.

*How can you trust a data analysis tool that does this*?

Chat GPT Plus

Using the Chat GPT plus tier allows the use of the Advanced Data Analysis feature which is currently in Beta (as of October 18th, 2023). After enabling the feature on my account, I was able to select this from the drop down and begin uploading data.

Uploading the file was super simple from a UI perspective, but there are a couple issues that would prevent me from adopting Chat GPT, or replacing Tableau entirely:

File size limit of 100 MB
You have to trust Open AI to safeguard or delete any data uploaded

On the plus side, I was blown away by the results of using Chat GPT on a simple dataset. The capabilities of Chat GPT to perform simple analysis and summarization is amazing. Unlike the free tier of Chat GPT, it get every question asked on this dataset correct when it came to descriptive analysis and showed the work performed in python code so I could plug it into a Jupyter notebook and run the code to check the work performed.

*Thorough and correct results after uploading the file*

Conclusion

For those folks who regularly extract small datasets from a dashboard, database, or application, this could be an extremely useful tool to supplement the tools already common in enterprises for analysts. Pay $20 dollars a month to automate and save time doing ad-hoc analysis. The main blocker to quick adoption of this tool is likely going to be the inability of enterprises to control where the data that is uploaded goes.

Building a Serverless Data Ingestion – Architecture

Data and analytics always seems to start with the same problem. How do get the data where it’s needed so that we can start getting insights? The problem isn’t getting the data from point A to B, but doing this in a way that is easy, cost-effective, reliable, and appropriately scalable for the use case. With the rise of the different cloud providers and their toolsets, I thought it would be fun to give a swing at implementing a serverless, JSON based approach using AWS.

This will be series of articles which will be broken down into the following:

Architecture: What’s the approach?
Development Process: How did I set up my environment that was effective and efficient for developing?
Difficulties: What issues came up, and how did they get resolved?
End results: Does this architecture achieve the goals that it set out to achieve?

Diving into the architecture plan is outlined below. We’ll go into each of the boxes in detail, but first let’s frame the use case for this project:

I want a solution that can be used in my personal data projects, can scale up to N data ingestion pipelines as needed, and is cheap to operate.

With that goal in mind, the solution uses technologies that support these objectives:

Scalability: All of these technologies can scale from gigabytes to terabytes of data automatically, being fully managed services. Additionally, the Lambda python functions that have been written are entirely serverless.
Cost: Cost is all based upon usage. So if nothing is used, all I’m paying for is storage costs for the storage of persistent data. DynamoDB’s on-demand capacity based pricing charges $.25 per a Gb, so using this service as a landing location before moving into Snowflake is extremely affordable considering the budget.
Upkeep/maintenance: Everything but the data layer is server-less, so no EC2 to keep up. No patching or server status’ needing to monitored. Or the worst case, no script kitties entering into an unprotected servers in my VPC that require me to start over from scratch.

So pretty straightforward from an overall technology standpoint right? The other item to note is how the Lambda functions are written in the python. The idea behind the S3 bucket structure is to funnel all of the data for ingestion into a single location, and ensure that the data is in a similar format to be landed in Dynamo DB.

With the Lambda functions in the GitHub repo here, we ensure that there is a key present that uniquely identifies the exact upload record and it’s origination so I can reuse the upload process for as many different feeds as we want, from whatever buckets we want. Completely configurable to point to a bucket you own, or someone else’s bucket, you can and land it in your own bucket.

Here’s one of the functions demonstrating a super straight forward movement/copy function to get our data to a single ingestion bucket:

# Read data file from S3 location
# Unpack/Unzip into JSON
# Load to landing bucket location
 def copy_object(self,source_bucket,object_key,target_bucket):
   target_object = object_key + str(time.time())
   copy_source = {
    'Bucket' : source_bucket,
    'Key' : object_key
   }
   s3 = boto3.resource('s3')
   landing_bucket = s3.Bucket(target_bucket)
   try:
    landing_bucket.copy(copy_source, target_object)
   except Exception as ex:
    print(ex)
   else:
    print('Success! Object loaded to: ' + target_object)
    return (target_object)

After this, it’s a matter of moving the data along the layers with our Lambda functions, manipulating the data as necessary, and ending up with that data inside of DynamoDB. The idea here being, if we build out the required functions in Lambda, these core python classes used in the Lambda functions to load the data for as many sources as we want, as long as they are similar.

As an example, do you have customer data being sent from many different sources, slightly differently? Well we can get that data into a single DynamoDB table to load into our relational Snowflake database for analytics, or access the data directly using DynamoDB’s API. All of the data in this example is landed in a single table, and can be identified by source for individual processing/analytics.

Although this all sounds straightforward, developing this architecture was truly easier than other side projects/tinkering I’ve done due to the tools that are available to develop Lambda functions and interact with AWS infrastructure. In the next section I’ll talk about the tools I used, how code was deployed, and few other relevant items that made all of this easier to do than expected.

Getting Started – Snowflake and S3

Snowflakehas been quickly throughout different organizations and geographies over the past few years. With exponential growth each year when looking at their customer base (~100% growth in 2019 fiscal year) and revenue (257% increase year over year in 2019). There has to be something amazing behind this product besides great marketing right?

With that in mind, I figured it would be a good use of time to take a look. One approach to take when assessing new tools is focusing on a few different use cases and evaluating how it stacks up against the competition. When looking at data and analytics tools, one of the first questions always is “can I get the data were it needs to go quickly?”. In this case, we’ll be looking at loading a basic pipe delimited data set used to populate an old Tableau dashboard from an S3 bucket into Snowflake.

For the purpose of this blog post we’re going to focus on bulk loading. The reason for this being that the most basic use case for many data warehousing initiatives are going to be based on nightly loads or a similar non-continuous schedule. Additionally, we’re going to focus on using an S3 bucket which is external to Snowflake for this attempt.

External Tables
Bulk Loading
Continuous loading

Step 1: What data to load?

In this case, I grabbed the pipe delimited text file and dropped it into an S3 bucket. The the complexity will come in the form of managing my VPC and transferring the data from my S3 bucket to the Snowflake’s internal S3 storage.

File loaded in Snowflake…Note the size of the file in S3.

Step 2: Grant Snowflake Access to S3

Snowflake exists outside of my main AWS account’s VPC so I need to grant my Snowflake account access to my S3 bucket to copy the data into Snowflake. One important thing to note, is that it is definitely a good idea that you have your S3 data in the same region as the Snowflake instance so that you’re data/traffic stays internal to the AWS network (non-public).

Integration flow for S3 Stage.

Per the Snowflake documentation there are three main options for performing this piece of work. The recommended option is to configure a Snowflake Storage Integration so that we can avoid credential requests while trying to get data loaded.

The creation of a policy on my AWS account is the first thing that’s needed.

On the left is my policy, on the right Snowflake’s standard recommendation for the policy.

Luckily, the process to do this is straightforward. Snowflake goes as far as providing the exact JSON that needs to be pasted in, so it’s relatively easy. The main exception between what Snowflake recommends and the policy I created is the use of the wildcard “*” in the s3:prefix member. It’s fine for my prototyping use case, but if you have more than 1,000 objects in the bucket, a “*” will cause an error when trying to read/copy data to Snowflake.

The second item that needs to be completed is the configuration of an IAM user which will allow your Snowflake account to access the specified S3 bucket. This is the same process used to allow two separate AWS accounts to access one another’s resources because, well, that’s exactly what’s occurring with Snowflake being hosted on AWS. Detail can be found here, but at the end of the day you’ll end up with an IAM user with the policy we created earlier assigned to the account.

Step 3: Setting up a Snowflake Integration

The creation of the integration is what we were striving for all along. The reason for this is that by creating an integration, we have a way to access our S3 buckets without having to log in or supply credentials in Snowflake. If we don’t create this integration, then each time data is loaded credentials will have to be supplied.

Integration creation which will be used to create Stages.

Step 4: Create the Stage

Finally, we have the ability to connect to our S3 bucket from our Snowflake account. We can now specify the Stage that’s needed in Snowflake to actually load the data from the S3 bucket into our Snowflake. There are many options to customize the stage, but for this basic example I know that I have a text file format with “|” as a delimiter and a header row.

Simple stage creation for loading CSV formatted file.

Step 5: Create the Table

Pretty straightforward. We need somewhere for the data to live once copied into Snowflake from S3. Everyone should be familiar with creating a table using standard SQL. One of the great parts of dealing with Snowflake is the auto-indexing and partitioning right out of the box. So I used the most basic DDL to ensure no data is dropped or skipped in loading due to data type mismatches.

DDL for table creation, based off the structure of the CSV. All String datatypes for ease of loading.

Step 6: Create the Snowpipe

The final step is to get the data loaded from our S3 to Snowflake’s internal S3. This is done through S3, and involves the final step of creating the Snowpipe for loading the data.

Creating a Snowpipe that automatically loads the data upon creation.

Final Outcome:

Our data has landed!

As we see when running a query against our table, the data is now present in Snowflake and available for querying. Additionally, the file that was loaded into Snowflake still exists in our S3 bucket for use in any QA or validation that we want to perform. In our final state, we can also see that Snowflake automatically compressed our data from over 50MB to 15.49MB without any manual intervention.

With automatic compression, the file now takes about ~30% of the storage space that the raw file does on S3.

Going forward, we’ll be exploring some more interesting/impressive capabilities of Snowflake. But starting here, you can see the ease of use, going from no infrastructure to having data feeding into a enterprise scalable data warehouse in less than a couple hours.

Re-establishing a Broken Cloud

This week, I cracked open Tableau to log into my Amazon RDS instance and noticed that the connection wasn’t working. Logging into the AWS console, my AWS RDS instance had disappeared (along with all the data in it). On perusing my emails, I noticed that I had an unpaid bill in my inbox from Amazon from ~1 month prior. So…along with the instance no longer running, I had lost all data contained which I had been collecting over the past 3 months which is more than slightly disappointing.

This does present an opportunity though. My EC2 instance is still running, and has been trying to push data to a server that no longer exists, meaning I need to set the RDS instance backup. This was an opportunity to document setting up a new RDS instance on AWS from scratch, with all necessary users, objects, and privileges and document how long it took.

Here’s the process form start to finish;

Start: 4:12 pm

First step, logging in and getting the instance created. You’ll notice during this step that I flip from free-tier to get more storage, then flip back to free-tier. Why pay more money to get increased storage I won’t need for a couple weeks? All I need to do to up the storage is change a configuration which will cause my RDS instance to be down for a couple minutes.

Second step, making sure security privileges are setup. After my first project a couple years ago, and getting my web server destroyed by a script kiddie, I now only open specific ports (which I should have been doing all along).

The third step. I should be able to login to the server using the account that I set up as admin. Once I log in, all I have to do is execute all the create scripts I have.

The way that I created the DDL for my tables and schemas means that I can copy and paste them into query window in PgAdmin4 and execute the scripts. You’ll notice I have a couple semicolon issues that I’ve resolved.

Finally, looks like everything has been created. Just need to validate that my different accounts can login to the server and have appropriate privileges, which they did.

Finish: 5:17 pm

Connections exist from my EC2. With no alteration to any code on the EC2 server!

This process did not include any alteration of the EC2 instance and allowed me to go from a web server scraping the internet and sending files into the ether (nowhere) to having a full database stood up with all objects. This was done in a little over an hour, and ~30 minutes of that was spent executing sql, copy and pasting sql into the query editor for execution, testing to ensure objects/configuration was successful, and fixing minor syntax issues. All of which could be automated away.

I was debating whether or not to get a personal server for my projects, but this in my mind firmly helps cement the cloud as being a better choice when it comes to infrastructure. Comparing to my experience setting up a local SSAS and SQL server instance, this took about 10% of the time and was extremely easy to get running.

A Foray Into Serious Scraping

It’s been a while since my last post. Getting married, honeymooning, buying a house, etc. took away the time I had for this. But all of that is nearing it’s end, so I’m getting back into the regular cadence of working on the scraping project. Since I’m now back into it, and got everything up and running, I figured the most sensible place to start is the architecture that has been implemented.

The Problem:

The most sensible place to start any discussions of architecture is clearly stating what the system is supposed to do. I need a system that accomplishes these three things.

I need a system that is able to reliably scrape data from any website or consume data from any source.
I need a place where this data can be loaded and reported on in a cohesive format.
I need the product to be lightweight as far as storage space required and CPU so I don’t have to pay out the wazoo.

Overview:

Overview of the architecture, from inputs to database inserts.

In order to meet this, a straightforward architecture was implemented. Using Amazon Web Services both a EC2 instance and RDS instance were set up, with the EC2 being an Ubuntu instance and the RDS being Postgresql. In sequential order, here is how the scraper works.

Using python’s Scrapy library, we’ve written Scrapy projects which look to specific sources to bring in data based upon the HTML on websites. Right now, we’ve targeted two, but can expand to as many as needed. These Scrapy spiders are scheduled through Scrapyd, a framework that no only allows for scheduling and management of spiders, but also offers better performance by operating on Twistd making it asynchronous.
As the spiders are constantly running they are outputting to JSON files on the server. Basically, the driver here is to have a place to drop the output of the data onto the server so that data won’t be lost if something happens with one of the processes.
A Python class was written with Psycopg2 in a way that is meant to be extensible for future data sources. The idea being, that as the data model and data sources are changed/expanded upon, the only thing that will need to change is the class itself. None of the scripts that call the class to insert data from our existing data sources will need to change.
A staging area was created within the RDS PostgreSQL instance which ingests the raw data from the data source. Where possible, a unique index was created that checks for changes before accepting the data into the staging area. As we have scrapers hitting sources repeatedly, we are going to be grabbing the same data. What we’re interested in are the changes, especially in regards to new items or price changes. Also, we want to make as efficient as possible of a architecture so storing only the data we are interested in just makes sense.
Once data has been accepted into the landing zone, the Ubuntu instance is used to schedule a slew of ETL jobs written in SQL and passed to PostgreSQL for execution using Psycopg2. Postgresql doesn’t have a native scheduler readily available, so we use the Crontab functionality of Ubuntu to execute a script for each of our sources that calls from a class containing all of our ETL functions. The end result of this is a 3NF model populated with data and appropriate relationships made.

So, it’s now up and running, and data is flowing through into the objects. The data is populating for all tables where it is expected and I could begin reporting price changes today. The best part? All of this was built using $0’s of infrastructure from Amazon Web Services (and a lot of my time). I’m running out of storage space rapidly (20 gigs from the free tier ran out over a couple days), and the CPU is not beefy at all, so stalls out if more than one scraper is running at a time (as pictured below).

Performance goes down dramatically in yellow. In red, my scraper has been blocked from accessing the site (which didn’t happen before refactor…I’ll go into that another time)

To refer back to the original goal, I would say it has been achieved. Not to say that it couldn’t be improved upon and optimized. But overall, the first serious foray into scraping seems to of gone well. Feel free to reach out with any questions, or suggestions!

From Nothing to Something (The Beginning)

As part of a personal project, which I’m managing (and actively working) here, I’ve decided to do a little write up on my approach, what I’m learning, and other technical things I’ve encountered. This is as much for my own memory, as it is in the hopes that I can help some others avoid the technical pitfalls that I have encountered.

The Product:

I’ve always been someone extremely interested in data, especially data that no one else is looking at. So, what is the logical place to go? The most accessible data is the data that is already out there for the grabbing. So…scraping.

What has no one else scraped, or at least scraped and aggregated AND displayed well? Game prices across different platforms. There’s aggregators for all different kinds of products (ammo, outdoor gear, etc.) but no one seems to have implemented one for games well, although they have tried.

With that goal in mind we are building a product for people to track game prices, and favorite games so that they no longer have to track news on multiple sites and check multiple web marketplaces for the best prices on games. This means we will be scraping Reddit, Twitter, and other news/social media sites, in addition to game marketplaces like Steam and Sony’s Playstore.

What I Hope to Gain:

At the end of the day, maybe we strike gold by building the coolest website and app that ever existed and people love. More realistically, I want to build a platform with which I can add data as needed for my own wants/needs. I want to become expert level using certain libraries and frameworks, and be at a point where I’m not just a Business Intelligence and ETL developer but can develop all over the stack as needed with ease.

Also, I want to gain experience in setting up a highly performant, extensible, ETL platform off of which I end up with an app on a marketplace and at least one download. All of which will be done on a shoe-string budget. I can then use that platform to pivot and build any sort of data-centric application for whatever purpose/reason I want.

The Steps:

So, with all this being said, there are three main topics I will be writing about on a broad level.

Writing scrapers with Python’s Scrapy library, which run 24/7 around the clock
Writing ETL’s to a Postgresql database with near real time availability and using a budget AWS instance
Serving up the data to end users using an open source tool

More updates in the coming days!

	Chat GPT for Data An… on Chat GPT for Data Analysi…
	Building a Serverles… on Building a Serverless Data Ing…
	Building a Serverles… on Building a Serverless Data Ing…
	Building a Serverles… on Building a Serverless Data Ing…
	Building a Serverles… on Building a Serverless Data Ing…

Michael Greis