Business IT

AI for Analysis in 2026

It’s been a long time since I’ve looked at using AI tools for data analysis. Frankly, after playing around with Chat GPT in 2023 it made me skeptical that these tools would be useful in the near term. Thinking back to 2023, we were at the start of the true AI boom. Chat GPT was just released in November of 2022 and entering the public zeitgeist. In the past few years we’ve seen an explosion of AI features in the tools and technologies we use in our personal and professional lives.

In this post we’ll be attempting to use Claude as a data analyst to perform hypothesis testing and analysis, and measuring it against our 2023 Chat GPT Analysis findings.

Can Claude act as an analyst for us?
Are the issues we found in 2023’s Chat GPT model resolved?
How could we use this going forward?

Can Claude perform as an analyst?

When we looked at Chat GPT for data analysis in 2023 we found it to be quick, but lacking the ability to understand context and detect data quality issues. Let’s see how Claude measures up against the 2023 Chat GPT model. To do this analysis we will be using the free tier of Claude and the Sonnet 4.6 model. This model is listed as “The best combination of speed and intelligence“.

Using the Claude desktop client we’ll now get our hypothesis tested. Using the desktop client you have the ability to upload data to a project, and then query that dataset within the project. The interface is intuitive and extremely straightforward. No tagging or explaining of the dataset was done to produce the output. Just upload the data, and prompt the model.

Prompt:

“You are an analyst who has been provided traffic datasets for Texas. These tell us by date and time how many traffic accidents occurred in Texas. I would like you to provide a response that supports or rejects this hypothesis: Traffic accidents were lower in 2020 due to Covid.”

Output Received:

This is a large step up from what I expected to be standard for the model. Not only did it produce a basic visualization, it directly provided KPI’s that would be used to validate the hypothesis.

In the details section below, they also provide the more in depth explanation. Not only does it walk me through the data, it also calls out the partial years of 2016 and 2023, removing them from the detailed analysis. Finally, and most importantly, it gives a clear explicit response to the hypothesis and went a level deeper than expected.

In the full response, detailed explanation of the factors that impact accepting or rejecting the hypothesis.

Conclusion

Can Claude act as an analyst?

Yes. For a basic analysis and hypothesis testing Claude is able to clearly analyze data, provide clarity into how it arrived there, and summarize data.

Are the issues we found in 2023’s Chat GPT model resolved?

The Claude model used for this analysis was clearly better. It caught the anomaly with the partial years of data and excluded them from the conclusion. Claude also gave in depth analysis of data points that both supported and refuted the hypothesis. Also, while not an issue in the original model, the visualization provided as a response to the prompt was useful for framing up KPIs and showing the trend over time.

How could we use this going forward?

The Claude model seems well positioned to be a tool for both experienced analysts doing a first pass of a dataset, and users who aren’t trying to do anything too complex with data. For standalone datasets that are small and generally simple this could expedite the time to insight and enable less data focused users to get valuable insights with the tools they have on hand.

Chat GPT for Data Analysis: Part 2

In the previous blog post we took a look at Chat GPT and it’s capabilities to perform simple analysis on data. For this next article, we’re going to look at using Chat GPT for some more advanced analysis. As we walk through this attempt, all resources used and analysis created will be made available.

The Data Set

For this analysis we’ll be using a dataset from Kaggle, titled US Accidents (2016-2023) which provides us with a nationwide view of all traffic accidents reported between 2016-2023. All citations can be found at the bottom of this article.

The dataset is a total of 3GB, but for the purpose of this analysis we will be focusing on Texas only data for ease of use.

Hypothesis Being Tested

2020 will have a lower total volume of accidents due to Covid.
Higher volumes of crashes during rush hour in both urban and rural parts of the state.

Hypothesis 1: 2020 will have a lower total volume of accidents due to Covid.

To start of with the descriptive analysis, we begin by loading the data into a Tableau dashboard. The reason for this is that the size of the data is ~3GB, which is too big to handle with straight excel analysis. Plugging into Tableau, we are able to build visuals and a way to extract the data for analysis as necessary.

Tableau and Excel

Looking at the dashboard, we can see 2020 is the third lowest in accidents. This raises the question, why do 2016 and 2023 have such low rates of accidents? What we’ll be looking for is any large gaps/inconsistencies in the data.

My hunch is that there is partial reporting in both 2023 and 2016. To prove this, I’ll pop Tableau back open to add the data into the dashboard for analysis. Below you can see the data with month added in, and the assumption was correct. There is partial data in 2016 and 2023.

This shows that 2020 did in fact have lower volumes of accidents. Looking at the months when Covid lockdowns were occurring (March-July), we can see that accidents rates were lower than previous and later years suggesting in part that this was due to Covid. In order to prove causation we would have to go a lot more in depth, which is beyond what I’m looking to do right now.

Chat GPT

Plugging in the data that can be found in the “Raw Data” section of the Tableau dashboard, let’s try to enter the prompt and see what it comes up with. The full chat log can be found by clicking here.

*Question is answered effectively, but doesn’t catch the anomaly*

As you can see based upon the above prompt and response, Chat GPT arrived at the same conclusion we did with our Tableau dashboard with about 10% of the effort. No visualization, no pivot table creation and interpretation. Chat GPT did not catch the anomaly in the data however.

When asked to check for anomalies and re-analyze the data, the same conclusion we arrived at in Tableau and excel is generated by Chat GPT.

Hypothesis 2: We can clearly see higher volumes of crashes during rush hour in both urban and rural parts of the state.

For the first step in this process we have to identify whether a county is urban and rural. This is not identified in the core data set being used, but is readily available in the US census information. We’ll come up with a rough rule that any county with a 2020 population of less than 50,000 is rural.

Tableau and Excel

Performing this one time analysis in Tableau and Excel was relatively easy.

1. Upload county information to the Tableau dashboard and joined with existing excel data. (~5 minutes)

2. Create LOD and calculated fields to provide the variables (~15 minutes)

I had to create quite a few calculated fields and an LOD expression to handle the join between the accident data and the county information (different granularity).

*Provides the number of hours, count of accidents, and population*

3. Saved the data to excel (~2 minutes)

4. Perform the calculation myself using formulas (~5 minutes)

Chat GPT

We start by uploading the two datasets, and having Chat GPT featurize the county information to identify whether a county is urban or rural. After that, it’s as easy as asking the questions and getting the response. The data is already joined together correctly using the county column from both datasets.

The original response given by the tool is below. Which is too straightforward of an interpretation, as it does not take into account population numbers or the duration of rush hour (6 hours out of the day). We need a different metric than what was provided by Chat GPT.

*Too simple, and not doesn’t account for other variables*

Let’s re-prompt with average accidents per an hour for 100,000 in population.

While this would be a seemingly straightforward request, unfortunately Chat GPT did not perform well. It took about 30 minutes to get Chat GPT to use the right numbers and make the right assumptions. When you compare something like excel with Chat GPT for a simple model where assumptions need to be made, there’s no doubt in my mind that excel is more efficient. In order to troubleshoot Chat GPT and steer towards the correct answers, I had to spin up an excel.

Conclusion:

Chat GPT continues to prove effective for basic exploration of data, and even joining datasets together for more complex analysis. When it came to the assumptions made for the calculation, it did however miss the mark and have to be reinforced multiple times, and in some cases corrected. While we eventually arrived at the right conclusion, I found dealing with these issues in Chat GPT uncomfortable. It was hard to tell how much more time it would take to get Chat GPT corrected due to sometimes having to correct the same issue multiple times.

Appendix

Tableau
Kaggle Data Set
- Citations:
  - Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.
  - Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. “Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.” In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.
US Census Dataset
Chat GPT Log #1 – Correct Answer
Chat GPT Log #2

Chat GPT for Data Analysis

AI is everywhere nowadays. It is going to replace all the office jobs, and make Google obsolete. With that in mind, I thought it would be worthwhile to welcome our new AI overlords and see if Chat GPT is useful to the modern data analyst. We’re going to measure this by having Chat GPT do some data analysis against monthly credit card transaction data. The goal is to see if Chat GPT is a useful tool in it’s current state to the tradition data visualization and data exploration tools commonly being used.

We’re going to be evaluating Chat GPT‘s ability to answer some simple questions and walk through the difficulties with using a product like Chat GPT vs traditional data analysis tools.

The Data

To start off with, I uploaded credit card transaction data from Chase. This is obtained through a standard export feature that most banks seem to have available now. The table has the following information:

Transaction Date: Date the transaction occurred at the point of sale.
Post Date: Date the transaction was reflected on the account.
Description: Usually contains the business and other information. Different for every vendor.
Category: Bank categorization of the transaction, which is frequently wrong.
Type: If it was a purchase or payment on the credit card. Payment being reducing the outstanding balance.
Amount: The credit/debit amount for the transaction.
Memo: Always blank for my credit card at the moment.

Loading to Chat GPT

Using Chat GPT at chat.openai.com, there are two options to loading the data. We’ll be trying out both, starting with the Free Tier.

Free Tier: Upload the data in the chat window, using Chat GPT 3.5.
Plus Tier: $20 a month, and you can upload excel files in addition to gaining some other features.

Free Tier

Uploading to the free tier was super easy, but I could imagine would cause issues with less well structured data. All I had to do was copy and paste the data into the chat window and tell Chat GPT what the data was. Looking at the video, you can see the upload is super simple. However, we run in to the first problem. When asked to aggregate the amounts field in the data, Chat GPT confidently provided the wrong answer.

*Uploading and describing data to Chat GPT*

Providing the wrong answer on the first pass isn’t great, but it isn’t a dealbreaker. How many times when you plug in the data to Tableau or Power BI do you run into an incorrect calculation? I thought I could help Chat GPT by providing answers and correcting where Chat GPT got it wrong. Unfortunately, that did not help.

At this point, I’m going to throw in the towel with the free tier. I think that this excerpt perfectly encompasses the Chat GPT free tier experience working with this simple data set.

*How can you trust a data analysis tool that does this*?

Chat GPT Plus

Using the Chat GPT plus tier allows the use of the Advanced Data Analysis feature which is currently in Beta (as of October 18th, 2023). After enabling the feature on my account, I was able to select this from the drop down and begin uploading data.

Uploading the file was super simple from a UI perspective, but there are a couple issues that would prevent me from adopting Chat GPT, or replacing Tableau entirely:

File size limit of 100 MB
You have to trust Open AI to safeguard or delete any data uploaded

On the plus side, I was blown away by the results of using Chat GPT on a simple dataset. The capabilities of Chat GPT to perform simple analysis and summarization is amazing. Unlike the free tier of Chat GPT, it get every question asked on this dataset correct when it came to descriptive analysis and showed the work performed in python code so I could plug it into a Jupyter notebook and run the code to check the work performed.

*Thorough and correct results after uploading the file*

Conclusion

For those folks who regularly extract small datasets from a dashboard, database, or application, this could be an extremely useful tool to supplement the tools already common in enterprises for analysts. Pay $20 dollars a month to automate and save time doing ad-hoc analysis. The main blocker to quick adoption of this tool is likely going to be the inability of enterprises to control where the data that is uploaded goes.

Building a Serverless Data Ingestion – End Result

In this final post, we’ll go over the final implementation of this serverless data ingestion pipeline. What is the result of all the effort put forward to build this serverless data ingestion process? I think the best way to break this down is to compare what we were originally aiming for, and what was implemented. Below you can see diagram that was created in the post outlining the overall architecture.

Pictured on the left is what was originally proposed. On the right, is what was actually implemented. Turns out the implementation pretty much stuck to the plan, with the additional enhancement of using an event to kick off the Lambda functions. This allows for everything to kick off in the appropriate sequence once a file is placed in our “ingestion-bucket-11.15.2021”.

The usage of AWS events to kick off Lambda functions was extraordinarily easy, and there’s plenty of good documentation to get started. The S3 event passes through all the metadata needed to parameterize and operate the pipeline in JSON. The documentation from Amazon, makes it super easy to access and use when setting up your Lambda functions.

Below you can see the actual code executed in the Lambda function. Notice that the variables bucket_name and file_name are both retrieved from the event.

from classes import ingester as ing
from classes import forstaparser as fp

def lambda_handler(event, context):
 bucket_name = event['Records'][0]['s3']['bucket']['name']
 file_name = event['Records'][0]['s3']['object']['key']
 target_bucket = 'landing-bucket-11.15.2021'
 upload_table = 'landing_table'
 source_type = 's3'

 if bucket_name == target_bucket:
  upload_file_name = ing.ingester.convert_object(None,bucket_name,file_name)
  raw_logs = fp.parser.read_logs(None,source_type,target_bucket,upload_file_name)
  fp.parser.dynamo_landing_load(None,upload_table,raw_logs,file_name)
 else:
  landing_ingester = ingester.ingester()
  landing_ingester.copy_bucket(bucket_name,target_bucket)
  #test_parser.t_parser()
 print("Completed")

Put simply, the function does the following. First it receives the event metadata, parses through the JSON, and obtains the bucket name that we want to transfer the files from. Notice we can point to any bucket, and it will always drop the file into ‘landing-bucket-11.15.2021’. Using the event metadata means I can reuse this Lambda function as often as I want to create a central dumping ground for staging data to be loaded.

Second, once files are put into ‘landing-bucket-11.15.2021’ another event kicks off. This event cleans the data, ensuring proper encoding (UTF-8), and then loads the data into our DynamoDB landing table. All in all pretty simple.

Below you can see everything running in action.

As we can see above, the files were automatically copied, and for posterity’s sake, we can check the cloud logs to see we have a 100% success rate in the last hour. Looking at the below we can see the result of 100% successful executions for our first Lambda function.

The next step is automatically kicked off whenever an object is PUT into the ‘landing-bucket-11.15.2021’ and loads the data into DynamoDB. With the current setup, we can see the data uploads successfully and the data is now available in DynamoDB to be ingested into whatever processes/analytics we want! The best part being that once this is setup it is automated, and auditable going forward due to the tools AWS offers.

In order to build this, it may not have seemed like a long journey. But keep in mind, in the process of building this little project out I’ve had to pickup and learn quite a few tools. Docker, Lambda, S3, IAM, Python, Boto3, and a few more tools which we’ve covered in the previous posts. If I need to do this again, it’ll be much simpler based upon what I’ve learned.

Architecture: What’s the approach?
Development Process: How did I set up my environment that was effective and efficient for developing?
Difficulties: What issues came up, and how did they get resolved?
End Result: Does this architecture achieve the goals that it set out to achieve?

Thanks for reading along!

Building a Serverless Data Ingestion – Difficulties

This is part three in a four part series on implementing a serverless, JSON based approach using AWS for data ingestion

Architecture: What’s the approach?
Development Process: How did I set up my environment that was effective and efficient for developing?
Difficulties: What issues came up, and how did they get resolved?
End results: Does this architecture achieve the goals that it set out to achieve?

Outlining the architecture and development process, I glossed over all of the problems and issues that had to be overcome along the way. The majority of my work life and free time isn’t spent using Python, so the majority of the issues confronted are likely to be straightforward for more experienced developers. Doing something new though, I did run into a few issues which were interesting and warrant at least jotting down for my own memory.

Learning about the Docker File
AWS Lambda events and layers
Learning Boto3

Learning about the Docker File

When starting off with Docker, I was throwing things at the wall and seeing what stuck. Originally, I was using a standard Ubuntu image to do testing from for the final function which would be up in AWS Lambda. This was not the right approach in retrospect. I should have started with the amazonlinux image that is readily available on Docker Hub. Once understanding how to create the Docker File from that image, the next step was understanding how to get the code into the container.

The first instinct I had was to create the Docker File in a specific subdirectory of the code base. I’d have a structure like follows:

The entirety of the GitHub repo is Forsta, with subdirectories serving specific purposes.

Database: Contains code to create the DynamoDB database tables, and other configurations.
Parser: Has the code for moving the data between S3 buckets and into DynamoDB from S3 buckets. Additionally, it contains functions to clean the data and create a primary/unique key for the DynamoDB table.
Test: Contain all unit tests or end to end tests I would need to create. It ended up containing the function executed by the AWS Lamdba function, which needs to be rectified in the future.
Docker: The final directory was aiming to be Docker, which would have contained a repository of different Docker Files which would be used for different Lambda functions. That’s where I ran into some issues with pathing.

Based upon where the Docker File was in this path, I was unable to easily use the “add” command which made me unable to pull the required files into the Docker container to test my code. My recommendation, have one main area which the Docker File lives in the topmost directory of your repo (in this case, right below Forsta), and you can easily get all of the code you need into the container.

AWS Lambda

This was my first time using AWS Lambda, and it was a bit bumpy at first. My original approach was to create a class, which I would then call in Lambda. While this was basically what the end result was, the route getting there involved some discovery/mistakes.

The first time I attempted to deploy code to Lambda, I just had the class, zipped it up, and tried publishing the Lambda function. In order to use these published classes, I didn’t think through the fact that something would have to call the function, other than my test scripts.

The second time I published a function, one of my test scripts which worked in the container, to run the desired code to see if it worked. Again, this did not work out. After doing some further research, I found that the AWS Lambda function requires an event to kick off the execution of the desired code. In retrospect, this makes complete sense.

The third attempt I got right, after looking a this great tutorial. The key is to create a wrapper which accepts the right events from the AWS environment which kicks off the underlying code I was looking to execute. You can see the repo here with

from tests import test_parser

def lambda_handler(event, context):
    test_parser.t_parser()
    print("Completed")

All of this could have been averted by reading the documentation before trying to deploy. In order to get to this point, I had to refactor the directory structure a couple times (leading to code impacts), and deploy multiple times. Lesson learned, documentation is in fact worth reading.

Learning Boto3

Let’s start off with the basics. What is Boto3? Luckily, the Boto3 documentation has a simple overview on the landing page.

You use the AWS SDK for Python (Boto3) to create, configure, and manage AWS services, such as Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). The SDK provides an object-oriented API as well as low-level access to AWS services.
– Boto3 Documentation

This library underpins everything that was done as part of the effort. Really, the complications came in the form of understanding how to get the data cleaned in a format that would be useful to have in a DynamoDB table. Trying to get the data to where I needed it was easy.

This can be seen here primarily in the ingester class located here and pictured below.

import boto3
import time as time
import gzip
import json
from io import BytesIO

class ingester():
 #def __init__(self):
 #print the name of all the buckets the configured account has access to
 def s3_list_buckets(self):
  for bucket in boto3.resource('s3').buckets.all():
   print(bucket.name)
   response_dict = boto3.client('s3').list_objects(Bucket=bucket.name)
   print(response_dict.keys())
   #ensures bucket has content before trying to pull content info out
   try:
    response_dict['Contents']
   except:
    print('No objects in ' + bucket.name + ' exist.')
   else:
    print(response_dict['Contents']) 
    objs_contents = response_dict['Contents']
    print(objs_contents)
    #unnecessary, good for reference
    #for i in range(len(objs_contents)):
    # file_name = objs_contents[i]['Key']
    # print(file_name)

# Read data file from S3 location
# Unpack/Unzip into JSON
# Load to landing bucket location
 def copy_object(self,source_bucket,object_key,target_bucket):
   target_object = object_key + str(time.time())
   copy_source = {
    'Bucket' : source_bucket,
    'Key' : object_key
   }
   s3 = boto3.resource('s3')
   landing_bucket = s3.Bucket(target_bucket)
   try:
    landing_bucket.copy(copy_source, target_object)
   except Exception as ex:
    print(ex)
   else:
    print('Success! Object loaded to: ' + target_object)
    return (target_object)

# turns the data contained in the s3 gzip compressed file to text document
 def convert_object(self,target_bucket,target_key):
   data = []
   s3_client = boto3.client('s3')
   read_object = s3_client.get_object(
     Bucket = target_bucket,
     Key = target_key
   )
   read_byte_object = BytesIO(read_object['Body'].read()) 
   raw_data = gzip.GzipFile(None, 'rb', fileobj=read_byte_object).read().decode('ASCII') #.decode('utf-8')
   s3_client.put_object(Body=raw_data, Bucket=target_bucket,Key=target_key[target_key.rindex('/')+1:] + str(time.time())+'.txt')

Looking at the convert_object function, you can see there was quite a bit of finagling needed in order to get the required data format and move the contents into my single landing bucket. This single bucket is where I’m storing all of my information, as outlined in the architecture. After doing this project, I realized the hard part of the library, just like anything, is learning how the different functions return the data and should be used in tandem to make a coherent solution. But I will say, the documentation is great and there are a plethora of resources/blogs.

Specifically, I’ll call out the following as a great place to start when looking to get something like this off the ground and into the cloud.

Delivering Change

When you think of great products, what do you think of? Products that have altered the way office jobs are done (excel, outlook, DropBox), or changed the way we interface with computers and technology (iPhone). Hearing from the VP of Product at Change.org recently was an opportunity to hear from the head of one of the most impactful products present in the market today, as evidenced by the impact on politics around the

French president responding to Change.org petition

globe. This presentation concerned the organizational tools that the team at Change.org used which led to the successful creation of an entirely new product line, circa 2011, to support a major shift in Change.org‘s business model.

The problem facing Change.org was that the revenue source had been a pure B2B model, which was no longer viable. Meaning that the grass roots petition platform that exists today did not exist. It became clear to the leadership of the organization that the business model was no longer feasible and had to be changed due to external market factors. Nick (VP of Product) led this effort from the product side and by using a few different concepts and methodologies successfully led the team to create the product line which made Change.org what it is today.

Problems Faced

The main internal problems that the company faced in developing an entirely new product line were the following:

Speed of Delivery: Slow progress towards product completion
Alignment: Getting the organization to work in the same direction
Focus: Inability to make meaningful progress towards the goals that have been set

In order to confront this problem, there were a few different tools that were utilized by the team at Change.org that stood out to me. The team utilized some of the obvious efficiency levers, like automating tasks, but the strategic/organizational decisions are the most interesting.

Avalanche

As a previous product owner, one of the main frustrations regularly encountered was the inability to make forward progress due to all of the work needed to “keep the lights on” as it’s commonly phrased. Change.org made the conscious decision to focus solely on creating new features for a limited and focused period of time. That means no fixing of features that break (I assume other than P1’s). No devops working performance tuning. The whole organization fully focused on features.

One of the anecdotes shared that stuck with me was when it was mentioned that two college hires ended up working on a feature with the CTO. When you have an entire organization aligned behind a single goal, the ability to have people naturally move across, up, and down traditional power structures to achieve that goal seems to be a huge gain in creating innovative solutions. In addition to having more deliveries towards that single goal, the fact of the matter is that also comes the ability to see the pain points and what other teams are dealing with/issues they have.

Once the avalanche was done, Nick mentioned how devops had a much better understanding of the issues facing the application development team due to the fact that these devops developers had been doing feature work on the application. With this understanding they were then able to take these learnings and implement solutions that were able to make big impacts within the organization.

Shots on Goal

Anyone who has worked in an agile environment should be aware of this, but the affect that this had in the organization is worth mentioning. Specifically how deliberately this discussion was had with leadership. What Nick realized is that they were entering an entirely new market. Knowing this, he knew that it was unlikely that the organization would be able to accurately predict what features the new consumers would adopt in their product line without releasing many features and finding out what worked.

To that point, Nick organized his team to focus on flexing their speed muscle, as opposed to quality. Why deliver 10 perfect features when you could deliver 100 working features and have a much higher chance that the organization will strike gold? That strategy led to multiple success in the new product line, and an ability to conduct and test experiments faster than they thought possible under the previous paradigm which tried to achieve both speed and quality. When a team is focused on an immense effort, making demonstrable progress is usually one of the hardest/most difficult things. By focusing on speed, much faster feedback loops were enabled, with some of these features contributing to more than 20% revenue growth from the new product line.

OKR

Objective and Key Results methodology. Google used it. Intel used it, and countless other organizations. It’s not reasonable to think that an organization can deliver an entirely new product line and business model without complete organizational focus. Using OKRs Change.org was able to create a clear focus. As a result the teams were able to deliver on what is meaningful and impacts the bottom line, as opposed to working towards multiple objectives and goals that cause people to tug and pull against one another.

Great book which provides a history and overview of OKRs

Working in multiple organizations, I’ve noticed that people don’t know what to do when their are too many goals. Having the ability to walk into a room and point to a single goal means that the organization is more easily able to coordinate and implement. When there is no clear goal unnecessary road blocks, and fiefdoms tend to spring up and lead to the organization pulling in different directions getting in front of the accomplishment of organizational goals.

Operating teams using the above three strategies worked for Change.org. Using some of these strategies has worked for other large and small organizations, and from what I’ve seen thus far in my career, could be used in many other organizations to transform the way products are delivered and impact the bottom line.

Using Redash for Visualization

Over the winter break I was having a conversation with my cousin concerning the awesomeness of Tableau and all it offers. While Tableau is a best in class product, there are a couple points that he raised which are valid points against the effectiveness of Tableau.

Tableau uses it’s own proprietary language and functions for a lot of aggregations/advanced functionality that could be done in SQL. SQL based tools are better, he referred to Metabase explicitly among others, due to the fact that most analysts know the language and therefore will easily be able to pick it up.
Tableau isn’t open source, so as a user, if something doesn’t exist in an open source tool and I know the language, a custom feature to the tool can be added easily (depending upon your ability to code in the respective language).

With that in mind, that got me down the path of looking at open source reporting and dash boarding tools that are heavily SQL based. When looking, I cracked upon both a Metabase and Redash instance with which to play around with. Metabase had a good number of features available, but an extremely limited amount of rows which could be ingested by the tool on the free tier.

Heroku – Different pricing tiers

So, not wanting to spend any money upfront, I went over to Redash and started using it to build the first dashboards in the tool using the free trial. Needless to say, I fell in love with the tool instantly then my love was instantly tempered by other limitations present in in Redash. Below is the dashboard that was created, which can no longer be accessed unfortunately due to the free trial period ending, but appeared as pictured below.

Redash dashboard for Steam and Sony Marketplace price changes

Running through the charts, you can see the game and pricing data that has been collected by my scrapers. Scrolling through the different charts you’ll see the following:

Largest Price Drops and Increases: Shows daily what the maximum price increase has been, and largest decrease has been, along with the average price change for all items which had a price change that day.
Average Price Change: The average price for items which changed price, along with the average previous price for all items in the Steam Marketplace and Sony Play Store which had a price change for the day. In green is average price change between the old price and new price.
Most Recent Price Decreases: The last 20 items which have decreased in price, and associated data.
Most Recent Price Increases: The last 20 items which have increased in price, along with associated data.

At the bottom of each of these panels you’ll see when the data last was refreshed from the database.

Now onto the original reason for writing this article. The pros and cons of using and building dashboards in this tool.

Pros

No row limit encountered on the free trial! What a great feature. The first requirement I had was that I wanted to be able to ingest a large amount of data and do aggregations using that data. Not having a hard row limit that I ran into with my small 40,000 record data-set originally sold me on this tool.
Easy to get up and running. Setting up with tool, and making use of the connectors already present in the tool was extremely easy.
SQL interface is extremely intuitive to anyone who has used SQL Server Management Studio/PgAdmin, or any other database querying GUI tool.
Refreshes are extremely easy to schedule and reliable.

Cons

Free trial, and free software if self hosted. If hosted using Redash.io like I did, the price is $49 a month on the lowest tier. Metabase hosting on Heroku is just as easy, and cheaper to use in the long term for small side projects.
No aggregations can be done within visualizations, which in my view is a must-have with any dash boarding/reporting tool. Redash forces you to push that logic into SQL code, which results in redundant/complex queries. It also forces you to pre-aggregate, so the feature of “No row limit” in the pro’s section no longer applies.
Visualization features are basic/limited. Other dash boarding tools allow you do stacked bars, generally more customization option, tool-tips, choosing color schemes, etc. which are not easily available or are limited compared to a tool like Tableau.

Redash query editor. Combine multiple query visualizations to create a dashboard.

Redash was a tool meant for another use case. Perhaps one where you need a basic tool for monitoring ETL, or some other system…Using Redash was a good experience, but using something like Metabase is the tool for me at the moment. Due to the fact that in Redash all calculations have to be pushed down to the SQL and cannot be done with aggregations in the tool, along with the pricing, it doesn’t seem to suit my use case.

Re-establishing a Broken Cloud

This week, I cracked open Tableau to log into my Amazon RDS instance and noticed that the connection wasn’t working. Logging into the AWS console, my AWS RDS instance had disappeared (along with all the data in it). On perusing my emails, I noticed that I had an unpaid bill in my inbox from Amazon from ~1 month prior. So…along with the instance no longer running, I had lost all data contained which I had been collecting over the past 3 months which is more than slightly disappointing.

This does present an opportunity though. My EC2 instance is still running, and has been trying to push data to a server that no longer exists, meaning I need to set the RDS instance backup. This was an opportunity to document setting up a new RDS instance on AWS from scratch, with all necessary users, objects, and privileges and document how long it took.

Here’s the process form start to finish;

Start: 4:12 pm

First step, logging in and getting the instance created. You’ll notice during this step that I flip from free-tier to get more storage, then flip back to free-tier. Why pay more money to get increased storage I won’t need for a couple weeks? All I need to do to up the storage is change a configuration which will cause my RDS instance to be down for a couple minutes.

Second step, making sure security privileges are setup. After my first project a couple years ago, and getting my web server destroyed by a script kiddie, I now only open specific ports (which I should have been doing all along).

The third step. I should be able to login to the server using the account that I set up as admin. Once I log in, all I have to do is execute all the create scripts I have.

The way that I created the DDL for my tables and schemas means that I can copy and paste them into query window in PgAdmin4 and execute the scripts. You’ll notice I have a couple semicolon issues that I’ve resolved.

Finally, looks like everything has been created. Just need to validate that my different accounts can login to the server and have appropriate privileges, which they did.

Finish: 5:17 pm

Connections exist from my EC2. With no alteration to any code on the EC2 server!

This process did not include any alteration of the EC2 instance and allowed me to go from a web server scraping the internet and sending files into the ether (nowhere) to having a full database stood up with all objects. This was done in a little over an hour, and ~30 minutes of that was spent executing sql, copy and pasting sql into the query editor for execution, testing to ensure objects/configuration was successful, and fixing minor syntax issues. All of which could be automated away.

I was debating whether or not to get a personal server for my projects, but this in my mind firmly helps cement the cloud as being a better choice when it comes to infrastructure. Comparing to my experience setting up a local SSAS and SQL server instance, this took about 10% of the time and was extremely easy to get running.

A Foray Into Serious Scraping

It’s been a while since my last post. Getting married, honeymooning, buying a house, etc. took away the time I had for this. But all of that is nearing it’s end, so I’m getting back into the regular cadence of working on the scraping project. Since I’m now back into it, and got everything up and running, I figured the most sensible place to start is the architecture that has been implemented.

The Problem:

The most sensible place to start any discussions of architecture is clearly stating what the system is supposed to do. I need a system that accomplishes these three things.

I need a system that is able to reliably scrape data from any website or consume data from any source.
I need a place where this data can be loaded and reported on in a cohesive format.
I need the product to be lightweight as far as storage space required and CPU so I don’t have to pay out the wazoo.

Overview:

Overview of the architecture, from inputs to database inserts.

In order to meet this, a straightforward architecture was implemented. Using Amazon Web Services both a EC2 instance and RDS instance were set up, with the EC2 being an Ubuntu instance and the RDS being Postgresql. In sequential order, here is how the scraper works.

Using python’s Scrapy library, we’ve written Scrapy projects which look to specific sources to bring in data based upon the HTML on websites. Right now, we’ve targeted two, but can expand to as many as needed. These Scrapy spiders are scheduled through Scrapyd, a framework that no only allows for scheduling and management of spiders, but also offers better performance by operating on Twistd making it asynchronous.
As the spiders are constantly running they are outputting to JSON files on the server. Basically, the driver here is to have a place to drop the output of the data onto the server so that data won’t be lost if something happens with one of the processes.
A Python class was written with Psycopg2 in a way that is meant to be extensible for future data sources. The idea being, that as the data model and data sources are changed/expanded upon, the only thing that will need to change is the class itself. None of the scripts that call the class to insert data from our existing data sources will need to change.
A staging area was created within the RDS PostgreSQL instance which ingests the raw data from the data source. Where possible, a unique index was created that checks for changes before accepting the data into the staging area. As we have scrapers hitting sources repeatedly, we are going to be grabbing the same data. What we’re interested in are the changes, especially in regards to new items or price changes. Also, we want to make as efficient as possible of a architecture so storing only the data we are interested in just makes sense.
Once data has been accepted into the landing zone, the Ubuntu instance is used to schedule a slew of ETL jobs written in SQL and passed to PostgreSQL for execution using Psycopg2. Postgresql doesn’t have a native scheduler readily available, so we use the Crontab functionality of Ubuntu to execute a script for each of our sources that calls from a class containing all of our ETL functions. The end result of this is a 3NF model populated with data and appropriate relationships made.

So, it’s now up and running, and data is flowing through into the objects. The data is populating for all tables where it is expected and I could begin reporting price changes today. The best part? All of this was built using $0’s of infrastructure from Amazon Web Services (and a lot of my time). I’m running out of storage space rapidly (20 gigs from the free tier ran out over a couple days), and the CPU is not beefy at all, so stalls out if more than one scraper is running at a time (as pictured below).

Performance goes down dramatically in yellow. In red, my scraper has been blocked from accessing the site (which didn’t happen before refactor…I’ll go into that another time)

To refer back to the original goal, I would say it has been achieved. Not to say that it couldn’t be improved upon and optimized. But overall, the first serious foray into scraping seems to of gone well. Feel free to reach out with any questions, or suggestions!

What Should Documentation Be?

The problems Business Intelligence organizations solve in organizations are generally the same. Pull some data out of somewhere, synthesize the data, analyze it, then create a picture of what is happening, what is going to happen, or what has happened. Working in small and large organizations, I’ve had the pleasure of seeing a variety of different processes used to deliver these insights. These range from the overbearing, and associated documentation that crush people’s productivity, to the lightweight that creates quicksand beneath teams feet through the lack of knowledge transfer.

Seeing the overbearing and the extremely light weight, there’s one conclusion I’ve arrived at concerning documentation…

Document as Little as Possible

Relevant literature

Don’t commit time to things that aren’t creating revenue or helping the business. Looking at IT projects, there is no doubt that the more documentation that there is, the less value there is. The perfect example came about when having a beer with a former co-worker.

It was brought up that the process at the company we both previously worked at had documentation that took longer to create then coding, testing, and implementing the change. Additionally, this painstakingly crafted documentation that the engineer had to spend time tracking down information for didn’t result in documents that would be useful to the team doing the work going forward. The process decreed that you must document X, Y, and Z in order to deploy the change/implementation so that’s what was done. The fundamental truth is that the “…benefit of having documentation must be greater than the cost of creating and maintaining it.”

Some people believe in the exact opposite of over documentation. Nothing should be documented. The code/implementation should speak for itself. This may work when you have a small size IT application the will always be managed by the same group of individuals (which likely won’t happen). Once you reach an application spanning multiple servers, teams, and databases the expectation for the code/implementation to “speak for itself” in a timely manner to those who have to report and get analytics out is unreasonable.

So, what’s being proposed in this rant? The only useful documentation that I’ve seen documents the “Why” and the “How”. Everything else doesn’t create value for the organization, as the cost to maintain and develop the documentation is too high.

Why

Creating a BI Product entails connecting the business process to an application(s) or database(s). Depending on the environment that you’re working in, Inmon, Kimball, or something else entirely, you need to know the answer to why things in your system exist. The “Why” is important not only from a high leadership level, but also at a low technical implementation level. The “Why” statement done at the low level helps to ensure that a team is using previously created tools and implementations as designed. And if a change is made that goes against the original “Why”, it is intentional and by design.

As an example, working on the Vehicle Profitability by VIN project, the Data Architect created both Inmon (3-NF) and Kimball (dimensional reporting) structures on the project. The “Why” was made extremely apparent through documentation, so the teams knew how to use the current implementation to achieve their goal in the best way possible.

Are you importing new invoice data? That should go into the wholesale invoice structure so that it flows up in the existing fact that contains the revenue information for vehicles which our reports feed from. Why? Because we want a single source of truth for vehicle revenue.

When documentation providing the “Why” for technical implementations exists, it makes adding on and changing the existing processes and assets easier. As opposed to re-inventing the wheel over and over.

How

basic data flow diagram

So after we know why something exists, the other piece that is useful for documentation is the “How”. The “How” shouldn’t be step by step instructions, it should function like a high level map. Data Flow Diagrams are a great example of “How” documentation that I’ve found useful for Business Intelligence products. Armed with the Data Flow Diagram and the “Why” of the design, team members who need to report on, extend, maintain, or refactor a system will be able to make informed decisions.

Make It Useful

At the end of the day, documentation gets in the way of creating code/analysis/direct business value. So the argument for spending time creating documentation is hard to make when someone hasn’t experienced the pains associated with lack of documentation. Lack of architecture that makes sense, misreported numbers, time wasted building processes that do exactly what existing processes already do.

Without documentation, maintaining and using a system or process as intended is impossible. With documentation that is accessible, searchable, and focuses on the “How” and “Why”, organizations can make smart and informed decisions of where to spend time, how to tweak things, and how to get value from their assets.

	Chat GPT for Data An… on Chat GPT for Data Analysi…
	Building a Serverles… on Building a Serverless Data Ing…
	Building a Serverles… on Building a Serverless Data Ing…
	Building a Serverles… on Building a Serverless Data Ing…
	Building a Serverles… on Building a Serverless Data Ing…

Michael Greis