Business IT

Chat GPT for Data Analysis

AI is everywhere nowadays. It is going to replace all the office jobs, and make Google obsolete. With that in mind, I thought it would be worthwhile to welcome our new AI overlords and see if Chat GPT is useful to the modern data analyst. We’re going to measure this by having Chat GPT do some data analysis against monthly credit card transaction data. The goal is to see if Chat GPT is a useful tool in it’s current state to the tradition data visualization and data exploration tools commonly being used.

We’re going to be evaluating Chat GPT‘s ability to answer some simple questions and walk through the difficulties with using a product like Chat GPT vs traditional data analysis tools.

The Data

To start off with, I uploaded credit card transaction data from Chase. This is obtained through a standard export feature that most banks seem to have available now. The table has the following information:

Transaction Date: Date the transaction occurred at the point of sale.
Post Date: Date the transaction was reflected on the account.
Description: Usually contains the business and other information. Different for every vendor.
Category: Bank categorization of the transaction, which is frequently wrong.
Type: If it was a purchase or payment on the credit card. Payment being reducing the outstanding balance.
Amount: The credit/debit amount for the transaction.
Memo: Always blank for my credit card at the moment.

Loading to Chat GPT

Using Chat GPT at chat.openai.com, there are two options to loading the data. We’ll be trying out both, starting with the Free Tier.

Free Tier: Upload the data in the chat window, using Chat GPT 3.5.
Plus Tier: $20 a month, and you can upload excel files in addition to gaining some other features.

Free Tier

Uploading to the free tier was super easy, but I could imagine would cause issues with less well structured data. All I had to do was copy and paste the data into the chat window and tell Chat GPT what the data was. Looking at the video, you can see the upload is super simple. However, we run in to the first problem. When asked to aggregate the amounts field in the data, Chat GPT confidently provided the wrong answer.

*Uploading and describing data to Chat GPT*

Providing the wrong answer on the first pass isn’t great, but it isn’t a dealbreaker. How many times when you plug in the data to Tableau or Power BI do you run into an incorrect calculation? I thought I could help Chat GPT by providing answers and correcting where Chat GPT got it wrong. Unfortunately, that did not help.

At this point, I’m going to throw in the towel with the free tier. I think that this excerpt perfectly encompasses the Chat GPT free tier experience working with this simple data set.

*How can you trust a data analysis tool that does this*?

Chat GPT Plus

Using the Chat GPT plus tier allows the use of the Advanced Data Analysis feature which is currently in Beta (as of October 18th, 2023). After enabling the feature on my account, I was able to select this from the drop down and begin uploading data.

Uploading the file was super simple from a UI perspective, but there are a couple issues that would prevent me from adopting Chat GPT, or replacing Tableau entirely:

File size limit of 100 MB
You have to trust Open AI to safeguard or delete any data uploaded

On the plus side, I was blown away by the results of using Chat GPT on a simple dataset. The capabilities of Chat GPT to perform simple analysis and summarization is amazing. Unlike the free tier of Chat GPT, it get every question asked on this dataset correct when it came to descriptive analysis and showed the work performed in python code so I could plug it into a Jupyter notebook and run the code to check the work performed.

*Thorough and correct results after uploading the file*

Conclusion

For those folks who regularly extract small datasets from a dashboard, database, or application, this could be an extremely useful tool to supplement the tools already common in enterprises for analysts. Pay $20 dollars a month to automate and save time doing ad-hoc analysis. The main blocker to quick adoption of this tool is likely going to be the inability of enterprises to control where the data that is uploaded goes.

Getting Started – Snowflake and S3

Snowflakehas been quickly throughout different organizations and geographies over the past few years. With exponential growth each year when looking at their customer base (~100% growth in 2019 fiscal year) and revenue (257% increase year over year in 2019). There has to be something amazing behind this product besides great marketing right?

With that in mind, I figured it would be a good use of time to take a look. One approach to take when assessing new tools is focusing on a few different use cases and evaluating how it stacks up against the competition. When looking at data and analytics tools, one of the first questions always is “can I get the data were it needs to go quickly?”. In this case, we’ll be looking at loading a basic pipe delimited data set used to populate an old Tableau dashboard from an S3 bucket into Snowflake.

For the purpose of this blog post we’re going to focus on bulk loading. The reason for this being that the most basic use case for many data warehousing initiatives are going to be based on nightly loads or a similar non-continuous schedule. Additionally, we’re going to focus on using an S3 bucket which is external to Snowflake for this attempt.

External Tables
Bulk Loading
Continuous loading

Step 1: What data to load?

In this case, I grabbed the pipe delimited text file and dropped it into an S3 bucket. The the complexity will come in the form of managing my VPC and transferring the data from my S3 bucket to the Snowflake’s internal S3 storage.

File loaded in Snowflake…Note the size of the file in S3.

Step 2: Grant Snowflake Access to S3

Snowflake exists outside of my main AWS account’s VPC so I need to grant my Snowflake account access to my S3 bucket to copy the data into Snowflake. One important thing to note, is that it is definitely a good idea that you have your S3 data in the same region as the Snowflake instance so that you’re data/traffic stays internal to the AWS network (non-public).

Integration flow for S3 Stage.

Per the Snowflake documentation there are three main options for performing this piece of work. The recommended option is to configure a Snowflake Storage Integration so that we can avoid credential requests while trying to get data loaded.

The creation of a policy on my AWS account is the first thing that’s needed.

On the left is my policy, on the right Snowflake’s standard recommendation for the policy.

Luckily, the process to do this is straightforward. Snowflake goes as far as providing the exact JSON that needs to be pasted in, so it’s relatively easy. The main exception between what Snowflake recommends and the policy I created is the use of the wildcard “*” in the s3:prefix member. It’s fine for my prototyping use case, but if you have more than 1,000 objects in the bucket, a “*” will cause an error when trying to read/copy data to Snowflake.

The second item that needs to be completed is the configuration of an IAM user which will allow your Snowflake account to access the specified S3 bucket. This is the same process used to allow two separate AWS accounts to access one another’s resources because, well, that’s exactly what’s occurring with Snowflake being hosted on AWS. Detail can be found here, but at the end of the day you’ll end up with an IAM user with the policy we created earlier assigned to the account.

Step 3: Setting up a Snowflake Integration

The creation of the integration is what we were striving for all along. The reason for this is that by creating an integration, we have a way to access our S3 buckets without having to log in or supply credentials in Snowflake. If we don’t create this integration, then each time data is loaded credentials will have to be supplied.

Integration creation which will be used to create Stages.

Step 4: Create the Stage

Finally, we have the ability to connect to our S3 bucket from our Snowflake account. We can now specify the Stage that’s needed in Snowflake to actually load the data from the S3 bucket into our Snowflake. There are many options to customize the stage, but for this basic example I know that I have a text file format with “|” as a delimiter and a header row.

Simple stage creation for loading CSV formatted file.

Step 5: Create the Table

Pretty straightforward. We need somewhere for the data to live once copied into Snowflake from S3. Everyone should be familiar with creating a table using standard SQL. One of the great parts of dealing with Snowflake is the auto-indexing and partitioning right out of the box. So I used the most basic DDL to ensure no data is dropped or skipped in loading due to data type mismatches.

DDL for table creation, based off the structure of the CSV. All String datatypes for ease of loading.

Step 6: Create the Snowpipe

The final step is to get the data loaded from our S3 to Snowflake’s internal S3. This is done through S3, and involves the final step of creating the Snowpipe for loading the data.

Creating a Snowpipe that automatically loads the data upon creation.

Final Outcome:

Our data has landed!

As we see when running a query against our table, the data is now present in Snowflake and available for querying. Additionally, the file that was loaded into Snowflake still exists in our S3 bucket for use in any QA or validation that we want to perform. In our final state, we can also see that Snowflake automatically compressed our data from over 50MB to 15.49MB without any manual intervention.

With automatic compression, the file now takes about ~30% of the storage space that the raw file does on S3.

Going forward, we’ll be exploring some more interesting/impressive capabilities of Snowflake. But starting here, you can see the ease of use, going from no infrastructure to having data feeding into a enterprise scalable data warehouse in less than a couple hours.

Using Redash for Visualization

Over the winter break I was having a conversation with my cousin concerning the awesomeness of Tableau and all it offers. While Tableau is a best in class product, there are a couple points that he raised which are valid points against the effectiveness of Tableau.

Tableau uses it’s own proprietary language and functions for a lot of aggregations/advanced functionality that could be done in SQL. SQL based tools are better, he referred to Metabase explicitly among others, due to the fact that most analysts know the language and therefore will easily be able to pick it up.
Tableau isn’t open source, so as a user, if something doesn’t exist in an open source tool and I know the language, a custom feature to the tool can be added easily (depending upon your ability to code in the respective language).

With that in mind, that got me down the path of looking at open source reporting and dash boarding tools that are heavily SQL based. When looking, I cracked upon both a Metabase and Redash instance with which to play around with. Metabase had a good number of features available, but an extremely limited amount of rows which could be ingested by the tool on the free tier.

Heroku – Different pricing tiers

So, not wanting to spend any money upfront, I went over to Redash and started using it to build the first dashboards in the tool using the free trial. Needless to say, I fell in love with the tool instantly then my love was instantly tempered by other limitations present in in Redash. Below is the dashboard that was created, which can no longer be accessed unfortunately due to the free trial period ending, but appeared as pictured below.

Redash dashboard for Steam and Sony Marketplace price changes

Running through the charts, you can see the game and pricing data that has been collected by my scrapers. Scrolling through the different charts you’ll see the following:

Largest Price Drops and Increases: Shows daily what the maximum price increase has been, and largest decrease has been, along with the average price change for all items which had a price change that day.
Average Price Change: The average price for items which changed price, along with the average previous price for all items in the Steam Marketplace and Sony Play Store which had a price change for the day. In green is average price change between the old price and new price.
Most Recent Price Decreases: The last 20 items which have decreased in price, and associated data.
Most Recent Price Increases: The last 20 items which have increased in price, along with associated data.

At the bottom of each of these panels you’ll see when the data last was refreshed from the database.

Now onto the original reason for writing this article. The pros and cons of using and building dashboards in this tool.

Pros

No row limit encountered on the free trial! What a great feature. The first requirement I had was that I wanted to be able to ingest a large amount of data and do aggregations using that data. Not having a hard row limit that I ran into with my small 40,000 record data-set originally sold me on this tool.
Easy to get up and running. Setting up with tool, and making use of the connectors already present in the tool was extremely easy.
SQL interface is extremely intuitive to anyone who has used SQL Server Management Studio/PgAdmin, or any other database querying GUI tool.
Refreshes are extremely easy to schedule and reliable.

Cons

Free trial, and free software if self hosted. If hosted using Redash.io like I did, the price is $49 a month on the lowest tier. Metabase hosting on Heroku is just as easy, and cheaper to use in the long term for small side projects.
No aggregations can be done within visualizations, which in my view is a must-have with any dash boarding/reporting tool. Redash forces you to push that logic into SQL code, which results in redundant/complex queries. It also forces you to pre-aggregate, so the feature of “No row limit” in the pro’s section no longer applies.
Visualization features are basic/limited. Other dash boarding tools allow you do stacked bars, generally more customization option, tool-tips, choosing color schemes, etc. which are not easily available or are limited compared to a tool like Tableau.

Redash query editor. Combine multiple query visualizations to create a dashboard.

Redash was a tool meant for another use case. Perhaps one where you need a basic tool for monitoring ETL, or some other system…Using Redash was a good experience, but using something like Metabase is the tool for me at the moment. Due to the fact that in Redash all calculations have to be pushed down to the SQL and cannot be done with aggregations in the tool, along with the pricing, it doesn’t seem to suit my use case.

Re-establishing a Broken Cloud

This week, I cracked open Tableau to log into my Amazon RDS instance and noticed that the connection wasn’t working. Logging into the AWS console, my AWS RDS instance had disappeared (along with all the data in it). On perusing my emails, I noticed that I had an unpaid bill in my inbox from Amazon from ~1 month prior. So…along with the instance no longer running, I had lost all data contained which I had been collecting over the past 3 months which is more than slightly disappointing.

This does present an opportunity though. My EC2 instance is still running, and has been trying to push data to a server that no longer exists, meaning I need to set the RDS instance backup. This was an opportunity to document setting up a new RDS instance on AWS from scratch, with all necessary users, objects, and privileges and document how long it took.

Here’s the process form start to finish;

Start: 4:12 pm

First step, logging in and getting the instance created. You’ll notice during this step that I flip from free-tier to get more storage, then flip back to free-tier. Why pay more money to get increased storage I won’t need for a couple weeks? All I need to do to up the storage is change a configuration which will cause my RDS instance to be down for a couple minutes.

Second step, making sure security privileges are setup. After my first project a couple years ago, and getting my web server destroyed by a script kiddie, I now only open specific ports (which I should have been doing all along).

The third step. I should be able to login to the server using the account that I set up as admin. Once I log in, all I have to do is execute all the create scripts I have.

The way that I created the DDL for my tables and schemas means that I can copy and paste them into query window in PgAdmin4 and execute the scripts. You’ll notice I have a couple semicolon issues that I’ve resolved.

Finally, looks like everything has been created. Just need to validate that my different accounts can login to the server and have appropriate privileges, which they did.

Finish: 5:17 pm

Connections exist from my EC2. With no alteration to any code on the EC2 server!

This process did not include any alteration of the EC2 instance and allowed me to go from a web server scraping the internet and sending files into the ether (nowhere) to having a full database stood up with all objects. This was done in a little over an hour, and ~30 minutes of that was spent executing sql, copy and pasting sql into the query editor for execution, testing to ensure objects/configuration was successful, and fixing minor syntax issues. All of which could be automated away.

I was debating whether or not to get a personal server for my projects, but this in my mind firmly helps cement the cloud as being a better choice when it comes to infrastructure. Comparing to my experience setting up a local SSAS and SQL server instance, this took about 10% of the time and was extremely easy to get running.

A Foray Into Serious Scraping

It’s been a while since my last post. Getting married, honeymooning, buying a house, etc. took away the time I had for this. But all of that is nearing it’s end, so I’m getting back into the regular cadence of working on the scraping project. Since I’m now back into it, and got everything up and running, I figured the most sensible place to start is the architecture that has been implemented.

The Problem:

The most sensible place to start any discussions of architecture is clearly stating what the system is supposed to do. I need a system that accomplishes these three things.

I need a system that is able to reliably scrape data from any website or consume data from any source.
I need a place where this data can be loaded and reported on in a cohesive format.
I need the product to be lightweight as far as storage space required and CPU so I don’t have to pay out the wazoo.

Overview:

Overview of the architecture, from inputs to database inserts.

In order to meet this, a straightforward architecture was implemented. Using Amazon Web Services both a EC2 instance and RDS instance were set up, with the EC2 being an Ubuntu instance and the RDS being Postgresql. In sequential order, here is how the scraper works.

Using python’s Scrapy library, we’ve written Scrapy projects which look to specific sources to bring in data based upon the HTML on websites. Right now, we’ve targeted two, but can expand to as many as needed. These Scrapy spiders are scheduled through Scrapyd, a framework that no only allows for scheduling and management of spiders, but also offers better performance by operating on Twistd making it asynchronous.
As the spiders are constantly running they are outputting to JSON files on the server. Basically, the driver here is to have a place to drop the output of the data onto the server so that data won’t be lost if something happens with one of the processes.
A Python class was written with Psycopg2 in a way that is meant to be extensible for future data sources. The idea being, that as the data model and data sources are changed/expanded upon, the only thing that will need to change is the class itself. None of the scripts that call the class to insert data from our existing data sources will need to change.
A staging area was created within the RDS PostgreSQL instance which ingests the raw data from the data source. Where possible, a unique index was created that checks for changes before accepting the data into the staging area. As we have scrapers hitting sources repeatedly, we are going to be grabbing the same data. What we’re interested in are the changes, especially in regards to new items or price changes. Also, we want to make as efficient as possible of a architecture so storing only the data we are interested in just makes sense.
Once data has been accepted into the landing zone, the Ubuntu instance is used to schedule a slew of ETL jobs written in SQL and passed to PostgreSQL for execution using Psycopg2. Postgresql doesn’t have a native scheduler readily available, so we use the Crontab functionality of Ubuntu to execute a script for each of our sources that calls from a class containing all of our ETL functions. The end result of this is a 3NF model populated with data and appropriate relationships made.

So, it’s now up and running, and data is flowing through into the objects. The data is populating for all tables where it is expected and I could begin reporting price changes today. The best part? All of this was built using $0’s of infrastructure from Amazon Web Services (and a lot of my time). I’m running out of storage space rapidly (20 gigs from the free tier ran out over a couple days), and the CPU is not beefy at all, so stalls out if more than one scraper is running at a time (as pictured below).

Performance goes down dramatically in yellow. In red, my scraper has been blocked from accessing the site (which didn’t happen before refactor…I’ll go into that another time)

To refer back to the original goal, I would say it has been achieved. Not to say that it couldn’t be improved upon and optimized. But overall, the first serious foray into scraping seems to of gone well. Feel free to reach out with any questions, or suggestions!

What Should Documentation Be?

The problems Business Intelligence organizations solve in organizations are generally the same. Pull some data out of somewhere, synthesize the data, analyze it, then create a picture of what is happening, what is going to happen, or what has happened. Working in small and large organizations, I’ve had the pleasure of seeing a variety of different processes used to deliver these insights. These range from the overbearing, and associated documentation that crush people’s productivity, to the lightweight that creates quicksand beneath teams feet through the lack of knowledge transfer.

Seeing the overbearing and the extremely light weight, there’s one conclusion I’ve arrived at concerning documentation…

Document as Little as Possible

Relevant literature

Don’t commit time to things that aren’t creating revenue or helping the business. Looking at IT projects, there is no doubt that the more documentation that there is, the less value there is. The perfect example came about when having a beer with a former co-worker.

It was brought up that the process at the company we both previously worked at had documentation that took longer to create then coding, testing, and implementing the change. Additionally, this painstakingly crafted documentation that the engineer had to spend time tracking down information for didn’t result in documents that would be useful to the team doing the work going forward. The process decreed that you must document X, Y, and Z in order to deploy the change/implementation so that’s what was done. The fundamental truth is that the “…benefit of having documentation must be greater than the cost of creating and maintaining it.”

Some people believe in the exact opposite of over documentation. Nothing should be documented. The code/implementation should speak for itself. This may work when you have a small size IT application the will always be managed by the same group of individuals (which likely won’t happen). Once you reach an application spanning multiple servers, teams, and databases the expectation for the code/implementation to “speak for itself” in a timely manner to those who have to report and get analytics out is unreasonable.

So, what’s being proposed in this rant? The only useful documentation that I’ve seen documents the “Why” and the “How”. Everything else doesn’t create value for the organization, as the cost to maintain and develop the documentation is too high.

Why

Creating a BI Product entails connecting the business process to an application(s) or database(s). Depending on the environment that you’re working in, Inmon, Kimball, or something else entirely, you need to know the answer to why things in your system exist. The “Why” is important not only from a high leadership level, but also at a low technical implementation level. The “Why” statement done at the low level helps to ensure that a team is using previously created tools and implementations as designed. And if a change is made that goes against the original “Why”, it is intentional and by design.

As an example, working on the Vehicle Profitability by VIN project, the Data Architect created both Inmon (3-NF) and Kimball (dimensional reporting) structures on the project. The “Why” was made extremely apparent through documentation, so the teams knew how to use the current implementation to achieve their goal in the best way possible.

Are you importing new invoice data? That should go into the wholesale invoice structure so that it flows up in the existing fact that contains the revenue information for vehicles which our reports feed from. Why? Because we want a single source of truth for vehicle revenue.

When documentation providing the “Why” for technical implementations exists, it makes adding on and changing the existing processes and assets easier. As opposed to re-inventing the wheel over and over.

How

basic data flow diagram

So after we know why something exists, the other piece that is useful for documentation is the “How”. The “How” shouldn’t be step by step instructions, it should function like a high level map. Data Flow Diagrams are a great example of “How” documentation that I’ve found useful for Business Intelligence products. Armed with the Data Flow Diagram and the “Why” of the design, team members who need to report on, extend, maintain, or refactor a system will be able to make informed decisions.

Make It Useful

At the end of the day, documentation gets in the way of creating code/analysis/direct business value. So the argument for spending time creating documentation is hard to make when someone hasn’t experienced the pains associated with lack of documentation. Lack of architecture that makes sense, misreported numbers, time wasted building processes that do exactly what existing processes already do.

Without documentation, maintaining and using a system or process as intended is impossible. With documentation that is accessible, searchable, and focuses on the “How” and “Why”, organizations can make smart and informed decisions of where to spend time, how to tweak things, and how to get value from their assets.

Data Day Texas 2017 – A Few Thoughts

Earlier this month I had the opportunity to attend Data Day Texas, and thought that it would be worthwhile to jot down a few thoughts. For those that aren’t aware of Data Day Texas, think of it as a gathering of nerdy IT people and Data Scientists. It was an interesting weekend with a wide range of topics that encompassed everything from machine learning algorithms to more approachable subjects like data dashboarding.

Graphs Are Here

There’s a reason that the keynote by Emil Eifrem was named “The Year of the Graph”. Looking at the popularity trend on db-engines.com, you can see a large gain in the

Neo4j Popularity

popularity of Neo4j. Leading naturally to the question, so what?

I think the major winning point for graph databases, other than performance on certain types of data analytics, is that graph databases are defined with relationships between data. This is in opposition to the approach of the traditional RDBMS which requires explicitly defining the tables in the schema, with relationships as a non-required afterthought in most cases. This means that while constructing the database, what you are doing is explicitly defining the node (core piece of data) and the edge (relationship between nodes). This means that you are enforcing the relationships between data, as opposed to the structure of the data itself. This creates another level of abstraction between the end user and the data, which should make the data in the database more approachable. Oh, and if you haven’t guessed, graph databases are schema-less which is a plus in many cases.

Issues Are Similiar Across Companies/Technology

In particular, there were two talks that hit this point home. The first was given by Chris LaCava from Expero Inc. in which he discussed visualization techniques with graph databases. The second was the discussion of how Stitch Fix sets up their environment for data scientists to work by Stefan Krawczyk.

What’s the root of this? People want to use the tools that work and that they like. Chris LaCava discussed how to do visualization on graph databases. While graph databases can

Look familiar? From Chris’ presentation on graph database dashboarding

meet some cool use cases as far as data sets and real time analytics go, what was discussed was a straight forward and common sense approach to dashboarding. Anyone familiar with Business Intelligence and dashboarding should roughly be following the above, or near to it.

Stefan‘s talk was all about using Docker to enable data scientists to use the tools that they want to use. The solution to the complaint that many of us in the industry have when we are locked in with a specific tool-set. The differentiation here was that Stitch Fix has done containerization at scale. This solves that problem by allowing each of their data scientists to run and operate on their own environment, with whatever tool-set they favor to deliver business value.

The Story is What Makes Things Interesting

The final point, which I’ve written about before, is that the story is what makes things interesting. The specific story presented at Data Day? The Panama Papers and how Neo4j was used to discover the unlikely connection that led to the downfall of a Prime Minister. That this was the best marketing tool that I have ever seen in regards to a database.
Having a database GUI that allows for easy exploration of the data natively? That’s a game changer.

This slideshow requires JavaScript.

Looking at the above, you can see a traditional RDBMS GUI (SQL Server Management Studio) versus Neo4j’s GUI. There’s a reason why people don’t pull up SQL Server Management Studio tools to tell a story. Having a database platform that can automatically tell a story about the data is an awesome approach.

Working For The Small Guys

I left my last job recently , by some rankings the 23rd largest company in the world, for a much smaller company. The general consensus from those who I talked to was that it would be good experience to have a change of pace, and many recommended a shift to a a smaller company.

With that said, I think I’ve settled in enough to pick out what I see as the advantages of working for a small company. Luckily for me, it appears that those I confided in gave me some good advice. So here we go. About four months in I’m going to give you what I see as the three best things that I’ve found working at a smaller company.

Clear Goals

Working for a smaller company, especially one the runs lean, means that the company needs to use resources as efficiently as possible. This is evident in the fact that everyone knows what the goals of the company currently are. Instead of being lost in the back of the office doing menial work without visibility into how you are helping the company, it’s clear what problems the business is facing and how specific pieces of work fit in to accomplish the larger goal.

In addition, clearly outlining the goals and objectives helps to build the feeling that teams are actually working together. In my experience, large companies have many objectives and everyone has a different idea on how to solve them. This isn’t a bad thing. What is a bad thing is that in pursuing these many ideas on how to solve some overarching problem, different teams in some instances actively work against one another’s interests to solve the same problem. All this with the hope that their solution is the one that gets noticed and creates a successful career.

The Chance to Get Your Hands Dirty

The lack of bureaucracy. I love it. Instead of having to fight through multiple approval processes and layers of pointless requests to get access to data/tools, you get the ability to solve problems however you would like (and pay for the consequences).

What types of problems? Well, real problems. Instead of trying to solve the problem of moving data from one place from another, or making a banner on a homepage appear differently, you are doing things that can have direct impact. Like what? In my teams case it’s creating customer segmentation strategies, or delivering insights on data that never before has been seen.

The best part? Instead of performance being judged on how quickly a problem is solved with a pre-defined approach, it is judged on the results. Instead of the mantra “How fast did you deliver to specifications”, the mantra is “How much value did you deliver“.

Lots of Opportunity to Push the Envelope

Looking at the points above, this is fairly obvious. New ideas are easier to implement in smaller organizations. Instead of having to fight with layers of process, you are up against reality. What do I mean by that? I mean that instead of fighting people over nothing, you are fighting the limits of technology, hardware, and business processes.

The only limit is your own lack of knowledge and passion.

A Framework for Innovation

Creating change. A fun subject, and an admirable goal according the American Ethos and the media our society has spawned. Even though the innovative ideas may go against the grain or the way that things are currently being done, many consider it a virtue to pursue

The macro effect of innovation on a company

them. With so much positive emphasis on innovation in our culture, why is it so difficult?

There thousands (if not millions) of reasons why this is the case and I couldn’t hope to answer in a short blog post. Looking at the reverse of that, “how do successful innovations occur?” many insights are available. There are many theories with many names, but reading many of these there is a correlation across the different materials which boil down to two things from what I’ve found and read.

Let people know how the innovation will effect them
Make things easy for the people who are going to be using the innovation

Switch lays out the best framework (in my opinion) for accomplishing these two goals.

Framework for Change

The basic idea is conveyed through the idea that every person can be pictured like a rider on a top of an elephant going down a trail. In order to change where the rider is going to end up, there is the ability to alter three things. You may have guessed it. We can change the rider, the elephant, and the trail.

The Rider: In the idea of the rider and the elephant, this is the logic. Everyone has logic (although some riders may be weaker than others) that helps to form how they behave. The rider’s the part of the person who when starting a new habit, like running in the morning, will cause people to set an alarm.

The Elephant: Emotions and subconscious drive. At the end of the day, the elephant dictates where the rider is going to go. The average ~150 pound rider will only be able to control a 13,000 pound animal for so long before coming exhausted.

The elephant is the reason why the planned morning run will be cancelled by multiple pushes on the snooze button. The elephant is also the reason why people work 90 hour work weeks and are excited to do so.

The Path: The final component for creating change is the external environment in which every individual operates. These are the external forces which effect behavior. Shaping the external forces and how they act upon elephants and their riders, getting the rider to move towards the desired end location.

All in all, it’s pretty straightforward right? Well, it is definitely much easier to conceptualize and talk about then it is to implement. So many people fail, myself included, to implement all three at the same time, leading to great ideas being dropped to the wayside.

For those who want to change things for the better, hopefully this framework can help you get to actualization of innovation.

Who You Should Hire

Every person is hired to perform or deliver something that you value. Considering the simplicity of this statement, why is it so hard? Why do large companies continuously hire dead weight?

It may sound like these are baseless questions, but looking at the numbers there is a serious problem with the talent that companies are hiring. In a macro analysis study by Gallup, 68.5% of employees were found to be disengaged at work. What does this mean? That means in the US if you manage a team of 10 people or have 9 coworkers, almost 7 of those people will be “dead weight” employees. This means that these employees are “…essentially ‘checked out’. They’re sleepwalking through their work day…” Even worse, of these 7 employees, 1-2 of these employees will be actively disengaged. These actively disengaged employees cost US business an estimated $450-$550 BILLION a year.
So how can hiring this dead weight be avoided?

The biggest indicator of how someone is going to perform is based upon two things. Personality and willingness to learn. Identifying the personality traits that are needed in the job and sussing out a potential candidate’s personality is up to the person hiring, but with a shift in thinking and process can be done more effectively.

Make a Biography

At the end of the day, hiring someone just based upon interviews is flawed. With the candidate sitting in front of you most likely trying to sell themselves, how can we get to the core of someone’s personality and goals to ensure that they align to the organization?

According to Praxent’s Tim Hamilton, adapting the methodology from Who, the goal of an interview process should be to gather as much data on a candidate as possible. The focus should be on a broader scope than the professional life or face that is put forward in an interview. By asking about the details of someone’s past and constructing a story of the overall trajectory of an individuals life, trends and characteristics can be found that reveal personality traits that a person may not be explicitly aware of (or unwilling to reveal). The more data points that are able to gleaned from the gathering of this biography, the more confident that you can be that the person who you are interviewing has the personality traits that the team needs. By the time an individual is going through the hiring process, it is reasonable to assume that the personality of the individual is relatively solidified so the usefulness of the biography should be high. If a candidate doesn’t have the right personality now, it is unlikely anyone will be able to change or mold the candidates personality once hired.

Importance of Learners

From the details presented here (and other research), the finding is that all the types of employees learning on the job had a high correlation to job performance when compared to other variables. The implication? Hiring individuals who can and want to learn is the most likely way to hire top job performers. This may seem obvious, but what is not correlated to job performance and how does that match up to the regular hiring process?

The correlation of job-performance with experience is .18 in this study, and in other studies as low as .03. Meaning that one of the most highly weighted factors (experience) that is brought in to asses candidates is not the most predictive of job performance.

If an organization conducted the best interview process ever and hired the most experienced and educated individuals, the hiring would be relying on a flawed methodology that doesn’t accurately account for and weigh the real predictors of job performance.

So what can we do?

Rely on measuring what is proven to lead to high job performance in the hiring process. Don’t heavily weigh an applicants experience other than to ensure the required technical skills. Ensure that the candidate has a goal and personality fit with what is needed for the position. Create an organization comprised of learners. Construct a biography, ascertain personality traits from this, and determine whether the candidate is willing to learn based upon their personality and goals. This may not be a sure-fire methodology, but it seems to be the best way currently available to ensure that the billions of loses incurred every year due to disengaged employees will not include a contribution from your organization.

	Chat GPT for Data An… on Chat GPT for Data Analysi…
	Building a Serverles… on Building a Serverless Data Ing…
	Building a Serverles… on Building a Serverless Data Ing…
	Building a Serverles… on Building a Serverless Data Ing…
	Building a Serverles… on Building a Serverless Data Ing…

Michael Greis