Getting Started – Snowflake and S3

Snowflakehas been quickly throughout different organizations and geographies over the past few years. With exponential growth each year when looking at their customer base (~100% growth in 2019 fiscal year) and revenue (257% increase year over year in 2019). There has to be something amazing behind this product besides great marketing right?

With that in mind, I figured it would be a good use of time to take a look. One approach to take when assessing new tools is focusing on a few different use cases and evaluating how it stacks up against the competition. When looking at data and analytics tools, one of the first questions always is “can I get the data were it needs to go quickly?”. In this case, we’ll be looking at loading a basic pipe delimited data set used to populate an old Tableau dashboard from an S3 bucket into Snowflake.

For the purpose of this blog post we’re going to focus on bulk loading. The reason for this being that the most basic use case for many data warehousing initiatives are going to be based on nightly loads or a similar non-continuous schedule. Additionally, we’re going to focus on using an S3 bucket which is external to Snowflake for this attempt.

  1. External Tables
  2. Bulk Loading
  3. Continuous loading

Step 1: What data to load?

In this case, I grabbed the pipe delimited text file and dropped it into an S3 bucket. The the complexity will come in the form of managing my VPC and transferring the data from my S3 bucket to the Snowflake’s internal S3 storage.

AWS Data

File loaded in Snowflake…Note the size of the file in S3.

Step 2: Grant Snowflake Access to S3

Snowflake exists outside of my main AWS account’s VPC so I need to grant my Snowflake account access to my S3 bucket to copy the data into Snowflake. One important thing to note, is that it is definitely a good idea that you have your S3 data in the same region as the Snowflake instance so that you’re data/traffic stays internal to the AWS network (non-public).

storage-integration-s3

Integration flow for S3 Stage.

Per the Snowflake documentation there are three main options for performing this piece of work. The recommended option is to configure a Snowflake Storage Integration so that we can avoid credential requests while trying to get data loaded.

The creation of a policy on my AWS account is the first thing that’s needed.

S3 Policy

On the left is my policy, on the right Snowflake’s standard recommendation for the policy.

 

Luckily, the process to do this is straightforward. Snowflake goes as far as providing the exact JSON that needs to be pasted in, so it’s relatively easy. The main exception between what Snowflake recommends and the policy I created is the use of the wildcard “*” in the s3:prefix member. It’s fine for my prototyping use case, but if you have more than 1,000 objects in the bucket, a “*” will cause an error when trying to read/copy data to Snowflake.

The second item that needs to be completed is the configuration of an IAM user which will allow your Snowflake account to access the specified S3 bucket. This is the same process used to allow two separate AWS accounts to access one another’s resources because, well, that’s exactly what’s occurring with Snowflake being hosted on AWS. Detail can be found here, but at the end of the day you’ll end up with an IAM user with the policy we created earlier assigned to the account.

Step 3: Setting up a Snowflake Integration

The creation of the integration is what we were striving for all along. The reason for this is that by creating an integration, we have a way to access our S3 buckets without having to log in or supply credentials in Snowflake. If we don’t create this integration, then each time data is loaded credentials will have to be supplied.

Snowflake Integration

Integration creation which will be used to create Stages.

Step 4: Create the Stage

Finally, we have the ability to connect to our S3 bucket from our Snowflake account. We can now specify the Stage that’s needed in Snowflake to actually load the data from the S3 bucket into our Snowflake. There are many options to customize the stage, but for this basic example I know that I have a text file format with “|” as a delimiter and a header row.

Snowflake Stage Creation

Simple stage creation for loading CSV formatted file.

Step 5: Create the Table

Pretty straightforward. We need somewhere for the data to live once copied into Snowflake from S3. Everyone should be familiar with creating a table using standard SQL. One of the great parts of dealing with Snowflake is the auto-indexing and partitioning right out of the box. So I used the most basic DDL to ensure no data is dropped or skipped in loading due to data type mismatches.

Snowflake Create Table

DDL for table creation, based off the structure of the CSV. All String datatypes for ease of loading.

Step 6: Create the Snowpipe

The final step is to get the data loaded from our S3 to Snowflake’s internal S3. This is done through S3, and involves the final step of creating the Snowpipe for loading the data.

Snowpipe Creation

Creating a Snowpipe that automatically loads the data upon creation.

Final Outcome:

Query Output

Our data has landed!

As we see when running a query against our table, the data is now present in Snowflake and available for querying. Additionally, the file that was loaded into Snowflake still exists in our S3 bucket for use in any QA or validation that we want to perform. In our final state, we can also see that Snowflake automatically compressed our data from over 50MB to 15.49MB without any manual intervention.

Snowflake Storage Volume

With automatic compression, the file now takes about ~30% of the storage space that the raw file does on S3.

Going forward, we’ll be exploring some more interesting/impressive capabilities of Snowflake. But starting here, you can see the ease of use, going from no infrastructure to having data feeding into a enterprise scalable data warehouse in less than a couple hours.

Using Redash for Visualization

Over the winter break I was having a conversation with my cousin concerning the awesomeness of Tableau and all it offers. While Tableau is a best in class product, there are a couple points that he raised which are valid points against the effectiveness of Tableau.

  1. Tableau uses it’s own proprietary language and functions for a lot of aggregations/advanced functionality that could be done in SQL. SQL based tools are better, he referred to Metabase explicitly among others, due to the fact that most analysts know the language and therefore will easily be able to pick it up.
  2. Tableau isn’t open source, so as a user, if something doesn’t exist in an open source tool and I know the language, a custom feature to the tool can be added easily (depending upon your ability to code in the respective language).

With that in mind, that got me down the path of looking at open source reporting and dash boarding tools that are heavily SQL based. When looking, I cracked upon both a Metabase and Redash instance with which to play around with. Metabase had a good number of features available, but an extremely limited amount of rows which could be ingested by the tool on the free tier.

herokutiers

Heroku – Different pricing tiers

So, not wanting to spend any money upfront, I went over to Redash and started using it to build the first dashboards in the tool using the free trial. Needless to say, I fell in love with the tool instantly then my love was instantly tempered by other limitations present in in Redash. Below is the dashboard that was created, which can no longer be accessed unfortunately due to the free trial period ending, but appeared as pictured below.

redash-dashboard

Redash dashboard for Steam and Sony Marketplace price changes

Running through the charts, you can see the game and pricing data that has been collected by my scrapers. Scrolling through the different charts you’ll see the following:

  1. Largest Price Drops and Increases: Shows daily what the maximum price increase has been, and largest decrease has been, along with the average price change for all items which had a price change that day.
  2. Average Price Change: The average price for items which changed price, along with the average previous price for all items in the Steam Marketplace and Sony Play Store which had a price change for the day. In green is average price change between the old price and new price.
  3. Most Recent Price Decreases: The last 20 items which have decreased in price, and associated data.
  4. Most Recent Price Increases: The last 20 items which have increased in price, along with associated data.

At the bottom of each of these panels you’ll see when the data last was refreshed from the database.

Now onto the original reason for writing this article. The pros and cons of using and building dashboards in this tool.

Pros

  1. No row limit encountered on the free trial! What a great feature. The first requirement I had was that I wanted to be able to ingest a large amount of data and do aggregations using that data. Not having a hard row limit that I ran into with my small 40,000 record data-set originally sold me on this tool.
  2. Easy to get up and running. Setting up with tool, and making use of the connectors already present in the tool was extremely easy.
  3. SQL interface is extremely intuitive to anyone who has used SQL Server Management Studio/PgAdmin, or any other database querying GUI tool.
  4. Refreshes are extremely easy to schedule and reliable.

Cons

  1. Free trial, and free software if self hosted. If hosted using Redash.io like I did, the price is $49 a month on the lowest tier. Metabase hosting on Heroku is just as easy, and cheaper to use in the long term for small side projects.
  2. No aggregations can be done within visualizations, which in my view is a must-have with any dash boarding/reporting tool. Redash forces you to push that logic into SQL code, which results in redundant/complex queries. It also forces you to pre-aggregate, so the feature of “No row limit” in the pro’s section no longer applies.
  3. Visualization features are basic/limited. Other dash boarding tools allow you do stacked bars, generally more customization option, tool-tips, choosing color schemes, etc. which are not easily available or are limited compared to a tool like Tableau.
redash-queryeditor

Redash query editor. Combine multiple query visualizations to create a dashboard.

Redash was a tool meant for another use case. Perhaps one where you need a basic tool for monitoring ETL, or some other system…Using Redash was a good experience, but using something like Metabase is the tool for me at the moment. Due to the fact that in Redash all calculations have to be pushed down to the SQL and cannot be done with aggregations in the tool, along with the pricing, it doesn’t seem to suit my use case.

Re-establishing a Broken Cloud

This week, I cracked open Tableau to log into my Amazon RDS instance and noticed that the connection wasn’t working. Logging into the AWS console, my AWS RDS instance had disappeared (along with all the data in it). On perusing my emails, I noticed that I had an unpaid bill in my inbox from Amazon from ~1 month prior. So…along with the instance no longer running, I had lost all data contained which I had been collecting over the past 3 months which is more than slightly disappointing.

This does present an opportunity though. My EC2 instance is still running, and has been trying to push data to a server that no longer exists, meaning I need to set the RDS instance backup. This was an opportunity to document setting up a new RDS instance on AWS from scratch, with all necessary users, objects, and privileges and document how long it took.

Here’s the process form start to finish;

Start: 4:12 pm

First step, logging in and getting the instance created. You’ll notice during this step that I flip from free-tier to get more storage, then flip back to free-tier. Why pay more money to get increased storage I won’t need for a couple weeks? All I need to do to up the storage is change a configuration which will cause my RDS instance to be down for a couple minutes.

Second step, making sure security privileges are setup. After my first project a couple years ago, and getting my web server destroyed by a script kiddie, I now only open specific ports (which I should have been doing all along).

The third step. I should be able to login to the server using the account that I set up as admin. Once I log in, all I have to do is execute all the create scripts I have.

The way that I created the DDL for my tables and schemas means that I can copy and paste them into query window in PgAdmin4 and execute the scripts. You’ll notice I have a couple semicolon issues that I’ve resolved.

Finally, looks like everything has been created. Just need to validate that my different accounts can login to the server and have appropriate privileges, which they did.

Finish: 5:17 pm

successfully back up

Connections exist from my EC2. With no alteration to any code on the EC2 server!

This process did not include any alteration of the EC2 instance and allowed me to go from a web server scraping the internet and sending files into the ether (nowhere) to having a full database stood up with all objects. This was done in a little over an hour, and ~30 minutes of that was spent executing sql, copy and pasting sql into the query editor for execution, testing to ensure objects/configuration was successful, and fixing minor syntax issues. All of which could be automated away.

I was debating whether or not to get a personal server for my projects, but this in my mind firmly helps cement the cloud as being a better choice when it comes to infrastructure. Comparing to my experience setting up a local SSAS and SQL server instance, this took about 10% of the time and was extremely easy to get running.

A Foray Into Serious Scraping

It’s been a while since my last post. Getting married, honeymooning, buying a house, etc. took away the time I had for this. But all of that is nearing it’s end, so I’m getting back into the regular cadence of working on the scraping project. Since I’m now back into it, and got everything up and running, I figured the most sensible place to start is the architecture that has been implemented.

The Problem:

The most sensible place to start any discussions of architecture is clearly stating what the system is supposed to do. I need a system that accomplishes these three things.

  1. I need a system that is able to reliably scrape data from any website or consume data from any source.
  2. I need a place where this data can be loaded and reported on in a cohesive format.
  3. I need the product to be lightweight as far as storage space required and CPU so I don’t have to pay out the wazoo.

Overview:

Scraper Architecture

Overview of the architecture, from inputs to database inserts.

In order to meet this, a straightforward architecture was implemented. Using Amazon Web Services both a EC2 instance and RDS instance were set up, with the EC2 being an Ubuntu instance and the RDS being Postgresql. In sequential order, here is how the scraper works.

  1. Using python’s Scrapy library, we’ve written Scrapy projects which look to specific sources to bring in data based upon the HTML on websites. Right now, we’ve targeted two, but can expand to as many as needed. These Scrapy spiders are scheduled through Scrapyd, a framework that no only allows for scheduling and management of spiders, but also offers better performance by operating on Twistd making it asynchronous.
  2. As the spiders are constantly running they are outputting to JSON files on the server. Basically, the driver here is to have a place to drop the output of the data onto the server so that data won’t be lost if something happens with one of the processes.
  3. A Python class was written with Psycopg2 in a way that is meant to be extensible for future data sources. The idea being, that as the data model and data sources are changed/expanded upon, the only thing that will need to change is the class itself. None of the scripts that call the class to insert data from our existing data sources will need to change.
  4. A staging area was created within the RDS PostgreSQL instance which ingests the raw data from the data source. Where possible, a unique index was created that checks for changes before accepting the data into the staging area. As we have scrapers hitting sources repeatedly, we are going to be grabbing the same data. What we’re interested in are the changes, especially in regards to new items or price changes. Also, we want to make as efficient as possible of a architecture so storing only the data we are interested in just makes sense.
  5. Once data has been accepted into the landing zone, the Ubuntu instance is used to schedule a slew of ETL jobs written in SQL and passed to PostgreSQL for execution using Psycopg2. Postgresql doesn’t have a native scheduler readily available, so we use the Crontab functionality of Ubuntu to execute a script for each of our sources that calls from a class containing all of our ETL functions. The end result of this is a 3NF model populated with data and appropriate relationships made.

So, it’s now up and running, and data is flowing through into the objects. The data is populating for all tables where it is expected and I could begin reporting price changes today. The best part? All of this was built using $0’s of infrastructure from Amazon Web Services (and a lot of my time). I’m running out of storage space rapidly (20 gigs from the free tier ran out over a couple days), and the CPU is not beefy at all, so stalls out if more than one scraper is running at a time (as pictured below).

scrapyd-performance

Performance goes down dramatically in yellow. In red, my scraper has been blocked from accessing the site (which didn’t happen before refactor…I’ll go into that another time)

To refer back to the original goal, I would say it has been achieved. Not to say that it couldn’t be improved upon and optimized. But overall, the first serious foray into scraping seems to of gone well. Feel free to reach out with any questions, or suggestions!

What Should Documentation Be?

The problems Business Intelligence organizations solve in organizations are generally the same. Pull some data out of somewhere, synthesize the data, analyze it, then create a picture of what is happening, what is going to happen, or what has happened. Working in small and large organizations, I’ve had the pleasure of seeing a variety of different processes used to deliver these insights. These range from the overbearing, and associated documentation that crush people’s productivity, to the lightweight that creates quicksand beneath teams feet through the lack of knowledge transfer.

Seeing the overbearing and the extremely light weight, there’s one conclusion I’ve arrived at concerning documentation…

Document as Little as Possible

 

documentation_joke

Relevant literature

Don’t commit time to things that aren’t creating revenue or helping the business. Looking at IT projects, there is no doubt that the more documentation that there is, the less value there is. The perfect example came about when having a beer with a former co-worker.

 

It was brought up that the process at the company we both previously worked at had documentation that took longer to create then coding, testing, and implementing the change. Additionally, this painstakingly crafted documentation that the engineer had to spend time tracking down information for didn’t result in documents that would be useful to the team doing the work going forward. The process decreed that you must document X, Y, and Z in order to deploy the change/implementation so that’s what was done. The fundamental truth is that the “…benefit of having documentation must be greater than the cost of creating and maintaining it.”

Some people believe in the exact opposite of over documentation. Nothing should be documented. The code/implementation should speak for itself. This may work when you have a small size IT application the will always be managed by the same group of individuals (which likely won’t happen). Once you reach an application spanning multiple servers, teams, and databases the expectation for the code/implementation to “speak for itself” in a timely manner to those who have to report and get analytics out is unreasonable.

So, what’s being proposed in this rant? The only useful documentation that I’ve seen documents the “Why” and the “How”. Everything else doesn’t create value for the organization, as the cost to maintain and develop the documentation is too high.

Why

Creating a BI Product entails connecting the business process to an application(s) or database(s). Depending on the environment that you’re working in, Inmon, Kimball, or something else entirely, you need to know the answer to why things in your system exist. The “Why” is important not only from a high leadership level, but also at a low technical implementation level. The “Why” statement done at the low level helps to ensure that a team is using previously created tools and implementations as designed. And if a change is made that goes against the original “Why”, it is intentional and by design.

As an example, working on the Vehicle Profitability by VIN project, the Data Architect created both Inmon (3-NF) and Kimball (dimensional reporting) structures on the project. The “Why” was made extremely apparent through documentation, so the teams knew how to use the current implementation to achieve their goal in the best way possible.

Are you importing new invoice data? That should go into the wholesale invoice structure so that it flows up in the existing fact that contains the revenue information for vehicles which our reports feed from. Why? Because we want a single source of truth for vehicle revenue.

When documentation providing the “Why” for technical implementations exists, it makes adding on and changing the existing processes and assets easier. As opposed to re-inventing the wheel over and over.

How

Payment-Data-Flow-Diagram

basic data flow diagram

So after we know why something exists, the other piece that is useful for documentation is the “How”. The “How” shouldn’t be step by step instructions, it should function like a high level map. Data Flow Diagrams are a great example of “How” documentation that I’ve found useful for Business Intelligence products. Armed with the Data Flow Diagram and the “Why” of the design, team members who need to report on, extend, maintain, or refactor a system will be able to make informed decisions.

 

 

Make It Useful

At the end of the day, documentation gets in the way of creating code/analysis/direct business value. So the argument for spending time creating documentation is hard to make when someone hasn’t experienced the pains associated with lack of documentation. Lack of architecture that makes sense, misreported numbers, time wasted building processes that do exactly what existing processes already do.

Without documentation, maintaining and using a system or process as intended is impossible. With documentation that is accessible, searchable, and focuses on the “How” and “Why”, organizations can make smart and informed decisions of where to spend time, how to tweak things, and how to get value from their assets.

 

Data Day Texas 2017 – A Few Thoughts

Earlier this month I had the opportunity to attend Data Day Texas, and thought that it would be worthwhile to jot down a few thoughts. For those that aren’t aware of Data Day Texas, think of it as a gathering of nerdy IT people and Data Scientists. It was an interesting weekend with a wide range of topics that encompassed everything from machine learning algorithms to more approachable subjects like data dashboarding.

Graphs Are Here

There’s a reason that the keynote by Emil Eifrem was named “The Year of the Graph”. Looking at the popularity trend on db-engines.com, you can see a large gain in the

neo4j-popularity

Neo4j Popularity

popularity of Neo4j. Leading naturally to the question, so what?

I think the major winning point for graph databases, other than performance on certain types of data analytics, is that graph databases are defined with relationships between data. This is in opposition to the approach of the traditional RDBMS which requires explicitly defining the tables in the schema, with relationships as a non-required afterthought in most cases. This means that while constructing the database, what you are doing is explicitly defining the node (core piece of data) and the edge (relationship between nodes). This means that you are enforcing the relationships between data, as opposed to the structure of the data itself. This creates another level of abstraction between the end user and the data, which should make the data in the database more approachable. Oh, and if you haven’t guessed, graph databases are schema-less which is a plus in many cases.

Issues Are Similiar Across Companies/Technology

In particular, there were two talks that hit this point home. The first was given by Chris LaCava from Expero Inc. in which he discussed visualization techniques with graph databases. The second was the discussion of how Stitch Fix sets up their environment for data scientists to work by Stefan Krawczyk.

What’s the root of this? People want to use the tools that work and that they like. Chris LaCava discussed how to do visualization on graph databases. While graph databases can

dashboarding-design-process

Look familiar? From Chris’ presentation on graph database dashboarding

meet some cool use cases as far as data sets and real time analytics go, what was discussed was a  straight forward and common sense approach to dashboarding. Anyone familiar with Business Intelligence and dashboarding should roughly be following the above, or near to it.

Stefan‘s talk was all about using Docker to enable data scientists to use the tools that they want to use. The solution to the complaint that many of us in the industry have when we are locked in with a specific tool-set. The differentiation here was that Stitch Fix has done containerization at scale. This solves that problem by allowing each of their data scientists to run and operate on their own environment, with whatever tool-set they favor to deliver business value.

The Story is What Makes Things Interesting

The final point, which I’ve written about before, is that the story is what makes things interesting. The specific story presented at Data Day? The Panama Papers and how Neo4j was used to discover the unlikely connection that led to the downfall of a Prime Minister. That this was the best marketing tool that I have ever seen in regards to a database.
Having a database GUI that allows for easy exploration of the data natively? That’s a game changer.

This slideshow requires JavaScript.

Looking at the above, you can see a traditional RDBMS GUI (SQL Server Management Studio) versus Neo4j’s GUI. There’s a reason why people don’t pull up SQL Server Management Studio tools to tell a story. Having a database platform that can automatically tell a story about the data is an awesome approach.

 

Working For The Small Guys

I left my last job recently , by some rankings the 23rd largest company in the world, for a much smaller company. The general consensus from those who I talked to was that it would be good experience to have a change of pace, and many recommended a shift to a a smaller company.

With that said, I think I’ve settled in enough to pick out what I see as the advantages of working for a small company. Luckily for me, it appears that those I confided in gave me some good advice. So here we go. About four months in I’m going to give you what I see as the three best things that I’ve found working at a smaller company.

Clear Goals

Working for a smaller company, especially one the runs lean, means that the company needs to use resources as efficiently as possible.  This is evident in the fact that everyone knows what the goals of the company currently are. Instead of being lost in the back of the office doing menial work without visibility into how you are helping the company, it’s clear what problems the business is facing and how specific pieces of work fit in to accomplish the larger goal.business-commerce-work_ethic-office_job-corporate_culture-corporate_environments-hard_workers-wmi111019l

In addition, clearly outlining the goals and objectives helps to build the feeling that teams are actually working together. In my experience, large companies have many objectives and everyone has a different idea on how to solve them. This isn’t a bad thing. What is a bad thing is that in pursuing these many ideas on how to solve some overarching problem, different teams in some instances actively work against one another’s interests to solve the same problem. All this with the hope that their solution is the one that gets noticed and creates a successful career.

The Chance to Get Your Hands Dirty

The lack of bureaucracy. I love it. Instead of having to fight through multiple approval processes and layers of pointless requests to get access to data/tools, you get the ability to solve problems however you would like (and pay for the consequences).

What types of problems? Well, real problems. Instead of trying to solve the problem of moving data from one place from another, or making a banner on a homepage appear differently, you are doing things that can have direct impact. Like what? In my teams case it’s creating customer segmentation strategies, or delivering insights on data that never before has been seen.

The best part? Instead of performance being judged on how quickly a problem is solved with a pre-defined approach, it is judged on the results. Instead of the mantra “How fast did you deliver to specifications”, the mantra is “How much value did you deliver“.

Lots of Opportunity to Push the Envelope

Looking at the points above, this is fairly obvious. New ideas are easier to implement in smaller organizations. Instead of having to fight with layers of process, you are up against reality. What do I mean by that? I mean that instead of fighting people over nothing, you are fighting the limits of technology, hardware, and business processes.

The only limit is your own lack of knowledge and passion.

A Framework for Innovation

Creating change. A fun subject, and an admirable goal according the American Ethos and the media our society has spawned. Even though the innovative ideas may go against the grain or the way that things are currently being done, many consider it a virtue to pursue

innovation-vs-no-innovation

The macro effect of innovation on a company

them. With so much positive emphasis on innovation in our culture, why is it so difficult?

There thousands (if not millions) of reasons why this is the case and I couldn’t hope to answer in a short blog post. Looking at the reverse of that, “how do successful innovations occur?” many insights are available. There are many theories with many names, but reading many of these there is a correlation across the different materials which boil down to two things from what I’ve found and read.

  1. Let people know how the innovation will effect them
  2. Make things easy for the people who are going to be using the innovation

Switch lays out the best framework (in my opinion) for accomplishing these two goals.

Framework for Change

The basic idea is conveyed through the idea that every person can be pictured like a rider on a top of an elephant going down a trail. In order to change where the rider is going to end up, there is the ability to alter three things. You may have guessed it. We can change the rider, the elephant, and the trail.

The Rider: In the idea of the rider and the elephant, this is the logic. Everyone has logic (although some riders may be weaker than others) that helps to form how they behave. The rider’s the part of the person who when starting a new habit, like running in the morning, will cause people to set an alarm.

the-rider-the-elephant-and-the-path_50290b0771b02_w1500.pngThe Elephant: Emotions and subconscious drive. At the end of the day, the elephant dictates where the rider is going to go. The average ~150 pound rider will only be able to control a 13,000 pound animal for so long before coming exhausted.

The elephant is the reason why the planned morning run will be cancelled by multiple pushes on the snooze button. The  elephant is also the reason why people work 90 hour work weeks and are excited to do so.

The Path: The final component for creating change is the external environment in which every individual operates. These are the external forces which effect behavior. Shaping the external forces and how they act upon elephants and their riders, getting the rider to move towards the desired end location.


All in all, it’s pretty straightforward right? Well, it is definitely much easier to conceptualize and talk about then it is to implement. So many people fail, myself included, to implement all three at the same time, leading to great ideas being dropped to the wayside.

For those who want to change things for the better, hopefully this framework can help you get to actualization of innovation.

Who You Should Hire

Every person is hired to perform or deliver something that you value. Considering the simplicity of this statement, why is it so hard? Why do large companies continuously hire dead weight?

It may sound like these are baseless questions, but looking at the numbers there is a serious problem with the talent that companies are hiring. In a macro analysis study by Gallup, 68.5% of employees were found to be disengaged at work. What does this mean? Gallup State of the American Workplace.PNGThat means in the US if you manage a team of 10 people or have 9 coworkers, almost 7 of those people will be “dead weight” employees. This means that these employees are “…essentially ‘checked out’. They’re sleepwalking through their work day…” Even worse, of these 7 employees, 1-2 of these employees will be actively disengaged. These actively disengaged employees cost US business an estimated $450-$550 BILLION a year.
So how can hiring this dead weight be avoided?

The biggest indicator of how someone is going to perform is based upon two things. Personality and willingness to learn. Identifying the personality traits that are needed in the job and sussing out a potential candidate’s personality is up to the person hiring, but with a shift in thinking and process can be done more effectively.

Make a Biography

At the end of the day, hiring someone just based upon interviews is flawed. With the candidate sitting in front of you most likely trying to sell themselves, how can we get to the core of someone’s personality and goals to ensure that they align to the organization?

According to Praxent’s Tim Hamilton, adapting the methodology from Who, the goal of an interview process should be to gather as much data on a candidate as possible. The focus should be on a broader scope than the professional life or face that is put forward in an interview. By asking about the details of someone’s past and constructing a story of the overall trajectory of an individuals life, trends and characteristics can be found that reveal personality traits that a person may not be explicitly aware of (or unwilling to reveal). The more data points that are able to gleaned from the gathering of this biography, the more confident that you can be that the person who you are interviewing has the personality traits that the team needs. By the time an individual is going through the hiring process, it is reasonable to assume that the personality of the individual is relatively solidified so the usefulness of the biography should be high. If a candidate doesn’t have the right personality now, it is unlikely anyone will be able to change or mold the candidates personality once hired.

Importance of Learners

From the details presented here (and other research), the finding is that all the types of employees learning on the job had a high correlation to job performance when compared to other variables. The implication? Hiring individuals who can and want to learn is the most likely way to hire top job performers. This may seem obvious, but what is not correlated to job performance and how does that match up to the regular hiring process?

Picture of Job Performance PredictorsThe correlation of job-performance with experience is .18 in this study, and in other studies as low as .03. Meaning that one of the most highly weighted factors (experience) that is brought in to asses candidates is not the most predictive of job performance.

If an organization conducted the best interview process ever and hired the most experienced and educated individuals, the hiring would be relying on a flawed methodology that doesn’t accurately account for and weigh the real predictors of job performance.

So what can we do?

Rely on measuring what is proven to lead to high job performance in the hiring process. Don’t heavily weigh an applicants experience other than to ensure the required technical skills. Ensure that the candidate has a goal and personality fit with what is needed for the position. Create an organization comprised of learners. Construct a biography, ascertain personality traits from this, and determine whether the candidate is willing to learn based upon their personality and goals. This may not be a sure-fire methodology, but it seems to be the best way currently available to ensure that the billions of loses incurred every year due to disengaged employees will not include a contribution from your organization.

Why Storytelling is Required

Storytelling and marketing is something that seems to be undervalued by technical individuals in the information technology field. The reason why I’m talking about this? Recently at SXSW in Austin, Contently hosted a talk where Shane Snow discussed the power of storytelling. While the audience attendance was a definitely skewed towards the marketing industry, the concepts that were presented can be applied to any idea or presentation that technical people are trying to sell to customers, managers, or co-workers.

Story Continuation

Shane, in his talk, brought up some interesting statistics that prove a powerful point. People tend to gravitate towards stories that build on existing lore and story lines. The area that was pointed to as proving his point? Movies. Shane mentioned a metric that can be used to demonstrate this. Movie revenue. The question is, does Shane’s theory prove true?Spiderman Movie Layout

If you look at the above, grabbed from The Numbers, it clearly shows a relative trend of decreasing sales revenue for Spider-Man movies. On close examination though, the biggest drop in revenue (~15%) when comparing a movie to its predecessor occurred between Spider-Man 3 and The Amazing Spider-Man…When the continuous story line from the first  three Spider-Man movies was broken.

Jurrassic park - Revenue

But…doing some spot checking, also reveals the opposite to be true. Looking at Jurassic Park’s history of revenues, it appears that sequels where the story line is broken can make just as much (or much more). In order to prove out this theory, it appears that more analysis would be needed to prove this point objectively…Maybe this is an oddity with re-boots of classic series?

Regardless, in SOME cases, when movies break from a continuous narrative there appears to be increased risk of people abandoning interest in the movie/idea.

Familiarity

Additionally, Shane mentioned the power of familiarity. When traveling abroad and being around unfamiliar scents, sounds, and tastes, people tend to gravitate towards the known. The perfect example that many can relate to? Beer. Heineken is sold in over 170 countries. When someone is given the choice between a familiar brand that may even be disliked and an unfamiliar brand, people generally choose the known brand that has familiarity. Thinking about the odd concoctions one might encounter when travelling abroad, what would you rather have?

Complexity of Content

The last point that was presented? The easier that content is to read and understand, the more popular it will be. Mark Twain has books that come up at around a 5th grade reading level according to Scholastic’s system. Even the more modern classics, depending on your point of view, come up at around the same reading level. The lesson? It may make us feel good communicating with big words, but it is not the most effective way to communicate.

So What?

At the end of the day, while these ideas are interesting, what can we learn? Everyone is trying to sell stories every day. In technology/knowledge work, it’s a new design or approach to solve a problem. These strategies can be used to communicate an idea effectively and gain the support of others when combined with logical arguments. Establish a narrative that creates a vision and compelling continuous story line. Do it in a way that anyone could understand, from developers to directors, technical to non-technical. Establish a brand, identifier, or name that people can familiarize themselves with. If technical people peddled ideas that have been implemented half as well as they implement them, it would be to everyone’s advantage. Playing to people’s logic works usually…playing to logic and human nature? Couldn’t hurt.