Building a Serverless Data Ingestion – End Result

In this final post, we’ll go over the final implementation of this serverless data ingestion pipeline. What is the result of all the effort put forward to build this serverless data ingestion process? I think the best way to break this down is to compare what we were originally aiming for, and what was implemented. Below you can see diagram that was created in the post outlining the overall architecture.

Pictured on the left is what was originally proposed. On the right, is what was actually implemented. Turns out the implementation pretty much stuck to the plan, with the additional enhancement of using an event to kick off the Lambda functions. This allows for everything to kick off in the appropriate sequence once a file is placed in our “ingestion-bucket-11.15.2021”.

The usage of AWS events to kick off Lambda functions was extraordinarily easy, and there’s plenty of good documentation to get started. The S3 event passes through all the metadata needed to parameterize and operate the pipeline in JSON. The documentation from Amazon, makes it super easy to access and use when setting up your Lambda functions.

Below you can see the actual code executed in the Lambda function. Notice that the variables bucket_name and file_name are both retrieved from the event.

from classes import ingester as ing
from classes import forstaparser as fp

def lambda_handler(event, context):
 bucket_name = event['Records'][0]['s3']['bucket']['name']
 file_name = event['Records'][0]['s3']['object']['key']
 target_bucket = 'landing-bucket-11.15.2021'
 upload_table = 'landing_table'
 source_type = 's3'

 if bucket_name == target_bucket:
  upload_file_name = ing.ingester.convert_object(None,bucket_name,file_name)
  raw_logs = fp.parser.read_logs(None,source_type,target_bucket,upload_file_name)
  fp.parser.dynamo_landing_load(None,upload_table,raw_logs,file_name)
 else:
  landing_ingester = ingester.ingester()
  landing_ingester.copy_bucket(bucket_name,target_bucket)
  #test_parser.t_parser()
 print("Completed")

Put simply, the function does the following. First it receives the event metadata, parses through the JSON, and obtains the bucket name that we want to transfer the files from. Notice we can point to any bucket, and it will always drop the file into ‘landing-bucket-11.15.2021’. Using the event metadata means I can reuse this Lambda function as often as I want to create a central dumping ground for staging data to be loaded.

Second, once files are put into ‘landing-bucket-11.15.2021’ another event kicks off. This event cleans the data, ensuring proper encoding (UTF-8), and then loads the data into our DynamoDB landing table. All in all pretty simple.

Below you can see everything running in action.

As we can see above, the files were automatically copied, and for posterity’s sake, we can check the cloud logs to see we have a 100% success rate in the last hour. Looking at the below we can see the result of 100% successful executions for our first Lambda function.

The next step is automatically kicked off whenever an object is PUT into the ‘landing-bucket-11.15.2021’ and loads the data into DynamoDB. With the current setup, we can see the data uploads successfully and the data is now available in DynamoDB to be ingested into whatever processes/analytics we want! The best part being that once this is setup it is automated, and auditable going forward due to the tools AWS offers.

In order to build this, it may not have seemed like a long journey. But keep in mind, in the process of building this little project out I’ve had to pickup and learn quite a few tools. Docker, Lambda, S3, IAM, Python, Boto3, and a few more tools which we’ve covered in the previous posts. If I need to do this again, it’ll be much simpler based upon what I’ve learned.

Thanks for reading along!

Building a Serverless Data Ingestion – Difficulties

This is part three in a four part series on implementing a serverlessJSON based approach using AWS for data ingestion

Outlining the architecture and development process, I glossed over all of the problems and issues that had to be overcome along the way. The majority of my work life and free time isn’t spent using Python, so the majority of the issues confronted are likely to be straightforward for more experienced developers. Doing something new though, I did run into a few issues which were interesting and warrant at least jotting down for my own memory.

  1. Learning about the Docker File
  2. AWS Lambda events and layers
  3. Learning Boto3

Learning about the Docker File

When starting off with Docker, I was throwing things at the wall and seeing what stuck. Originally, I was using a standard Ubuntu image to do testing from for the final function which would be up in AWS Lambda. This was not the right approach in retrospect. I should have started with the amazonlinux image that is readily available on Docker Hub. Once understanding how to create the Docker File from that image, the next step was understanding how to get the code into the container.

The first instinct I had was to create the Docker File in a specific subdirectory of the code base. I’d have a structure like follows:

The entirety of the GitHub repo is Forsta, with subdirectories serving specific purposes.

  • Database: Contains code to create the DynamoDB database tables, and other configurations.
  • Parser: Has the code for moving the data between S3 buckets and into DynamoDB from S3 buckets. Additionally, it contains functions to clean the data and create a primary/unique key for the DynamoDB table.
  • Test: Contain all unit tests or end to end tests I would need to create. It ended up containing the function executed by the AWS Lamdba function, which needs to be rectified in the future.
  • Docker: The final directory was aiming to be Docker, which would have contained a repository of different Docker Files which would be used for different Lambda functions. That’s where I ran into some issues with pathing.

Based upon where the Docker File was in this path, I was unable to easily use the “add” command which made me unable to pull the required files into the Docker container to test my code. My recommendation, have one main area which the Docker File lives in the topmost directory of your repo (in this case, right below Forsta), and you can easily get all of the code you need into the container.

AWS Lambda

This was my first time using AWS Lambda, and it was a bit bumpy at first. My original approach was to create a class, which I would then call in Lambda. While this was basically what the end result was, the route getting there involved some discovery/mistakes.

The first time I attempted to deploy code to Lambda, I just had the class, zipped it up, and tried publishing the Lambda function. In order to use these published classes, I didn’t think through the fact that something would have to call the function, other than my test scripts.

The second time I published a function, one of my test scripts which worked in the container, to run the desired code to see if it worked. Again, this did not work out. After doing some further research, I found that the AWS Lambda function requires an event to kick off the execution of the desired code. In retrospect, this makes complete sense.

The third attempt I got right, after looking a this great tutorial. The key is to create a wrapper which accepts the right events from the AWS environment which kicks off the underlying code I was looking to execute. You can see the repo here with

from tests import test_parser

def lambda_handler(event, context):
    test_parser.t_parser()
    print("Completed")

All of this could have been averted by reading the documentation before trying to deploy. In order to get to this point, I had to refactor the directory structure a couple times (leading to code impacts), and deploy multiple times. Lesson learned, documentation is in fact worth reading.

Learning Boto3

Let’s start off with the basics. What is Boto3? Luckily, the Boto3 documentation has a simple overview on the landing page.

You use the AWS SDK for Python (Boto3) to create, configure, and manage AWS services, such as Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). The SDK provides an object-oriented API as well as low-level access to AWS services.

– Boto3 Documentation

This library underpins everything that was done as part of the effort. Really, the complications came in the form of understanding how to get the data cleaned in a format that would be useful to have in a DynamoDB table. Trying to get the data to where I needed it was easy.

This can be seen here primarily in the ingester class located here and pictured below.

import boto3
import time as time
import gzip
import json
from io import BytesIO

class ingester():
 #def __init__(self):
 #print the name of all the buckets the configured account has access to
 def s3_list_buckets(self):
  for bucket in boto3.resource('s3').buckets.all():
   print(bucket.name)
   response_dict = boto3.client('s3').list_objects(Bucket=bucket.name)
   print(response_dict.keys())
   #ensures bucket has content before trying to pull content info out
   try:
    response_dict['Contents']
   except:
    print('No objects in ' + bucket.name + ' exist.')
   else:
    print(response_dict['Contents']) 
    objs_contents = response_dict['Contents']
    print(objs_contents)
    #unnecessary, good for reference
    #for i in range(len(objs_contents)):
    # file_name = objs_contents[i]['Key']
    # print(file_name)

# Read data file from S3 location
# Unpack/Unzip into JSON
# Load to landing bucket location
 def copy_object(self,source_bucket,object_key,target_bucket):
   target_object = object_key + str(time.time())
   copy_source = {
    'Bucket' : source_bucket,
    'Key' : object_key
   }
   s3 = boto3.resource('s3')
   landing_bucket = s3.Bucket(target_bucket)
   try:
    landing_bucket.copy(copy_source, target_object)
   except Exception as ex:
    print(ex)
   else:
    print('Success! Object loaded to: ' + target_object)
    return (target_object)

# turns the data contained in the s3 gzip compressed file to text document
 def convert_object(self,target_bucket,target_key):
   data = []
   s3_client = boto3.client('s3')
   read_object = s3_client.get_object(
     Bucket = target_bucket,
     Key = target_key
   )
   read_byte_object = BytesIO(read_object['Body'].read()) 
   raw_data = gzip.GzipFile(None, 'rb', fileobj=read_byte_object).read().decode('ASCII') #.decode('utf-8')
   s3_client.put_object(Body=raw_data, Bucket=target_bucket,Key=target_key[target_key.rindex('/')+1:] + str(time.time())+'.txt')

Looking at the convert_object function, you can see there was quite a bit of finagling needed in order to get the required data format and move the contents into my single landing bucket. This single bucket is where I’m storing all of my information, as outlined in the architecture. After doing this project, I realized the hard part of the library, just like anything, is learning how the different functions return the data and should be used in tandem to make a coherent solution. But I will say, the documentation is great and there are a plethora of resources/blogs.

Specifically, I’ll call out the following as a great place to start when looking to get something like this off the ground and into the cloud.

Delivering Change

When you think of great products, what do you think of? Products that have altered the way office jobs are done (excel, outlook, DropBox), or changed the way we interface with computers and technology (iPhone). Hearing from the VP of Product at Change.org recently was an opportunity to hear from the head of one of the most impactful products present in the market today, as evidenced by the impact on politics around the

BlogPostYellowShirt

French president responding to Change.org petition

globe. This presentation concerned the organizational tools that the team at Change.org used which led to the successful creation of an entirely new product line, circa 2011, to support a major shift in Change.org‘s business model.

The problem facing Change.org was that the revenue source had been a pure B2B model, which was no longer viable. Meaning that the grass roots petition platform that exists today did not exist. It became clear to the leadership of the organization that the business model was no longer feasible and had to be changed due to external market factors. Nick (VP of Product) led this effort from the product side and by using a few different concepts and methodologies successfully led the team to create the product line which made Change.org what it is today.

Problems Faced

The main internal problems that the company faced in developing an entirely new product line were the following:

  1. Speed of Delivery: Slow progress towards product completion
  2. Alignment: Getting the organization to work in the same direction
  3. Focus: Inability to make meaningful progress towards the goals that have been set

In order to confront this problem, there were a few different tools that were utilized by the team at Change.org that stood out to me. The team  utilized some of the obvious efficiency levers, like automating tasks, but the strategic/organizational decisions are the most interesting.

Avalanche

As a previous product owner, one of the main frustrations regularly encountered was the inability to make forward progress due to all of the work needed to “keep the lights on” as it’s commonly phrased. Change.org made the conscious decision to focus solely on creating new features for a limited and focused period of time. That means no fixing of features that break (I assume other than P1’s). No devops working performance tuning. The whole organization fully focused on features.

One of the anecdotes shared that stuck with me was when it was mentioned that two college hires ended up working on a feature with the CTO. When you have an entire organization aligned behind a single goal, the ability to have people naturally move across, up, and down traditional power structures to achieve that goal seems to be a huge gain in creating innovative solutions. In addition to having more deliveries towards that single goal, the fact of the matter is that also comes the ability to see the pain points and what other teams are dealing with/issues they have.

Once the avalanche was done, Nick mentioned how devops had a much better understanding of the issues facing the application development team due to the fact that these devops developers had been doing feature work on the application. With this understanding they were then able to take these learnings and implement solutions that were able to make big impacts within the organization.

Shots on Goal

Anyone who has worked in an agile environment should be aware of this, but the affect that this had in the organization is worth mentioning. Specifically how deliberately this discussion was had with leadership. What Nick realized is that they were entering an entirely new market. Knowing this, he knew that it was unlikely that the organization would be able to accurately predict what features the new consumers would adopt in their product line without releasing many features and finding out what worked.

To that point, Nick organized his team to focus on flexing their speed muscle, as opposed to quality. Why deliver 10 perfect features when you could deliver 100 working features and have a much higher chance that the organization will strike gold? That strategy led to multiple success in the new product line, and an ability to conduct and test experiments faster than they thought possible under the previous paradigm which tried to achieve both speed and quality. When a team is focused on an immense effort, making demonstrable progress is usually one of the hardest/most difficult things. By focusing on speed, much faster feedback loops were enabled, with some of these features contributing to more than 20% revenue growth from the new product line.

OKR

Objective and Key Results methodology. Google used it. Intel used it, and countless other organizations. It’s not reasonable to think that an organization can deliver an entirely new product line and business model without complete organizational focus. Using OKRs Change.org was able to create a clear focus. As a result the teams were able to deliver on what is meaningful and impacts the bottom line, as opposed to working towards multiple objectives and goals that cause people to tug and pull against one another.

Blog2-measurewhatmatters

Great book which provides a history and overview of OKRs

Working in multiple organizations, I’ve noticed that people don’t know what to do when their are too many goals. Having the ability to walk into a room and point to a single goal means that the organization is more easily able to coordinate and implement. When there is no clear goal unnecessary road blocks, and fiefdoms tend to spring up and lead to the organization pulling in different directions getting in front of the accomplishment of organizational goals.

Operating teams using the above three strategies worked for Change.org. Using some of these strategies has worked for other large and small organizations, and from what I’ve seen thus far in my career, could be used in many other organizations to transform the way products are delivered and impact the bottom line.

Using Redash for Visualization

Over the winter break I was having a conversation with my cousin concerning the awesomeness of Tableau and all it offers. While Tableau is a best in class product, there are a couple points that he raised which are valid points against the effectiveness of Tableau.

  1. Tableau uses it’s own proprietary language and functions for a lot of aggregations/advanced functionality that could be done in SQL. SQL based tools are better, he referred to Metabase explicitly among others, due to the fact that most analysts know the language and therefore will easily be able to pick it up.
  2. Tableau isn’t open source, so as a user, if something doesn’t exist in an open source tool and I know the language, a custom feature to the tool can be added easily (depending upon your ability to code in the respective language).

With that in mind, that got me down the path of looking at open source reporting and dash boarding tools that are heavily SQL based. When looking, I cracked upon both a Metabase and Redash instance with which to play around with. Metabase had a good number of features available, but an extremely limited amount of rows which could be ingested by the tool on the free tier.

herokutiers

Heroku – Different pricing tiers

So, not wanting to spend any money upfront, I went over to Redash and started using it to build the first dashboards in the tool using the free trial. Needless to say, I fell in love with the tool instantly then my love was instantly tempered by other limitations present in in Redash. Below is the dashboard that was created, which can no longer be accessed unfortunately due to the free trial period ending, but appeared as pictured below.

redash-dashboard

Redash dashboard for Steam and Sony Marketplace price changes

Running through the charts, you can see the game and pricing data that has been collected by my scrapers. Scrolling through the different charts you’ll see the following:

  1. Largest Price Drops and Increases: Shows daily what the maximum price increase has been, and largest decrease has been, along with the average price change for all items which had a price change that day.
  2. Average Price Change: The average price for items which changed price, along with the average previous price for all items in the Steam Marketplace and Sony Play Store which had a price change for the day. In green is average price change between the old price and new price.
  3. Most Recent Price Decreases: The last 20 items which have decreased in price, and associated data.
  4. Most Recent Price Increases: The last 20 items which have increased in price, along with associated data.

At the bottom of each of these panels you’ll see when the data last was refreshed from the database.

Now onto the original reason for writing this article. The pros and cons of using and building dashboards in this tool.

Pros

  1. No row limit encountered on the free trial! What a great feature. The first requirement I had was that I wanted to be able to ingest a large amount of data and do aggregations using that data. Not having a hard row limit that I ran into with my small 40,000 record data-set originally sold me on this tool.
  2. Easy to get up and running. Setting up with tool, and making use of the connectors already present in the tool was extremely easy.
  3. SQL interface is extremely intuitive to anyone who has used SQL Server Management Studio/PgAdmin, or any other database querying GUI tool.
  4. Refreshes are extremely easy to schedule and reliable.

Cons

  1. Free trial, and free software if self hosted. If hosted using Redash.io like I did, the price is $49 a month on the lowest tier. Metabase hosting on Heroku is just as easy, and cheaper to use in the long term for small side projects.
  2. No aggregations can be done within visualizations, which in my view is a must-have with any dash boarding/reporting tool. Redash forces you to push that logic into SQL code, which results in redundant/complex queries. It also forces you to pre-aggregate, so the feature of “No row limit” in the pro’s section no longer applies.
  3. Visualization features are basic/limited. Other dash boarding tools allow you do stacked bars, generally more customization option, tool-tips, choosing color schemes, etc. which are not easily available or are limited compared to a tool like Tableau.

redash-queryeditor

Redash query editor. Combine multiple query visualizations to create a dashboard.

Redash was a tool meant for another use case. Perhaps one where you need a basic tool for monitoring ETL, or some other system…Using Redash was a good experience, but using something like Metabase is the tool for me at the moment. Due to the fact that in Redash all calculations have to be pushed down to the SQL and cannot be done with aggregations in the tool, along with the pricing, it doesn’t seem to suit my use case.

Re-establishing a Broken Cloud

This week, I cracked open Tableau to log into my Amazon RDS instance and noticed that the connection wasn’t working. Logging into the AWS console, my AWS RDS instance had disappeared (along with all the data in it). On perusing my emails, I noticed that I had an unpaid bill in my inbox from Amazon from ~1 month prior. So…along with the instance no longer running, I had lost all data contained which I had been collecting over the past 3 months which is more than slightly disappointing.

This does present an opportunity though. My EC2 instance is still running, and has been trying to push data to a server that no longer exists, meaning I need to set the RDS instance backup. This was an opportunity to document setting up a new RDS instance on AWS from scratch, with all necessary users, objects, and privileges and document how long it took.

Here’s the process form start to finish;

Start: 4:12 pm

First step, logging in and getting the instance created. You’ll notice during this step that I flip from free-tier to get more storage, then flip back to free-tier. Why pay more money to get increased storage I won’t need for a couple weeks? All I need to do to up the storage is change a configuration which will cause my RDS instance to be down for a couple minutes.

Second step, making sure security privileges are setup. After my first project a couple years ago, and getting my web server destroyed by a script kiddie, I now only open specific ports (which I should have been doing all along).

The third step. I should be able to login to the server using the account that I set up as admin. Once I log in, all I have to do is execute all the create scripts I have.

The way that I created the DDL for my tables and schemas means that I can copy and paste them into query window in PgAdmin4 and execute the scripts. You’ll notice I have a couple semicolon issues that I’ve resolved.

Finally, looks like everything has been created. Just need to validate that my different accounts can login to the server and have appropriate privileges, which they did.

Finish: 5:17 pm

successfully back up

Connections exist from my EC2. With no alteration to any code on the EC2 server!

This process did not include any alteration of the EC2 instance and allowed me to go from a web server scraping the internet and sending files into the ether (nowhere) to having a full database stood up with all objects. This was done in a little over an hour, and ~30 minutes of that was spent executing sql, copy and pasting sql into the query editor for execution, testing to ensure objects/configuration was successful, and fixing minor syntax issues. All of which could be automated away.

I was debating whether or not to get a personal server for my projects, but this in my mind firmly helps cement the cloud as being a better choice when it comes to infrastructure. Comparing to my experience setting up a local SSAS and SQL server instance, this took about 10% of the time and was extremely easy to get running.

A Foray Into Serious Scraping

It’s been a while since my last post. Getting married, honeymooning, buying a house, etc. took away the time I had for this. But all of that is nearing it’s end, so I’m getting back into the regular cadence of working on the scraping project. Since I’m now back into it, and got everything up and running, I figured the most sensible place to start is the architecture that has been implemented.

The Problem:

The most sensible place to start any discussions of architecture is clearly stating what the system is supposed to do. I need a system that accomplishes these three things.

  1. I need a system that is able to reliably scrape data from any website or consume data from any source.
  2. I need a place where this data can be loaded and reported on in a cohesive format.
  3. I need the product to be lightweight as far as storage space required and CPU so I don’t have to pay out the wazoo.

Overview:

Scraper Architecture

Overview of the architecture, from inputs to database inserts.

In order to meet this, a straightforward architecture was implemented. Using Amazon Web Services both a EC2 instance and RDS instance were set up, with the EC2 being an Ubuntu instance and the RDS being Postgresql. In sequential order, here is how the scraper works.

  1. Using python’s Scrapy library, we’ve written Scrapy projects which look to specific sources to bring in data based upon the HTML on websites. Right now, we’ve targeted two, but can expand to as many as needed. These Scrapy spiders are scheduled through Scrapyd, a framework that no only allows for scheduling and management of spiders, but also offers better performance by operating on Twistd making it asynchronous.
  2. As the spiders are constantly running they are outputting to JSON files on the server. Basically, the driver here is to have a place to drop the output of the data onto the server so that data won’t be lost if something happens with one of the processes.
  3. A Python class was written with Psycopg2 in a way that is meant to be extensible for future data sources. The idea being, that as the data model and data sources are changed/expanded upon, the only thing that will need to change is the class itself. None of the scripts that call the class to insert data from our existing data sources will need to change.
  4. A staging area was created within the RDS PostgreSQL instance which ingests the raw data from the data source. Where possible, a unique index was created that checks for changes before accepting the data into the staging area. As we have scrapers hitting sources repeatedly, we are going to be grabbing the same data. What we’re interested in are the changes, especially in regards to new items or price changes. Also, we want to make as efficient as possible of a architecture so storing only the data we are interested in just makes sense.
  5. Once data has been accepted into the landing zone, the Ubuntu instance is used to schedule a slew of ETL jobs written in SQL and passed to PostgreSQL for execution using Psycopg2. Postgresql doesn’t have a native scheduler readily available, so we use the Crontab functionality of Ubuntu to execute a script for each of our sources that calls from a class containing all of our ETL functions. The end result of this is a 3NF model populated with data and appropriate relationships made.

So, it’s now up and running, and data is flowing through into the objects. The data is populating for all tables where it is expected and I could begin reporting price changes today. The best part? All of this was built using $0’s of infrastructure from Amazon Web Services (and a lot of my time). I’m running out of storage space rapidly (20 gigs from the free tier ran out over a couple days), and the CPU is not beefy at all, so stalls out if more than one scraper is running at a time (as pictured below).

scrapyd-performance

Performance goes down dramatically in yellow. In red, my scraper has been blocked from accessing the site (which didn’t happen before refactor…I’ll go into that another time)

To refer back to the original goal, I would say it has been achieved. Not to say that it couldn’t be improved upon and optimized. But overall, the first serious foray into scraping seems to of gone well. Feel free to reach out with any questions, or suggestions!

What Should Documentation Be?

The problems Business Intelligence organizations solve in organizations are generally the same. Pull some data out of somewhere, synthesize the data, analyze it, then create a picture of what is happening, what is going to happen, or what has happened. Working in small and large organizations, I’ve had the pleasure of seeing a variety of different processes used to deliver these insights. These range from the overbearing, and associated documentation that crush people’s productivity, to the lightweight that creates quicksand beneath teams feet through the lack of knowledge transfer.

Seeing the overbearing and the extremely light weight, there’s one conclusion I’ve arrived at concerning documentation…

Document as Little as Possible

 

documentation_joke

Relevant literature

Don’t commit time to things that aren’t creating revenue or helping the business. Looking at IT projects, there is no doubt that the more documentation that there is, the less value there is. The perfect example came about when having a beer with a former co-worker.

 

It was brought up that the process at the company we both previously worked at had documentation that took longer to create then coding, testing, and implementing the change. Additionally, this painstakingly crafted documentation that the engineer had to spend time tracking down information for didn’t result in documents that would be useful to the team doing the work going forward. The process decreed that you must document X, Y, and Z in order to deploy the change/implementation so that’s what was done. The fundamental truth is that the “…benefit of having documentation must be greater than the cost of creating and maintaining it.”

Some people believe in the exact opposite of over documentation. Nothing should be documented. The code/implementation should speak for itself. This may work when you have a small size IT application the will always be managed by the same group of individuals (which likely won’t happen). Once you reach an application spanning multiple servers, teams, and databases the expectation for the code/implementation to “speak for itself” in a timely manner to those who have to report and get analytics out is unreasonable.

So, what’s being proposed in this rant? The only useful documentation that I’ve seen documents the “Why” and the “How”. Everything else doesn’t create value for the organization, as the cost to maintain and develop the documentation is too high.

Why

Creating a BI Product entails connecting the business process to an application(s) or database(s). Depending on the environment that you’re working in, Inmon, Kimball, or something else entirely, you need to know the answer to why things in your system exist. The “Why” is important not only from a high leadership level, but also at a low technical implementation level. The “Why” statement done at the low level helps to ensure that a team is using previously created tools and implementations as designed. And if a change is made that goes against the original “Why”, it is intentional and by design.

As an example, working on the Vehicle Profitability by VIN project, the Data Architect created both Inmon (3-NF) and Kimball (dimensional reporting) structures on the project. The “Why” was made extremely apparent through documentation, so the teams knew how to use the current implementation to achieve their goal in the best way possible.

Are you importing new invoice data? That should go into the wholesale invoice structure so that it flows up in the existing fact that contains the revenue information for vehicles which our reports feed from. Why? Because we want a single source of truth for vehicle revenue.

When documentation providing the “Why” for technical implementations exists, it makes adding on and changing the existing processes and assets easier. As opposed to re-inventing the wheel over and over.

How

Payment-Data-Flow-Diagram

basic data flow diagram

So after we know why something exists, the other piece that is useful for documentation is the “How”. The “How” shouldn’t be step by step instructions, it should function like a high level map. Data Flow Diagrams are a great example of “How” documentation that I’ve found useful for Business Intelligence products. Armed with the Data Flow Diagram and the “Why” of the design, team members who need to report on, extend, maintain, or refactor a system will be able to make informed decisions.

 

 

Make It Useful

At the end of the day, documentation gets in the way of creating code/analysis/direct business value. So the argument for spending time creating documentation is hard to make when someone hasn’t experienced the pains associated with lack of documentation. Lack of architecture that makes sense, misreported numbers, time wasted building processes that do exactly what existing processes already do.

Without documentation, maintaining and using a system or process as intended is impossible. With documentation that is accessible, searchable, and focuses on the “How” and “Why”, organizations can make smart and informed decisions of where to spend time, how to tweak things, and how to get value from their assets.

 

Data Day Texas 2017 – A Few Thoughts

Earlier this month I had the opportunity to attend Data Day Texas, and thought that it would be worthwhile to jot down a few thoughts. For those that aren’t aware of Data Day Texas, think of it as a gathering of nerdy IT people and Data Scientists. It was an interesting weekend with a wide range of topics that encompassed everything from machine learning algorithms to more approachable subjects like data dashboarding.

Graphs Are Here

There’s a reason that the keynote by Emil Eifrem was named “The Year of the Graph”. Looking at the popularity trend on db-engines.com, you can see a large gain in the

neo4j-popularity

Neo4j Popularity

popularity of Neo4j. Leading naturally to the question, so what?

I think the major winning point for graph databases, other than performance on certain types of data analytics, is that graph databases are defined with relationships between data. This is in opposition to the approach of the traditional RDBMS which requires explicitly defining the tables in the schema, with relationships as a non-required afterthought in most cases. This means that while constructing the database, what you are doing is explicitly defining the node (core piece of data) and the edge (relationship between nodes). This means that you are enforcing the relationships between data, as opposed to the structure of the data itself. This creates another level of abstraction between the end user and the data, which should make the data in the database more approachable. Oh, and if you haven’t guessed, graph databases are schema-less which is a plus in many cases.

Issues Are Similiar Across Companies/Technology

In particular, there were two talks that hit this point home. The first was given by Chris LaCava from Expero Inc. in which he discussed visualization techniques with graph databases. The second was the discussion of how Stitch Fix sets up their environment for data scientists to work by Stefan Krawczyk.

What’s the root of this? People want to use the tools that work and that they like. Chris LaCava discussed how to do visualization on graph databases. While graph databases can

dashboarding-design-process

Look familiar? From Chris’ presentation on graph database dashboarding

meet some cool use cases as far as data sets and real time analytics go, what was discussed was a  straight forward and common sense approach to dashboarding. Anyone familiar with Business Intelligence and dashboarding should roughly be following the above, or near to it.

Stefan‘s talk was all about using Docker to enable data scientists to use the tools that they want to use. The solution to the complaint that many of us in the industry have when we are locked in with a specific tool-set. The differentiation here was that Stitch Fix has done containerization at scale. This solves that problem by allowing each of their data scientists to run and operate on their own environment, with whatever tool-set they favor to deliver business value.

The Story is What Makes Things Interesting

The final point, which I’ve written about before, is that the story is what makes things interesting. The specific story presented at Data Day? The Panama Papers and how Neo4j was used to discover the unlikely connection that led to the downfall of a Prime Minister. That this was the best marketing tool that I have ever seen in regards to a database.
Having a database GUI that allows for easy exploration of the data natively? That’s a game changer.

This slideshow requires JavaScript.

Looking at the above, you can see a traditional RDBMS GUI (SQL Server Management Studio) versus Neo4j’s GUI. There’s a reason why people don’t pull up SQL Server Management Studio tools to tell a story. Having a database platform that can automatically tell a story about the data is an awesome approach.

 

A Framework for Innovation

Creating change. A fun subject, and an admirable goal according the American Ethos and the media our society has spawned. Even though the innovative ideas may go against the grain or the way that things are currently being done, many consider it a virtue to pursue

innovation-vs-no-innovation

The macro effect of innovation on a company

them. With so much positive emphasis on innovation in our culture, why is it so difficult?

There thousands (if not millions) of reasons why this is the case and I couldn’t hope to answer in a short blog post. Looking at the reverse of that, “how do successful innovations occur?” many insights are available. There are many theories with many names, but reading many of these there is a correlation across the different materials which boil down to two things from what I’ve found and read.

  1. Let people know how the innovation will effect them
  2. Make things easy for the people who are going to be using the innovation

Switch lays out the best framework (in my opinion) for accomplishing these two goals.

Framework for Change

The basic idea is conveyed through the idea that every person can be pictured like a rider on a top of an elephant going down a trail. In order to change where the rider is going to end up, there is the ability to alter three things. You may have guessed it. We can change the rider, the elephant, and the trail.

The Rider: In the idea of the rider and the elephant, this is the logic. Everyone has logic (although some riders may be weaker than others) that helps to form how they behave. The rider’s the part of the person who when starting a new habit, like running in the morning, will cause people to set an alarm.

the-rider-the-elephant-and-the-path_50290b0771b02_w1500.pngThe Elephant: Emotions and subconscious drive. At the end of the day, the elephant dictates where the rider is going to go. The average ~150 pound rider will only be able to control a 13,000 pound animal for so long before coming exhausted.

The elephant is the reason why the planned morning run will be cancelled by multiple pushes on the snooze button. The  elephant is also the reason why people work 90 hour work weeks and are excited to do so.

The Path: The final component for creating change is the external environment in which every individual operates. These are the external forces which effect behavior. Shaping the external forces and how they act upon elephants and their riders, getting the rider to move towards the desired end location.


All in all, it’s pretty straightforward right? Well, it is definitely much easier to conceptualize and talk about then it is to implement. So many people fail, myself included, to implement all three at the same time, leading to great ideas being dropped to the wayside.

For those who want to change things for the better, hopefully this framework can help you get to actualization of innovation.

Down with Vertical Database Architecture

The goal of gathering data can be broken down into a combination of any of the following free. Understanding what has happened, what is happening, or project what will happen. When getting answers to these questions, as long as the answer is obtained, why does it matter how the answer was obtained?

Getting a view into this information can be done many different ways, and with the products available on the market can be done for free and with minimal IT know how. There is a time and a place to pay a premium on IT projects to obtain the capabilities that Skyscraper_Diagramnone of your competitors will have. When a solution needs to be scale-able and tailored to your unique needs.

This is when architecture comes in.

Just like any structure, a database architecture can be flat or tall. What is the difference? To
run with analogy of comparing database architecture to buildings, a skyscraper (vertical) is much more complex to build and maintain compared to a house (horizontal).

Horizontal

A horizontal architecture can be pictured like a suburb. This translates to a house that is commissioned by you that is easily customizable and suited to your needs and wants. Do you want a pool? Easy. Do you want a larger living room or a smaller kitchen? That can be done.

Taking this analogy from building skyscrapers to databases, a flat architecture means that your data is displayed from a single (or as few as possible) levels. It is much easier to understand how the wiring, plumbing, lighting, etc. were put into a house when compared to a skyscraper. Additionally, when you want to install a pool, it’s much easier to install and maintain than a pool on the 23rd floor of a high rise.

Architecture Diagram

Vertical

A vertical architecture means many structural layers in the database, and with it comes complexity. The difference between the physical skyscraper and databases? Skyscrapers are generally created when there is no more land to build flat, this law of physics doesn’t apply to databases.

Why would anyone build a vertical architecture than? In my experience time and resource constraints effect (two thirds of the magic time-resource-quality triangle), short-term thinking.

Benefits of Horizontal

  1. Decrease in cost: Less people to maintain complex solutions, and more time spent creating value for you.
  2. Higher quality: More visibility into what is happening where. Instead of having to dive through and learn how everything was built, people playing with the data only have to learn specific portions which they are interested in.
  3. Faster delivery: The final win on a flat architecture, is speed of delivery. By reducing complexity people spend less time learning, and more time creating value. While in the immediate you may save time in the short-term with a vertical architecture, you will pay dearly in the long-term.