A Foray Into Serious Scraping

It’s been a while since my last post. Getting married, honeymooning, buying a house, etc. took away the time I had for this. But all of that is nearing it’s end, so I’m getting back into the regular cadence of working on the scraping project. Since I’m now back into it, and got everything up and running, I figured the most sensible place to start is the architecture that has been implemented.

The Problem:

The most sensible place to start any discussions of architecture is clearly stating what the system is supposed to do. I need a system that accomplishes these three things.

I need a system that is able to reliably scrape data from any website or consume data from any source.
I need a place where this data can be loaded and reported on in a cohesive format.
I need the product to be lightweight as far as storage space required and CPU so I don’t have to pay out the wazoo.

Overview:

Overview of the architecture, from inputs to database inserts.

In order to meet this, a straightforward architecture was implemented. Using Amazon Web Services both a EC2 instance and RDS instance were set up, with the EC2 being an Ubuntu instance and the RDS being Postgresql. In sequential order, here is how the scraper works.

Using python’s Scrapy library, we’ve written Scrapy projects which look to specific sources to bring in data based upon the HTML on websites. Right now, we’ve targeted two, but can expand to as many as needed. These Scrapy spiders are scheduled through Scrapyd, a framework that no only allows for scheduling and management of spiders, but also offers better performance by operating on Twistd making it asynchronous.
As the spiders are constantly running they are outputting to JSON files on the server. Basically, the driver here is to have a place to drop the output of the data onto the server so that data won’t be lost if something happens with one of the processes.
A Python class was written with Psycopg2 in a way that is meant to be extensible for future data sources. The idea being, that as the data model and data sources are changed/expanded upon, the only thing that will need to change is the class itself. None of the scripts that call the class to insert data from our existing data sources will need to change.
A staging area was created within the RDS PostgreSQL instance which ingests the raw data from the data source. Where possible, a unique index was created that checks for changes before accepting the data into the staging area. As we have scrapers hitting sources repeatedly, we are going to be grabbing the same data. What we’re interested in are the changes, especially in regards to new items or price changes. Also, we want to make as efficient as possible of a architecture so storing only the data we are interested in just makes sense.
Once data has been accepted into the landing zone, the Ubuntu instance is used to schedule a slew of ETL jobs written in SQL and passed to PostgreSQL for execution using Psycopg2. Postgresql doesn’t have a native scheduler readily available, so we use the Crontab functionality of Ubuntu to execute a script for each of our sources that calls from a class containing all of our ETL functions. The end result of this is a 3NF model populated with data and appropriate relationships made.

So, it’s now up and running, and data is flowing through into the objects. The data is populating for all tables where it is expected and I could begin reporting price changes today. The best part? All of this was built using $0’s of infrastructure from Amazon Web Services (and a lot of my time). I’m running out of storage space rapidly (20 gigs from the free tier ran out over a couple days), and the CPU is not beefy at all, so stalls out if more than one scraper is running at a time (as pictured below).

Performance goes down dramatically in yellow. In red, my scraper has been blocked from accessing the site (which didn’t happen before refactor…I’ll go into that another time)

To refer back to the original goal, I would say it has been achieved. Not to say that it couldn’t be improved upon and optimized. But overall, the first serious foray into scraping seems to of gone well. Feel free to reach out with any questions, or suggestions!

From Nothing to Something (The Beginning)

As part of a personal project, which I’m managing (and actively working) here, I’ve decided to do a little write up on my approach, what I’m learning, and other technical things I’ve encountered. This is as much for my own memory, as it is in the hopes that I can help some others avoid the technical pitfalls that I have encountered.

The Product:

I’ve always been someone extremely interested in data, especially data that no one else is looking at. So, what is the logical place to go? The most accessible data is the data that is already out there for the grabbing. So…scraping.

What has no one else scraped, or at least scraped and aggregated AND displayed well? Game prices across different platforms. There’s aggregators for all different kinds of products (ammo, outdoor gear, etc.) but no one seems to have implemented one for games well, although they have tried.

With that goal in mind we are building a product for people to track game prices, and favorite games so that they no longer have to track news on multiple sites and check multiple web marketplaces for the best prices on games. This means we will be scraping Reddit, Twitter, and other news/social media sites, in addition to game marketplaces like Steam and Sony’s Playstore.

What I Hope to Gain:

At the end of the day, maybe we strike gold by building the coolest website and app that ever existed and people love. More realistically, I want to build a platform with which I can add data as needed for my own wants/needs. I want to become expert level using certain libraries and frameworks, and be at a point where I’m not just a Business Intelligence and ETL developer but can develop all over the stack as needed with ease.

Also, I want to gain experience in setting up a highly performant, extensible, ETL platform off of which I end up with an app on a marketplace and at least one download. All of which will be done on a shoe-string budget. I can then use that platform to pivot and build any sort of data-centric application for whatever purpose/reason I want.

The Steps:

So, with all this being said, there are three main topics I will be writing about on a broad level.

Writing scrapers with Python’s Scrapy library, which run 24/7 around the clock
Writing ETL’s to a Postgresql database with near real time availability and using a budget AWS instance
Serving up the data to end users using an open source tool

More updates in the coming days!

What Should Documentation Be?

The problems Business Intelligence organizations solve in organizations are generally the same. Pull some data out of somewhere, synthesize the data, analyze it, then create a picture of what is happening, what is going to happen, or what has happened. Working in small and large organizations, I’ve had the pleasure of seeing a variety of different processes used to deliver these insights. These range from the overbearing, and associated documentation that crush people’s productivity, to the lightweight that creates quicksand beneath teams feet through the lack of knowledge transfer.

Seeing the overbearing and the extremely light weight, there’s one conclusion I’ve arrived at concerning documentation…

Document as Little as Possible

Relevant literature

Don’t commit time to things that aren’t creating revenue or helping the business. Looking at IT projects, there is no doubt that the more documentation that there is, the less value there is. The perfect example came about when having a beer with a former co-worker.

It was brought up that the process at the company we both previously worked at had documentation that took longer to create then coding, testing, and implementing the change. Additionally, this painstakingly crafted documentation that the engineer had to spend time tracking down information for didn’t result in documents that would be useful to the team doing the work going forward. The process decreed that you must document X, Y, and Z in order to deploy the change/implementation so that’s what was done. The fundamental truth is that the “…benefit of having documentation must be greater than the cost of creating and maintaining it.”

Some people believe in the exact opposite of over documentation. Nothing should be documented. The code/implementation should speak for itself. This may work when you have a small size IT application the will always be managed by the same group of individuals (which likely won’t happen). Once you reach an application spanning multiple servers, teams, and databases the expectation for the code/implementation to “speak for itself” in a timely manner to those who have to report and get analytics out is unreasonable.

So, what’s being proposed in this rant? The only useful documentation that I’ve seen documents the “Why” and the “How”. Everything else doesn’t create value for the organization, as the cost to maintain and develop the documentation is too high.

Why

Creating a BI Product entails connecting the business process to an application(s) or database(s). Depending on the environment that you’re working in, Inmon, Kimball, or something else entirely, you need to know the answer to why things in your system exist. The “Why” is important not only from a high leadership level, but also at a low technical implementation level. The “Why” statement done at the low level helps to ensure that a team is using previously created tools and implementations as designed. And if a change is made that goes against the original “Why”, it is intentional and by design.

As an example, working on the Vehicle Profitability by VIN project, the Data Architect created both Inmon (3-NF) and Kimball (dimensional reporting) structures on the project. The “Why” was made extremely apparent through documentation, so the teams knew how to use the current implementation to achieve their goal in the best way possible.

Are you importing new invoice data? That should go into the wholesale invoice structure so that it flows up in the existing fact that contains the revenue information for vehicles which our reports feed from. Why? Because we want a single source of truth for vehicle revenue.

When documentation providing the “Why” for technical implementations exists, it makes adding on and changing the existing processes and assets easier. As opposed to re-inventing the wheel over and over.

How

basic data flow diagram

So after we know why something exists, the other piece that is useful for documentation is the “How”. The “How” shouldn’t be step by step instructions, it should function like a high level map. Data Flow Diagrams are a great example of “How” documentation that I’ve found useful for Business Intelligence products. Armed with the Data Flow Diagram and the “Why” of the design, team members who need to report on, extend, maintain, or refactor a system will be able to make informed decisions.

Make It Useful

At the end of the day, documentation gets in the way of creating code/analysis/direct business value. So the argument for spending time creating documentation is hard to make when someone hasn’t experienced the pains associated with lack of documentation. Lack of architecture that makes sense, misreported numbers, time wasted building processes that do exactly what existing processes already do.

Without documentation, maintaining and using a system or process as intended is impossible. With documentation that is accessible, searchable, and focuses on the “How” and “Why”, organizations can make smart and informed decisions of where to spend time, how to tweak things, and how to get value from their assets.

Data Day Texas 2017 – A Few Thoughts

Earlier this month I had the opportunity to attend Data Day Texas, and thought that it would be worthwhile to jot down a few thoughts. For those that aren’t aware of Data Day Texas, think of it as a gathering of nerdy IT people and Data Scientists. It was an interesting weekend with a wide range of topics that encompassed everything from machine learning algorithms to more approachable subjects like data dashboarding.

Graphs Are Here

There’s a reason that the keynote by Emil Eifrem was named “The Year of the Graph”. Looking at the popularity trend on db-engines.com, you can see a large gain in the

Neo4j Popularity

popularity of Neo4j. Leading naturally to the question, so what?

I think the major winning point for graph databases, other than performance on certain types of data analytics, is that graph databases are defined with relationships between data. This is in opposition to the approach of the traditional RDBMS which requires explicitly defining the tables in the schema, with relationships as a non-required afterthought in most cases. This means that while constructing the database, what you are doing is explicitly defining the node (core piece of data) and the edge (relationship between nodes). This means that you are enforcing the relationships between data, as opposed to the structure of the data itself. This creates another level of abstraction between the end user and the data, which should make the data in the database more approachable. Oh, and if you haven’t guessed, graph databases are schema-less which is a plus in many cases.

Issues Are Similiar Across Companies/Technology

In particular, there were two talks that hit this point home. The first was given by Chris LaCava from Expero Inc. in which he discussed visualization techniques with graph databases. The second was the discussion of how Stitch Fix sets up their environment for data scientists to work by Stefan Krawczyk.

What’s the root of this? People want to use the tools that work and that they like. Chris LaCava discussed how to do visualization on graph databases. While graph databases can

Look familiar? From Chris’ presentation on graph database dashboarding

meet some cool use cases as far as data sets and real time analytics go, what was discussed was a straight forward and common sense approach to dashboarding. Anyone familiar with Business Intelligence and dashboarding should roughly be following the above, or near to it.

Stefan‘s talk was all about using Docker to enable data scientists to use the tools that they want to use. The solution to the complaint that many of us in the industry have when we are locked in with a specific tool-set. The differentiation here was that Stitch Fix has done containerization at scale. This solves that problem by allowing each of their data scientists to run and operate on their own environment, with whatever tool-set they favor to deliver business value.

The Story is What Makes Things Interesting

The final point, which I’ve written about before, is that the story is what makes things interesting. The specific story presented at Data Day? The Panama Papers and how Neo4j was used to discover the unlikely connection that led to the downfall of a Prime Minister. That this was the best marketing tool that I have ever seen in regards to a database.
Having a database GUI that allows for easy exploration of the data natively? That’s a game changer.

This slideshow requires JavaScript.

Looking at the above, you can see a traditional RDBMS GUI (SQL Server Management Studio) versus Neo4j’s GUI. There’s a reason why people don’t pull up SQL Server Management Studio tools to tell a story. Having a database platform that can automatically tell a story about the data is an awesome approach.

Working For The Small Guys

I left my last job recently , by some rankings the 23rd largest company in the world, for a much smaller company. The general consensus from those who I talked to was that it would be good experience to have a change of pace, and many recommended a shift to a a smaller company.

With that said, I think I’ve settled in enough to pick out what I see as the advantages of working for a small company. Luckily for me, it appears that those I confided in gave me some good advice. So here we go. About four months in I’m going to give you what I see as the three best things that I’ve found working at a smaller company.

Clear Goals

Working for a smaller company, especially one the runs lean, means that the company needs to use resources as efficiently as possible. This is evident in the fact that everyone knows what the goals of the company currently are. Instead of being lost in the back of the office doing menial work without visibility into how you are helping the company, it’s clear what problems the business is facing and how specific pieces of work fit in to accomplish the larger goal.

In addition, clearly outlining the goals and objectives helps to build the feeling that teams are actually working together. In my experience, large companies have many objectives and everyone has a different idea on how to solve them. This isn’t a bad thing. What is a bad thing is that in pursuing these many ideas on how to solve some overarching problem, different teams in some instances actively work against one another’s interests to solve the same problem. All this with the hope that their solution is the one that gets noticed and creates a successful career.

The Chance to Get Your Hands Dirty

The lack of bureaucracy. I love it. Instead of having to fight through multiple approval processes and layers of pointless requests to get access to data/tools, you get the ability to solve problems however you would like (and pay for the consequences).

What types of problems? Well, real problems. Instead of trying to solve the problem of moving data from one place from another, or making a banner on a homepage appear differently, you are doing things that can have direct impact. Like what? In my teams case it’s creating customer segmentation strategies, or delivering insights on data that never before has been seen.

The best part? Instead of performance being judged on how quickly a problem is solved with a pre-defined approach, it is judged on the results. Instead of the mantra “How fast did you deliver to specifications”, the mantra is “How much value did you deliver“.

Lots of Opportunity to Push the Envelope

Looking at the points above, this is fairly obvious. New ideas are easier to implement in smaller organizations. Instead of having to fight with layers of process, you are up against reality. What do I mean by that? I mean that instead of fighting people over nothing, you are fighting the limits of technology, hardware, and business processes.

The only limit is your own lack of knowledge and passion.

A Framework for Innovation

Creating change. A fun subject, and an admirable goal according the American Ethos and the media our society has spawned. Even though the innovative ideas may go against the grain or the way that things are currently being done, many consider it a virtue to pursue

The macro effect of innovation on a company

them. With so much positive emphasis on innovation in our culture, why is it so difficult?

There thousands (if not millions) of reasons why this is the case and I couldn’t hope to answer in a short blog post. Looking at the reverse of that, “how do successful innovations occur?” many insights are available. There are many theories with many names, but reading many of these there is a correlation across the different materials which boil down to two things from what I’ve found and read.

Let people know how the innovation will effect them
Make things easy for the people who are going to be using the innovation

Switch lays out the best framework (in my opinion) for accomplishing these two goals.

Framework for Change

The basic idea is conveyed through the idea that every person can be pictured like a rider on a top of an elephant going down a trail. In order to change where the rider is going to end up, there is the ability to alter three things. You may have guessed it. We can change the rider, the elephant, and the trail.

The Rider: In the idea of the rider and the elephant, this is the logic. Everyone has logic (although some riders may be weaker than others) that helps to form how they behave. The rider’s the part of the person who when starting a new habit, like running in the morning, will cause people to set an alarm.

The Elephant: Emotions and subconscious drive. At the end of the day, the elephant dictates where the rider is going to go. The average ~150 pound rider will only be able to control a 13,000 pound animal for so long before coming exhausted.

The elephant is the reason why the planned morning run will be cancelled by multiple pushes on the snooze button. The elephant is also the reason why people work 90 hour work weeks and are excited to do so.

The Path: The final component for creating change is the external environment in which every individual operates. These are the external forces which effect behavior. Shaping the external forces and how they act upon elephants and their riders, getting the rider to move towards the desired end location.

All in all, it’s pretty straightforward right? Well, it is definitely much easier to conceptualize and talk about then it is to implement. So many people fail, myself included, to implement all three at the same time, leading to great ideas being dropped to the wayside.

For those who want to change things for the better, hopefully this framework can help you get to actualization of innovation.

Visualization Exploration and Explanation

Recently I’ve been looking at the visualization tools that are available on the web, and noticed a distinct difference. A quick Google search shows that I am far from the first person to notice this idea. The basic concept, and what many have found before me, is that visualization’s fall under one of two categories, explanatory and exploratory tools.

Explanatory:

These are the visualizations that are used to illustrate an idea or result in a clear and concise way that removes the need for someone to be 100% familiar with visualization and data. The key? These answer “why” some statement, assertion, or idea is true. The best examples of these are the more advanced and complicated tools that are used to visualize complex ideas that may be hard to explain or illustrate by traditional means.

Click here for full visualization

Exploratory:

Quick and fast to implementation generally. The analyst type visualizations. Think bar charts, simple line graphs…anything that has the goal of showing what is in the specific data set, and doesn’t answer why those data items are in the dataset. These generally answer the “what”. What is happening? Where are high cost goods going? These will be simple to create and illustrate. Think financial dashboards, and simple line and bar charts for the most part.

Click here for site

Why is this distinction important? Being extremely conscientious of what you are trying to do can help select tools, and define what is trying to be accomplished. When this distinction is made on what is trying to be achieved, the right tool and result can be created resulting in time not being wasted

Inspiration vs Manipulation

A few weeks ago I was lent a copy of the book Start with Why. The ideas around the use of manipulation vs inspiration to change human behavior is one of the ideas that struck a chord with me thus far. Looking at companies today, there is a clear differentiation in the way that organizations position themselves based upon where they fall with when using manipulation vs inspiration.

Inspiration

These are the companies that get the best employees, deliver innovative solutions, and much of the time have higher margins and growth compared to competitors. How is this achieved? You guessed it. They have a great and inspiring vision of why they do what they do.

This vision and end goal above is lofty, and certainly something the people (especially rocket designers) can aspire to. Having such a vision for where the company is headed has apparently worked to Spacex’s advantage. Spacex has managed to steal significant market share from the much older Arianespace. There there must be something behind Spacex’s success.

The focus on a lofty vision of “why” an organization does what is does not only drives profits, but also clear and concise decision making and motivations for all who are involved with the company. Driving behavior towards a goal that is inspiring and internally motivated is much more effective in the long-term compared to manipulations.

Manipulation

Price, promotions, fear, aspirations. These are the tactics that are potential changers of human behavior. When making use of these strategies, a company has most likely lost sense of why it exists. The reason for this? When a company is offering to cut price, or market to customer aspirations, there is no longer an internal motivating factor that drives the company. The company compass for decision making has been lost.

The great (or terrible depending on your perspective) example of this is General Motor’s use of promotions to drive sales. In the 1990’s General Motors, along with other US auto manufacturers relied on offering of sales incentives to retain market share when faced by an onslaught of foreign automakers. By taking this route, the US automakers effectively weakened their brands. This may have allowed the automakers to retain higher market share short-term, but it obviously didn’t help the long-term growth and profitability of the company.

Manipulations create addictions for companies that may create some short-term value, but it is at the expense of harming the organization in the long-term. The more fear, promotions, prices cuts, and aspirations a company uses to sell products the cheaper the brand perception will be.

Bottom line, knowing why a company exists provides in internal locus of control which has been proven to be a motivating force compared to the use of manipulations. There’s a reason that Apple customer’s pay more than a 20% premium compared to competitor products, and it’s due to knowing why.

Down with Vertical Database Architecture

The goal of gathering data can be broken down into a combination of any of the following free. Understanding what has happened, what is happening, or project what will happen. When getting answers to these questions, as long as the answer is obtained, why does it matter how the answer was obtained?

Getting a view into this information can be done many different ways, and with the products available on the market can be done for free and with minimal IT know how. There is a time and a place to pay a premium on IT projects to obtain the capabilities that none of your competitors will have. When a solution needs to be scale-able and tailored to your unique needs.

This is when architecture comes in.

Just like any structure, a database architecture can be flat or tall. What is the difference? To
run with analogy of comparing database architecture to buildings, a skyscraper (vertical) is much more complex to build and maintain compared to a house (horizontal).

Horizontal

A horizontal architecture can be pictured like a suburb. This translates to a house that is commissioned by you that is easily customizable and suited to your needs and wants. Do you want a pool? Easy. Do you want a larger living room or a smaller kitchen? That can be done.

Taking this analogy from building skyscrapers to databases, a flat architecture means that your data is displayed from a single (or as few as possible) levels. It is much easier to understand how the wiring, plumbing, lighting, etc. were put into a house when compared to a skyscraper. Additionally, when you want to install a pool, it’s much easier to install and maintain than a pool on the 23rd floor of a high rise.

Vertical

A vertical architecture means many structural layers in the database, and with it comes complexity. The difference between the physical skyscraper and databases? Skyscrapers are generally created when there is no more land to build flat, this law of physics doesn’t apply to databases.

Why would anyone build a vertical architecture than? In my experience time and resource constraints effect (two thirds of the magic time-resource-quality triangle), short-term thinking.

Benefits of Horizontal

Decrease in cost: Less people to maintain complex solutions, and more time spent creating value for you.
Higher quality: More visibility into what is happening where. Instead of having to dive through and learn how everything was built, people playing with the data only have to learn specific portions which they are interested in.
Faster delivery: The final win on a flat architecture, is speed of delivery. By reducing complexity people spend less time learning, and more time creating value. While in the immediate you may save time in the short-term with a vertical architecture, you will pay dearly in the long-term.

Social Media from the Seasoned

Human history began 4,000 BCE. Social media in its modern form? Started in 1997 with the site Six Degrees. Putting these two timelines together may seem ridiculous, and admittedly is a bit, but there is a point. With social media being such a recent innovation compared to many skills still heavily used and valued by society today, such as accounting, what can social media media professionals teach us about this roughly 20 year old skill set?

I’ve had the opportunity to hear from industry experts from McGarrah Jessee, Splash Media Group LLC, and other both large and small social media marketing firms. These individuals have covered everything from creating useful branded content to establishing client relationships (and more). Although a new field, there is definitely a large group of driven individuals working towards continuously evolving and growing the usefulness of social media for business. I’m going to attempt to encapsulate a few of the important points at an extremely high level.

Biggest Applicable Advice

When interacting with people on a day to day basis, what do you prefer? A person who talks about themselves and what they have accomplished, or people who provide valuable advice and interesting conversation? Well, social media experts have realized that businesses can create a relationships with people that represent the latter through content marketing.

Content marketing seems to be a phrase that all of these social media professionals, across multiple industry and business sizes, have been repeating or hinting at. “Make it interesting” or “Make your content something people want to read”. With the deluge of ads that people see everyday, and the myriad of ways to avoid seeing these adds, advertisers now have to convince people that they want to spend time looking at ads. This same concept can be applied everywhere.

Whenever communicating with others, be genuine, interesting and most importantly create value so that people want to listen. Sticking with these pillars can go a long way.

The days of useless and noisy banner advertising are over.

Attributes of Social Media Professionals

There seem to be two types of primary skills that are needed to be successful at social media marketing. The first, an analytical ability to extrapolate data into stories and actionable items for advertising. What do I mean?

Building profiles of the target customer, measuring what works and what doesn’t, basically collecting every piece of data that is available and having the ability to develop strategies around these data points. Looking at Hops and Grain, a local brewery, the social media strategist has been able to learn from data and create a social media profile and brand that people find interesting and go out of their way to view.

Looking at the below pictures can you see a common theme? Outdoorsy, lifestyle type photos that sneak in the beer and branding into the picture in somewhat subtle ways.

The second major attribute I noticed was that all of the individuals, no matter how data driven they are, seem to be creatives. What I mean is that these professionals love coming up with new and clever ideas.

This seems to be needed due to social media being a medium that can produce widely varied results. Content must be instantaneous, in the moment, and clever, or risk becoming a Red Lobster. If a professional can’t come up with new ideas that capture people’s attention, success in social media is unlikely.

Quick Takeaways

At the end of the day, using social media as a microphone doesn’t work.
Be consistent online (and in life)…but really, nothing disappears once it’s online.
Don’t feed the trolls. There are toxic followers, just like their are toxic customers. Be aware of who with and how you’re spending time on social media.

After hearing these different speakers, I don’t have plans to apply these learnings directly on pursuing a career in social media. Luckily these lessons can be applied outside of conducting advertising for a business client. I plan to apply the concepts to my career and developing a personal professional brand.

	Chat GPT for Data An… on Chat GPT for Data Analysi…
	Building a Serverles… on Building a Serverless Data Ing…
	Building a Serverles… on Building a Serverless Data Ing…
	Building a Serverles… on Building a Serverless Data Ing…
	Building a Serverles… on Building a Serverless Data Ing…

Michael Greis