A Foray Into Serious Scraping

It’s been a while since my last post. Getting married, honeymooning, buying a house, etc. took away the time I had for this. But all of that is nearing it’s end, so I’m getting back into the regular cadence of working on the scraping project. Since I’m now back into it, and got everything up and running, I figured the most sensible place to start is the architecture that has been implemented.

The Problem:

The most sensible place to start any discussions of architecture is clearly stating what the system is supposed to do. I need a system that accomplishes these three things.

  1. I need a system that is able to reliably scrape data from any website or consume data from any source.
  2. I need a place where this data can be loaded and reported on in a cohesive format.
  3. I need the product to be lightweight as far as storage space required and CPU so I don’t have to pay out the wazoo.

Overview:

Scraper Architecture

Overview of the architecture, from inputs to database inserts.

In order to meet this, a straightforward architecture was implemented. Using Amazon Web Services both a EC2 instance and RDS instance were set up, with the EC2 being an Ubuntu instance and the RDS being Postgresql. In sequential order, here is how the scraper works.

  1. Using python’s Scrapy library, we’ve written Scrapy projects which look to specific sources to bring in data based upon the HTML on websites. Right now, we’ve targeted two, but can expand to as many as needed. These Scrapy spiders are scheduled through Scrapyd, a framework that no only allows for scheduling and management of spiders, but also offers better performance by operating on Twistd making it asynchronous.
  2. As the spiders are constantly running they are outputting to JSON files on the server. Basically, the driver here is to have a place to drop the output of the data onto the server so that data won’t be lost if something happens with one of the processes.
  3. A Python class was written with Psycopg2 in a way that is meant to be extensible for future data sources. The idea being, that as the data model and data sources are changed/expanded upon, the only thing that will need to change is the class itself. None of the scripts that call the class to insert data from our existing data sources will need to change.
  4. A staging area was created within the RDS PostgreSQL instance which ingests the raw data from the data source. Where possible, a unique index was created that checks for changes before accepting the data into the staging area. As we have scrapers hitting sources repeatedly, we are going to be grabbing the same data. What we’re interested in are the changes, especially in regards to new items or price changes. Also, we want to make as efficient as possible of a architecture so storing only the data we are interested in just makes sense.
  5. Once data has been accepted into the landing zone, the Ubuntu instance is used to schedule a slew of ETL jobs written in SQL and passed to PostgreSQL for execution using Psycopg2. Postgresql doesn’t have a native scheduler readily available, so we use the Crontab functionality of Ubuntu to execute a script for each of our sources that calls from a class containing all of our ETL functions. The end result of this is a 3NF model populated with data and appropriate relationships made.

So, it’s now up and running, and data is flowing through into the objects. The data is populating for all tables where it is expected and I could begin reporting price changes today. The best part? All of this was built using $0’s of infrastructure from Amazon Web Services (and a lot of my time). I’m running out of storage space rapidly (20 gigs from the free tier ran out over a couple days), and the CPU is not beefy at all, so stalls out if more than one scraper is running at a time (as pictured below).

scrapyd-performance

Performance goes down dramatically in yellow. In red, my scraper has been blocked from accessing the site (which didn’t happen before refactor…I’ll go into that another time)

To refer back to the original goal, I would say it has been achieved. Not to say that it couldn’t be improved upon and optimized. But overall, the first serious foray into scraping seems to of gone well. Feel free to reach out with any questions, or suggestions!

Down with Vertical Database Architecture

The goal of gathering data can be broken down into a combination of any of the following free. Understanding what has happened, what is happening, or project what will happen. When getting answers to these questions, as long as the answer is obtained, why does it matter how the answer was obtained?

Getting a view into this information can be done many different ways, and with the products available on the market can be done for free and with minimal IT know how. There is a time and a place to pay a premium on IT projects to obtain the capabilities that Skyscraper_Diagramnone of your competitors will have. When a solution needs to be scale-able and tailored to your unique needs.

This is when architecture comes in.

Just like any structure, a database architecture can be flat or tall. What is the difference? To
run with analogy of comparing database architecture to buildings, a skyscraper (vertical) is much more complex to build and maintain compared to a house (horizontal).

Horizontal

A horizontal architecture can be pictured like a suburb. This translates to a house that is commissioned by you that is easily customizable and suited to your needs and wants. Do you want a pool? Easy. Do you want a larger living room or a smaller kitchen? That can be done.

Taking this analogy from building skyscrapers to databases, a flat architecture means that your data is displayed from a single (or as few as possible) levels. It is much easier to understand how the wiring, plumbing, lighting, etc. were put into a house when compared to a skyscraper. Additionally, when you want to install a pool, it’s much easier to install and maintain than a pool on the 23rd floor of a high rise.

Architecture Diagram

Vertical

A vertical architecture means many structural layers in the database, and with it comes complexity. The difference between the physical skyscraper and databases? Skyscrapers are generally created when there is no more land to build flat, this law of physics doesn’t apply to databases.

Why would anyone build a vertical architecture than? In my experience time and resource constraints effect (two thirds of the magic time-resource-quality triangle), short-term thinking.

Benefits of Horizontal

  1. Decrease in cost: Less people to maintain complex solutions, and more time spent creating value for you.
  2. Higher quality: More visibility into what is happening where. Instead of having to dive through and learn how everything was built, people playing with the data only have to learn specific portions which they are interested in.
  3. Faster delivery: The final win on a flat architecture, is speed of delivery. By reducing complexity people spend less time learning, and more time creating value. While in the immediate you may save time in the short-term with a vertical architecture, you will pay dearly in the long-term.