As part of a personal project, which I’m managing (and actively working) here, I’ve decided to do a little write up on my approach, what I’m learning, and other technical things I’ve encountered. This is as much for my own memory, as it is in the hopes that I can help some others avoid the technical pitfalls that I have encountered.
The Product:
I’ve always been someone extremely interested in data, especially data that no one else is looking at. So, what is the logical place to go? The most accessible data is the data that is already out there for the grabbing. So…scraping.
What has no one else scraped, or at least scraped and aggregated AND displayed well? Game prices across different platforms. There’s aggregators for all different kinds of products (ammo, outdoor gear, etc.) but no one seems to have implemented one for games well, although they have tried.
With that goal in mind we are building a product for people to track game prices, and favorite games so that they no longer have to track news on multiple sites and check multiple web marketplaces for the best prices on games. This means we will be scraping Reddit, Twitter, and other news/social media sites, in addition to game marketplaces like Steam and Sony’s Playstore.
What I Hope to Gain:
At the end of the day, maybe we strike gold by building the coolest website and app that ever existed and people love. More realistically, I want to build a platform with which I can add data as needed for my own wants/needs. I want to become expert level using certain libraries and frameworks, and be at a point where I’m not just a Business Intelligence and ETL developer but can develop all over the stack as needed with ease.
Also, I want to gain experience in setting up a highly performant, extensible, ETL platform off of which I end up with an app on a marketplace and at least one download. All of which will be done on a shoe-string budget. I can then use that platform to pivot and build any sort of data-centric application for whatever purpose/reason I want.
The Steps:
So, with all this being said, there are three main topics I will be writing about on a broad level.
- Writing scrapers with Python’s Scrapy library, which run 24/7 around the clock
- Writing ETL’s to a Postgresql database with near real time availability and using a budget AWS instance
- Serving up the data to end users using an open source tool
More updates in the coming days!