Sporting Stylish SQL with Stephen

Note: this was published originally on Medium in July 2019

When I started using Structured Query Language in my job, it was a baptism by fire. I understood how SQL worked, but I had to figure a lot of things out on my own about its operational uses. I decided I had to learn by doing, and though I very much did not want to make a mess of the data my company was keeping, I did really want to learn how best to use the technology at my fingertips, to really stretch my wings. So I spun up a PostgreSQL database on my personal computer and started playing around with publicly available soccer data. I figured that one of the best ways to learn is to practice, and what better way to practice than to toy around with data you already understand. Having to really work to obtain the data and then to mold them into something from which you can actually gain insight makes the whole endeavor — dare I say it — actually kind of fun.


When dealing with modern business intelligence platforms, it behooves you to have a solid comprehension of SQL. Tableau has SQL features it uses (inner and left joins, unions, etc.), Looker generates SQL to run its queries and even if the tool you use is not itself explicitly SQL based, a lot of the most popular data bases and warehouses (e.g Amazon Redshift and Snowflake) are built to run SQL. Tableau was recently acquired by Salesforce, and Looker just recently announced that it is being integrated into Google Cloud; major players are betting big on this straightforward technology, so learning how to use it effectively is, to my mind, a no-brainer.

So from here on out, consider us, you the reader and I the author, coworkers on the same data analysis team. You are a new hire and have been paired up with me to create tables and run analyses in our SQL database. I will presume you have a basic knowledge of how SQL works, but I will also provide links to more in-depth explanations of some of the concepts we run through in this series.

But first, if we’re going to start working together, we should get to know each other. So here goes:

ALLOW ME TO INTRODUCE MYSELF

My name is Stephen, and I’ve been a Business Intelligence engineer for about five years. I have lived in the Bay Area for almost 15 years, I have an incredible wife, two terrific twin boys (pictured) and a Boston terrier (also pictured). I love movies, TV, math and, relevantly for this series, sports. I spend an inordinate amount of time following the ins-and-outs of European and world soccer and NBA Basketball, I just cannot get enough. I was a math major and soccer player in college, and taught myself a little Python, R and SQL programming after I got out of school, just enough to get a job at a startup out here. It was there that I learned the day-to-day of using SQL for data storage and analysis, and picked up some tips and tricks for getting stuff done quick-, clean-, effective- and efficiently. I am excited to share some of these tips with you, my esteemed and dutiful coworker, but before that, let’s lay some groundwork:

SOME TOOLS OF THE TRADE

Since I am driving in this particular pairing paradigm, I’m running everything on my computer, so here’s a rundown of what I will mostly be using:

  • A Macbook Pro (currently running Sierra), but most of the tools here are open source and for the better part also platform agnostic (as far as I know).

  • A local PostgreSQL server and database set up on my Macbook. Here’s a great tutorial on how to set that up on your own computer if you’re interested in following along. It’s fairly straightforward, only one thing you have to install first (a package manager called Homebrew, but the tutorial will walk you through all of that).

  • psql is a “a terminal based front-end to PostgreSQL”, allows you to access and query any PostgreSQL database (and indeed Redshift as well), local or remote as long as you have access to it.

  • pgAdmin is lovely, easy to use and free, just gets a lot of updates. psql is a great tool for running scripts and maybe a QA query, but its display is trash. If you’re going to run any kind of SELECT query, pgAdmin is industry standard. Other options are TablePlus (free) or Postico (unlimited free trial, comes with an upsell). To underline the difference between psql and pgAdmin, behold the same query in the screenshots below

psql, not hot

pgadmin, hot

  • If you’re not typing queries directly into psql or pgAdmin, you can write them out first into a SQL script and run the script using psql. The best way to write that script, and indeed many other types of text files, is a text editor. Atom is my text editor of choice, Sublime Text (currently on version 3) is a great alternative. If you’re not all that into using your mouse, you can try VIM, one of the oldest and most robust text editors that’s built right into your terminal, (not as intuitive as Atom or Sublime at first, but supposedly amazing once you get the hang of it and will give you a little cache amongst fellow engineers)

  • Finally, you should have some kind of spreadsheet software, Numbers is what I use on the Mac, essentially because I am too frugal to purchase Microsoft Office just to use Excel. Spreadsheet software can come in very handy in some cases, not just with data manipulation but also for data cleaning.


STARTING STYLE TIPS

Now we get to the nitty gritty.

I am a naturally forgetful person, and when I build a SQL query, I don’t want to have to remember everything about the construction of that query and the data underlying it. Instead, I would prefer that the query itself be written in a straightforward (hopefully elegant) way so as to communicate clearly what is being requested of the data.

Let’s go through some of the basics with an example query:

The query above lays out many but not all of the salient points, (no WHERE statement or GROUP BY, we’ll get to those later). The important things to note are these:

  • Capitalize the reserved words — anything that’s not a name or alias for something in the database goes in upper case.

  • Indent for the SELECT statement inside Common Table Expressions (CTEs, but not the bad kind). I prefer a tab size of two spaces, it makes everything else line up very elegantly, but to each their own.

  • Indent the column names and put them on a new line from the SELECT command.

  • FROM goes back (outdents) to the same margin on which the SELECT command lies.

  • The closed parenthesis that terminates the CTE gets its own line, even if there’s another CTE coming.

  • The line for a new CTE starts with a comma

  • UNION and INTERSECT commands get their own line.

Let’s stop there for now, we don’t want to overdo it on day one. It was nice introducing myself and onboarding you a little bit, I look forward to working together and hope we can help each other grow in our respective careers!

Now, if you want to join me for a bonus after work beer and chat, I am going to break down what this query is actually doing over a nice cold one. If you are over it and are uninterested in the nitty gritty, no worries, I’ll see you next time!


STRUCTURED BEER-Y LANGUAGE

This particular dataset is built off Basketball Reference’s advanced stats for Game 6 (the clincher) of the Warriors-Rockets series in the 2019 playoffs. Daryl Morey reportedly contends that you need three straight up stars to win a championship. So what I wanted to see was out of the top three players on both teams, who was the best player defensively in that cohort. I defined top three in this case by choosing the three players out on the floor for their respective teams the longest, and since the game was close, there were no garbage time fourth quarter minutes for the far side of the bench to eat up, i.e it should be that your best players are on the floor the longest) and for my defensive measure, I used points allowed per one hundred possessions (BR’s de facto defensive efficiency rating). The result:

I’m not crying you’re crying

I’m not crying you’re crying

Did I reverse engineer (contrive) an explanation for singling out Andre Iguodala; a player who also had the highest points made per 100 possessions amongst the top three Warriors during that game 6 which they won without KD; a player who shut down a 🐐contender to win Finals MVP (tbf I think Lebron should have won that award); a defense-first, level-headed, tone-setting vet who may have cemented his hall of fame candidacy as a sixth man; would I do that? Never.

Cheers.

Previous
Previous

Create a table in PostgreSQL out of a CSV using Atom and psql