Create a table in PostgreSQL out of a CSV using Atom and psql

Part 1 of x of Sporting Stylish SQL with Stephen

Note: this was published originally on Medium in August 2019

(and put behind the paywall shortly thereafter, but no worries, this is a friend link)



The Premier (League) Edition

Welcome to the next fixture in our Sporting Stylish SQL with Stephen series. This edition is about using some of the tools we outlined last time to more efficiently (read: quickly) create SQL tables on demand.

When it comes to the myriad of documents used to communicate data, Comma Separated Variable files or CSVs are almost definitely one of the easiest ways to interact with a SQL database. (Since the data within have to adhere to a very strict structure, they interface easily with Structured aka Relational Databases).

Pure text, no formatting, every dividing line in the row becomes a comma and every row ends when the commas stop. It is so effective in its simplicity as to be ubiquitous, thus you may find need at some point to take the column names off an exemplar CSV file and out of its contents create a SQL table into which you upload those CSVs. Doing so is straightforward but can be quite tedious, a tedium which I will to assuage with some shortcuts I have found along the way.

CSV as seen on Google SheetsOR PUT ANOTHER WAY:Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR 2018-08-10,Man United,Leicester,2,1,H,1,0,H 2018-08-11,Bournemouth,Cardiff,2,0,H,1,0,H 2018-08-11,Fulham,CrystalPalace,0,2,A,0,1,A 2018-08-11,Huddersf…

CSV as seen on Google Sheets



OR PUT ANOTHER WAY:

Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR
2018-08-10,Man United,Leicester,2,1,H,1,0,H
2018-08-11,Bournemouth,Cardiff,2,0,H,1,0,H
2018-08-11,Fulham,CrystalPalace,0,2,A,0,1,A
2018-08-11,Huddersfield,Chelsea,0,3,A,0,2,A
2018-08-11,Newcastle,Tottenham,1,2,A,1,2,A

So let’s say our manager comes to us with this (part of a) CSV (here’s the real thing if you want to follow along)

She wants us to create a table in our PostgreSQL database into which we can pour the contents of CSVs so we can track the progress of the Premier League this season. To do this, we are going to create a SQL file that will:

  • Create a schema to house the table that houses the data

  • Create a table that houses the data (we’ll actually be creating an empty table and altering it by adding the columns on individually and sequentially)

  • Fill the table up by copying the contents of our provided CSV

Copy and Paste

First step: we take the column names directly off the CSV and copy them into a new document.

Command + C >> Command + V

Up- and (more pertinent) Downcasing

Next we use Atom’s downcasing functionality (command + K + L) to make all the terms lower case. Remember the capitalization: In SQL, Reserved words are upper case(e.g CREATE, INSERT, SELECT, etc.), and anything we name ourselves (schemas, tables, columns, etc.) are lower case.

Command + K + L

Now we are going to edit the column names so they’re easier to read and understand.

date,hometeam,awayteam,fthg,ftag,ftr,hthg,htag,htr

BECOMES

game_date,home_team,away_team,full_time_home_goals, full_time_away_goals,full_time_result,half_time_home_goals, half_time_away_goals, half_time_result

We can’t use the word date as a column name because DATE is a datatype and thus a reserved word, so we have to change that to something simple yet descriptive: game_date. For the rest of the values, I expanded their acronyms to make the more descriptive (and longer) column names so we could more easily understand and use the table after it’s created.

Now Copy and Paste those into a lower line and put them in parentheses. [note: if you highlight the line in Atom and press (, it will automatically put the highlighted portion into the parentheses for you]

game_date,home_team,away_team,full_time_home_goals, full_time_away_goals,full_time_result,half_time_home_goals, half_time_away_goals,half_time_result(game_date,home_team,away_team,full_time_home_goals, full_time_away_goals,full_time_result,half_time_home_goals, half_time_away_goals,half_time_result)

[another note: this part of the process may appear repetitive and asinine, but trust me it will pay off later]

Fancy Find and Replace

This next step is a little tricky, but let’s go through it step by step:

  1. Highlight the top line with the header names

  2. Press command + F to open the Find and Replace tool

  3. Select the Use Regex and Only in Selection buttons on the tool

  4. In the Find line, put a comma “,

  5. In the replace line put “\n” [Signifies a new line]

  6. Hit Replace All.

Command + F

Excellent! We’ve successfully found and replaced all the commas in our selections with new lines!

We’ve (finally) arrived at a place where we get to add some SQL statements. And in addition to the columns we’ve gotten from the CSV, I’m going to add a couple of basic but dutiful columns to make our table more useable and performant:

  1. id: this column just puts the row number on the row in the order in which it was created, thereby creating a value that is guaranteed to be unique across row to row, and is therefore a perfect candidate for a primary key.

  2. created_at: this column records the time (according to the time zone to which your DB is set) that the row was created.

CREATE SCHEMA xpert; # SCHEMA TO HOUSE TABLE
CREATE TABLE xpert.england(); # TABLE TO HOUSE DATA
id # ROW COUNTER AND THUS PRIMARY KEY
created_at # RECORD OF FILE UPLOAD TIME
game_date
home_team
away_team
full_time_home_goals
full_time_away_goals
full_time_result
half_time_home_goals
half_time_away_goals
half_time_result

(game_date,home_team,away_team,full_time_home_goals,full_time_away_goals,full_time_result,half_time_home_goals,half_time_away_goals,half_time_result)

Note: the schema is called xpert in deference to 12xpert, aka Joseph Buchdal, a betting analyst in the UK who keeps and maintains historical and current results along with betting data, of which the CSV we’re using is a small and pruned subset.

Multi-Cursor

We now have our columns lined up, but we have some commands to add to all those lines.

Highlight your newly vertical column names (but not the line in the parentheses) and press Command + Shift + L.

Now that we have all those cursors, we can easily type out the necessary statements we need to turn these names into commands. On the left of the name, you put in the ALTER statements:

ALTER TABLE xpert.england ADD COLUMN

And on the right put a space and a semicolon (;). Semi-colons signify the end of the SQL command, and the space we need so we can finish the statements with the datatype of each column.

Command + Shift + L

We are going to use the multi-cursor again for identifying the column types, but instead of selecting all those rows, we simply hold the command button down and click the mouse exactly where we want our other cursors to go.

Command + [[Mouse Click]]

Once we have all the datatypes properly placed, we have the frame of the table set up, now we just need to get in its contents. We can use PostgreSQL’s \copy utility, which can shuttle data in and out of your database.

A note on why we have all those column names in parentheses on the bottom of this file: the \copy command doesn’t need the column name of the table you are using unless some of the columns on the table are not on the CSV. I’m sure you remember when we added the id and created_at columns onto the table, well the CSV doesn’t have those. To successfully copy in the data in that case is to put all the column names you plan to insert your table in between parentheses, separated by commas of course. [I told you it would pay off]

```SQL
CREATE SCHEMA xpert;
CREATE TABLE xpert.england();
ALTER TABLE xpert.england ADD COLUMN id SERIAL PRIMARY KEY;
ALTER TABLE xpert.england ADD COLUMN created_at TIMESTAMP NOT NULL DEFAULT NOW();
ALTER TABLE xpert.england ADD COLUMN game_date DATE;
ALTER TABLE xpert.england ADD COLUMN home_team TEXT;
ALTER TABLE xpert.england ADD COLUMN away_team TEXT;
ALTER TABLE xpert.england ADD COLUMN full_time_home_goals INTEGER;
ALTER TABLE xpert.england ADD COLUMN full_time_away_goals INTEGER;
ALTER TABLE xpert.england ADD COLUMN full_time_result TEXT;
ALTER TABLE xpert.england ADD COLUMN half_time_home_goals INTEGER;
ALTER TABLE xpert.england ADD COLUMN half_time_away_goals INTEGER;
ALTER TABLE xpert.england ADD COLUMN half_time_result TEXT;

\copy xpert.england (game_date,home_team,away_team,full_time_home_goals,full_time_away_goals,full_time_result,half_time_home_goals,half_time_away_goals,half_time_result) FROM <path_to_csv_file> CSV HEADER
```

Let’s ignore the <path_to_csv_file> for a moment, let’s concentrate on the CSV HEADER at the end: PostgreSQL needs you to tell it what kind of file to expect and whether it has a header or not, so it will know how to read the file and whether to treat the first row of data as column names or simply values.

Recall: psql is a simple tool which allows you to control your database from the command line on your terminal.

I’m not here to tell you how to structure your files, but I’m going to show you a really simple way to do it for this situation and you can tweak it as you please. I put the SQL file (named xpert_table_build.sql) in a folder dedicated to the project and the CSV file (named england_18_19.csv) in the same directory in a subfolder labeled data. After navigating to this project’s directory, the copy command should look like this:

```SQL
\copy xpert.england (game_date,home_team,away_team,full_time_home_goals,full_time_away_goals,full_time_result,half_time_home_goals,half_time_away_goals,half_time_result) FROM data/england_18_19.csv CSV HEADER
```

Important to note that the <path_to_csv_file> is relative in case above, but it can be absolute if you prefer. Now we run psql from the project directory, and use the \i command, which is how you run SQL files in psql.

Table Creation with a little QA to boot

Table Creation with a little QA to boot

On the top there you see your CREATE & ALTER statements have successfully run, along with the COPY.

And as you can see, after the copy command, we have also run the all-important…

QA!

Whenever you build something, or do anything in your database really, you should run tests to confirm everything goes according to plan. In this case, we have to confirm the table built and filled successfully:

  1. I ran psql’s describe command /d on the table to confirm the column structure conformed to our CSV file structure (it did).

  2. I counted the number of rows to make sure that matched with how many rows in our CSV (316, it did).

And there you have it, a table you can fill with the CSVs and refill till kingdom come (or you run out of room in your DB). We’ll be querying this table in future editions of this series, but for now, it’s beer o’clock.

Structured Beer-y Language

After finishing a project, I like to unwind and chat with my coworkers over a beer. This time we’re talking searching for data with metaphorical blinders on and the general availability of data:

Once, early in my days of data devotion, I was flailing around the internet trying to find a downloadable version of the Premier League table. Woe was me when all I could find were these files with all the results of all the games, tracked over years. Since it was not exactly what I was looking for, I ignored it as I tried desperately to find my ideal dataset. SMDH.

Then I realized that not only was this better than the league table, this data would allow me to construct the league table at any date since 1993. Little did I know that this treasure trove of data was not only invaluable, it was meticulously updated and maintained and consistent and free and even provides half time scores, which is a paid feature of football-data.org. I have validated this data against multiple other sources time and again, and it’s accurate every time.

All this is to say we should appreciate the work people like 12xpert do, maintaining data sources like this one and allow others to use it free of charge. As I learned in those early days, data out in the world can often be harder to come by than you think, even more so if you let the perfect be the enemy of the good and ignore entirely the suitable, and indeed fruitful data like I did.

Cheers.

Previous
Previous

Create a View in PostgreSQL to Remove Duplicate Data

Next
Next

Sporting Stylish SQL with Stephen