Create a View in PostgreSQL to Remove Duplicate Data

Part 2 of x of Sporting Stylish SQL with Stephen

Note: this was published originally on Medium in August 2019



Quick note at the top: it turns out that when we created and filled our table, I forgot to mention that we were also collectively going back in time ever so slightly to February 9th, 2019. Happy National Pizza Day!

So here’s where we stand so far: we’ve created a table, xpert.england, where every row is a game in the English Premier League, but since, as we now know, it’s “currently” the middle of the season, as we go forward there are going to be more CSVs with even more games. Here’s the rub though, these CSVs are also going to contain the data for the games we have already input, which means we are now faced with a dreaded burden of the data base: duplicate data. So how do we address this looming specter of duplication? Let me count some ways:

  1. Edit the CSV itself by removing the rows which you already have in the DB. risk accidentally mutilating the data

  2. Truncate (aka delete everything inside) the table before you upload a new CSV.

  3. Upload the full CSV to the table every time, and use a View to weed out duplicate rows.

When it comes to making a decision on which option to go here, an important question to ask is: are we going to have to do this re-upload over and over again or is this an ad hoc, one-and-done kind of project?

I spoke with our manager and she said we’re going to be uploading this file after every game day, thus we are going to have to pick the option most suitable for setting up a reproducible process.

So, right off the bat, for me, Option 1 is a loser. It requires editing the data file and anytime you edit something you run the risk of, for lack of a better word, mutilating it somehow. Not to mention that it’s an extra step, external to the database, and would have to be done manually by someone. This sounds like just a hassle for us to QA and for whomever has to clean out those dupes (assuming that job doesn’t also fall to us, in which case it’s all just the one big hassle).

Option 1 is off the island, we’re left with Options 2 and 3. Both are good options, in that both can fairly easily be turned into set and forget processes. Option 2 (Truncating before uploading) could be set up with a trigger function which is a little more work but which would save storage space in your DB. Option 3 (Using a SQL View) is pretty simple to write and light on the database, the only downside, really, is that when push comes to shove, you have to remember to query the view and not the table (the table being full of duplicate data).

I will bet you have guessed by now that my preferred option is number 3 (it’s the title of the piece for Pete’s sake). It’s easy like I said, and it doesn’t tax the DB very much if at all. And if we do it right we get the added bonus of obviating the need for SQL updates while fixing errors or affecting changes in the scores (if you don’t think scores can change after the fact, think again).

If that updating part sounds a little confusing, don’t worry for now, we’ll come back to it when we run our QA.

Now we have been informed that a new CSV has been uploaded to the table, let’s open up pgAdmin and take a look at one game in particular, I’ll pick Tottenham at Everton. Let’s build that query together. Let’s look at the whole table for wherever the home team is Everton and the away team is Tottenham.

 
 

You see here that I begin the WHERE clause with the statement “1 = 1”. I do this so all my WHERE statements get their own line and begin with AND, which keeps them lined up and more easy on the eye than

```sql
WHERE
  home_team = 'Everton'
  AND away_team = 'Tottenham'
 ```

It also allows me to comment out a WHERE condition out if I want to poke around the data a little bit without having to add any more ANDs.

 
 

Now what do we get when we put our lovely little query into pgAdmin?

Dupe-town

Dupe-town

Above we can see a duplicate row that has been copied into the table in a subsequent upload, and we can also see the key (wink!) we need to tracking down the most recent upload. Since the id, the primary key, is constantly incrementing, it’s guaranteed to be higher for the new row, thus if we pick the version of the row with the highest (or MAXimum) id, it will be the most recently uploaded version of the row. And before you say it, yes, the created_at column would also tell us which was uploaded more recently, but it is a bit simpler for the processor to find a larger number than a later date.

Judiciously commenting out lines (as I did on line 2) is super effective  in exploring your data,  but be wary of cluttering up your query and  losing the plot

Judiciously commenting out lines (as I did on line 2) is super effective in exploring your data,
but be wary of cluttering up your query and losing the plot

So now that we know we are going to use the id column to narrow down the rows from the most recent upload, we have to make a decision about what columns we use to define the data we are looking for. This question boils down to what do we presume about what are the columns in the data that should stay constant and which ones can possibly be variable. In our case let’s agree that the date the game took place and the teams involved will not change, even if the other particulars of the game (i.e. the final score) may. So it follows that game_date, home_team and away_team will be our mainstays, we’ll call them our constant columns.

Now how do we use these constant columns to create our view? Well the query we have been building is only aiming at one game, so lets see what introducing our constant columns into our query gets us.

Now you will notice that I added a GROUP BY statement at the bottom, allow me to explain. The MAX is what’s called an aggregate function, which means if you call it along with other columns, you have to indicate how & in what order you will be doing the aggregating; in our case, we have an arbitrary ordering of all of our constant columns.

Now that we have a way to pick our last rows, all we have to do is pick only those last rows. I am going to do that by turning the query we have built so far into a CTE and then joining it back to the table with an INNER JOIN so they are the only rows we get.

INNER JOIN on the last rows CTE so we only get the most recent upload

You see for our CTE version of our query, we no longer have the constant columns in the SELECT section, because once we have it in the GROUP BY, putting it in the SELECT becomes superfluous and indeed actually slows the query slightly. The INNER JOIN ensures that we only get those last rows we found with the CTE, thus we have found the most recent uploaded row for every game, i.e we have DE-DUPED this bad boy!

Now that we have the query we are going to use, let’s build that View!

 
 

QA!

Once we build this view, we have to make sure it works properly, which means creating a query that confirms the view is doing what it purports to be doing, which is displaying the most recently updated game data. Taking that into account, we have two things we have to assess:

  1. The row counts line up.

  2. If we change the CSV and then upload it, that change is reflected in the view.

The easiest way to see if the data in the CSV that has most recently been uploaded are the only data in the view is to count the number of rows in the CSV, and compare that with the number of rows in the table and the view.

Let’s build a single query that can check the number of rows in both the view and the table. We can join them on id, since that is the basis for the view anyway.

 
 

Now we just need to know our row counts so we know what we are looking to confirm. The most recent CSV has 261 rows, which added to the 258 we uploaded when we built the table should come to 519 rows total.

Drum roll, please.

t for Table ; v for View

t for Table ; v for View

We see that the duplicate rows from the table are not visible to the view. Now let’s see about changing scores:

Let’s say you are a die-hard Liverpool fan and you just hate the idea of Manchester City winning their game in hand here. Well, since we are just doing some QA, let’s rewrite the tale and edit the table such that Chelsea don’t just absolutely capitulate to City and the score is a much more respectable (sadly) 0–0 draw:

GRIM REALITY

GRIM REALITY

BECOMES

After we upload our conspicuously (though not ostentatiously) edited CSV, we should see our new ‘reality’ reflected in our view.

Dispatches from the Reddest timeline

Dispatches from the Reddest timeline

Our view has successfully removed duplicates and updated the score as we hoped, we have done what we set out to do which means we are done with QA. Let’s make sure we upload the uncorrupted (yeah I said it) data to make everything copacetic, and then it’s time for a beer!


Structured Beer-y Language

After finishing a project, I like to unwind and chat with my coworkers over a beer. This time we’re talking soccer highlight videos and Mariah Carey covers:

What a shellacking City gave Chelsea in that game, huh? It was actually difficult to find a suitably short highlight video, the rest were just at least 6 minutes of techno-backed schadenfreude which would have been too much to bear.

Also, I don’t know if you clicked on that Mariah Carey link above, but I had totally forgotten about the carnival-themed video with the weird clown intro, the roller blades and the singing-while-in-the-front-row-of-a-gd-rollercoaster. And to be perfectly honest, while I love that track and many other Mariah hits, my favorite rendition of that song, for which you did not ask, is this one by Owen Pallett.

Cheers.

Previous
Previous

Transform Your Table with a Materialized View

Next
Next

Create a table in PostgreSQL out of a CSV using Atom and psql