How to create and query a PostgreSQL database
Welcome to the second article in our series on PostgreSQL. In our previous post, found here###, we covered some of the basics of PostgreSQL; what it is, what it’s used for and how to install it on macOS. Here we will cover the basics of setting up and querying a PostgreSQL database, including table joins and a few real world challenges you might come across. The data used in this tutorial is originally from the Nobel Prize API, the interface for which can be found here. The datasets can be found here***.
Below we cover some basic commands for creating and interacting with databases. For more comprehensive coverage we suggest this useful cheat sheet.
# To check the current logged in user (By default this will be the new # owner of any databases created): SELECT current_user; # To list all databases: \l # To connect to any database within postgres: \c databasename # To list all tables in the current database: \dt # To list columns and data types in a table + tablename \d # To create a database called nobel: CREATE DATABASE nobel;
PostgreSQL supports multiple different data types. Including:
Booleans, such as TRUE, FALSE, 0 and 1
Character types including TEXT, CHAR and VARCHAR
Numeric types such as INT and NUMERIC
Temporal types such as DATE, TIME and TIMESTAMP.
A full list of data types can be found here.
In this exercise we will create two tables, laureates, with personal data about Nobel prize laureates, and nobelprizes with data about Nobel prizes awarded. They are linked by an id field and we will have a go at joining these tables in PostgreSQL below.
For now let’s create the tables, to create each table, initialise a column by naming it and specifying the data type for said column, as below:
# Typical structure for creating tables: table_name ( CREATE TABLE [IF NOT EXISTS] datatype(length) column_constraint, column1 datatype(length) column_constraint, column2 datatype(length) column_constraint, column3 table_constraints ); # To delete a table DROP TABLE table_name;
In the following we will import our Nobel datasets from csv files into our PostgreSQL database. First we initialise the table with column headings and types to ensure that our data falls into the right place.
One feature of the laureates dataset is that dates where only the year, but not the date is known are entered as YYYY-00-00. This will not be accepted by PostgreSQL, presenting some challenges. One way of tackling this is shown below:
# Initialise the table with column names and data types laureates ( CREATE TABLE id SMALLINT, name TEXT, gender TEXT, birthdate TEXT, birthcountry TEXT, birthcontinent TEXT ); # This copies the file from its location to the laureates table. The call identifies # the file as a csv file with a semicolon delimiter and a header with column names. # PostgreSQL does not recognise NA values so we tell it to recognise them as NULL '~/howtoapi/_posts/queryingpostgres/laureates.csv' WITH \COPY laureates FROM ';', NULL 'NA', HEADER); (FORMAT CSV, DELIMITER # The data in our birthdate column won't be accepted by PostgreSQL as a # date format. We tackle this is by initialising the column as a TEXT column and # converting it to a DATE column after ingesting the data, # Postgres will convert any invalid dates to January 1st, for example '1948-01-01'. ALTER TABLE laureatesto_date(birthdate, 'YYYY-MM-DD'); ALTER COLUMN birthdate TYPE DATE USING # We now initialise the nobelprizes table nobelprizes ( CREATE TABLE np_id SMALLINT, awardyear SMALLINT, category TEXT, awarddate DATE, motivation TEXT, prizeamount INT, prizeamountadjusted INT ); # And copy from csv to the database in the same way '~/howtoapi/_posts/queryingpostgres/nobelPrizes.csv' WITH \COPY nobelprizes FROM ';', NULL 'NA', HEADER);(FORMAT CSV, DELIMITER
Here we cover some basic table queries using our Nobel Prize data. You can try these commands on your own machine.
It’s useful to note that in PostgreSQL double quotation marks are interpreted as columns. So postgres will assume queries specifying gender = “female” are looking for results where values in the gender column and the female column are equivalent, throwing an error since there isn’t a female column. For character strings always use single quotation marks, so gender = ‘female’.
# The most fundamental psql command is the following # which selects all columns: * SELECT FROM tablename; # You can also select single or multiple columns and add conditions SELECT name, gender FROM laureates= 'female'; WHERE gender # Select all columns and rows from laureates where the # gender isn't male and the birth continent is Africa. * SELECT FROM laureates!= 'male' AND WHERE gender = 'Africa'; birthcontinent # Select all rows and columns from laureates where the award year is after the millennium. * from laureates WHERE awardyear > 2000-01-01 ; SELECT # You can also do counts, for example the below counts the rows in the nobelprizes table COUNT(*) SELECT FROM nobelprizes;
PostgreSQL is at its most powerful when used for relational queries. To query multiple tables is called a join query. We will be using a join query to determine whether any female laureates have won the Nobel Prize more than once.
There are several types of table join, including LEFT JOIN, RIGHT JOIN and INNER JOIN. The type of join you do will determine the data returned. We will be using an equi join, which only returns rows that have equivalent values for the columns specified.
Below we join our laureates and nobelprizes column together on the id column, which is shared between the two tables. We then filter the result by gender, and we store our result in bigtable.
Next, we creates query which selects columns from bigtable, groups them and filters for rows where the id column occurs more than once, meaning there is more than one entry in the database for that person.
SELECT * INTO bigtable FROM laureates JOIN nobelprizes ON id = np_id WHERE laureates.gender = 'female'; SELECT id, name, category, awardyear FROM bigtable GROUP BY id, name, category, awardyear HAVING COUNT(id) > 1;
Below we can see that the only female laureate to have won a Nobel Prize twice was Marie Curie, who won in 1903 and again in 1911.
And that’s it!
Thanks for reading through this introduction to PostgreSQL, We hope you found it helpful. If you’d like to learn more about using PostgreSQL with R, you can follow along with our tutorial here***.