SQL (Structured Query Language) Querying
Introduction
SQL is a declarative data manipulation language. Simple SELECT queries follow the general form of
SELECT [projection] FROM [tables] WHERE [selection]
.
[projection]can be column names from the tables, constants or computations between them. A*stands for "all columns".[tables]can be one or multiple table names.[selection]is a set of logical data filter conditions combined withANDorOR
Most of the query parts are optional. A very simple query is SELECT 42, which results in a single-row single-column result set containing the value 42. A more common simple query is SELECT * FROM sometable, which retrieves all row from sometable. In the following, we will slowly expand from these simple queries.
For the more formally inclined, you can look at SQLite's formal grammar for SELECT queries
The sample data used below is based on the material published in the book J.R. Bruijn, F.S. Gaastra and I. Schaar, Dutch-Asiatic Shipping in the 17th and 18th Centuries, which gives an account of the trips made to the East and ships returned safely (or wrecked on the way). This is the main table, voyages:
The exercises here will work as follows: For every task, you write a SQL query into the text box and press the "play" button left of it. This will execute the query and check the result. The following query is already given, so you can directly run it. Try it, you will see the query result. If you get the result right, the task box turns green and the result is recorded.
Basic Queries
Projections
Projections determine the columns that are retrieved from the database. Columns are usually data, but can also be the the result of computation on the data or constants. Several columns can be specified in the query separated by ,. For example, SELECT boatname, master FROM voyages retrieves the two columns boatname and master from the table voyages.
voyages.
We can also modify the data we retrieve. For example, the (pointless) query SELECT number, number + 1 FROM voyages will add 1 to every value in the numbers column. A quite large amount of computation is possible here, comparable to common programming languages.
2 from the table voyages.
In the previous task, you can see that the column name of the result of the computation on the tonnage column is tonnage*2. This is sometimes quite unhandy, especially when continuing to use that column in some computation. For this reason, columns and computation results can be renamed using the AS keyword. For example, SELECT boatname AS shipname FROM voyages will retrieve the boatname column, but name it shipname.
2 renamed to tonnage_times_two from the table voyages.
Selections
To determine which rows are retrieved, we can filter the table with some criteria in the WHERE clause. For example, the query SELECT boatname FROM voyages WHERE number > 8395 will only retrieve voyages where the value for the number column is greater than 8395. Note that the number column is not mentioned in the projection list, this is common.
150 from the table voyages.
You can specify multiple such filter criteria and connect them with Boolean logic using AND and OR. For example, the query SELECT boatname FROM voyages WHERE number > 8395 AND type_of_boat = 'fluit' will only retrieve rows where both conditions are met.
150 and for ships departing from the harbour of Batavia from the table voyages. Note that string constants (Batavia) need to be quoted with single quotes in SQL, like so: 'Batavia'.
One special case for selections are NULL values. Those typically cannot be filtered out with (non-)equality checks (because NULL != NULL, but have special syntax. In particular, WHERE departure_date IS NULL selects rows where departure_date has the value NULL, WHERE departure_date IS NOT NULL does the opposite.
Another special case is string pattern matching. Sometimes, we want not to compare exact equality of strings, but use pattern matching instead (much like regular expressions). SQL uses the LIKE keyword for this. Patterns can contain two special "magic" characters, % and _. % matches an arbitrary number (zero or more) characters, _ matches a single arbitrary character. For example, if one wanted to retrieve all boat names from the voyages table where the boat name starts with D, the query would be SELECT boatname FROM voyages WHERE boatname LIKE 'D%'. If we wanted to retrieve all five-character boat names starting with D, we would use LIKE 'D____'. There is also the ILIKE (Borat) variant, which is case-insensitive.
NULL and the boat name consists of five-characters ending with AI.
Output Modifiers
Sometimes we want to change the result set independent from the query. Two common tasks are limiting the amount of rows that are retrieved (keyword LIMIT) and changing the ordering of the result set (keyword ORDER BY. Using LIMIT is generally a good idea since databases can grow to billions of records and they will duly output all of them if so instructed.
Note that the relational model does not specify an explicit order of rows, hence it is important to specify an order when using LIMIT. For example, to retrieve the first five boat names in alphabetical order, we can use the query SELECT boatname FROM voyages ORDER BY boatname LIMIT 5;. We can order by multiple columns and use the DESC keyword to invert the order.
All Combined
Now, let's combine all of the above in a single query.
2 renamed to half_tonnage from the table voyages for boats built after 1788 and the chamber code is A. The result should be ordered by departure date and only include the top 3 results.
Aggregation Queries
Often, we are not interested in "raw" data, but aggregates like averages or counts. These can be expressed in SQL by adding an additional GROUP BY clause after the WHERE clause of the query. In addition, the projection list is modified by adding aggregation functions and groups. The commonly supported aggregate functions are COUNT, MIN, MAX and AVG.
Ungrouped Aggregation
The easiest case of aggregating values is aggregating without a GROUP BY clause. In this case, the grouping is implicit and all rows fall into a single group. For example, the query SELECT MIN(number) FROM voyages will compute the smallest value for the column number.
max_tonnage from the voyages table.
Single-Column Groups
The next step of aggregation is the aggregation by a group determined by the values in the data. For example, the query SELECT type_of_boat, COUNT() AS n FROM voyages GROUP BY type_of_boat generates the aggregate COUNT for each distinct value of type_of_boat. Its important to know that the projection of the query (the part behind the SELECT) can only contain column names of columns that are used in the GROUP BY part of the query as well. All other columns need to be wrapped in an aggregation function.
chamber and the maximum tonnage for each distinct value of chamber as the column max_tonnage from the voyages table grouped by chamber.
Multi-Column Groups
We can also group by multiple columns. In this case, all combinations of values between the two columns that occur in the data are grouping values. The query will simply list all grouping columns separated by , after the GROUP BY keyword.
n from the voyages table grouped by chamber and type_of_boat.
Filtering Values and Groups
In one of the previous excersises, we have seen how we can select (WHERE ...) only part of the table to be relevant to the query. This also applies to groups. For example, before grouping, we might decide to filter out values. We can also filter out groups based on their aggregation values using the HAVING keyword.
For example, we could aggregate only the type of boat for chamber A using a query like SELECT type_of_boat, COUNT() AS n FROM voyages WHERE chamber = 'A' GROUP BY type_of_boat. The logic here tells the database to conceptually first compute the result of the WHERE clause and then run the aggregation according to the GROUP BY column. The result of this query is:
n from the voyages table grouped by departure_harbour while filtering out rows where the field departure_harbour is NULL (IS NOT NULL, see above).
From the previous result, we have seen some groups with low values for n. Let's say we want to remove those. For this we can use the HAVING clause, which operates on the result of aggregation functions. For example, we can modify the example from above to not consider boat types for which the COUNT value in n is less than 5: SELECT type_of_boat, COUNT() AS n FROM voyages GROUP BY type_of_boat HAVING n > 5. All expressions that can used following the SELECT in a grouped query are acceptable in the HAVING clause to filter the groups.
chamber and sum of tonnage as field s from the voyages table grouped by chamber while filtering out groups where the sum of tonnage is less than or equal to 5000.
Distinct Values
A special case of grouping is the DISTINCT keyword. It retrieves the set of unique values from a set of columns. An example is SELECT DISTINCT type_of_boat FROM voyages. This is set-equivalent to the query SELECT type_of_boat FROM voyages GROUP BY type_of_boat. Try it!
voyages table.
All Combined
Now let's try to combine a lot of what we have learned so far about grouping (and before).
n, the minimum and maximum tonnage in each group (min_t / max_t) from the voyages table. Group by departure harbour and departure date. Filter out rows where departure harbour is NULL or equal to Batavia. Filter the groups to have at least two voyages in them.
Join
One of the most powerful features of relational databases is the JOIN, the horizontal combination of two tables according to the data. For example, if one table contains information about voyages and another table contains information about the invoices (value of ship's cargo) for those voyages, the JOIN operator can be used to combine the two.
To extend the example, we use an additional table, chambers, which contains the expansion of the VOC chamber (department) code we used in the previous section:
We will also use the table invoices, which contains the total amount of money charged for a voyage:
Equi-Join
The basic form of JOIN is the "Equi-Join". "Equi" stands for "Equality" and means that values have to be equal to create a match for the join. In SQL, there are various syntaxes for this, but generally we need to name the two tables in the FROM clause and then either immediately or later in the WHERE specify the columns that should be considered.
For example, we can join the voyages table and the chambers using an Equi-Join on the respective chamber columns to add the actual chamber name to the output with SELECT boatname, chamber, name FROM voyages JOIN chambers USING (chamber) LIMIT 5;. The result of this query is:
Note the list of tables in the FROM clause and the USING keyword, which specifies that values from the column chamber, which exists in both tables, should be compared to generate the join result. There are at least two equivalent ways to formulate this query. First, the join condition can also be more verbose: SELECT boatname, voyages.chamber, name FROM voyages JOIN chambers ON voyages.chamber = chambers.chamber LIMIT 5;. Here, we use the ON keyword to specify a more precise and expressive join condition. We explicitly name the tables from which the join columns are coming using the tablename.columname syntax. We also explicitly use the = comparision operator. Both of these definitions are implicit when using USING. The USING keyword also adds an implicit projection which removes one of the chamber results from the result set. The ON version does not, this is why we need to explicitly name one of them in the SELECT part of the query. Second, the join condition can also be in the WHERE clause of the query: SELECT boatname, voyages.chamber, name FROM voyages JOIN chambers WHERE voyages.chamber = chambers.chamber LIMIT 5;. This method of specifying a join condition is widespread, but confusing and thus discouraged.
In the result above, also note how several rows from voyages (# 1, 3 and 5) are joined up with the same row from chambers. This is normal and expected behavior, rows are re-used if they match multiple rows in the join partner.
Now try to write a JOIN query yourself!
voyages and invoices table using equality on the number column. Project only the boatname and invoice column.
Joins can (of course) use more than one pair of columns to determine that a match is present. For example, the query SELECT boatname, invoice FROM voyages JOIN invoices USING (number, chamber) joins the two tables by checking equality for both the number and chamber columns. The equivalent form with the ON keyword would be SELECT boatname, invoice FROM voyages JOIN invoices ON voyages.number = invoices.number AND voyages.chamber = invoices.chamber.
Natural Join
A special case of the Equi-join is the NATURAL JOIN. What it does is perform a join between two tables while using all columns from both tables with matching names as join criteria. The following two queries are equivalent: SELECT boatname, invoice FROM voyages JOIN invoices USING (number, chamber) and SELECT boatname, invoice FROM voyages NATURAL JOIN invoices, because the columns number and chamber are the only column names the two tables have in common. Think of natural joins as syntactic sugar, which often reduces readability of your query due to its implicit nature.
Outer Join
The previous section used (by default) what is called an INNER JOIN. Here, only rows where a match in the other table is found for are candidates for the result. There is also an opposite version, the LEFT OUTER JOIN (The OUTER keyword is optional). Here, all rows from the left side of the join (the table named first) are included in the result. If a matching row from the right side is found, it is joined. If none is found, the left row is still returned with the missing fields from the right side set to NULL
voyages table have a match in the invoices table. Use a LEFT OUTER JOIN on number and a filter for invoice equal to NULL to only retrieve rows where no match was found. Retrieve only the number column.
Self-Join
A powerful use case of joins in SQL is the self-join. Here, we combine the table with itself, which can for example be used to express queries about relationships between rows in the same table. In our dataset, we have such a relationship with the next_voyage column in the voyages table. This column indicates the next voyage number of a particular ship. Since both table have the same name in a self-join, it is required to rename them using the AS statement to unique temporary table names. For example, SELECT v1.boatname FROM voyages AS v1 where v1.chamber='A' demonstrates such a renaming.
c1 and c2.
More Tables
We can join over more than two tables by chaining JOIN keywords. We also use multiple ON or USING statements to define the join criteria.
voyages, invoices and chambers. Join voyages and invoices on the number and chamber columns, and the resulting table with chambers using only chamber column. Limit the result set to 5 rows.
Subqueries
SQL supports nested queries, the so-called subqueries. They can be used in all parts of the query and often simplify query expression as opposed to a JOIN. Other uses are creation of more convenient join partners or computation of projection results. These queries mostly differ in the allowed cardinality of their results. In projections, they can only return a single value whiled in the FROMclause, entire tables may be produced.
Filters
Subqueries in a WHERE clause are typically comparing an attribute, either against a single value (e.g., using =) or against a set of values (e.g., with IN or EXISTS). In both cases, it is possible to refer to tuple variables from the enclosing query but not the other way around. Subqueries are enclosed in brackets () and are otherwise complete queries on their own. For example, the rather over-complicated query SELECT boatname FROM voyages WHERE tonnage = (SELECT 884) uses a subquery to retrieve all ships with a tonnage of 884.
WHERE clause to retrieve boat names from the voyages table where there is no matching entry in the invoices table for the particular voyage number. Use a subquery with either NOT IN or NOT EXISTS (with a reference to the outer query).
Tables
The result of a subquery is also a table, hence they can also be used to use a subquery result where a table could be used, i.e., the FROM clause. For example, the (again) over-complicated query SELECT number FROM (SELECT * FROM voyages) AS v uses a table-creating subquery. Note how we have to use the AS clause to give the table created by the subquery a name. We often use subqueries to avoid complex filter conditions after joining tables.
FROM clause to only retrieve invoices from chamber 'H' and the invoice amount of larger than 10000 and join the result with the voyages table using the number column. Project to only retrieve the boatname and the invoice amount of the join result. Order by invoice amount.
Set Operations
SQL is grounded in the set semantics of the relational model. Hence, result sets can be interpreted as sets of tuples and we can use set operations to modify them. The most common set operation is the UNION, but intersections and set differences are also possible. These operators are often used to combine tuples from different relations.
Unions
The UNION keyword combines two result sets while eliminating duplicates, the UNION ALL keyword does the same while keeping duplicates. For example, the query SELECT name, chamber FROM chambers UNION ALL SELECT boatname, chamber FROM voyages will stack the otherwise unrelated boat and chamber names into one result set. Often, constants are also selected to denote the source of the respective tables.
UNION keyword to add a (modern-day) province column to the chambers values. 'Noord-Holland for 'Amsterdam', 'Hoorn' and 'Enkhuizen', 'Zeeland for 'Zeeland', 'Zuid-Holland for 'Delft' and 'Rotterdam'. Select chamber code, name and province. Order entire result set by chamber code.
You made it to the end, congratulations.