Relational Databases

Data helps solve problems. Anne Wojcicki Most businesses store data, and most of those do so in a sophisticated database system. Better database engines have lead to increased business capabilities, so a few decades of market pressure has lead to highly capable, complex systems. But with great power comes great responsibility: it can be costly to misunderstand database engines. This course provides a first look under the hood of database systems, to help give you the understanding needed to make wise choices.

Readings in Database Systems, more commonly known as the databases “Red Book” is my suggested next stop for papers after you’ve finished this course, or if you have extra time during itBy completing the main exercise sequence for this course, you will incidentally implement a simple relational database management system. As usual, you should aim to complete each exercise yourself, using the walkthrough videos and supplementary explainers as needed. In some courses, it's very clear when an exercise is "done"—for this one, I leave the scope more to you, implementing just the core objectives and a handful of interesting stretch goals myself. For more suggestions on how to approach CS Primer, see the how-to guide.

Also included are seminars recorded with CS Primer students, designed to cover more conceptually rich material that's harder to cover in the form of an exercise. Typically students will have worked through some of the exercises in the corresponding module before attending the seminar.

For supplementary resources, I recommend using one of two sets of video lectures: the Spring 2015 session of CS 186 by Joe Hellerstein at Berkeley, or Intro to Database Systems by Andy Pavlo at CMU. Both are fantastic and preference may come down to style. There are some databases textbooks, but past students have tended not to find them particularly helpful. I do however link to a number of papers, particularly from the databases Red Book, as well as The Internals of PostgreSQL.

Note from Oz: I'm currently actively working on this course. It's ready to start, and I'll aim to release problems fast enough to keep up with your progress, but note that the structure may change on short notice.

Query execution

We'll start our exploration of relational database management systems by focusing on basic query execution. Our main project goal will be to write a query executor for simple single-table queries in memory, designed to be easy to extend in future project steps. Along the way we'll also start exploring PostgreSQL as our case study, dipping in to the source and reading explain statements.

Problems

Basic query executor

write the core query executor for a relational database (58:26)

Explainers

A brief overview of query execution using EXPLAIN (42:42)

Observing PostgreSQL client and server communication (12:26)

What is the "core" of a "core dump" or "out of core sort"? (04:50)

A very brief history of early DBMSs (22:18)

Column vs row stores (08:42)

Physical storage

Database systems become a little more interesting when they're expected to operate over data that's too large to hold in memory all at once. We'll need to read from and write to disk, and ensure that any blocking operations (such as sort) can spill out of memory as needed.

Problems

CSV FileScan	update your DBMS to read from a (CSV) file, to start testing with real data (1:03:35)
Heap File	design and implement a custom file format for data storage (2:06:33)
Basic inserts	implement a first version of an insert operation (1:01:09)

Explainers

The layout of a PostgreSQL table (35:21)

Joins

Here we progress from single table queries to those across multiple tables. There is more than one way to perform a join, and a sophisticated DBMS will be able to choose between options. We'll start by implementing the most obvious join algorithm, then progress to something a little more sophisticated.

For further background, see the CS 186 lecture on join algorithms, although you may wish to just try implementing them yourself, first!

Problems

Nested loops join	implement the most basic join algorithm you can think of (1:00:54)
Hash join	implement a more efficient join algorithm by hashing one relation (40:17)
Sort-merge join	join two sorted tables in linear time (1:38:54)
Aggregation	support operations like count, sum and avg (58:21)

Explainers

Nested loop variants and practical considerations (15:24)

What is a join? (20:49)

The three basic join stratgies (18:53)

Hash join variants and practical considerations (14:57)

Indexes

Indexes help us avoid slow sequential scans, by storing offsets into our underlying data.

Query optimization

The beauty of the relational model is that a declarative query can be translated into a potentially large number of logically equivalent query plans, some of which may have dramatically different performance characteristics. In this module, we start to explore a simple method for cost calculation and optimization.

Transactions and concurrency

For most systems, it's ideal to be able to support multiple concurrent queries, rather than expecting one to run to completion before the next is able to commence. One of the major challenges in databases has been to device a way to do this which balances performance against anomalies, ie allowing multiple concurrent queries without excessive restriction, but also without allowing one query to "see" the partial effects of another. We'll learn that there is no single best compromise, so we will need to understand various concurrency levels and their tradeoffs.

Explainers

Transactions: the big picture (30:27)

Logging, atomicity and durability

In order to guarantee durability (the result of any confirmed transaction will not be lost, even if power is lost) as well as atomicity (transactions will entirely either fail or succeed, nothing in between) we will need to maintain a running log of all transactions, as we process them.