The Structure of a Relational Database.
Database is a term that can mean different but similar things depending on context. Database may refer to the instance, the data or the server. Useful shorthand but sufficiently broad that unless you have the context, meaningless.
If you have sufficient context, the words “The production receivables master database is down” galvanize you. If you don’t have the context, it’s as meaningless as asking why a production Database Administrator (DBA) would need to be dipped in molten Zinc. If you have ever worked with production DBA’s (context again), you’ll know why it does not make sense to ask if they need to be dipped in molten Zinc.
Databases have been commercially available now for 40 years and are relatively straightforward. The way that they work is that
- Your data, called the DATABASE is managed by a piece of software, called appropriately enough
- The Relational Database Management System (or RDBMS, DBMS or again database) and the combination of your DBMS and your Data is called
- An INSTANCE (or database), all of which needs to run on a machine, typically a server called appropriately enough
- A DATABASE SERVER (or database).
And Relational Databases (RDBMS, DBMS, Database) look like this:
Since the only operations that take place on the disk are reads and writes the data on disk is constrained by the limitations of the media, but is organized in a way to exploit the advantages of disk storage (bits, bytes, blocks, tracks and cylinders). This physical organization is managed by the operating system and to all extents and purposes is invisible to the database users. The database administrator (DBA) can specify that different pieces of your database are placed on different disks and that this data can be duplicated but at the lowest level, you generally don’t care how it gets on and off the disk or how it is organized. The organization of the disk data is optimized for reading and writing.
Having said this there are lots of different sets of data that make up the structure of the database – the data itself, which we think of as tables containing rows and columns, indices on this data, log files, control files, parameter files, configuration files and application logic. It’s vitally important that the DBMS knows this structure, and the way that the DBMS ‘sees’ this is through a storage model.
The storage model is a layer of abstraction that maps the requirements of simply reading and writing data to the needs of the DBMS. While you don’t care how the data is read and written, you do care that changes made to user data are not written to configuration files, for example. Thus, each DBMS has a representation of the different types of data on disk that it needs to do its job.
The storage model for the DBMS describes the physical structures of the DBMS (Control Files, Data Files, Log Files), and it is this structure that is called the DATABASE.
Disk data has the following advantages. It’s cheap, it’s virtually limitless, and most importantly it is persistent.
The place where work gets done on data is in memory. Memory is orders of magnitude faster to access than disk, and the types of structures that can be built up in memory are less constrained than those on disk – they can be complex and arbitrary, they can be modified and restructured easily. This gives the DBMS its speed and flexibility.
There is a structure that is common between memory and disk, and this basic building block of information is called appropriately enough a block. Structures in memory are organized in blocks and this facilitates reading blocks from disk into memory. Since chunks of data in the database are logically related, it is efficient to move not only the data that you need into memory, but also data that it is likely that you are going to need at the same time. These ‘anticipatory’ reads are called pre-fetching, and since they can fetch several blocks of data at a time, the blocks are stacked up next to each other in structures called extents.
In addition to holding data retrieved from disk, memory will hold such things as transient structures created during the databases operations (i.e. intermediate sort tables), blocks of executable code, the system catalog and dictionary, and the plan.
The structures in memory, described by the memory model, can be accessed quickly and contain majority of the information needed by the DBMS to operate. This includes parts of the larger population of user data held on disk, and structures required for the database management system itself (i.e. the plan, meta code etc.). Because it might not be possible or desirable to have all the data that resides on disk in memory, and for security (persistence), the DBMS will swap data between memory and disk.
There are many things happening simultaneously in the DBMS, such as reading and writing data from and to disk, recording recovery information in database logs, checking to see if data has changed in memory, holding connections for user and system programs, monitoring the health of the database, running user written code, locking parts of the database and detecting deadlocks. This list is by no means exhaustive, and there are differences between different DBMS implementations but there are a core set of processes that deal with memory management, paging to disk, and writing information in the event of recovery or rollback.
It is these working in concert with the structures in memory that are called the INSTANCE.
Process Model Implementations
Databases need to be able to service large numbers of requests from concurrent users. The Operating Systems that they run on handle concurrency using Processes and Threads. A thread is the smallest unit of execution that can be scheduled by an operating system, and in general a process contains one or more threads. Multiple threads in a process have access to the same resources such as memory, while the processes that contain them do not. This has speed advantages in that threads can communicate with each other without the overhead of an inter process call (IPC). The disadvantages are that because threads can trample on each other, the structures that they are working on need to be locked, and these locks need management (which incurs its own overhead). Even with these mechanisms in place it’s still possible for threads working on the same data structures to change the data only to have another thread change it back – the so-called ‘race-condition’. These are notoriously hard to isolate and reproduce. With this being said, threading is implemented differently from one operating system to another, and some do a better job of this than others.
On single processor systems threading is handled by multiplexing. This is where the processor switches between threads (called context switching), and while the processor is working on an execution thread, other threads wait. It is generally the case that the processor is running much faster than the requirements of the individual threads (which may be waiting for I/O, or user input), and the switching happens so fast that it looks like each thread is running simultaneously. On multi-processor systems, the threads will generally run at the same time, but in different processors.
A database will map its concurrent user needs onto the underlying operating systems processes and threads. This mapping is called the database process model. Because of differences in how databases are used (single user vs multi-user, stand-alone vs parallel), and because different operating systems implement threads differently. The units of work that need to be performed are the same whether they are implemented as threads or processes, and somewhat confusingly the term process model and process tends to be used for threads, user processes, database processes and operating system processes interchangeably (just like database can mean the database instance or the user data on disk), and this document is no exception. When the word process is used it means all three (database process, operating system process, and user process) AND the work being done by a thread. This is a convenient term to use when we want to describe a unit of work that the database needs to perform where the underlying implementation is not a consideration (abstraction, if you will). Where it is important to differentiate between types of processes and processes and threads this difference will be specific (i.e. Checkpointing is a process vs Checkpointing is a threaded database process.)
The structures on disk comprise the DATABASE, the structures and processes in memory are the DATABASE INSTANCE, and the DATABASE INSTANCE and DATABASE together are called the DATABASE SERVER.