Ganglia in the cloud ( unicast instead of multicast )

Last time, I talked about how much I like using Ganglia for monitoring a large number of distributed servers.

One of the issues I ran into that is barely covered in the documentation is how to set this up if you cannot use multicast.  Multicast is the default method that ganglia uses to discover nodes.  This is great, it means that auto-discovery works… kinda… The issue is that most cloud providers squash your ability to do multicast .  This is a good thing, can you imagine having to share a room with the guy who can’t stop screaming through the bull-horn every 2 milliseconds?  So, if I want to use ganglia in EC2, the Amazon cloud, how do I go about doing that ?

To get around this issue, you need to configure ganglia in unicast mode.  This is the mysterious part, what exactly is it, where do I set it and, how do I have multiple clusters in unicast mode all report to the same web-UI?  Most of the tutorials I read alluded to the fact that you *could* have multiple clusters setup in ganglia, and most speculated [ some even correctly ] about how to do it, but none really implemented it.  So, here is how you can disable multicast in ganglia and instead, enable unicast with multiple clusters.

First, to get started with this, there are a couple of ganglia components that you really need to be familiar with.

gmetad

gmetad is the ‘server’ side of ganglia.  It is responsible for taking the data from the remote collectors and stuffing it into the backend database ( ganglia uses rrdtool).  You’ll have one of these bad-boys running for each web-ui you have setup.

Configuration

First of all, take a look at the full, default config file.  It’s got a lot of great comments in there and really helps to explain everything from soup to nuts.  That being said, Here’s what I used ( and my comments) to get me up and running.

Configuring this is done in ( default ) /etc/gmetad.conf

# Each 'cluster' is its own data-source
# I have two clusters, so, 2 data-sources
# ... plus my local host
data_source "Local" localhost
data_source "ClusterA" localhost:8650
data_source "ClusterB" localhost:8655

# I have modified this from the default rrdtool
# storage config for my purposes, I want to
# store 3 full years of datapoints.Sure there
# is a storage requirement, but that's what I need.
RRAs "RRA:AVERAGE:0.5:1:6307199" "RRA:AVERAGE:0.5:4:1576799" "RRA:AVERAGE:0.5:40:52704"

Essentially, the above sets up two clusters, ClusterA and ClusterB.  The sources from these are coming from localhost:8650 and localhosty:8651 respectively  ( don’t worry, I’ll explain that bit below…).  The other thing for me is that I need to keep 3 full years of real datapoints.  ( rrdtool is designed to ‘aggregate’ your data after some time.  If you don’t adjust it, you lose resolution to the aggregation, which can be frustrating).

gmond

gmond is a data-collector.  It will, essentially, collect data from a host and send it … somewhere.  Let’s discuss where.

Before we address the multiple clusters piece, here’s how you disable multicast.  The default config file will contain three sections that you really care about:

( The things we need to change are:

   Cluster -> name

comment out the udp_send_channel -> mcast_join parameter

comment out the udp_recv_channel -> mcast_join parameter

comment out the udp_recv_channel -> bind parameter

)


/* If a cluster attribute is specified, then all gmond hosts are wrapped inside
* of a <CLUSTER> tag. If you do not specify a cluster tag, then all <HOSTS> will
* NOT be wrapped inside of a <CLUSTER> tag. */
cluster {
name = "unspecified"
owner = "unspecified"
latlong = "unspecified"
url = "unspecified"
}

/* The host section describes attributes of the host, like the location */
host {
location = "unspecified"
}

/* Feel free to specify as many udp_send_channels as you like. Gmond
used to only support having a single channel */
udp_send_channel {
# Comment this out for unicast
#mcast_join = 239.2.11.71
port = 8649
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
# Comment this out for unicast
#mcast_join = 239.2.11.71
port = 8649
#Comment this out for unicast
#bind = 239.2.11.71
}

So, in order to convert this to unicast, you would just comment out the above, and set the port to some available tcp/ip port… that simple!

So, I have 3 clusters, localhost, ClusterA and ClusterB.  To get this working with Unicast ( unicast meaning that I talk to one specific endpoint ), I need to have a separate gmond running on my server for EACH cluster.

So, on the ganglia server, I have 3 gmond config files:

(localhost)

</pre>
/*
 * The cluster attributes specified will be used as part of the <CLUSTER>
 * tag that will wrap all hosts collected by this instance.
 */
cluster {
 name = "Local"
 owner = "Scottie"
 latlong = "unspecified"
 url = "unspecified"
}
/* The host section describes attributes of the host, like the location */
host {
 location = "GangliaSever"
}

/* Feel free to specify as many udp_send_channels as you like. Gmond
 used to only support having a single channel */
udp_send_channel {
host = localhost
port = 8649
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
port = 8649
}

/* You can specify as many tcp_accept_channels as you like to share
 an xml description of the state of the cluster */
tcp_accept_channel {
 port = 8649
}

Remember the ‘data-sources’ from your gmetad.conf file? Well, if you look up, you’ll see that the data-source for the ‘Local’ cluster was ‘localhost:8649′  Essentially, gmetad will talk to this gmond on localhost:8649 for receiving data.  Now, the remainder of your gmond.conf file is important, it dictates all of the monitoring that the gmond instance will do.  Only change the section that I have listed above.

Now for the two remaining clusters:

ClusterA:

/*
 * The cluster attributes specified will be used as part of the <CLUSTER>
 * tag that will wrap all hosts collected by this instance.
 */
cluster {
 name = "ClusterA"
 owner = "Scottie"
 latlong = "unspecified"
 url = "unspecified"
}
/* The host section describes attributes of the host, like the location */
host {
 location = "GangliaSever"
}

/* Feel free to specify as many udp_send_channels as you like. Gmond
 used to only support having a single channel */
udp_send_channel {
host = localhost
port = 8650
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
port = 8650
}

/* You can specify as many tcp_accept_channels as you like to share
 an xml description of the state of the cluster */
tcp_accept_channel {
 port = 8650
}

Cluster B:

/*
 * The cluster attributes specified will be used as part of the <CLUSTER>
 * tag that will wrap all hosts collected by this instance.
 */
cluster {
 name = "ClusterB"
 owner = "Scottie"
 latlong = "unspecified"
 url = "unspecified"
}
/* The host section describes attributes of the host, like the location */
host {
 location = "GangliaSever"
}

/* Feel free to specify as many udp_send_channels as you like. Gmond
 used to only support having a single channel */
udp_send_channel {
host = localhost
port = 8655
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
port = 8655
}

/* You can specify as many tcp_accept_channels as you like to share
 an xml description of the state of the cluster */
tcp_accept_channel {
 port = 8655
}

Now that we’ve got our ‘server’ setup to receive data for each of our clusters, we need to configure the actual hosts that are part of that cluster to forward data in.  Essentially, this is going to be the same ‘gmond’ configuration, but will forward data to the ‘gmond’ that we just setup on the server.

Let’s say we have three hosts:

Grumpy ( is our local server)

Sleepy ( Cluster A)

Doc ( Cluster B)

Now, let’s configure their gmond’s to talk to our server (Grumpy) and start saving off our data.  First of all, Grumpy is already configured up and running, so if you connected to the ganglia interface at this point ( and your gmetad is running ), you should see ‘Grumpy’ showing up in the ‘Local’ cluster.

On each of these hosts, you only change the host field to be the name or IP address of your ganglia ‘server’ ( udp_send_channel->host)

:

<pre>/*
 * The cluster attributes specified will be used as part of the <CLUSTER>
 * tag that will wrap all hosts collected by this instance.
 */
cluster {
 name = "ClusterA"
 owner = "Scottie"
 latlong = "unspecified"
 url = "unspecified"
}

/* The host section describes attributes of the host, like the location */
host {
 location = "GangliaSever"
}

/* Feel free to specify as many udp_send_channels as you like. Gmond
 used to only support having a single channel */
udp_send_channel {
host = grumpy
port = 8650
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
port = 8650
}

/* You can specify as many tcp_accept_channels as you like to share
 an xml description of the state of the cluster */
tcp_accept_channel {
 port = 8650
}</pre>

On Doc ( Cluster B ), you make the same change ( udp_send_channel->host ):

/*
 * The cluster attributes specified will be used as part of the <CLUSTER>
 * tag that will wrap all hosts collected by this instance.
 */
cluster {
 name = "ClusterB"
 owner = "Scottie"
 latlong = "unspecified"
 url = "unspecified"
}
/* The host section describes attributes of the host, like the location */
host {
 location = "GangliaSever"
}

/* Feel free to specify as many udp_send_channels as you like. Gmond
 used to only support having a single channel */
udp_send_channel {
host = grumpy
port = 8655
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
port = 8655
}

/* You can specify as many tcp_accept_channels as you like to share
 an xml description of the state of the cluster */
tcp_accept_channel {
 port = 8655
}

Once you start the gmond process on each server, wait a few and they will appear in the ganglia interface. Simple as that!


disk

SQL NoSQL Why Choose?

Let’s quickly take a look at the strengths and weakness of current SQL and NoSQL databases with an eye towards where they are today!  Its clear there is tremendous investment and development extending both SQL an NoSQL solutions (often to provide functions inherent in the other!) and over time innovation will close these gaps.  Exactly when and how will evolve over the coming years, but we feel TODAY any serious data strategy must embrace BOTH storage paradigms.

[one_half]

SQL

  • Mature technology
  • Tremendous support options
  • Decades of applications developed (both open and closed source)
  • Low Latency transactions (OLTP)
  • Robust transaction support (ACID)
  • High concurrency
  • Full ANSI compliant SQL
  • Robust security

[/one_half]
[one_half_last]

NoSQL

  • Massive datastore capacity
  • Massive parallel query
  • Commodity hardware
  • Horizontal scalability
  • Ultra fast bulk data loads
  • Unstructured data support

[/one_half_last]

Its interesting to note that many of the strengths of NoSQL specifically arose from innovations driven to get past current SQL scaling limitations.  In doing so, a entire new paradigm has been created and is being embraced not just by high flying high tech companies but now across industry domains.  Data that was ignored by necessity is now being harvested to drive additional business insights.  The key word here is additional.  To fully gain the potential afforded by BigData computing, its critical to integrate within your existing data environments.  Successful organizations are investing with an eye towards extending value and not looking to “replace” but augment current data infrastructure investments.
At OpenSCG we are focused on integrating existing SQL infrastructure WITH NoSQL extensions to ensure our clients get the most out of their investment dollars.  To support this with our PostgreSQL clients we have created BigSQL ( www.bigsql.org ).  This open source distribution includes both SQL (PostgreSQL) and NoSQL (Hadoop) and extensions to smooth data integration across both stacks.

Ganglia monitoring of large (and small) clusters

Lately, I’ve been spending more time back in the performance testing world.  The problem is that the testing we’re doing is taking days and in some cases, weeks to run.  No matter how diligent you are, you’re not going to be staring at your CPU / memory graphs the whole time.  The question is, what Open-Source tool should I be using to collect my metrics?

Previously, I’ve always used nmon and nmon analyser to collect and inspect ( respectively ) my metrics.  There are a few issues with it however, the most glaring of which is that the analyser tool is an Excel macro ( gross, out comes the windows VM ).  More recently I’ve been using cacti, which is a great tool for collecting system metrics, but the rrdtool defaults are a bit weak on data retention.  Basically, you end up losing your data after 24 hours.  Now, granted, I can modify cacti to increase the the number of data points stored, but there are a few issues:

  1. Data Collection is kludgy
  2. The server has a LOT to do
  3. The interface is beginning to age
  4. Adding a new host is kind of like pulling out your wisdom teeth

So, dentistry aside, I found Ganglia.  Under the covers, ganglia is really using the same database technology as cacti ( rrdtool ), but, the defaults are changed in one simple place.  In 30 seconds, I had reconfigured ALL rrd databases and metrics to store 3 years of full data points.  Pretty simple and powerful.

The big win for me though was provisioning.  The environment I’m working in has a new machine showing up each day ( or an old machine re-purposed ), so setup needs to be quick.  With Ganglia, there are two methods for doing this:

1. Multicast ( The default)

It is what is sounds like.  You turn on data collector on a host and before you even know it… your host is in the web interface.  This is really great when dealing with large clusters ( http://coen.boisestate.edu/ece/raspberry-pi/ ) in a lab where boxes come in and out before you know it.

2. Unicast ( The reality )

Multicast doesn’t work in EC2, or, in most corporate networks for that matter.  Your production environment is 4 firewalls and 9 routers from where your graphing node is.  The configuration for this mode is a bit more up-front work, but, once you get it setup, you just start the collector daemon and it connects to the mothership and does the rest ( provisioning, initial graphing, etc… )

 

If you’re looking for a monitoring solution that gives you all the data-points, is easy to provision and open-source… gotta go Ganglia!

 

bigdata

Hadoop Hive Tutorial – Zero to Results in 15 Minutes

This Tutorial will walk through using Hadoop’s Hive to access data stored in HDFS.  This is the first tutorial (more to follow) and covers installing Hadoop as part of BigSQL and running some simple queries.

The Platform:  Linux (RHEL, CentOS or Ubuntu) or Mac OSX.  The Hadoop and Postgres Install was from www.BigSQL.org.  Takes about 10 – 15 minutes to install the developer bundle and get up and running.  ( dont skip the pre-requisites  ).

The Data: The original source data is from the NYC finance department and contains Rolling Sales files for the last 12 months in New York City.   The data for Manhattan was used in the tutorial.   A slightly cleaned file that can be used for the tutorial is available below.

Let’s get started!

  • If you have not previously installed bigsql,  download the bundle and tar the file.  Make sure to use the exact filename downloaded:

tar -xf bigsql-9.3xxx-linux64.tar.bz2

  • navigate to the home directory of your BigSQL install.  

I have installed BigSQL in the following directory:  /home/cgerth/bigsql/bigsql-9.3beta1-7   simply  cd to this directory

  • start BigSQL

./bigsql start

  • execute the environment script setENV.sh

. setENV.sh

  • create a tutorial directory, change to that directory

mkdir tutorial

cd tutorial

  • put the following three files in the tutorial directory: ex2data.csv, ex1createHiveTable.sql and ex1loadHiveTables.sql .  The files are zipped here

use the tool of your choice!

  • create the hive table structure.  Take a look at the ex1createHiveTable.sql script which is used to create the Hive Table.

CREATE TABLE IF NOT EXISTS salesdata (
s_num FLOAT,
s_borough INT,
s_neighbor STRING,
s_b_class STRING,
s_c_p STRING,
s_block STRING,
s_lot STRING,
s_easement STRING,
w_c_p_2 STRING,
s_address STRING,
s_app_num STRING,
s_zip STRING,
s_res_units STRING,
s_com_units STRING,
s_tot_units INT,
s_sq_ft FLOAT,
s_g_sq_ft FLOAT,
s_yr_built INT,
s_tax_c INT,
s_b_class2 STRING,
s_price FLOAT,
s_sales_dt STRING

)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE;

$ hive -v -f ex1createHiveTables.sql

  • load the datafile into the structure created.  The ex1loadHiveTable.sql script can be used for this.  Take a look at the script before running.

LOAD DATA LOCAL INPATH ‘ex1data.csv’
OVERWRITE INTO TABLE salesdata;

hive -v -f ex1loadHiveTables.sql

  • Run a quick query to test the data load.  Run the following command and you should see the 5 records from the table.

hive

SELECT * from salesdata LIMIT 5;

You should see 5 records from the table.

  • Take a quick look at the hive table created.

DESCRIBE salesdata;

OK
col_name    data_type    comment
s_num    float    
s_borough    int    
s_neighbor    string    
s_b_class    string    
s_c_p    string    
s_block    string    
s_lot    string    
s_easement    string    
w_c_p_2    string    
s_address    string    
s_app_num    string    
s_zip    string    
s_res_units    string    
s_com_units    string    
s_tot_units    int    
s_sq_ft    float    
s_g_sq_ft    float    
s_yr_built    int    
s_tax_c    int    
s_b_class2    string    
s_price    float    
s_sales_dt    string    
Time taken: 0.194 seconds
  • Lets run a query to see the sales by neighborhood in Manhattan;

SELECT s_neighbor, sum(s_price) FROM  salesdata GROUP BY s_neighbor;

Results should look simular to:
ALPHABET CITY 2.11818189E8
CHELSEA 2.250264798E9
CHINATOWN 2.88838109E8
CIVIC CENTER 6.12244158E8
CLINTON 8.06269355E8
EAST VILLAGE 4.68213911E8
FASHION 1.714396813E9
FINANCIAL 1.402998865E9
FLATIRON 1.612799353E9
GRAMERCY 1.021019603E9
GREENWICH VILLAGE-CENTRAL 1.210462635E9
GREENWICH VILLAGE-WEST 1.670496319E9
HARLEM-CENTRAL 7.91802512E8
HARLEM-EAST 2.17612503E8
HARLEM-UPPER 2.5843908E8
HARLEM-WEST 1.60041248E8
INWOOD 1.20332695E8
JAVITS CENTER 1.48603249E8
KIPS BAY 4.22209848E8
LITTLE ITALY 2.50821741E8
LOWER EAST SIDE 6.58796147E8
MANHATTAN VALLEY 2.20308024E8
MANHATTAN-UNKNOWN 0.0
MIDTOWN CBD 3.203149051E9
MIDTOWN EAST 1.294451495E9
MIDTOWN WEST 6.728533424E9
MORNINGSIDE HEIGHTS 2.9809357E7
MURRAY HILL 1.441188759E9
SOHO 2.929228594E9
SOUTHBRIDGE 9.78509618E8
TRIBECA 1.663415012E9
UPPER EAST SIDE (59-79) 4.325223036E9
UPPER EAST SIDE (79-96) 3.159172312E9
UPPER EAST SIDE (96-110) 2.10717582E8
UPPER WEST SIDE (59-79) 3.4046098E9
UPPER WEST SIDE (79-96) 1.422565321E9
UPPER WEST SIDE (96-116) 5.95009679E8
WASHINGTON HEIGHTS LOWER 3.67326009E8
WASHINGTON HEIGHTS UPPER 3.30247015E8
Time taken: 3.656 seconds
  • How many properties in each community?

SELECT s_neighbor, count(s_price) FROM salesdata GROUP BY s_neighbor;

ALPHABET CITY                128
 CHELSEA                      961
 CHINATOWN                    175
 CIVIC CENTER                 263
 CLINTON                      495
 EAST VILLAGE                 233
 FASHION                      249
 FINANCIAL                    491
 FLATIRON                     528
 GRAMERCY                     504
 GREENWICH VILLAGE-CENTRAL    699
 GREENWICH VILLAGE-WEST       710
 HARLEM-CENTRAL               867
 HARLEM-EAST                  298
 HARLEM-UPPER                 161
 HARLEM-WEST                  67
 INWOOD                       117
 JAVITS CENTER                38
 KIPS BAY                     304
 LITTLE ITALY                 101
 LOWER EAST SIDE              439
 MANHATTAN VALLEY             268
 MANHATTAN-UNKNOWN            3
 MIDTOWN CBD                  235
 MIDTOWN EAST                 1196
 MIDTOWN WEST                 5856
 MORNINGSIDE HEIGHTS          65
 MURRAY HILL                  771
 SOHO                         432
 SOUTHBRIDGE                  70
 TRIBECA                      694
 UPPER EAST SIDE (59-79)      2408
 UPPER EAST SIDE (79-96)      1929
 UPPER EAST SIDE (96-110)     131
 UPPER WEST SIDE (59-79)      1985
 UPPER WEST SIDE (79-96)      1028
 UPPER WEST SIDE (96-116)     461
 WASHINGTON HEIGHTS LOWER     186
 WASHINGTON HEIGHTS UPPER     350
 Time taken: 3.79 seconds
  • Interesting that UNKNOW had 3 transactions but no $ volume, lets take a closer look:

SELECT s_price FROM salesdata WHERE s_neighbor LIKE  ‘%UNKNOWN%’

Mapred Local Task Succeeded . Convert the Join into MapJoin
 OK
 s_price
 0.0
 0.0
 0.0
 Time taken: 3.479 seconds

Yes, as expected they are all zero value transactions.

This completes the first starting Hive tutorial.

The next Tutorial in the series is PostgreSQL FDW to Hive  .

Please share feedback and look for more tutorials moving forward.

Click here for more information on BigSql!

 

PostgreSQL HA Zero Downtime

PostgreSQL Clustering with Postgres HA

PostgreSQL Clustering made easy with PostgresHA.

Authored jointly with Scott Mead, Sr. Architect OpenSCG and Architect of PostgresHA.

PostgresHA is Highly Available (HA) PostgreSQL.

High Availability

High availability is not new. Financial trading systems and the telephone network were federally mandated to provide highly availability. People didn’t do HA because it was a fun project, they did it because they were told to.

What is new is that EVERYTHING these days is HA. Your email and blog posts on Google, your shopping cart on Amazon, your pictures on Facebook. People have gone from doing this because it’s critical for their business and doing it because its good for business.

Open Source is also good for business and by virtue of all the components of a Big Data solution being publicly available some very smart people have gone to the trouble of making this stuff accessible to the likes of you and me. They have made it so that we can understand it and use it. You don’t have to be Ma Bell or the NYSE to enjoy the benefits of always available computing.

PostgreSQL

The worlds most advanced Open Source Database. PostgreSQL has synchronous replication. In computer theory terms it is called 2-safe replication, in English it means that you have two databases instances running simultaneously and your Primary (Master) database is synchronized with a Secondary (Slave) database. Unless both databases crash simultaneously, you won’t lose data. If your databases are on separate machines in separate racks in separate buildings you can reduce the possibility of simultaneous failure to near zero. Building a HA solution that survives an Extinction Level Event like a meteor strike is beyond the scope of this paper and would not get read anyway.

With synchronous replication, each write waits until confirmation is received by the transaction log of both the Master and Slave. This increases the transaction response time and the minimum is the roundtrip time between them. All two-phase commit actions require waits (prepare and commit.)

However, there is good news, read only transactions and transaction rollbacks need not wait for replies from standby servers (PostgreSQL, We’re optimized for Rollback!) Only top level transactions require commits, sub-transaction commits do not need to wait for standby servers and long running data loading or index building only need the very last commit. Good for speed, but if you lose it halfway through, you lose it all – you pays your money, you takes your choice.

PostgreSQL makes a uses the same concept of active standby as ZooKeeper – keep a standby hot for fail over (Master / Slave, Primary / Secondary, Active / Standby – you say potato I say potato.)

OpenSCG makes PostgreSQL HA with the Open Source project PostgresHA .

At a high level, PostgresHA monitors both Master and Slave databases. If it spots something that could be a failure it enters a state of heightened awareness and starts yelling about it. Heightened awareness means continuous pinging, more pinging and deeper pinging.

At some point it’s serious enough to warrant failover.

This is where you need to consider what you want (In advance).

Some things (like a head crash) you want to know right away and react as soon as possible, and things like auto failover sound like a good idea.

It’s possible that there is a transient problem that no amount of intelligent probing is going to find out and it will go away without causing too much trouble. In this case you probably don’t want to fail over automatically, you might want to watch it for a little bit and when you can bear it no more, you punch a big red failover button.

Sometimes you might want to crank up the sensitivity of the probes anyway just to see what nominal behavior looks like.

PostgresHA essentially does this for you.  The great part is that OpenSCG has a free distribution of PostgreSQL HA here!

Here’s how:

PostgresHA Architecture

Your Applications (clients) attach to a connection pooler (pgBouncer). The connection pooler simply holds all the client connections and switches them between a smaller number of database connections. Everyone does it, it saves expensive database connection resources. Another nice property is that it can switch these connections to a different database if needed.

PostgresHA Steady State.

PostgresHA Steady State

With Streaming Replication in PostgreSQL, we can have a Master and Slave running synchronously, and through pgBouncer we can route read-only connections to the slave and write transactions to the Master (nice). To protect against a bad hardware failure, you can put the Master and Slave databases in different racks. For Disaster Recovery, you can use log streaming to replicate to a separate data center.

PostgresHA handles failure scenarios the following way:

Master cannot communicate with Slave

PostgresHA, Master Slave Disconnect

Slave becomes isolated:

PostgresHA, Slave Isolated

Master Isolated:

Master Isolated

RDBMS_Structure

Relational Data

The Structure of a Relational Database.

Database is a term that can mean different but similar things depending on context. Database may refer to the instance, the data or the server. Useful shorthand but sufficiently broad that unless you have the context, meaningless.

If you have sufficient context, the words “The production receivables master database is down” galvanize you. If you don’t have the context, it’s as meaningless as asking why a production Database Administrator (DBA) would need to be dipped in molten Zinc. If you have ever worked with production DBA’s (context again), you’ll know why it does not make sense to ask if they need to be dipped in molten Zinc.

Databases have been commercially available now for 40 years and are relatively straightforward. The way that they work is that

  • Your data, called the DATABASE is managed by a piece of software, called appropriately enough
  • The Relational Database Management System (or RDBMS, DBMS or again database) and the combination of your DBMS and your Data is called
  • An INSTANCE (or database), all of which needs to run on a machine, typically a server called appropriately enough
  • A DATABASE SERVER (or database).

And Relational Databases (RDBMS, DBMS, Database) look like this:

Relational Database Architecture

Storage Model

Since the only operations that take place on the disk are reads and writes the data on disk is constrained by the limitations of the media, but is organized in a way to exploit the advantages of disk storage (bits, bytes, blocks, tracks and cylinders). This physical organization is managed by the operating system and to all extents and purposes is invisible to the database users. The database administrator (DBA) can specify that different pieces of your database are placed on different disks and that this data can be duplicated but at the lowest level, you generally don’t care how it gets on and off the disk or how it is organized. The organization of the disk data is optimized for reading and writing.

Having said this there are lots of different sets of data that make up the structure of the database – the data itself, which we think of as tables containing rows and columns, indices on this data, log files, control files, parameter files, configuration files and application logic. It’s vitally important that the DBMS knows this structure, and the way that the DBMS ‘sees’ this is through a storage model.

The storage model is a layer of abstraction that maps the requirements of simply reading and writing data to the needs of the DBMS. While you don’t care how the data is read and written, you do care that changes made to user data are not written to configuration files, for example. Thus, each DBMS has a representation of the different types of data on disk that it needs to do its job.

The storage model for the DBMS describes the physical structures of the DBMS (Control Files, Data Files, Log Files), and it is this structure that is called the DATABASE.

Disk data has the following advantages. It’s cheap, it’s virtually limitless, and most importantly it is persistent.

Memory Model

The place where work gets done on data is in memory. Memory is orders of magnitude faster to access than disk, and the types of structures that can be built up in memory are less constrained than those on disk – they can be complex and arbitrary, they can be modified and restructured easily. This gives the DBMS its speed and flexibility.

There is a structure that is common between memory and disk, and this basic building block of information is called appropriately enough a block. Structures in memory are organized in blocks and this facilitates reading blocks from disk into memory. Since chunks of data in the database are logically related, it is efficient to move not only the data that you need into memory, but also data that it is likely that you are going to need at the same time. These ‘anticipatory’ reads are called pre-fetching, and since they can fetch several blocks of data at a time, the blocks are stacked up next to each other in structures called extents.

In addition to holding data retrieved from disk, memory will hold such things as transient structures created during the databases operations (i.e. intermediate sort tables), blocks of executable code, the system catalog and dictionary, and the plan.

The structures in memory, described by the memory model, can be accessed quickly and contain majority of the information needed by the DBMS to operate. This includes parts of the larger population of user data held on disk, and structures required for the database management system itself (i.e. the plan, meta code etc.). Because it might not be possible or desirable to have all the data that resides on disk in memory, and for security (persistence), the DBMS will swap data between memory and disk.

Process Model

There are many things happening simultaneously in the DBMS, such as reading and writing data from and to disk, recording recovery information in database logs, checking to see if data has changed in memory, holding connections for user and system programs, monitoring the health of the database, running user written code, locking parts of the database and detecting deadlocks. This list is by no means exhaustive, and there are differences between different DBMS implementations but there are a core set of processes that deal with memory management, paging to disk, and writing information in the event of recovery or rollback.

It is these working in concert with the structures in memory that are called the INSTANCE.

Process Model Implementations

Databases need to be able to service large numbers of requests from concurrent users. The Operating Systems that they run on handle concurrency using Processes and Threads. A thread is the smallest unit of execution that can be scheduled by an operating system, and in general a process contains one or more threads. Multiple threads in a process have access to the same resources such as memory, while the processes that contain them do not. This has speed advantages in that threads can communicate with each other without the overhead of an inter process call (IPC). The disadvantages are that because threads can trample on each other, the structures that they are working on need to be locked, and these locks need management (which incurs its own overhead). Even with these mechanisms in place it’s still possible for threads working on the same data structures to change the data only to have another thread change it back – the so-called ‘race-condition’. These are notoriously hard to isolate and reproduce. With this being said, threading is implemented differently from one operating system to another, and some do a better job of this than others.

On single processor systems threading is handled by multiplexing. This is where the processor switches between threads (called context switching), and while the processor is working on an execution thread, other threads wait. It is generally the case that the processor is running much faster than the requirements of the individual threads (which may be waiting for I/O, or user input), and the switching happens so fast that it looks like each thread is running simultaneously. On multi-processor systems, the threads will generally run at the same time, but in different processors.

A database will map its concurrent user needs onto the underlying operating systems processes and threads. This mapping is called the database process model. Because of differences in how databases are used (single user vs multi-user, stand-alone vs parallel), and because different operating systems implement threads differently. The units of work that need to be performed are the same whether they are implemented as threads or processes, and somewhat confusingly the term process model and process tends to be used for threads, user processes, database processes and operating system processes interchangeably (just like database can mean the database instance or the user data on disk), and this document is no exception. When the word process is used it means all three (database process, operating system process, and user process) AND the work being done by a thread. This is a convenient term to use when we want to describe a unit of work that the database needs to perform where the underlying implementation is not a consideration (abstraction, if you will). Where it is important to differentiate between types of processes and processes and threads this difference will be specific (i.e. Checkpointing is a process vs Checkpointing is a threaded database process.)

Summary

The structures on disk comprise the DATABASE, the structures and processes in memory are the DATABASE INSTANCE, and the DATABASE INSTANCE and DATABASE together are called the DATABASE SERVER.

map

dell xps15 laptop

My brand spanking new Dell XPS 15 laptop is surely sweet, and windoze 8 doesn’t suck, but… even with 16 GB of RAM and a huge ssd drive… I personally prefer a 13 year old version of Redhat Linux running on a 10 year old i386 desktop.

Seems its likely the same chinese hardware that Apple was using on its now defunct 17 inch Macbook Pro two years ago. Apple is still four steps ahead of Dell on innovation, three steps ahead on hardware, two steps ahead on laptops… and 15 steps ahead on software AND twenty steps ahead on mobile phones and…

–BS

PS Dell is now only one year ahead of apple on laptop pricing… this is where it gets interesting as the biggies battle it out in the retail space….