Low Power (and cost) ARM Servers Get Real

Lately I’ve been fiddling with the new Cubietruck boards. For about $100 you get:

– Dual Core ARM A20/a7 CPU
– 2 GB RAM
– Full connections for 2.5″ sata drive
– Gigabit Ethernet

It all runs nicely off of a little 10-watt $10 power supply. Couple this with a $100 2.5″ Seagate 1 TB SSHD and you’ve got a efficient, powerful, scale-out little Linux server for 200 USD a piece. Heck, what is that these days, 100 euros. :-)

This is going to change the world… this is why I am an investor in ARMH, STX and AMD. Note that AMD has nothing to do with my Cubietruck project, but, they recently signed a contract with ARMH to manufacture ARM chips in bulk.

–Luss

Browser Wars 2013

Browser Wars 2013

For the most recent month, see below for the breakouts by Browser, Platform & OS of OpenSCG’s corporate website. From this data it doesn’t take a rocket scientist to conclude that:

 

  • Microsoft defintely lost the browser & mobile wars
  • Microsoft has NOT lost the OS war (yet)
  • The folks interested in OpenSCG’s content are dweebs
  • Our content is not compelling enough for folks to want to see it from the Smartphones

 

Summarized by Browser

Google Chrome 40%
Firefox 31%
Internet Explorer 10%
Safari 10%
Android 7%
iOS 5%
BlackBerry .6%
Windows Phone .04%

 

Summarized by Platform

PC 91%
Mobile 9%

 

Summarized by OS

Windows 52%
Linux 18%
OSX 17%
Other 3%

Ganglia in the cloud ( unicast instead of multicast )

Last time, I talked about how much I like using Ganglia for monitoring a large number of distributed servers.

One of the issues I ran into that is barely covered in the documentation is how to set this up if you cannot use multicast.  Multicast is the default method that ganglia uses to discover nodes.  This is great, it means that auto-discovery works… kinda… The issue is that most cloud providers squash your ability to do multicast .  This is a good thing, can you imagine having to share a room with the guy who can’t stop screaming through the bull-horn every 2 milliseconds?  So, if I want to use ganglia in EC2, the Amazon cloud, how do I go about doing that ?

To get around this issue, you need to configure ganglia in unicast mode.  This is the mysterious part, what exactly is it, where do I set it and, how do I have multiple clusters in unicast mode all report to the same web-UI?  Most of the tutorials I read alluded to the fact that you *could* have multiple clusters setup in ganglia, and most speculated [ some even correctly ] about how to do it, but none really implemented it.  So, here is how you can disable multicast in ganglia and instead, enable unicast with multiple clusters.

First, to get started with this, there are a couple of ganglia components that you really need to be familiar with.

gmetad

gmetad is the ‘server’ side of ganglia.  It is responsible for taking the data from the remote collectors and stuffing it into the backend database ( ganglia uses rrdtool).  You’ll have one of these bad-boys running for each web-ui you have setup.

Configuration

First of all, take a look at the full, default config file.  It’s got a lot of great comments in there and really helps to explain everything from soup to nuts.  That being said, Here’s what I used ( and my comments) to get me up and running.

Configuring this is done in ( default ) /etc/gmetad.conf

# Each 'cluster' is its own data-source
# I have two clusters, so, 2 data-sources
# ... plus my local host
data_source "Local" localhost
data_source "ClusterA" localhost:8650
data_source "ClusterB" localhost:8655

# I have modified this from the default rrdtool
# storage config for my purposes, I want to
# store 3 full years of datapoints.Sure there
# is a storage requirement, but that's what I need.
RRAs "RRA:AVERAGE:0.5:1:6307199" "RRA:AVERAGE:0.5:4:1576799" "RRA:AVERAGE:0.5:40:52704"

Essentially, the above sets up two clusters, ClusterA and ClusterB.  The sources from these are coming from localhost:8650 and localhosty:8651 respectively  ( don’t worry, I’ll explain that bit below…).  The other thing for me is that I need to keep 3 full years of real datapoints.  ( rrdtool is designed to ‘aggregate’ your data after some time.  If you don’t adjust it, you lose resolution to the aggregation, which can be frustrating).

gmond

gmond is a data-collector.  It will, essentially, collect data from a host and send it … somewhere.  Let’s discuss where.

Before we address the multiple clusters piece, here’s how you disable multicast.  The default config file will contain three sections that you really care about:

( The things we need to change are:

   Cluster -> name

comment out the udp_send_channel -> mcast_join parameter

comment out the udp_recv_channel -> mcast_join parameter

comment out the udp_recv_channel -> bind parameter

)


/* If a cluster attribute is specified, then all gmond hosts are wrapped inside
* of a <CLUSTER> tag. If you do not specify a cluster tag, then all <HOSTS> will
* NOT be wrapped inside of a <CLUSTER> tag. */
cluster {
name = "unspecified"
owner = "unspecified"
latlong = "unspecified"
url = "unspecified"
}

/* The host section describes attributes of the host, like the location */
host {
location = "unspecified"
}

/* Feel free to specify as many udp_send_channels as you like. Gmond
used to only support having a single channel */
udp_send_channel {
# Comment this out for unicast
#mcast_join = 239.2.11.71
port = 8649
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
# Comment this out for unicast
#mcast_join = 239.2.11.71
port = 8649
#Comment this out for unicast
#bind = 239.2.11.71
}

So, in order to convert this to unicast, you would just comment out the above, and set the port to some available tcp/ip port… that simple!

So, I have 3 clusters, localhost, ClusterA and ClusterB.  To get this working with Unicast ( unicast meaning that I talk to one specific endpoint ), I need to have a separate gmond running on my server for EACH cluster.

So, on the ganglia server, I have 3 gmond config files:

(localhost)

</pre>
/*
 * The cluster attributes specified will be used as part of the <CLUSTER>
 * tag that will wrap all hosts collected by this instance.
 */
cluster {
 name = "Local"
 owner = "Scottie"
 latlong = "unspecified"
 url = "unspecified"
}
/* The host section describes attributes of the host, like the location */
host {
 location = "GangliaSever"
}

/* Feel free to specify as many udp_send_channels as you like. Gmond
 used to only support having a single channel */
udp_send_channel {
host = localhost
port = 8649
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
port = 8649
}

/* You can specify as many tcp_accept_channels as you like to share
 an xml description of the state of the cluster */
tcp_accept_channel {
 port = 8649
}

Remember the ‘data-sources’ from your gmetad.conf file? Well, if you look up, you’ll see that the data-source for the ‘Local’ cluster was ‘localhost:8649′  Essentially, gmetad will talk to this gmond on localhost:8649 for receiving data.  Now, the remainder of your gmond.conf file is important, it dictates all of the monitoring that the gmond instance will do.  Only change the section that I have listed above.

Now for the two remaining clusters:

ClusterA:

/*
 * The cluster attributes specified will be used as part of the <CLUSTER>
 * tag that will wrap all hosts collected by this instance.
 */
cluster {
 name = "ClusterA"
 owner = "Scottie"
 latlong = "unspecified"
 url = "unspecified"
}
/* The host section describes attributes of the host, like the location */
host {
 location = "GangliaSever"
}

/* Feel free to specify as many udp_send_channels as you like. Gmond
 used to only support having a single channel */
udp_send_channel {
host = localhost
port = 8650
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
port = 8650
}

/* You can specify as many tcp_accept_channels as you like to share
 an xml description of the state of the cluster */
tcp_accept_channel {
 port = 8650
}

Cluster B:

/*
 * The cluster attributes specified will be used as part of the <CLUSTER>
 * tag that will wrap all hosts collected by this instance.
 */
cluster {
 name = "ClusterB"
 owner = "Scottie"
 latlong = "unspecified"
 url = "unspecified"
}
/* The host section describes attributes of the host, like the location */
host {
 location = "GangliaSever"
}

/* Feel free to specify as many udp_send_channels as you like. Gmond
 used to only support having a single channel */
udp_send_channel {
host = localhost
port = 8655
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
port = 8655
}

/* You can specify as many tcp_accept_channels as you like to share
 an xml description of the state of the cluster */
tcp_accept_channel {
 port = 8655
}

Now that we’ve got our ‘server’ setup to receive data for each of our clusters, we need to configure the actual hosts that are part of that cluster to forward data in.  Essentially, this is going to be the same ‘gmond’ configuration, but will forward data to the ‘gmond’ that we just setup on the server.

Let’s say we have three hosts:

Grumpy ( is our local server)

Sleepy ( Cluster A)

Doc ( Cluster B)

Now, let’s configure their gmond’s to talk to our server (Grumpy) and start saving off our data.  First of all, Grumpy is already configured up and running, so if you connected to the ganglia interface at this point ( and your gmetad is running ), you should see ‘Grumpy’ showing up in the ‘Local’ cluster.

On each of these hosts, you only change the host field to be the name or IP address of your ganglia ‘server’ ( udp_send_channel->host)

:

<pre>/*
 * The cluster attributes specified will be used as part of the <CLUSTER>
 * tag that will wrap all hosts collected by this instance.
 */
cluster {
 name = "ClusterA"
 owner = "Scottie"
 latlong = "unspecified"
 url = "unspecified"
}

/* The host section describes attributes of the host, like the location */
host {
 location = "GangliaSever"
}

/* Feel free to specify as many udp_send_channels as you like. Gmond
 used to only support having a single channel */
udp_send_channel {
host = grumpy
port = 8650
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
port = 8650
}

/* You can specify as many tcp_accept_channels as you like to share
 an xml description of the state of the cluster */
tcp_accept_channel {
 port = 8650
}</pre>

On Doc ( Cluster B ), you make the same change ( udp_send_channel->host ):

/*
 * The cluster attributes specified will be used as part of the <CLUSTER>
 * tag that will wrap all hosts collected by this instance.
 */
cluster {
 name = "ClusterB"
 owner = "Scottie"
 latlong = "unspecified"
 url = "unspecified"
}
/* The host section describes attributes of the host, like the location */
host {
 location = "GangliaSever"
}

/* Feel free to specify as many udp_send_channels as you like. Gmond
 used to only support having a single channel */
udp_send_channel {
host = grumpy
port = 8655
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
port = 8655
}

/* You can specify as many tcp_accept_channels as you like to share
 an xml description of the state of the cluster */
tcp_accept_channel {
 port = 8655
}

Once you start the gmond process on each server, wait a few and they will appear in the ganglia interface. Simple as that!


BigSQL Architecture

BigSQL

BigSQL.

From data to information, from information to insight.

A state-of-the-art Big Data Warehouse solution that is fast, secure and continuously available. BigSQL will scale from your desktop to the cloud. Run real time OLAP directly from the worlds most secure RDBMS.

Get started with BigSQL right now.

You can immediately put BigSQL to work on your Relational Data and Big Data. BigSQL is an integrated bundle of:

  • PostgresHA – Highly Available PostgreSQL, the worlds most advanced open source database,
  • Hadoop, the archetypical Big Data solution and
  • Hive, an implementation of relational table abstrations.

BigSQL Architecture.

BigSQL Architecture

This is the core BigSQL engine and together they give you a Highly Available Big Data Warehouse solution.

When you add in components like Enterprise Search (Solr), Streams Processors (Pig), ETL (Sqoop) you have all the components required to analyze real time data directly from PostgreSQL including your NoSQL data in Hadoop.

Linear Scalability.

BigSQL leverages the linear scalability of Hadoop and HDFS across low cost commodity hardware and/or the cloud. It can easily scale to petabytes of information.

Platform Ubiquity.

BigSQL will lay down cleanly in 64 bit Linux (Production) and 64 bit OS X (Development) distros.

 24 x 7.

Every part of your Big Data stack should be hardened. The Hive Relational Metastore in BigSQL is PostgresHA, a Highly Available PostgreSQL implementation that can be set up and distributed exactly the same way that you would any Big Data implementation. You can have Active Standby clusters in the same datacenter but in different racks, you can stream to a remote Disaster Recover node.

Open Source.

Every component of BigSQL is Open Source. Some components serve double duty.

ZooKeeper is used as the distributed coordinator for HDFS and is used as the distributed lock manager in Hive. PostgresSQL, through PostgresHA is the relational metastore in Hive and a Relational Data Warehouse in it’s own right.

Each software component is free and it runs on cheap freely available hardware. If you cobble together enough Raspberry Pi’s, your entire hardware software stack could be open source.

Security.

BigSQL is built on PostgreSQL, the worlds most secure RDBMS.

Data Equivalence.

BigSQL gives you equivalent access to your Big Data and Relational Data through psql (the PostgresHA Command Line Interface) and the Hadoop Foreign Data Wrapper.

Help.

OpenSCG The people that built BigSQL are here to help, from package customization through on-site consulting to 24 x 7 database administration.

BigSQL

“From data to information, from information to insight.”

PostgreSQL HA Zero Downtime

PostgreSQL Clustering with Postgres HA

PostgreSQL Clustering made easy with PostgresHA.

Authored jointly with Scott Mead, Sr. Architect OpenSCG and Architect of PostgresHA.

PostgresHA is Highly Available (HA) PostgreSQL.

High Availability

High availability is not new. Financial trading systems and the telephone network were federally mandated to provide highly availability. People didn’t do HA because it was a fun project, they did it because they were told to.

What is new is that EVERYTHING these days is HA. Your email and blog posts on Google, your shopping cart on Amazon, your pictures on Facebook. People have gone from doing this because it’s critical for their business and doing it because its good for business.

Open Source is also good for business and by virtue of all the components of a Big Data solution being publicly available some very smart people have gone to the trouble of making this stuff accessible to the likes of you and me. They have made it so that we can understand it and use it. You don’t have to be Ma Bell or the NYSE to enjoy the benefits of always available computing.

PostgreSQL

The worlds most advanced Open Source Database. PostgreSQL has synchronous replication. In computer theory terms it is called 2-safe replication, in English it means that you have two databases instances running simultaneously and your Primary (Master) database is synchronized with a Secondary (Slave) database. Unless both databases crash simultaneously, you won’t lose data. If your databases are on separate machines in separate racks in separate buildings you can reduce the possibility of simultaneous failure to near zero. Building a HA solution that survives an Extinction Level Event like a meteor strike is beyond the scope of this paper and would not get read anyway.

With synchronous replication, each write waits until confirmation is received by the transaction log of both the Master and Slave. This increases the transaction response time and the minimum is the roundtrip time between them. All two-phase commit actions require waits (prepare and commit.)

However, there is good news, read only transactions and transaction rollbacks need not wait for replies from standby servers (PostgreSQL, We’re optimized for Rollback!) Only top level transactions require commits, sub-transaction commits do not need to wait for standby servers and long running data loading or index building only need the very last commit. Good for speed, but if you lose it halfway through, you lose it all – you pays your money, you takes your choice.

PostgreSQL makes a uses the same concept of active standby as ZooKeeper – keep a standby hot for fail over (Master / Slave, Primary / Secondary, Active / Standby – you say potato I say potato.)

OpenSCG makes PostgreSQL HA with the Open Source project PostgresHA .

At a high level, PostgresHA monitors both Master and Slave databases. If it spots something that could be a failure it enters a state of heightened awareness and starts yelling about it. Heightened awareness means continuous pinging, more pinging and deeper pinging.

At some point it’s serious enough to warrant failover.

This is where you need to consider what you want (In advance).

Some things (like a head crash) you want to know right away and react as soon as possible, and things like auto failover sound like a good idea.

It’s possible that there is a transient problem that no amount of intelligent probing is going to find out and it will go away without causing too much trouble. In this case you probably don’t want to fail over automatically, you might want to watch it for a little bit and when you can bear it no more, you punch a big red failover button.

Sometimes you might want to crank up the sensitivity of the probes anyway just to see what nominal behavior looks like.

PostgresHA essentially does this for you.  The great part is that OpenSCG has a free distribution of PostgreSQL HA here!

Here’s how:

PostgresHA Architecture

Your Applications (clients) attach to a connection pooler (pgBouncer). The connection pooler simply holds all the client connections and switches them between a smaller number of database connections. Everyone does it, it saves expensive database connection resources. Another nice property is that it can switch these connections to a different database if needed.

PostgresHA Steady State.

PostgresHA Steady State

With Streaming Replication in PostgreSQL, we can have a Master and Slave running synchronously, and through pgBouncer we can route read-only connections to the slave and write transactions to the Master (nice). To protect against a bad hardware failure, you can put the Master and Slave databases in different racks. For Disaster Recovery, you can use log streaming to replicate to a separate data center.

PostgresHA handles failure scenarios the following way:

Master cannot communicate with Slave

PostgresHA, Master Slave Disconnect

Slave becomes isolated:

PostgresHA, Slave Isolated

Master Isolated:

Master Isolated

Bash Script for setting JAVA_HOME

Bash Script for setting JAVA_HOME on Linux & OSX

jvmVer=`java -version 2>&1 | grep "java version" | awk '{print $3}' | tr -d \"`
rc=$?
if [ ! "$rc" == "0" ]; then
  echo " "
  echo "ERROR: JDK does not appear to be installed."
  exit 3
fi

if [ "x$JAVA_HOME" == "x" ]; then
  osName=`uname`
  if [ "$osName" == "Darwin" ]; then
    JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home
  else
    JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-amd64
  fi
fi