Load balancing & High Availability
Heimdall provides a multi-tiered approach to supporting High Availability that supports a wider variety of databases vendors and database topologies than any other product on the market. This includes failover logic at the JDBC driver itself, and coordinated failover via the central manager. With it, the following aspects of database behavior can be accounted for and made highly reliable, while accounting for the limits that may be inherent in any database synchronization technology:
True active+active load balancing One or more write-only masters One or more read-only slaves Hot-standby servers Cold-standby servers Control of pre-emption on restoration of services Scripted orchestration on failures with multi-phase configuration commits Automatic disaster recovery activation *Custom query routing
How to account for each of these will be described, and a variety of configuration scenario examples explained in detail.
While simple at a high level, high availability of database servers is an extremely complex topic, and the interaction that Heimdall has with the databases is as such also complex topic. Below are a few questions that need to be asked both in setting up the database cluster, and the answers are important in evaluating what should be done with the Heimdall configuration:
- Should the infrastructure have a shared-nothing approach to the data processing of queries? This implies that all disks and disk infrastructure should be redundant. Solutions such as Oracle RAC leverage a shared-disk infrastructure in order to simplify the task of managing data, when load balancing of the database nodes is performed. While an excellent solution for many environments, this infrastructure is very expensive, and still leaves the risk of a storage corruption resulting in a complete outage.
- If not sharing the disk infrastructure, will synchronous or asynchronous replication be performed between nodes?
- Synchronous replication indicates that between two or more nodes, all nodes must agree that a commit can be performed, and that it completed before a response is provided to the client. This results in a significant overhead in writes, but guarantees that database nodes are in sync, and allows a more flexible active-active approach to load balancing of write traffic. For the purposes of configuration, a shared disk infrastructure will be considered the same as synchronous replication.
- Asynchronous replication is the method of executing a complete transaction on a single server, then replicating the changes to a second server. In most cases, this works well, but can have unfortunate side-effects.
- Will a single write-only node be acceptable, with reads spread across nodes, or will all activity take place on a single node, with a second node acting as a pure standby node?
- Will the replication be unidirectional or bidirectional, i.e. will all changes be pushed from one node to one or more, or will all changes on all nodes be pushed to all other nodes.
An excellent article on the issues with Asynchronous replication and load balancing can be found at: http://datacharmer.blogspot.com/2013/03/multi-master-data-conflicts-part-1.html, and covers much of what is necessary to help answer these questions and an overview of the issues related with them.
Once the question of if the server is doing synchronous replication or asynchronous replication and unidirectional vs. bidirectional replication is answered, then the Heimdall configuration can be designed.
First, for load balancing to operate, the “enable load balancing” option must be selected. If not, then the only server Heimdall will connect to will be the primary JDBC URL. If load balancing is enabled however, then the jdbc URL specified in the top section of the configuration will be ignored, although used as the initially configured server in the load balancing configuration.
The next option, that of track cluster changes, enables the cluster detection and tracking logic to become active with Heimdall. The purpose of this is to allow Heimdall to reconfigure itself based on the cluster’s configuration as it may change. This works for the following cluster types:
- MySQL (stock) Master/slave replication
- Galera Cluster Master/Master replication
- Postgres Master/Read-slave replication (not PG 10 logical replication)
- SQL Server Always-on availability groups
- SQL Server Mirroring
In an RDS environment, to support global RDS clusters, the AWS RDS ARN can be filled in, and will then be used to discover the entire global cluster.
The use response metrics option is used in situations where the readers may be distributed, either in multiple availability zones, or even globally distributed. This option will result in each proxy using the response time metrics from health checks to determine the closest or best performing reader, then filtering out any readers that are not within 25% of the best reader's response time. In most cases, this is sufficient enough so that only the readers in the same availability zone will be selected to be read from, although some combinations of AZ's may be paired together for reading due to close proximity.
The Track Replication Lag option is described below, as is the Lag window Buffer.
When enabled, Heimdall will on a failure event and every 30 seconds check the state of the database cluster, and will reconfigure itself as necessary based on the cluster configuration, allowing for automatic role selection. Please see below for the rules in determining what server takes what role (when it is not deterministic) via the target read and write capacity.
The node configurations contain the following:
- NAME: A simple name used internally for tracking the individual server, and may be used in various logging operations. When cluster tracking is used, the name will indicate the role and ID of the server, and will be auto-generated.
- URL: The URL is the JDBC URL that will be used to access this data source. Note that when load balancing is configured, this JDBC URL overrides the normal JDBC URL. In the case a database server has multiple ports, each can be used as an independent database in this configuration.
- Enabled: The Enabled flag specifies if the server should be used or not. If a configuration change is made from enabled to not enabled, then all active connections will be migrated away from the server as soon as transactions on each connection are completed. Enabled (even if not used) also means that the server will be monitored with internal monitoring, so will allow state change scripts to operate properly. Internal to the driver is another value similar to Enabled, called “Active”. This may be observed in the logs during failures, with the state being changed automatically by the driver. This value represents if the server is considered healthy or not, i.e. it is passing health checks. This value is managed by the driver itself and can’t be set.
- Writeable: Specifies if the server should be considered usable for queries that are flagged as writes. If a server is not writable, then if and only if a query matches a “reader eligible” rule or specially flagged forwarded request will the server be used, i.e. read/write split.
- Weight: Represents what share of connections each server should receive, as a random share based on the weight. If one server was configured with a weight of 1, and another as a weight of 2, then the second will get 2x as many connections as server 1. This is determined as a random weighted assignment–it does not account for the number of connections that a server already has, or the traffic load, and with a small number of connections may not be accurately represented. In the case where some servers are read-only, the share allocation will only be computed including the read-only servers if the setReadOnly() method is used on a connection. The act of using this method will force a reconnect based on the new weights, even if it was previously connected to another server.
A weight of 0 is considered a special case–it is used to designate a “server of last resort” for a data source, and allows a driver to perform an independent failover without coordinating with the central server, in configurations that allows this behavior. In the case that several servers have a weight of 0, then all servers with a weight of zero will have traffic balanced to them during the failure in a random manner.
When a server of last resort is used, it is ONLY used until at least one server with a weight greater than 0 comes back online. At that point, as connections complete transactions (or immediately if they are ide), the JDBC connections will again revert back to using the weighted server or servers, and the weights will again be used to compute the share they allocate.
Connection Hold Time: If no server of last resort is available, and all valid servers for a connection are in a non-active state, then the overall load balancing property of Connection Hold Time will take effect. This value is specified in milliseconds, and represents how long we should wait until one of two events occurs:
- A server becomes active again;
- The data source configuration is changed via the central manager.
When a configuration change takes place or a server state change, all waiting threads will be woken, and will then check to see if a new connection can be made. If not, they will again go to sleep until either of these events occur again until the hold-time has been passed, at which time an exception will be passed to the calling application.
Cluster Role Decision Making
The target read and write capacity is used to help select what server should be used for what role, in particular in the condition of a Galera cluster, as any node in theory can be a write node, but usually, it is desirable to only have one node act as the write node, one or more nodes as read nodes, and other nodes may be hot-standby or used for maintenance or reporting activities. The logic works as follows:
For each node, a write and read capacity will be associated. The online nodes will be ordered for write consideration based on the highest write weight, then the next, etc. The nodes will then be selected for a write role until the target write capacity is achieved. These nodes will also be used as read nodes unconditionally.
For the read roles, IF the write nodes don’t satisfy the desired read weight, then again, the read weights will be ordered from highest to lowest, and additional read-only nodes selected for reading until the target read capacity (including the write nodes) has been achieved.
If any nodes are not selected as write or read nodes, then they will be flagged as disabled, and will not be used by Heimdall until after a failure happens, at which point they will be evaluated for their role again.
The goal of this is to allow a customer to decide which servers in a cluster fill what role, based on their desires.
Database Tables for Cluster Roles
In order to provide greater flexibility in managing database clusters, for any system that supports more than two nodes (MySQL Replication, Galera, Postgres, & SQL Server), when cluster tracking is enabled, a small table will be configured on the database in the Heimdall database/schema called "servers_info" defined as:
CREATE TABLE heimdall.servers_info ( private_id VARCHAR(255), public_id VARCHAR(255), read_weight NUMERIC(2), write_weight NUMERIC(2), enabled NUMERIC(1), PRIMARY KEY(private_id) );
This will be created on the initial setup of the cluster by Heimdall, and provides an alternative control point to manage the values as they match in the GUI. In addition, this allows you to specify a private Id vs. Public ID, in order to provide mapping due to NAT. As Heimdall polls the cluster for the IP addresses of the nodes in the cluster, Heimdall will often receive a private IP that can't be directly connected to. By setting this private ID in the "private_id" field, and mapping it to the appropriate public IP or name, this allows the cluster logic to account for the actual topology. The read and write weights can also be adjusted, and if a node is to be brought offline, setting the enabled field to "0" will disable the node.
These settings, if adjusted, should take effect in 30 seconds, and will be reflected on the GUI once it is refreshed. Likewise, when used, a change on the GUI will be immediately reflected into the database. This provides a control point for a DB doing maintenance to take offline nodes without even logging into the Heimdall console.
Note: This table is not currently used with AWS Aurora due to the way cluster detection is done.
State Change Scripting
In order to account for the variety of different scenarios that may be necessary, Heimdall provides a generic state change scripting mechanism to allow simple scripts to orchestrate with third party tools as part of a failover. The script should be named statechange-
The script is provided the state of the data source via the command line, i.e.
statechange-dbdemo-mysql.bat Primary:true,false,true,jdbc,1:mysql://mysql.heimdalldata.com:3307/tpch Secondary:false,true,false,0,jdbc:mysql://mysql.heimdalldata.com:3308/tpch
The values for each server entry are:
- Server name
- Enabled (true or false)
- Healthy (true if healthy, or not monitored)
- Writeable state (true or false)
- Weight (0 or positive integer)
- jdbc URL
The script needs to take these values into account and then print the desired changes to the data source configuration on the standard output of the script. The valid commands that can be issued are:
The debug option results in all commands being printed as received. Enabled, writeable and weight all adjust the respective options for the specified server. The commit option enacts the new configuration. The changes are made to the stored configuration as the output is returned, and made effective on a commit. Multiple commits may be made in a single script execution, and is generally recommended. First, the failed server should be disabled and a commit made. Any orchestration that will then take time should then be performed, and finally, a new server should be made active.
It is possible to leverage both the scripted failover logic, as well as the cluster change logic. In this case, the scripted failover logic will be executed first when a failover is detected, which can promote a cluster node, then the cluster change logic will execute, which can then detect the changed servers based on the cluster state. To receive notifications of health change events, please insure the notification section in the Heimdall Management Console is configured and use the test button to verify it is working properly.
One of the problems when routing queries to a database cluster that uses replication between nodes is that most such configurations results in a delay between writing to a table and being able to read the results on the read-only server. As it is a common behavior to write to a table and then immediately use the new information as part of a new query, being able to track how long it takes for this replication to occur can be an important factor in implementing read/write split. With the replication lag detection in place, Heimdall will configure tables in the Heimdall database to write to on the db cluster, and then will use those tables to do performance testing to see how long it actually takes to perform replication. In addition, a static "safety" window can be configured in the "Lag Window Buffer" setting, which will be the baseline for replication lag, to which this is added. If dynamic replication lag detection is not desired, set the Lag window buffer to a sufficiently high value that should always be safe.
With the combination of either or both the "Track Replication Lag" and "Lag Window Buffer", Heimdall will then use the last write time for a table in order to determine if a particular query is safe to read from the read-only server or only from the write master. For example, if the combination of the two results in a value of 10s, then a read of a table that hasn't been read for 11s will be eligible to be sent to the read-only server, but if the table was written 5s ago, then it will be directed to the write server. This logic closely mimics (and leverages) the logic in the cache engine that determines if a particular cached object has been expired, based on the last write time to the table.