Optmial InnoDB configuration

How are transactions organized physically

When InnoDB creates a new transaction it is not yet committed. The database has not yet made any promises to the application and so we do not really have to make anything persistent so far.

To be fast InnoDB tries to assemble the transaction in a memory buffer, the innodb_log_buffer. It should be sufficiently large that you actually can assemble such a transaction in memory without needing to write it out in part into the redo log. A size of 1M to 8M is normal.

Once a transaction is to be committed InnoDB has to read the page from disk which contains the image of the row that is being changed. It then has to actually make that change in memory. The changed page is cached in a memory pool called the innodb_buffer_pool. This pool also caches unchanged pages that have been accessed by read operations. All of these pages on disk and in memory are 16K in size and the innodb_buffer_pool_size determines how much RAM we will use as a cache for such pages - usually as much as we can spare.


The modified page in memory is generally NOT written back to disk immediately. Instead the change is recorded at the end of the current redo log on disk - this is your ib_logfile0 or ib_logfile1. The page is also marked as "dirty" (= to be written back to disk) in memory.

A dirty page is written out to the tablespace in three cases:
  1. The redo log, which is organized as a ring buffer, is full. To free up some space we will write out dirty pages in redo log order so that we can advance the trailing pointer of the redo log ring buffer an make some room.

    This situation is called an Innodb_log_wait and will be registered in the status counter of the same name.
  2. InnoDB requires a free page from the InnoDB buffer pool but cannot find one. Usually we can free a page in the buffer pool by giving up a page that is not marked dirty. When a page is not marked dirty its contents can be reloaded from disk at any time and so we can safely give it up in memory. But when the buffer pool holds only dirty pages this is impossible and we actually have to flush dirty pages to disk before we can free them up for other uses.

    This situation is called Innodb_buffer_pool_wait_free and will be registered in a status counter of the same name. InnoDB tries to avoid this situation: Whenever more than innodb_max_dirty_pages_pct percent many pages are marked dirty a checkpoint is forced and dirty pges will be written.
  3. InnoDB feels idle and will write out batches of 64 pages each to disk once a second.

    This is normal and will not be specifically registered (but will of course bump Innodb_pages_written like everything else).


Relevant config entries in my.cnf:
CODE:
# Global buffer for transaction
# assembly before commit
innodb_log_buffer_size = 8M

# Percentage of buffer pool pages that
# may be dirty before forcing a
# checkpoint
#
# Default = 90, ok
innodb_max_dirty_pages_pct = 90


Relevant counters from SHOW GLOBAL STATUS:
CODE:
# Counter for "redo log full"
Innodb_log_waits

# Counter for "buffer pool all dirty"
Innodb_buffer_pool_wait_free


Choosing a useful redo log size



The redo log logs transactions and entries inside the redo log are proportional to the transactions size because we are logging rows not pages. The purpose of the log is to be able to delay writeback of those 16K pages to the tablespace. Often a page contains more than one row and many transactions close together in time will change the same page or even the same row. The write to the redo log will make all those changes sychronously persistent, but can do so using linear writes to disk. Writing back changes to the table space will involve disk seeks, and disk seeks are slow. Having a redo log minimizes or eliminates such random writes.

Usually the redo log should be large enough and never fill up. Consequently your Innodb_log_waits counter should be 0 or at least not move when you look at it twice. If you experience Innodb_log_wait events one of two situations exists: Your server has write bursts larger than your redo log - the redo log is too small and must be extended. Or your server has persistent high write load and the redo log will overflow no matter how large you make it. In this case, but more disks or choose other ways to distribute the write load to more spindles.

By default the redo log consists of two files (innodb_log_files_in_group), each of which is 5M in size (innodb_log_file_size), for a total of 10M. This is usually much to small. Ideally you should have two files which are 64M to 256M in size, resulting in a total redo log of 128M to 512M. In any case the redo log cannot be larger than 4096M = 4G, even if you are on a 64 bit box.

Before MySQL 5.0 it was important not to oversize your redo log: Recovery time after a server crash has been dependent on the size of the used part of the redo log because InnoDB first did the redo log recovery and then entered operational mode. With 5.0 this changed: Log recovery is done in background while the server is already online, so the redo log size is no longer controlling the servers recovery time.

If you change the innodb_log_file_size when ib_logfile0 and ib_logfile1 already exist your InnoDB will refuse to start and leave a message in your servers error log. The message is a complaint about real log file size and configured log file size disagreeing - and would you please kindly fix that?

So in order to change the size of your redo log proper procedure must be followed: Shut down your server cleanly. Double check the log and that there is no server process any more. Then move away the two existing ib_logfile? to some other location and change the my.cnf to reflect the desired new innodb_log_file_size. Restart the server. In your error log you will find messages about new ib_logfiles being created (and a number of really scary complaints all of which you must ignore). Once the server is online again and you have checked that your data is still present you may delete the old logfiles.

Relevant entries in my.cnf:
CODE:
# Number of ib_logfile?
innodb_log_files_in_group = 2

# Size of one ib_logfile?
innodb_log_file_size = 256M


How InnoDB stores data in files

As shown in an earlier article of this series, InnoDB has two modes of operation: If innodb_file_per_table = 0 all data is being stored in one or multiple ibdata central tablespace files. If innodb_file_per_table = 1 data is being stored in .ibd files. The central tablespace file still must exist, but stored only administrative data and the undo log.

The central tablespace files are being created in innodb_data_home_dir, unless you specify them with full pathnames. If innodb_data_home_dir is empty, datadir is being used - this is good, so leave it empty. The individual tablespace files are then specified with some magic syntax inside innodb_data_file_path - look it up in the manual for details. The default string is "ibdata1:10M:autoextend", which will create a 10M file named ibdata1 in datadir.

Because of the autoextend parameter this file will be grown on demand. Growth will be in steps of innodb_autoextend_increment megabytes. The default here is 8, which is far to small if you are using innodb_file_per_table = 0.

You should be operating with innodb_file_per_table = 1 to get one file per table. This enables you to reclaim disk space at the operating system level with OPTIMIZE TABLE. At innodb_file_per_table = 1 the defaults of 10M initial file and 8M steps are ok, so do not change them.

If you need or want to run with innodb_file_per_table = 0 you will have a very large ibdata1 file. Make sure that your operating system and backup tools can conveniently deal with such large files. If that is not the case you might need a more complicated innodb_data_file_path statement specifying a large number if ibdata files - each of them for example 2G in size or whatever your limit is.

One of your ibdata files should be autoextending and the step size should be reasonably large in order to help the operating system to allocate disk space in a nonfragmented way. A good step size is 1% to 5% of your disk space so that your tablespace file will eventually fill the entire disk in no more than 20 to 100 steps. This is nicely finegrained to manage disk space but coarsely grained enough for efficient allocation. On a filesystem of 200G you would then choose an innodb_autoextend_increment of 2048 (2048M = 1% of 200G) or even 10240 (10G = 10240M = 5% of 200G).

When using innodb_file_per_table = 1 please consider the increased need for filehandles. Set innodb_open_files to provide at least one filehandle per table. You might need to up open_files_limit as well - this is also used to cache .frm files and MyISAM files, so it must be even larger.

Relevant entries in my.cnf (File per Table):
CODE:
# Using one file per table
innodb_file_per_table = 1

# Where to put the ibdata (default: datadir)
# innodb_data_home_dir

# How to create the ibdata
innodb_data_file_path = "ibdata1:10M:autoextend"

# ibdata growth step
innodb_autoextend_increment = 8

# More file handles
#  One per table
innodb_open_files = 2048

# On linux, we might up this properly
open_files_limit  = 32768


Relevant entries in my.cnf (Single tablespace file):
CODE:
# Using single tablespace
innodb_file_per_table = 0

# Where to put the ibdata (default: datadir)
# innodb_data_home_dir

# How to create the ibdata
#   We want a single 2G file
innodb_data_file_path = "ibdata1:2048M:autoextend"

# ibdata growth step
#   Growing in 2G steps (1% of 200G)
innodb_autoextend_increment = 2048


How InnoDB paints data to disk

We have already seen that writing commands will cause only disk reads in InnoDB. Any INSERT or UPDATE will create a log buffer and will mark pages as dirty inside the buffer pool for data and undo log pages.

Once we commit the change is written to the redo log - at least if innodb_flush_log_at_trx_commit = 1. The commit will now perform a write system call to push data from the server into the operating system file system buffer cache. It will then issue a flush system call to force the operating system to flush the file system buffer cache to disk. Only then the data is persistent and the commit is done.

A linear write to the redo log is much faster than a seek and a write to the tablespace file, but even then this is a relatively slow operation which can incur wait times of several millisecons - unless you happen to have a disk controller with a battery buffers RAM (BBU RAM). If you do not care about your data you have the option to decouble the commit from the disk and avoid those waits at the expense of data persistence.

At innodb_flush_log_at_trx_commit = 2 a commit will trigger the write system call to push data out of the server into the file system buffer cache, but the flush to force the actual disk write will only happen once a second or so. This is much faster because we do no longer have to wait for the disk. If your mysql server process crashes, no data is lost - it is all inside the file system buffer cache and will eventually be written to disk. But if your mysql server hardware crashes, though, up to one second of redo log may be missing. There may be data where the application (or a user) has been signalled a successful write which now is lost.

Depending on the business case that is being implemented it may be that this is a relevant error or not. From the computer scientists point of view "innodb_flush_log_at_trx_commit = 2" is a violation of the ACID principle of Codd and therefore wrong. From the business perspective the behaviour can still be right. That would be the case when the missing data can be otherwise reproduced or then the correction of the error through customer service is cheaper than the additional cost in hardware that would be necessary to deliver the required performance at innodb_flush_log_at_trx_commit = 1.

With innodb_flush_log_at_trx_commit = 0 write strategy is even more relaxed - "commit" is now a purely logical operation generating now write or flush system calls. Instead the redo log is written and flushed only once a second. This is slightly faster than mode "2" but not a lot.

No matter what value you choose for this variable, the database server will recover correctly after a crash. In any case the database will return into a consistent transactional state. What is different is the point in time (the latest visible transaction number to which the system will recover) that will be reached when recovery finishes.

Another way to influence the behavior of InnoDB is the choise of a innodb_flush_method. In Unix the valid values are "fdatasync" (the default), "O_DSYNC", "O_DIRECT", "littlesync", "nosync", in Windows "normal" and "unbuffered" (the default) as well as "async_unbuffered" (the default in Windows XP and Windows 2000) are offered.

The major idea with O_DSYNC and O_DIRECT is to open the redo log file in a way that the file system buffer cache is disabled. The database then does all the buffering and each write system call will hit the disk immediately with no need for a flush system call ever. In Linux this can be achieves by using innodb_flush_method = O_DIRECT. InnoDB will then generate only write system calls and will never call flush. This is no longer necessary because writes will always write to disk unbuffered.

Relevant configuration for my.cnf (for Linux):
CODE:
# Fast insecure writing for many applications
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = O_DIRECT

# Secure writing for ACID compliance
innodb_flush_log_at_trx_commit = 1
innodb_flush_method = O_DIRECT


Concurrency Tickets

InnoDB works better when we limit the number of threads concurrently operating inside the InnoDB storage engine proper. There can be innodb_thread_concurrency many threads inside the engine at once. Many formulas exist to calculate this number ("number of things you want to keep busy = cpus + disks", "cores times two"), but it all comes down to the fact that current versions of InnoDB expose worse performance when this value is set to larger values than 16 or 32 depending on the workload.

If you have more concurrent transactions than innodb_thread_concurrency allows surplus threads will have to wait. Often a single thread will call into the storage engine through the handler interface many times in order to execute a primitive operation such as key lookup only to return into the MySQL general SQL layer. To answer a single SQL query there may be many transitions between the general SQL layer and the InnoDB engine.

To prevent our thread from waiting each time it want to do something inside the engine it will get innodb_concurrency_tickets many "Tickets" when it is granted initial entry to the engine. The tread can now enter and leave the engine that often without needing to ask for additional permission each time. Useful values to experiment if you have a sufficiently big and busy box are "number of records in one block", "... in one 64 block segment" or "number of records we expect this query to read".

A similar variable is innodb_commit_concurrency, which limits the number of threads committing in parallel. This parameter limits the resource usage inside the log buffer and regulates contention on the redo log.

For historic reaons there is a variable named thread_concurrency. The value here ends up in a call to pthread_setconcurrency(), which does nothing in current versions of Linux and Solaris. In Solaris8 it affected the internal mapping of User threads to kernel threads. Ignore it on current machines, it is a no-op.

Relevant entries in my.cnf:
CODE:
innodb_commit_concurrency = 0
innodb_thread_concurrency = 16
innodb_concurrency_tickets = 500


Meta data structures

Judging by what I find in customer configurations the variable innodb_additional_mem_pool_size is the one most often set to strange values. What that value does is to control the size of a buffer for meta data structures, a cache for the Innodb internal data dictionary. The default value is 1M and normal system never will need more than 8M. I know of one customer with 40.000 InnoDB tables who required 20M for this.

Relevant entries in my.cnf:
CODE:
innodb_additional_mem_pool_size = 4M

InnoDB performance Methods

Tunning MySQL InnoDB storage engine is not an easy task. It’s more like a game where you set mini goals and then try to find the right value and achieve them. Playing with the InnoDB variables is an art where you have to check your performance results each time you change a configuration parameter. The start point of this game is your MySQL my.cnf file. The optimal configuration depends on your hardware and on the type of workload you are running. I played a few days with the configs and it was really helpful to see a performance gain at the end of the game. So I am sharing here some helpful commands and links that can save me some time in the next level of the game.
MySQL Innodb Only Memory-related variables:
innodb_buffer_pool_size – Set the amount of memory allocated to both Innodb data and index buffer cache. If the server requests data available in the cache, the data can be processed right away. Otherwise, the operating system will request that the data be loaded from the disk into the buffer. It is important to set this value as high as possible to use the more efficient innodb data and index buffer cache instead of operating system buffer. For the sysbench I/O bound workload on a T2000 server with 8G RAM, increasing innodb_buffer_pool_size from 4G to 5G can improve performance by around 11%.
Configure an Optimum Number of User Threads:
MySQL is a single-process, multithreaded application.  There is one master thread with highest priority to control the server. For every client request, it creates a dedicated user thread running at normal priority in the thread pools to process the user request and send back the result to each client once the result is ready. And there is one single user thread that waits for input from the console, and a group of utility threads running at lower priority to handle some background tasks. Currently, MySQL cannot scale well with the number of concurrent user connections. On a T2000 server, in the OLTP I/O bound read-write sysbench test, MySQL can scale from 2 up to 64 concurrent user threads to reach the peak performance point. After that, increasing the number of user connections will increase the user level lock contention observed from prstat –mL output(LCK) to reduce MySQL performance. For applications where the number of user connections is tunable, you need to test to get the optimum number of user connections for peak performance. For applications where the number of user connections is not tunable, the innodb_thread_concurrency parameter can be configured to set the number of threads working inside the InnoDB engine. You need to increase this value when you see many queries in the queue in show innodb status. Setting this value at 0 will disable it. On the T2000 server, we set it to be around 2*(Num of disks) in the sysbench OLTP I/O bound workload test. Testing and tuning the optimal value for the innodb_thread_concurrency parameter according to the kind of workload, and behavior of your system at runtime, can affect performance significantly.
Examples of  MySQL options:
Here is the  example of  /etc/my.cnf on T2000(32x1200MHz, 8GB RAM, Solaris 10 11/06) in sysbench OLTP I/O bound test(100M-row): Listing of another example my.cnf: grep -v “#” /etc/my.cnf|sed -e ‘/^$/d’
[mysqld] port = 3306
socket = /tmp/mysql.sock
basedir = /usr/local/mysql
datadir = /usr/local/mysql/data
log-error = /data/error.txt
user=root
skip-locking
max_connections = 3000
table_cache = 1024
max_allowed_packet = 1M
sort_buffer_size = 64K
thread_cache = 8
thread_concurrency = 32
query_cache_size = 0M
query_cache_type = 0
default-storage-engine = innodb
transaction_isolation = REPEATABLE-READ
tmp_table_size = 1M
innodb_data_file_path = ibdata1:100M:autoextend
innodb_buffer_pool_size = 5500M
innodb_additional_mem_pool_size = 20M
innodb_log_file_size =1900M
innodb_log_buffer_size = 8M
innodb_flush_log_at_trx_commit =1
innodb_lock_wait_timeout = 300
innodb_max_dirty_pages_pct = 90
innodb_thread_concurrency =32
[client]
port            = 3306
socket          = /tmp/mysql.sock
[mysqld]
port            = 3306
socket          = /tmp/mysql.sock
back_log = 50
max_connections = 200
max_connect_errors = 10
table_cache = 2048
max_allowed_packet = 16M
binlog_cache_size = 1M
max_heap_table_size = 64M
sort_buffer_size = 8M
join_buffer_size = 8M
thread_cache_size = 8
thread_concurrency = 8
query_cache_size = 64M
query_cache_limit = 2M
ft_min_word_len = 4
default_table_type =myISAM
thread_stack = 192K
transaction_isolation = REPEATABLE-READ
tmp_table_size = 64M
log-bin=mysql-bin
log_slow_queries
long_query_time = 2
log_long_format
server-id = 1
key_buffer_size = 32M
read_buffer_size = 2M
read_rnd_buffer_size = 16M
bulk_insert_buffer_size = 64M
myisam_sort_buffer_size = 128M
myisam_max_sort_file_size = 10G
myisam_max_extra_sort_file_size = 10G
myisam_repair_threads = 1
myisam_recover
innodb_additional_mem_pool_size = 16M
innodb_buffer_pool_size = 1G
innodb_data_file_path = ibdata1:10M:autoextend
innodb_data_home_dir =/mnt/mysql/data/
innodb_file_io_threads = 4
innodb_thread_concurrency = 16
innodb_flush_log_at_trx_commit = 1
innodb_log_buffer_size = 8M
innodb_log_file_size = 256M
innodb_log_files_in_group = 3
innodb_max_dirty_pages_pct = 90
innodb_lock_wait_timeout = 120
[mysqldump]
quick
max_allowed_packet = 16M
[mysql]
no-auto-rehash
[isamchk]
key_buffer = 512M
sort_buffer_size = 512M
read_buffer = 8M
write_buffer = 8M
[myisamchk]
key_buffer = 512M
sort_buffer_size = 512M
read_buffer = 8M
write_buffer = 8M
[mysqlhotcopy]
interactive-timeout
[mysqld_safe]
open-files-limit = 8192

If InnoDB is disabled:
mysql> show engine innodb status;
ERROR 1235 (42000): Cannot call SHOW INNODB STATUS because skip-innodb is defined
mysql>

The solution is to remove the log files so that MySql can generate new one on start:
/etc/init.d/mysql stop
$mv /var/lib/mysql/ib_logfile0 /var/lib/mysql/ib_logfile0.bak
$mv /var/lib/mysql/ib_logfile1 /var/lib/mysql/ib_logfile1.bak
/etc/init.d/mysql start
Useful commands:
Connect to MySql :        $mysql –u root  -p
Show table properties: $mysql> SHOW TABLE STATUS LIKE ‘mytable’ \G
Print my.cnf in nice format :
perl -ne ‘m/^([^#][^\s=]+)\s*(=.*|)/ && printf(“%-35s%s\n“, $1, $2)’ /etc/my.cnf
Show MySQL  statistics:
mysql> show global status;  #mysql global resources and options
mysql> show variables;    #shows all mysql option variables and their values
mysql> show innodb status;   #status information specific to innodb

Some interesting thoughts on database High Availability

Unpleasant things occasionally happen to production databases. Much of this is due to chance. Occasionally, you'll be unlucky, but your good luck increases with the effort you put into ensuring the robustness of your database. The extent of your occasional misfortune will decrease in direct proportion to the increasing resilience of your database architecture and your level of preparedness.
Your measure of System Availability is determined by your speed of fixing any problems that happen. In this respect, High Availability is not so much a technology as a cultural mindset within the organisation. In order to achieve high availability, you must:
  • Be clear on the likely points of failure, the patterns and volume of use, the business requirements, and the strengths and weaknesses of the system architecture,
  • Be methodical in reducing risks and both scripting and rehearsing disaster recovery.
  • Build in resilience and 'pain reporting' into both the software and hardware.
  • Be able to fix problems rapidly. Although the fates alone will decide when failures happen, you have more control over the time you take to restore the system's functionality. To maintain high-availability, you have to reduce this time as much as possible
It is important to learn from the misfortunes of others in order to refine your database's availability, as there is neither the time nor opportunity to learn only from your own misfortunes.

Measuring System Availability

Calculating the current availability of your database system

Availability is measured from the desktop. Databases can be 'available' from the DBA's perspective when it is actually not available to the end user. An eCommerce site is only available when the customer can trade!
Database availability is a simple calculation
ALTER FUNCTION ufrDatabaseAvailability
    
(
      
@MeanTimeBetweenFailures real,  --in hours
      
@MeanTimeToRealise real,         --in hours
      
@MeanTimeToRepair real           --in hours
    
)RETURNS REALAS BEGIN
    RETURN 
@MeanTimeBetweenFailures - ( @MeanTimeToRealise
                                
@MeanTimeToRepair ) ) * 100.000
        
@MeanTimeBetweenFailures 1.0000
   
END
--so if you get a failure every two months, and it takes you,
--on average 3 hours to realise the
--fact and eight hours to fix it, your availability is...
SELECT  dbo.ufrDatabaseAvailability(30 2448)--99.16666
Even though your record is pretty bad, you can claim 99% availability. However, your SLA (Service-Level Agreement) could specify the four nines (99.99%), meaning that you'd have two minutes to discover the fault and two minutes to fix it.
The two nines (99% availability) is relatively easy to achieve by installing proper monitoring and alerting, performing frequent and proper backups, and having precise and clear instructions for disaster recovery.
The Three nines (99.9% availability) can only be achieved once you have resilient hardware, using redundant hardware components, including redundant network cards, routers, and Hot-Swappable RAID Arrays.
Moving to the four nines (99.99% availability) requires all of the 'Three nines' and, in aditionaddition, the service must also be maintained during routine maintenance, by using redundant systems. Your total downtime in a year can be 51.6 minutes.
Achieving the five nines would requires requires an annual down-time of just five minutes. To achieve this, you need geographically dispersed systems with automatic failover.

Calculating Cost of Downtime

This is a difficult figure to calculate. It is often quoted by the minute as a revenue figure; 'cost per minute'. Unfortunately, a failure in the middle of peak trading times is quite different from the same failure at a quiet period. For some production systems, this figure is almost impossible to calculate, and must be established and agreed with the accountants, before downtimes happen.

Calculating the cost of Availability

It is tempting to cost out only the hardware and buildings, rather than the man-hours. However, before a decision can be made on implementing high-availability, the cost of attaining each of the 'nines' levels of availability must be estimated for the particular production system. By doing this, you are more likely to be provided with the necessary budget to achieve it, and less likely to be presented with unrealistic demands over availability.

Preparing for Disaster recovery

Before implementing a resilient system, you need to:
  • Understand the processing requirements of all applications that are connecting to the databases.
  • List all security accounts, security settings, configuration options, databases, instance-level objects, DTS/SSIS packages, data feeds, remote/linked servers
You will need to write a disaster recovery plan that includes every step and detail that must be performed in order to rebuild the servers. And keep it updated with every change. Write it in plain English. Make no assumptions about the specialised knowledge of whoever ends up with the task of rebuilding the servers.
Ensure that there are several 'first-responders', people who are trained, able, and willing to deal with virtually any problem or issue.
Fully test your disaster recovery plan at least once a year.
The disaster recovery plan can be maintained as a project, making it easier to plot people, resources, hardware, software, dependencies and tasks.

Minimising unplanned downtime

Keep an up-to-date list of the people who will be responsible for making the server available, with contact numbers etc. Keep information as to who is offsite and onsite up to date. Keep logins and passwords, and software keys up-to-date. Do occasional disaster drill, and refine the plan in the light of what goes wrong.
Do a range of backup types to cope with all eventualities. file groups, for example, can be used to group database objects in a logical way, such as similar usage patterns, to allow rapid restores, so they can be a useful supplement, alongside full backups since file group backups can often be restored whilst the rest of the database is online. One must consider the placement of data within file-groups with an eye to rapid restores. Even where the database becomes unavailable, a file group restore incurs less downtime than a full restore, if it is possible to apply it.
You can speed up the time it takes to have a replacement server ready to switch in, by providing standby servers. These must be pre-configured with at least the correct OS and hot-fixes, and be sufficient in size to hold the production systems. There is some difference in opinion about what the terms 'cold' 'warm' and 'hot' standby means, but here are what seem to be the most common definitions.

Cold Standby

This is a spare server, of the same specification of the production server, which is configured and ready to receive a copy of the database taken from the backups of the production server.

Warm Standby

This is a redundant server with a mirrored, or log-shipped, copy of the database that requires only a manual intervention to failover, and promote it to being the production server.

Hot standby

This is a server kept in sync with the production server and able to detect failure of the production server and automatically failover without the need for manual intervention. The ideal is a geographically dispersed failover cluster.
Server synchronisation for Warm or Hot standby can be achieved by:
Log Shipping
This is simple, cheap and dependable. It can maintain a 'warm standby', but not 'hot standby', at a distance, but you will need to script the role-switching, login-synchronisation and client redirects, in order to minimise downtime.
When failover occurs, one may lose transactions from the time of the last transaction log transfer. As log shipping copies the database rather then the entire server, it does not copy new logins, database users, DTS/SSIS packages, Agent jobs etc. These must be done separately. It can be implemented without downtime, but will always require manual intervention for failover.
SQL Backup now provides a very slick log-shipping facility that requires less bandwidth than the standard log shipping and is faster.
Mirroring,
This is a new technology in SQL Server 2005. Like Clustering, it can manage a failover at the database level, and thereby maintain a 'hot standby'. It requires brief outage of production server during synchronization but, unlike Clustering, does not need a high-bandwidth link, specially certified hardware, or special skills to set up and manage.
It uses log records to synchronise the databases and can quickly fail over to the standby server. Client applications have to be coded to automatically redirect their connection information and, in the event of a failover, automatically connect to the standby server and database. Database mirroring, however, can fail over quickly with no loss of committed data. Mirroring uses a 'Witness' member of a database mirroring session to determine if a failure has really occurred.
When the partner servers in a mirroring cannot see each other, the partner servers contact the witness to see whether the witness can contact the other partner
Failover clustering
This requires the whole setup to be designed and built from the ground up. You will need specially certified hardware, modification to the applications, transfer of all databases, and a great deal of patience and methodical work to set up and maintain.
Synchronisation
By using a command-line version of SQL Data Compare, it is possible to keep databases in synchronisation in order to maintain a 'Warm Standby'. However, it is effective only for small databases, puts loading on the server, and synchronisation can be delayed by up to ten minutes, depending on how often the synchronisation takes place. This requires manual fail over, as with Log Shipping
Replication
This is reasonably simple to set up and requires no backup restoration as synchronisation is done via a snapshot. This is not really a high-availability option but is included here as it is sometimes suggested. Most useful for Web Farm applications, using merge synchronisation, but, generally, there are too many single-points-of-failure in such a system to qualify it as an alternative.

Reducing planned down-time

The use of enterprise-quality servers will assist in reducing planned downtime. By using hot-swap RAM and RAID drives to add memory and disk storage, for example, you can avoid any downtime for such upgrades.
There will always be maintenance tasks that require that the server is taken off-line, however briefly. For example, if you have only one server and no standby then performing any maintenance that requires a system restart, such as replacing failing system components, or installing service packs, will require planned system downtime.
With a standby, you can do rolling upgrades by performing a manual failover to switch the workload off the node that requires maintenance, thereby maintaining the service and avoiding any downtime.
Every database will also require routine maintenance tasks; tuning needs to be performed, back ups need to be executed, indexes need to be maintained, and so on. In most cases, SQL Server will allow these to take place whilst maintaining the service. There is usually a way of selecting a technique and scheduling the task so as to maintain the service. For example, by using the DBCC INDEXDEFRAG statement, which does not hold locks, you can avoid blocking any running queries or updates.

Taking precautions against failures

Although one cannot prevent failures, one can take all reasonable precautions. The most obvious and most important precaution to take is to put in place a well-designed and implemented backup regime, taking into account the special requirements of the application. This is a subject beyond the scope of this article but, nonetheless, it is worth emphasising its importance. Also…

Process failure precautions

  • Control access to the server and server room
  • Ensure that all of your team clearly understand their roles and responsibilities.
  • Implement change controls to ensure that all software and hardware changes to a production server are documented. Because Change control systems require the signoff by several team specialists it allows them to check for potential problem .
  • Document and map all of your SQL Server instances, being particularly careful to record application relationships such as replication or log-shipping, data-feeds, message routes, links, remoting, and file transfer routes.
  • Make sure your Test servers are identical in configuration to your Production servers.
  • Before applying patches, hotfixes, and service packs, test them first on a Test Server.

Change failure precautions

  • Document all proposed changes
  • List the expected impact on the production system
  • Gain consensus and signoff for the changes, as appropriate.
  • Test the effect of the changes in terms of functionality and Stress/Load.
  • Document the rollback/reversion plan and test it out on those who are likely to be 'early responders' to a system failure.

Natural, and man-made, disaster precautions

  • Arrange for the service to be mirrored, or held at 'Warm stand-by', a long way away. Test out the ability to switch the service remotely. SAN replication is a popular solution, but mirroring is very effective

Hardware failure precautions

  • Simulate failure in all likely places to check that secondary hardware 'kicks-in' as expected.
  • Make sure there is an architecture diagram, and clear instructions for all hardware recovery routines, which are easily understandable to the 'first responder'.
  • Provide generous battery-backup.
  • Use redundant power supplies
  • Use hardware and software monitoring tools: hardware often gives out warning signs before 'letting go'.
  • Use a RAIDed array or SAN for storing your data, with hot-swappable drives with available spares. A 'Stripe of Mirrors' (Raid 10) is probably best practice.
  • Install redundancy in storage controllers.
  • Place the databases of your server on a different raid array to the transaction log. Locate TempDB on a high performance RAID array. SQL Server cannot function without it.
  • Provide both Network card and router redundancy
  • Ensure at least 'Warm Standby' fallback servers by using clustering, database mirroring, synchronisation or log shipping.

Software Failure precautions

Software failure can happen due to software changes, but also when data changes. Even date changes can cause failure. 'Code Rot' is the common term for software system failure when no recent software changes have been made.
  • Use Change and source control (see change failure above)
  • Before rolling out a production release, do strict 'limit' testing (testing under the extremes of data or throughput, and with hardware components randomly unplugged to assess whether software degradation is 'graceful' or not)
  • Perform Regular regression testing on the test server with different simulated loads
  • Avoid overlapping jobs in the SQL Server Agent; do routine DBCC checks and re-indexes of tables at off-peak times.

Network Failure precautions

TCP/IP is designed fundamentally as a resilient system in the event of disaster, but this relies on the network infrastructure being able to route network packets via alternative pathways in the event of the failure of a pathway.
  • Secondary DNS/WINS servers must be provided.
  • The system must not be reliant on a single domain server or active directory.
  • There should be Redundant routers/switches
  • Redundant WAN/Internet connections are generally important.
  • Ensure that there is no single point of failure in the network by regular 'limit-testing'

Security Failure precautions

  • Ensure the physical security of each SQL Server.
  • Create alerts and reports for any unusual patterns of user activity on the server, and investigate them (SQL Data Compare is very handy for this)
  • Give users the fewest permissions they need to perform their job.
  • Audit all login and logout events
  • Use DDL triggers to log and notify all changes to the security configuration of the server.
  • Adopt all current security best-practices when implementing the Server

Conclusions

It is a mistake to believe that high availability can be implemented merely with a chequebook. Although there are some attractive technologies around that will minimise your downtime in most circumstances, they are only part of the solution. At the heart of every robust system, there is planning, documentation, scripting, testing and drill. Lack of budget is not a barrier to achieving high availability. There are several solutions that do not have a high cost attached. There is much that can be achieved by exploiting the features available in SQL Server, in the use of SQL Server tools, and in designing a resilient architecture.

Oracle Clusterware 11g release 2 preinstallation steps

Verify System Requirements

Enter the following commands to check available memory:
grep MemTotal /proc/meminfo
grep SwapTotal /proc/meminfo
 
On 64-bit systems and IBM: Linux on System z, the minimum required RAM is 2.5 GB for Oracle Grid Infrastructure for a Cluster and Oracle RAC. Oracle recommends that you set swap space to 2 times the amount of RAM for systems with 4 to 8 GB of RAM. For systems with 8 to 32 GB RAM, use swap space equal to 1.5 times the size of RAM, up to 32 GB. For systems with greater than 32 GB of RAM, use 32 GB RAM.

On 32-bit systems, the minimum required RAM is 2.5 GB for Oracle Grid Infrastructure for a Cluster, including installations where you plan to install Oracle RAC. Oracle recommends that for systems with 2.5 GB to 16 GB RAM, use swap space equal to RAM. For systems with more than 16 GB RAM, use 16 GB of RAM for swap space.

If the swap space and the Grid home are on the same filesystem, then add together their respective requirements for the total minimum space required.
 
df -h

This command checks the available space on file systems. If you use normal redundancy for Oracle Clusterware files, which is three Oracle Cluster Registry (OCR) locations and three voting disk locations, then you should have at least 2 GB of file space available on shared storage volumes reserved for Oracle Grid Infrastructure files.

If you plan to install on Oracle ASM, then to ensure high availability of OCR or voting disk files on Oracle ASM, you need to have at least 2 GB of for Oracle Clusterware files in three separate failure groups, with at least three physical disks. Each disk must have at least 1 GB of capacity to ensure that there is sufficient space to create Oracle Clusterware files.

Ensure you have at least 6.5 GB of space for the Oracle Grid Infrastructure for a Cluster home (Grid home). This includes Oracle Clusterware and Oracle Automatic Storage Management (Oracle ASM) files and log files, ACFS log files, and includes the Cluster Health Monitor repository.
 
df -h /tmp
Ensure that you have at least 1 GB of space in /tmp. If this space is not available, then increase the size, or delete unnecessary files in /tmp.

Check Network Requirements

Single Client Access Name (SCAN) for the Cluster

During Typical installation, you are prompted to confirm the default Single Client Access Name (SCAN), which is used to connect to databases within the cluster irrespective of which nodes they are running on. By default, the name used as the SCAN is also the name of the cluster. The default value for the SCAN is based on the local node name. If you change the SCAN from the default, then the name that you use must be globally unique throughout your enterprise.
In a Typical installation, the SCAN is also the name of the cluster. The SCAN and cluster name must be at least one character long and no more than 15 characters in length, must be alphanumeric, cannot begin with a numeral, and may contain hyphens (-).

For example:
 
NE-Sa89
If you require a SCAN that is longer than15 characters, then be aware that the cluster name defaults to the first 15 characters of the SCAN.


IP Address Requirements

Before starting the installation, you must have at least two interfaces configured on each node: One for the private IP address and one for the public IP address.

IP Address Requirements for Manual Configuration
If you do not enable GNS, then the public and virtual IP addresses for each node must be static IP addresses, configured before installation for each node, but not currently in use. Public and virtual IP addresses must be on the same subnet.
Oracle Clusterware manages private IP addresses in the private subnet on interfaces you identify as private during the installation interview.
The cluster must have the following addresses configured:
  • A public IP address for each node, with the following characteristics:
    • Static IP address
    • Configured before installation for each node, and resolvable to that node before installation
    • On the same subnet as all other public IP addresses, VIP addresses, and SCAN addresses
  • A virtual IP address for each node, with the following characteristics:
    • Static IP address
    • Configured before installation for each node, but not currently in use
    • On the same subnet as all other public IP addresses, VIP addresses, and SCAN addresses
  • A Single Client Access Name (SCAN) for the cluster, with the following characteristics:
    • Three Static IP addresses configured on the domain name server (DNS) before installation so that the three IP addresses are associated with the name provided as the SCAN, and all three addresses are returned in random order by the DNS to the requestor
    • Configured before installation in the DNS to resolve to addresses that are not currently in use
    • Given a name that does not begin with a numeral
    • On the same subnet as all other public IP addresses, VIP addresses, and SCAN addresses
    • Conforms with the RFC 952 standard, which allows alphanumeric characters and hyphens ("-"), but does not allow underscores ("_").
  • A private IP address for each node, with the following characteristics:
    • Static IP address
    • Configured before installation, but on a separate, private network, with its own subnet, that is not resolvable except by other cluster member nodes


      Redundant Interconnect Usage

      In previous releases, to make use of redundant networks for the interconnect, bonding, trunking, teaming, or similar technology was required. Oracle Grid Infrastructure and Oracle RAC can now make use of redundant network interconnects, without the use of other network technology, to enhance optimal communication in the cluster. This functionality is available starting with Oracle Database 11g Release 2 (11.2.0.2).
      Redundant Interconnect Usage enables load-balancing and high availability across multiple (up to 4) private networks (also known as interconnects).

      Intended Use of Network Interfaces

      During installation, you are asked to identify the planned use for each network interface that OUI detects on your cluster node. You must identify each interface as a public or private interface, or as "do not use." For interfaces that you plan to have used for other purposes—for example, an interface dedicated to a network file system—you must identify those instances as "do not use" interfaces, so that Oracle Clusterware ignores them.
      Redundant Interconnect Usage cannot protect interfaces used for public communication. If you require high availability or load balancing for public interfaces, then use a third party solution. Typically, bonding, trunking or similar technologies can be used for this purpose.
      You can enable Redundant Interconnect Usage for the private network by selecting multiple interfaces to use as private interfaces. Redundant Interconnect Usage creates a redundant interconnect when you identify more than one interface as private. This functionality is available starting with Oracle Grid Infrastructure 11g Release 2 (11.2.0.2).

      Create Groups and Users

      Enter the following commands to create default groups and users:
      One system privileges group for all operating system-authenticated administration privileges, including Oracle RAC (if installed):
      # groupadd -g 1000 oinstall
      # groupadd -g 1031 dba
      # useradd -u 1101 -g oinstall -G dba oracle
      # mkdir -p  /u01/app/11.2.0/grid
      # mkdir -p /u01/app/oracle
      # chown -R oracle:oinstall /u01
      # chmod -R 775 /u01/
      
      This set of commands creates a single installation owner, with required system privileges groups to grant the OraInventory system privileges (oinstall), and to grant the OSASM/SYSASM and OSDBA/SYSDBA system privileges. It also creates the Oracle base for both Oracle Grid Infrastructure and Oracle RAC, /u01/app/oracle. It creates the Grid home (the location where Oracle Grid Infrastructure binaries are stored), /u01/app/11.2.0/grid.


      Check Storage

      You must have space available on Oracle ASM for Oracle Clusterware files (voting disks and Oracle Cluster Registries), and for Oracle Database files, if you install standalone or Oracle Real Application Clusters Databases. Creating Oracle Clusterware files on block or raw devices is no longer supported for new installations.

      Prepare Storage for Oracle Automatic Storage Management

      Install the Linux ASMLIB RPMs to simplify storage administration. ASMLIB provides persistent paths and permissions for storage devices used with Oracle ASM, eliminating the need for updating udev or devlabel files with storage device paths and permissions.

      ASMLib 2.0 is delivered as a set of three Linux packages:
      • oracleasmlib-2.0 - the Oracle ASM libraries
      • oracleasm-support-2.0 - utilities needed to administer ASMLib
      • oracleasm - a kernel module for the Oracle ASM library
      Each Linux distribution has its own set of ASMLib 2.0 packages, and within each distribution, each kernel version has a corresponding oracleasm package.

      Installing ASMLIB Packages Automatically with ULN

      If you are registered with the Oracle Unbreakable Linux Network (ULN), then you can download and install ASMLIB packages for your system automatically. To install ASMLIB from ULN:
      1. Log in as root.
      2. Run the following command:
        # up2date -i oracleasm-support oracleasmlib oracleasm-'uname -r'
        
        This command installs the support tools, the library, and the kernel driver for the Linux kernel version running on your system.

        Determine the Correct Oracleasm Package
        Determine which kernel you are using by logging in as root and running the following command:
        uname -rm
        
        For example:
        # uname –rm
        2.6.9-5.ELsmp i686 
         
         
        Download and Install the Oracleasm Package
        After you determine the kernel version for your system, complete the following task:
      3. Open a Web browser using the following URL:
        http://www.oracle.com/technetwork/server-storage/linux/downloads/index-088143.html
        
      4. Click Linux Drivers for Automatic Storage Management, and select the ASMLIB link for your version of Linux.
      5. Download the oracleasmlib and oracleasm-support packages for your version of Linux.
      6. Download the oracleasm package corresponding to your kernel version.
      7. Log in as root and install the Oracle ASM packages.
      Configure ASMLib
      Log in as root, and enter the following command:
      # oracleasm configure -i
      
      Provide information as prompted for your system. the oracleasm command by default is in the path /usr/sbin. If you enter the command oracleasm configure without the -i flag, then you are shown the current configuration.

      Mark Oracle ASM Candidate Disk Partitions

      For OUI to recognize partitions as Oracle ASM disk candidates, you must log in as root and mark the disk partitions that Oracle ASM can use. To mark a disk for use by Oracle ASM, enter the following command syntax, where ASM_DISK_NAME is the name of the Oracle ASM disk group, and candidate_disk is the name of the disk device that you want to assign to that disk group:
      oracleasm createdisk ASM_DISK_NAME candidate_disk
      
      For example:
      # oracleasm createdisk data1 /dev/sdf 
       

      Install Oracle Grid Infrastructure Software

      1. Start OUI from the root level of the installation media. For example:
        ./runInstaller
        
      2. Select Install and Configure Grid Infrastructure for a Cluster, then select Typical Installation. In the installation screens that follow, enter the configuration information as prompted.

       



      1.  







Overview of the Automatic Workload Repository

The Automatic Workload Repository (AWR) collects, processes, and maintains performance statistics for problem detection and self-tuning purposes. This data is both in memory and stored in the database. The gathered data can be displayed in both reports and views.
The statistics collected and processed by AWR include:
  • Object statistics that determine both access and usage statistics of database segments
  • Time model statistics based on time usage for activities, displayed in the V$SYS_TIME_MODEL and V$SESS_TIME_MODEL views
  • Some of the system and session statistics collected in the V$SYSSTAT and V$SESSTAT views
  • SQL statements that are producing the highest load on the system, based on criteria such as elapsed time and CPU time
  • ASH statistics, representing the history of recent sessions activity
Gathering database statistics using the AWR is enabled by default and is controlled by the STATISTICS_LEVEL initialization parameter. The STATISTICS_LEVEL parameter should be set to the TYPICAL or ALL to enable statistics gathering by the AWR. The default setting is TYPICAL. Setting STATISTICS_LEVEL to BASIC disables many Oracle Database features, including the AWR, and is not recommended. If STATISTICS_LEVEL is set to BASIC, you can still manually capture AWR statistics using the DBMS_WORKLOAD_REPOSITORY package. However, because in-memory collection of many system statistics—such as segments statistics and memory advisor information—will be disabled, the statistics captured in these snapshots may not be complete.

How do we interpret Oracle 11g statistics

When initially examining performance data, you can formulate potential theories by examining your statistics. One way to ensure that your interpretation of the statistics is correct is to perform cross-checks with other data. This establishes whether a statistic or event is really of interest. Also, because foreground activities are tunable, it is better to first analyze the statistics from foreground activities before analyzing the statistics from background activities.
Some pitfalls are discussed in the following sections:

Hit ratios
When tuning, it is common to compute a ratio that helps determine whether there is a problem. Such ratios include the buffer cache hit ratio, the soft-parse ratio, and the latch hit ratio. Do not use these ratios as definitive identifiers of whether a performance bottleneck exists. Rather, use them as indicators. To identify whether a bottleneck exists, examine other related evidence.

Wait events with timed statistics
Setting TIMED_STATISTICS to true at the instance level directs the database to gather wait time for events, in addition to available wait counts. This data is useful for comparing the total wait time for an event to the total elapsed time between the data collections. For example, if the wait event accounts for only 30 seconds out of a 2-hour period, then little is to be gained by investigating this event, although it may be the highest ranked wait event when ordered by time waited. However, if the event accounts for 30 minutes of a 45-minute period, then the event is worth investigating.

Comparing Oracle Database statistics with other factors
When looking at statistics, it is important to consider other factors that influence whether the statistic is of value. Such factors include the user load and the hardware capability. Even an event that had a wait of 30 minutes in a 45-minute period might not be indicative of a problem if you discover that there were 2000 users on the system, and the host hardware was a 64-node computer.

Wait events without timed statistics
If TIMED_STATISTICS is false, then the amount of time waited for an event is not available. Therefore, it is only possible to order wait events by the number of times each event was waited for. Although the events with the largest number of waits might indicate the potential bottleneck, they might not be the main bottleneck. This can happen when an event is waited for a large number of times, but the total time waited for that event is small. The converse is also true: an event with fewer waits might be a problem if the wait time is a significant proportion of the total wait time. Without having the wait times to use for comparison, it is difficult to determine whether a wait event is really of interest.

Idle wait events
Oracle Database uses some wait events to indicate if the Oracle server process is idle. Typically, these events are of no value when investigating performance problems, and they should be ignored when examining the wait events.

Computed statistics
When interpreting computed statistics (such as rates, statistics normalized over transactions, or ratios), it is important to cross-verify the computed statistic with the actual statistic counts. This confirms whether the derived rates are really of interest: small statistic counts usually can discount an unusual ratio. For example, on initial examination, a soft-parse ratio of 50% generally indicates a potential tuning area. If, however, there was only one hard parse and one soft parse during the data collection interval, then the soft-parse ratio would be 50%, even though the statistic counts show this is not an area of concern. In this case, the ratio is not of interest due to the low raw statistic counts.


Is MySQL's SLEEP() function a busy-wait?

It is not a busy-wait, but it is not a useful thing to do either.

The MySQL sleep() function occupies the server thread for the length of the sleep, which means that it can't do other (more useful work) and uses up the thread stack in the mysql server process. These are not desirable things and should not be encouraged.
In some cases (depending on the context) locks may be held while sleep() is sleeping.
Of course the sleep() function is very useful for debugging, but I feel that it should not be in production code.

If a client-thread wants to sleep, it should sleep itself, not ask mysql to do it.