Translators/cluster/unify

From GlusterDocumentation

Contents

Translator cluster/unify

If you have some existing data, unify is the best translator. If your setup is fresh, use the distribute translator. The unify translator combines multiple storage bricks into one big fast storage server. For I/O scheduling, you can bind your preferred I/O scheduler module to the unify volume. You have a variety of I/O schedulers to pick from, based on your application requirements.

Read Understanding Unify Translator to know more about the unify translator.

volume unify
   type cluster/unify
   subvolumes brick1 brick2 brick3 brick4 brick5 brick6 brick7 brick8
   option namespace brick-ns # should be a node which is not present in 'subvolumes'
   option scheduler rr    # simple round-robin scheduler
end-volume


GlusterFS Schedulers

The scheduler decides how to distribute the new creation operations across the clustered file system based on load, availability and other determining factors. Here is a list of I/O schedulers you can pick from...

ALU Scheduler

ALU stands for "Adaptive Least Usage". It is the most advanced scheduler available in GlusterFS:

  • It balances the load across volumes taking several factors into account.
  • It adapts itself to changing I/O patterns according to its configuration.

When properly configured, it can eliminate the need for regular tuning of the filesystem to keep volume load nicely balanced.

The ALU scheduler is composed of multiple least-usage sub-schedulers. Each sub-scheduler keeps track of a certain type of load, for each of the subvolumes, getting the actual statistics from the subvolumes themselves. The sub-schedulers are these:

  • disk-usage - the used and free disk space on the volume
  • read-usage - the amount of reading done from this volume
  • write-usage - the amount of writing done to this volume
  • open-files-usage - the number of files currently opened from this volume
  • disk-speed-usage - the speed at which the disks are spinning. This is a constant value and therefore not very useful.

The ALU scheduler needs to know which of these sub-schedulers to use and in which order to evaluate them. This is done through the "option alu.order" configuration directive.

Each sub-scheduler needs to know two things: when to kick in (the entry-threshold), and how long to stay in control (the exit-threshold). For example: when unifying three disks of 100GB, keeping an exact balance of disk-usage is not necessary. Instead, there could be a 1GB margin which can be used to nicely balance other factors, such as read-usage. The disk-usage scheduler can be told to kick in only when a certain threshold of discrepancy is passed, such as 1GB. When it assumes control under this condition, it will write all subsequent data to the least-used volume. If it is doing so, it is unwise to stop right after the values are below the entry-threshold again since that would make it very likely that the situation will occur again very soon. Such a situation would cause the ALU to spend most of its time disk-usage scheduling which is unfair to the other sub-schedulers. The exit-threshold therefore defines the amount of data that needs to be written to the least-used disk before control is relinquished again.

In addition to the sub-schedulers, the ALU scheduler also has "limits" options. These can stop the creation of new files on a volume once values drop below a certain threshold. For example, setting "option alu.limits.min-free-disk 5GB" will stop the scheduling of files to volumes that have less than 5GB of free disk space leaving the files on that disk some room to grow.

The actual values you assign to the thresholds for sub-schedulers and limits depend on your situation. If you have fast-growing files, you would want to stop file-creation on a disk much earlier than when hardly any of your files are growing. If you care less about disk-usage balance than about read-usage balance, you would want a bigger disk-usage scheduler entry-threshold and a smaller read-usage scheduler entry-threshold.

For thresholds defining a size, percentage of free space is allowed. For example: "option alu.limits.min-free-disk 5%".

  • ALU Scheduler Volume example
volume bricks
  type cluster/unify
  subvolumes brick1 brick2 brick3 brick4 brick5
  option alu.read-only-subvolumes brick5 # This option makes brick5 to be readonly, where no new files are created.
  option scheduler alu   # use the ALU scheduler
  option alu.limits.min-free-disk  5%      # Don't create files one a volume with less than 5% free diskspace
  option alu.limits.max-open-files 10000   # Don't create files on a volume with more than 10000 files open
  
  # When deciding where to place a file, first look at the disk-usage, then at  
  # read-usage, write-usage, open files, and finally the disk-speed-usage.
  option alu.order disk-usage:read-usage:write-usage:open-files-usage:disk-speed-usage
  option alu.disk-usage.entry-threshold 2GB   # Kick in if the discrepancy in disk-usage between volumes is more than 2GB
  option alu.disk-usage.exit-threshold  60MB   # Don't stop writing to the least-used volume until the discrepancy is 1988MB 
  option alu.open-files-usage.entry-threshold 1024   # Kick in if the discrepancy in open files is 1024
  option alu.open-files-usage.exit-threshold 32   # Don't stop until 992 files have been written the least-used volume
# option alu.read-usage.entry-threshold 20%   # Kick in when the read-usage discrepancy is 20%
# option alu.read-usage.exit-threshold 4%   # Don't stop until the discrepancy has been reduced to 16% (20% - 4%)
# option alu.write-usage.entry-threshold 20%   # Kick in when the write-usage discrepancy is 20%
# option alu.write-usage.exit-threshold 4%   # Don't stop until the discrepancy has been reduced to 16%
# option alu.disk-speed-usage.entry-threshold # NEVER SET IT. SPEED IS CONSTANT!!!
# option alu.disk-speed-usage.exit-threshold  # NEVER SET IT. SPEED IS CONSTANT!!!
  option alu.stat-refresh.interval 10sec   # Refresh the statistics used for decision-making every 10 seconds
# option alu.stat-refresh.num-file-create 10   # Refresh the statistics used for decision-making after creating 10 files
end-volume

NUFA Scheduler

Non-Uniform Filesystem Scheduler similar to NUMA (http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access) memory design. It is mainly used in HPC environments where you are required to run the filesystem server and client within the same cluster. Under such environment, NUFA scheduler gives the local system more priority for file creation over other nodes.

volume posix1
  type storage/posix               # POSIX FS translator
  option directory /home/export    # Export this directory
end-volume 

volume bricks
  type cluster/unify
  subvolumes posix1 brick2 brick3 brick4
  option scheduler nufa
  option nufa.local-volume-name posix1
  option nufa.limits.min-free-disk 5%
end-volume

NOTE: Now NUFA comes with support for more than one local volume option.

Random Scheduler

Random scheduler randomly scatters file creation across storage bricks.

volume bricks
  type cluster/unify
  subvolumes brick1 brick2 brick3 brick4
  option scheduler random
  option random.limits.min-free-disk 5%
end-volume

Round-Robin Scheduler

Round-Robin (RR) scheduler creates files in a round-robin fashion. Each client will have its own round-robin loop. When your files are mostly similar in size and I/O access pattern, this scheduler is a good choice. RR scheduler now checks for free disk size of the server before scheduling, so you can get to know when to add another server brick. The default value of min-free-disk is 5% and is checked every 10seconds (by default) if there is any create call happening.

volume bricks
  type cluster/unify
  subvolumes brick1 brick2 brick3 brick4
  option scheduler rr
  option rr.read-only-subvolumes brick4  # No files will be created in 'brick4'
  option rr.limits.min-free-disk 5%          # Unit in %
  option rr.refresh-interval 10               # Check server brick after 10s.
end-volume


Switch Scheduler

Switch Scheduler is the latest addition to the GlusterFS code base, which actually schedules the file according the the filename patterns specified. One can understand it with the example given below.

volume bricks
  type cluster/unify
  subvolumes brick1 brick2 brick3 brick4 brick5 brick6 brick7
  option scheduler switch
  option switch.case *jpg:brick1,brick2;*mpg:brick3;*:brick4,brick5,brick6
  option switch.read-only-subvolumes brick7
end-volume

Above is the snapshot of just unify translator in a spec file. Here, files with pattern '*jpg' will be created in brick1 and brick2, and '*mpg' will be created in brick3, and all other files will be created in brick4,brick5, and brick6. And brick7 will be just read-only subvolume, from which just data can be read.

 

Copyright © 2009 Gluster, Inc. All Rights Reserved.