Translators/cluster/unify
From GlusterDocumentation
Contents |
Translator cluster/unify
If you have some existing data, unify is the best translator. If your setup is fresh, use the distribute translator. The unify translator combines multiple storage bricks into one big fast storage server. For I/O scheduling, you can bind your preferred I/O scheduler module to the unify volume. You have a variety of I/O schedulers to pick from, based on your application requirements.
Read Understanding Unify Translator to know more about the unify translator.
volume unify type cluster/unify subvolumes brick1 brick2 brick3 brick4 brick5 brick6 brick7 brick8 option namespace brick-ns # should be a node which is not present in 'subvolumes' option scheduler rr # simple round-robin scheduler end-volume
GlusterFS Schedulers
The scheduler decides how to distribute the new creation operations across the clustered file system based on load, availability and other determining factors. Here is a list of I/O schedulers you can pick from...
ALU Scheduler
ALU stands for "Adaptive Least Usage". It is the most advanced scheduler available in GlusterFS:
- It balances the load across volumes taking several factors into account.
- It adapts itself to changing I/O patterns according to its configuration.
When properly configured, it can eliminate the need for regular tuning of the filesystem to keep volume load nicely balanced.
The ALU scheduler is composed of multiple least-usage sub-schedulers. Each sub-scheduler keeps track of a certain type of load, for each of the subvolumes, getting the actual statistics from the subvolumes themselves. The sub-schedulers are these:
- disk-usage - the used and free disk space on the volume
- read-usage - the amount of reading done from this volume
- write-usage - the amount of writing done to this volume
- open-files-usage - the number of files currently opened from this volume
- disk-speed-usage - the speed at which the disks are spinning. This is a constant value and therefore not very useful.
The ALU scheduler needs to know which of these sub-schedulers to use and in which order to evaluate them. This is done through the "option alu.order" configuration directive.
Each sub-scheduler needs to know two things: when to kick in (the entry-threshold), and how long to stay in control (the exit-threshold). For example: when unifying three disks of 100GB, keeping an exact balance of disk-usage is not necessary. Instead, there could be a 1GB margin which can be used to nicely balance other factors, such as read-usage. The disk-usage scheduler can be told to kick in only when a certain threshold of discrepancy is passed, such as 1GB. When it assumes control under this condition, it will write all subsequent data to the least-used volume. If it is doing so, it is unwise to stop right after the values are below the entry-threshold again since that would make it very likely that the situation will occur again very soon. Such a situation would cause the ALU to spend most of its time disk-usage scheduling which is unfair to the other sub-schedulers. The exit-threshold therefore defines the amount of data that needs to be written to the least-used disk before control is relinquished again.
In addition to the sub-schedulers, the ALU scheduler also has "limits" options. These can stop the creation of new files on a volume once values drop below a certain threshold. For example, setting "option alu.limits.min-free-disk 5GB" will stop the scheduling of files to volumes that have less than 5GB of free disk space leaving the files on that disk some room to grow.
The actual values you assign to the thresholds for sub-schedulers and limits depend on your situation. If you have fast-growing files, you would want to stop file-creation on a disk much earlier than when hardly any of your files are growing. If you care less about disk-usage balance than about read-usage balance, you would want a bigger disk-usage scheduler entry-threshold and a smaller read-usage scheduler entry-threshold.
For thresholds defining a size, percentage of free space is allowed. For example: "option alu.limits.min-free-disk 5%".
- ALU Scheduler Volume example
volume bricks type cluster/unify subvolumes brick1 brick2 brick3 brick4 brick5 option alu.read-only-subvolumes brick5 # This option makes brick5 to be readonly, where no new files are created. option scheduler alu # use the ALU scheduler option alu.limits.min-free-disk 5% # Don't create files one a volume with less than 5% free diskspace option alu.limits.max-open-files 10000 # Don't create files on a volume with more than 10000 files open # When deciding where to place a file, first look at the disk-usage, then at # read-usage, write-usage, open files, and finally the disk-speed-usage. option alu.order disk-usage:read-usage:write-usage:open-files-usage:disk-speed-usage option alu.disk-usage.entry-threshold 2GB # Kick in if the discrepancy in disk-usage between volumes is more than 2GB option alu.disk-usage.exit-threshold 60MB # Don't stop writing to the least-used volume until the discrepancy is 1988MB option alu.open-files-usage.entry-threshold 1024 # Kick in if the discrepancy in open files is 1024 option alu.open-files-usage.exit-threshold 32 # Don't stop until 992 files have been written the least-used volume # option alu.read-usage.entry-threshold 20% # Kick in when the read-usage discrepancy is 20% # option alu.read-usage.exit-threshold 4% # Don't stop until the discrepancy has been reduced to 16% (20% - 4%) # option alu.write-usage.entry-threshold 20% # Kick in when the write-usage discrepancy is 20% # option alu.write-usage.exit-threshold 4% # Don't stop until the discrepancy has been reduced to 16% # option alu.disk-speed-usage.entry-threshold # NEVER SET IT. SPEED IS CONSTANT!!! # option alu.disk-speed-usage.exit-threshold # NEVER SET IT. SPEED IS CONSTANT!!! option alu.stat-refresh.interval 10sec # Refresh the statistics used for decision-making every 10 seconds # option alu.stat-refresh.num-file-create 10 # Refresh the statistics used for decision-making after creating 10 files end-volume
NUFA Scheduler
Non-Uniform Filesystem Scheduler similar to NUMA (http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access) memory design. It is mainly used in HPC environments where you are required to run the filesystem server and client within the same cluster. Under such environment, NUFA scheduler gives the local system more priority for file creation over other nodes.
volume posix1 type storage/posix # POSIX FS translator option directory /home/export # Export this directory end-volume volume bricks type cluster/unify subvolumes posix1 brick2 brick3 brick4 option scheduler nufa option nufa.local-volume-name posix1 option nufa.limits.min-free-disk 5% end-volume
NOTE: Now NUFA comes with support for more than one local volume option.
Random Scheduler
Random scheduler randomly scatters file creation across storage bricks.
volume bricks type cluster/unify subvolumes brick1 brick2 brick3 brick4 option scheduler random option random.limits.min-free-disk 5% end-volume
Round-Robin Scheduler
Round-Robin (RR) scheduler creates files in a round-robin fashion. Each client will have its own round-robin loop. When your files are mostly similar in size and I/O access pattern, this scheduler is a good choice. RR scheduler now checks for free disk size of the server before scheduling, so you can get to know when to add another server brick. The default value of min-free-disk is 5% and is checked every 10seconds (by default) if there is any create call happening.
volume bricks type cluster/unify subvolumes brick1 brick2 brick3 brick4 option scheduler rr option rr.read-only-subvolumes brick4 # No files will be created in 'brick4' option rr.limits.min-free-disk 5% # Unit in % option rr.refresh-interval 10 # Check server brick after 10s. end-volume
Switch Scheduler
Switch Scheduler is the latest addition to the GlusterFS code base, which actually schedules the file according the the filename patterns specified. One can understand it with the example given below.
volume bricks type cluster/unify subvolumes brick1 brick2 brick3 brick4 brick5 brick6 brick7 option scheduler switch option switch.case *jpg:brick1,brick2;*mpg:brick3;*:brick4,brick5,brick6 option switch.read-only-subvolumes brick7 end-volume
Above is the snapshot of just unify translator in a spec file. Here, files with pattern '*jpg' will be created in brick1 and brick2, and '*mpg' will be created in brick3, and all other files will be created in brick4,brick5, and brick6. And brick7 will be just read-only subvolume, from which just data can be read.


