Understanding Unify Translator
From GlusterDocumentation
The Unify Translator is a key feature of GlusterFS. This enables glusterfs to be classified as a clustered file system. Going further into details of the unify design, it's important to know how a translator works in GlusterFS.
NOTE:
- Unify being part of a file system, has a requirement. It will create directories on all the child nodes and files in anyone of the child nodes. It expects this to be followed properly.
- The current explanation is based on GlusterFS release 1.3.0-pre5 onwards, but for releases before it, everything holds good, other than namespace and self-heal.
Contents |
Unify's Volume specification
1 volume unify 2 type cluster/unify 3 option scheduler rr # check alu, random, nufa 4 option rr.limits.min-free-disk 5 # 5% of free disk is minimum. 5 option namespace namespace-child 6 subvolumes child1 child2 child3 child4 # client[1-n], n < 32bit number :D 7 end-volume
Block Diagram
CLIENT:
----------------
| FUSE | (/mnt/glusterfs)
----------------
|
V
. <- There may be some 'performance' translators
.
|
V
---------------
| UNIFY | <- "volume unify" in the spec file.
---------------\
. . . . -------------------->---------->--------+
/ / | \ <--~ can load 'afr', 'stripe' etc here. |
. . . . V
. . . . |
---------- ---------- ---------- ---------- -------------------
| CHILD1 | | CHILD2 | | CHILD3 | | CHILD4 | | namespace-child |
---------- ---------- ---------- ---------- -------------------
. . . . .
-.-.-.-.-.-.-.-.-.-..-.-.-..-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-...-.-.-.-.-.-.-.-..-
++++++ Network (can be TCP/IP, IBSDP, or IB-VERBS) +++++++++
-.-.-.-.-.-.-.-.-.-..-.-.-..-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-...-.-.-.-.-.-.-.-..-
. . . . |
| | | | <- [GlusterFSD] -> |
V V V V V
. . . . .
--------- --------- --------- --------- ---------
| POSIX | | POSIX | | POSIX | | POSIX | | POSIX |
--------- --------- --------- --------- ---------
Understanding volume spec
Now have a glance at the spec file snapshot.
- LINE 1: "volume unify"
Here 'volume' is an identifier for the parser, which starts definition of new translator 'unify' is the name of the volume, which should be unique for a given spec file.
- LINE 2: "type cluster/unify"
Here 'type' is an identifier for the parser, which is used to look for the type of the translator. (there are many as listed in wiki or check source). 'cluster/unify' is the path inside libdir, for looking the shared object file. /<libdir>/glusterfs/xlators/clu ster/unify.so". You can see here that any given translator in GlusterFS is loaded during run time after parsing through the spec file. So, the behaviour of the GlusterFS can be decided by the spec file.
- LINE 3: "option scheduler rr"
Here 'option' is an identifier for the parser, which is used to specify some of the configurable flags of translator. 'option' always needs two more parameters after it, which can be treated as 'key', 'value'. ('option key value'). 'scheduler' is an option for unify translator, which specifies which type of the scheduler to use while creating a file. (Remember, a file is created only in one of the child node, out of 'n' child nodes). There are already few schedulers present in GlusterFS, go through wiki for further info, or also check 'schedulers/' directory in the release tarball. 'rr' is the name of the scheduler to be used. ('rr' means round-robin). There are other schedulers named 'alu', 'random', and 'nufa'.
- LINE 4: "option rr.limits.min-free-disk 5"
'rr.limits.min-free-disk', this key is used by rr scheduler to understand that it has to stop considering the child node for scheduling if its available free-disk is less than '5'% of the total disk-space.
- LINE 5: "option namespace namespace-child"
The option specifies that namespace-child should be used as a place where the whole filesystem's namespace (ie, directory/file tree structure) is maintained. Here 'namespace-child' needs to be another volume which is already defined in the spec.
'namespace-child' is configured like another server brick, except that it's content is used to manage namespace. It should not be listed along with the other subvolumes.
- LINE 6: "subvolumes child1 child2 child3 child4"
Here 'subvolumes' is an identifier for parser to tell that following volumes are the 'children' nodes for current volume. 'subvolumes' takes one or more volume names as argument. Here child1, child2, child3, child4 are also named as storage nodes as these are the nodes where actual data is storaged.
- LINE 7: "end-volume"
Here 'end-volume' is an identifier for spec file parser, which tells that definition of 'volume <x>' is over.
NOTE: "option scheduler <x>" and "option namespace <volume>" are mandatory option for unify to complete its initialization.
Design of Unify
- Whenever a directory has to be created in GlusterFS, its created in all the child nodes including namespace.
- Whenever a file (symlink, mknod) has to be created, its created in namespace, and if it is not there earlier then its created in one of the child nodes depending on the output of scheduler. If its already present in namespace, then an open request is sent to the node where the file is existing.
- 'rename()' call is sent to the namespace node first, if successful, its sent to the node which contains the source file. If the destination file exists and is on different node, then an unlink fop is sent to that file.
- 'link ()' will be sent to the same node where the source file exists. ('link' call is made to create hardlinks).
- When a 'lookup()' call comes to unify, it will be sent to all the child nodes including namespace. If the file is present, there will be successful return from namespace and the node where its present. If its a directory, lookup should be successful for all the child nodes. And if file is not present, even unify will return -1.
- All the fops work based on the result of lookup, ie, when a file/directory is lookup'ed, an inode map is maintained which will contain information of which child node has what inode number for the given file. All the fops send the request to the child nodes depending on this mapping list.
- Whenever stat structure is getting returned in _cbk() ('struct stat *'), it will be the stat structure which is returned from the namespace as the inode number of the namespace entry is sent to the FUSE layer to keep the persistent inode.
- If the stat(struct stat *) is of the file, then only the 'st_size' and 'st_blocks' fields are taken from the _cbk returned from the storage node which contains the file, as namespace entry will be of size 0.
- In 'readdir_cbk()', unify returns the dirents from the namespace, but also checks for the consistency in all the storage nodes. Also the stat (struct stat *) for each entry is managed as explained in previous point.
- In 'statfs_cbk()', it adds up the free-disk-space, total-disk-space, etc of all the storage nodes and return's the unified value to above layer.
Design of self-heal in Unify
- When a 'lookup()/stat()' call is made on directory for the first time, a self-heal call is made, which checks for the consistancy of its child nodes. If an entry is present in storage node, but not in namespace, that entry is created in namespace, and vica-versa. There is an writedir() API introduced which is used for the same. It also checks for permissions, and uid/gid consistencies.
- This check is also done when an server goes down and comes up.
- If one starts with an empty namespace export, but has data in storage nodes, a 'find .>/dev/null' or 'ls -lR >/dev/null' should help to build namespace in one shot. Even otherwise, namespace is built on demand when a file is looked up for the first time.
NOTE: There are some issues (Kernel 'Oops' msgs) seen with fuse-2.6.3, when someone deletes namespace in backend, when glusterfs is running. But with fuse-2.6.5, this issue is not there.
Read-only volumes and unify
In 1.3.2 and mainline-2.5 (9/26/2007), it is not possible to prevent schedulers from scheduling writes to read-only volumes.
The option "read-only-subvolumes" is present in schedulers from 1.3.8preX releases. (and in TLA after 1.3.7 release).
Configuration needed to have redundant namespace bricks
Q: What happens when the namespace brick crashes? Does that f** your namespace? If so, how is a redundant namespace brick configured -- using afr?
A: If namespace is configured with AFR, it should not affect the filesystem. But if namespace entry not redundant, then yes, file system will not be accessible till it comes back (this is the same problem GlusterFS had with earlier versions with lock-server). There is design process in progress to get distributed namespace, so the failure is as minimal as possible. But till then, if you use unify, there exists a single point of failure.
Namespace FAQ
Q: Does the size of unify translator namespace affect the capacity of storage of my cluster ?
A: No. The capacity of storage in your cluster isn't affected by the size of your namespace. The unify translator use namespace to store information about the files, the data itself is stored on the bricks. The namespace usually needs little space.
Ex: Having 666000+ files over Namespace in 'reiserfs takes 24MB of diskspace. But it varies from filesystem to filesystem. The important criteria of namespaces is inode availability, whether your namespace can provide enough inodes for all your data.
Q: Can someone explain if or why i need a namespace with unify translator ? Is it mandatory or not ? If not, what`s the disadvantage of not using namespace ?
A: If you just think generally (not going very deep technically), you may observe the complexity of cluster filesystems. With our goal of not having metadata for filesystem, we had to think of someways to handle the unified view of namespace.
We could have given a unified view after doing parallel readdir()s on servers. But it will surely hit the performance, but not just that, during file creation, if two nodes try to create files with same names, without a centralized server which can maintain namespace locks, it would lead to filesystem corruption. Hence we came up with a namespace design, which will not contain any critical data, and also it can rebuild from scratch if you choose to use different new namespace after a while.
So, the namespace brick is required for giving a high performance, and proper functionality for the cluster file system. Till now we haven't hit the issues with performance or scalability with having just one namespace (other than when namespace brick goes down, and its not afr'd). But as we know that its going to be an issue going further, we have decided to bring in a distributed namespace from 1.4.x versions.
About your question, whether its mandatory to use namespace, YES it is. If you don't give this option, you will not be able to mount GlusterFS filesystem. (As I said earlier its going to change from 1.4.x releases).


