-----BEGIN PGP SIGNED MESSAGE----- Taxonomy of Distributed File System Functions BACKGROUND The deployment of Storage Area Networks (SAN) and Network Attached Storage (NAS) has the goal of improving many facets of file system behavior. Ultimately, the possibility of attaching user machines directly to disks offers the hope for maximizing the performance, scalability, flexibility, and adaptability of the storage system. A big problem, however, is that existing distributed file system protocols just don't accommodate the idea of having the client talk to the disk very well. It isn't any surprise of course. NFS, AFS and DFS were all designed, to varying degrees, with the idea of moving work from the servers to the clients. But none of them considered the idea that the server wouldn't have complete control of its disks. The idea of further improving performance and scalability by also moving work from the server to the disk is certainly appealing. The simplest schemes involve little more that porting the SCSI protocol to run over a network. But more ambitious plans call for object oriented (OO) disks, which export a file-like abstraction. Even present day disks are doing a lot more than just storing data on physically addressed disk sectors. How should a distributed file system evolve to take advantage of smarter, client-accessible disks? Deciding how to accomplish this seems to be a matter of partitioning the file system's functions between the various agents: the client (close to the user), the disk (close to the data), and the file server. In the limit the file server would have no functions left at all, though other servers (location, replication, backup, and security) will continue to be important. SERVER-LITE FILE SYSTEM A traditional file system transports data from producers to consumers (both are called "users") across time. A distributed file system transports data between producers and consumers (both now called "clients") across time and space. From the hardware perspective, the time transport is provided by disks and the space transport is done by networks. "Servers" don't appear at all in this high level picture. To the extent that servers are *forced* into the picture they are likely to have a negative impact on performance and scalability. Looking at a distributed file system in more detail, what intrinsic functions do servers have? If we consider the job of a distributed file system we have these major tasks: data storage -- remembering data: transporting it through time. naming -- this is the mapping of human visible names to objects. Once time and space are conquered, naming is everything. I believe global pathnames are the correct model. These names lead to objects via an indirection layer provided by a file identifier (fid). This approach is shared by AFS & DFS. In other parts of the solution space we have: NFS (fids, but no global names), CIFS (global names, but no fids), and WWW (global names, but no fids). data location -- determining the location an object. This is certainly the most complex and important part of a distributed file system. Goals of scalability, performance and fault tolerance all demand that nearby copies be preferred to distant, unavailable or overloaded sources. This leads to the caching and replication solutions in AFS/DFS, but more flexibility and adaptability will continue to improve performance. In practice this is typically tied up with naming: even in AFS/DFS, mountpoints in the name space have locational implications. But this isn't necessary and always reduces flexibility. security -- this takes the form of the twin tasks of authentication and authorization. This is always a thorny but important problem. Issues like "who decides?" (user / administrator a.k.a. discretionary / non-discretionary) complicate the already difficult authentication problem[1]. The presence of caching and replication have created a significant difference between read and write permission. The authorization problem develops two faces: access to data and control of the name space. Read permission controls who can see the data. Imagine that all data was encrypted and signed; read permission reduces to who can access the decryption key. Writing to signed data just replaces it, so write permission reduces to control of the mapping between names and objects. The occurrence of bona fid database-like files that actually need write permission controls is quite rare. However, directories are common example of database-like objects. data transport -- moving the data through space between producers, consumers and disks. The protocol for doing this isn't really part of a file system; unfortunately most distributed file systems come with a (unique) network protocol anyway. There is no end to the proliferation of protocols in sight. Ideally a file system should be as agnostic about data transport as possible. data distribution -- performance is a large challenge for distributed file systems. The big-two mechanisms for getting good delivery performance are caching and replication. An overlooked consideration is routing: some routes may be much faster or cheaper. Another issue is server responsiveness (loaded / idle, fast / slow). sharing -- while there are some other advantages of distributed file systems, the most important one is data sharing. For users to find shared data coherent, the cached (and replicated) data must have well defined consistency. storage management -- a smaller and probably decreasing advantage of distributed file systems is off-loading storage from the client. The tremendous advantage of data sharing means that the amount of individual storage a user or client needs is quite small; easily provided in most cases by the capacities of today's disks. The advantage of being able to access individual storage from whatever client machine one is using, the "no matter where you go, there you are"[7] feature, is very convenient in some environments. Providing storage for multiple users introduces the problems of sharing it, namely quota management. data management -- another small, but probably growing, advantage of distributed file systems is that data services with large economies of scale, such as 24x7 operation, hierarchical storage and backup, can be provided conveniently and cheaply. Architecturally, DFS and AFS tie all these features intimately to the server, only data location is provided by a different server. The file system's security function is supported by an auxiliary server, but the file server still evaluates the mode bits and ACLs. Data transport is tied to the DCE/RPC for DFS and Rx for AFS (even NFS, LANMAN, and Web Browsers have their own protocols). The server holds all the cards. To investigate the architecture of distributed file systems in the small server limit, I am going to disregard the constraints of existing systems and consider a family of technologically possible scenarios. Without describing new features in great detail, I'll outline which functions listed above can reasonably be moved to the client or disk, then we'll see what is left. naming -- low in the hierarchy -- allow servers to delegate long-term token management[8] for directories to a client that is actively updating it, subsequently redirect other clients to the delegatee. Clients who have responsibility for a directory will have to export a token management interface as well as server-like operations on directories (so other clients can make changes without forcing a change of delegation). high in the hierarchy -- replication, with its advantages (scalability, clean snapshots) and disadvantages (out dated contents) is still the preferred approach. data storage -- There isn't much argument that interposing a server between client and disk adds latency, restricts bandwidth and reduces scalability. I see two components to this. metadata -- In the limit smart disks will allow the client to read and write data to the disk without contacting the server to obtain metadata. fid assignment -- Currently the server assigns the fid for a new file, which puts the server into the file creation path. An object oriented disk will need some way to name objects (files). If the client can assign fids, it can do file creation locally, at least in directories delegated to it. The implication of this for the data storage system is that it must treat the file identifier as an arbitrary tag. This will require use of a hash table, or similar structure, to locate file metadata instead of relying on the internal structure of the fid it assigns. data location -- clients can be much more active at maintaining and updating a comprehensive, persistent idea of data location (i.e. a mini fileset location data base (fldb)), and they can even swap their mini-fldbs with each other. But, ultimately, the advantages of having specialized servers collect, collate, and publish data location information will be compelling. Because location information is fail-safe (at least when supported by some form of authentication), a collection (even a hodge-podge one) of complementary data location services will out perform highly structured and regimented services. This looks like a robustly server-centric task. security -- hash fids -- If clients assign fids, the most convenient and useful fid they can select, is a cryptographically strong hash of the file's contents. This provides automatic data authentication, thus freeing data location services and their clients from any security worries. private data -- Clients generating non-public data can encrypt it at the source which removes any security risks from data transport or storage. By creating certificates to make the encryption key available to desired recipients the client handles the access control function. With public key (PK) cryptography the client can do this with only occasional reference to a key server. Using secret key systems it will need to contact a certificate service for each distinct ACL[3] it uses. Readers will have to decode certificates to access the desired data. A certificate cache should make this no more onerous than it was for the data's writer. One casualty of this model is non-discretionary access control (a big Defense Department desiderata during the 70s & 80s), the client alone decides who can see the data[4]. Possibly non-discretionary access control policies could be enforced by a certificate service in a secret key based environment. write access -- with hash based fids, data is immutable so write access doesn't apply. Instead, control of the mapping between names and data resides in the name space. This makes it clear that directory delegation must be limited to clients with write access to the directory and that they must be conveyed via secure and authenticated channels. Directories, except in read-only replicas, cannot use hash based fids. They really are (usually) small databases with well defined update semantics implemented by a server. The idea proposed above is to allow delegation of this server role to authorized clients on a per directory basis. data transport -- using hash based fids and encryption at the source, the data transport is freed from many complications now placed on it. Any old thing will do: Kermit, TCP/ftp, NetSCSI, SAN/fibre channel, DirectPC, or CD-by-FedEx. One still needs an RPC interface to talk to and servers other clients, but it becomes easier to decouple the control and data channels. This decoupling is exactly what is needed for SAN and even DirectPC satellite systems, and will become more important as network technology grows more complex with a richer mix of price / performance trade-offs. sharing -- using hash based fids eliminates the need for a cache consistency protocol on files[2], confining this function to directories. This provides a big improvement in scalability since there are many more files than directories. Caching directories will require token management overhead on a server for tracking the client delegates for each directory and doing full token management (and handling directory updates) for undelegated directories. data distribution -- caching -- imposes little burden on any server. For plain files, the only issue finding and fetching the data[5]. For directories the client either becomes the delegate and "owns" the directory or contacts another client or server, but it doesn't care which. Another advantage of hash-fids and end-to-end encryption is that files can be safely fetched from other clients caches; call this "cache sharing". Such a client may be little more than an OO disk with extra smarts enabling it to perform cache replacement according to some policy. Attractive as this sounds, it adds to the data location job. This can be viewed as a problem or an opportunity. replication -- It is a small step from cache sharing of immutable file objects to replication. The main difference is that the unit of replication is a collection of files (e.g. a DFS fileset), which has some semantic meaning, while caching is done blindly (without regard to semantic relationship between files) on a per file basis. With a little less structure (i.e. useful bits assembled from various sources) this could also be called mirroring. In another way caching sharing and replication are very different. With cache sharing there is zero semantic content and the data location problem is difficult. With replication the semantic content is high and the data location problem is easy. Another view of the data distribution problem is sorting the output of the data location process to provide best performance or lowest cost. Part of selecting the best source for a file is considering the characteristics of routes available to reach that source. Another factor will be behavior of the source itself. An under-utilized, but distant, source may be preferred to an overloaded nearby one. A service that collects, collates and publishes data on sources and routes would be important component of an efficient distributed file system. storage management -- requires allocating a shared resource: the disk hardware. Ultimately, the disk owner must establish a mechanism controlling the use of the disk; likely implemented by a server of some sort. If the disk is managing file metadata, it will be in the best position to track disk usage and enforce limits. However, it will collaborate with the owner's server to update the limits, and perhaps an agent of the storing client to communicate this information in a timely fashion. If multiple clients are sharing the same disk storage they may need to cooperate with each other so that one doesn't unexpectedly consume the space needed by another to store its data. data management -- is, by definition, a centrally managed process. Like any other process, the cost and performance will depend on the extent to which it can be automated. This naturally suggests an intelligent server. Neither the client nor the disk are in good positions to accomplish these type of functions. SUMMARY naming / low -- client / high -- server data storage / metadata -- disk / fid asgn -- client data location -- server security / private data -- client / write access -- client / non-PK envs -- server data transport -- client +disk sharing / files -- client / dirs -- client +server data dist / caching -- client / other -- server storage management -- disk +server data management -- server The result of this investigation is that while servers do not disappear from the picture, they are not file servers: disks serve files. Instead we have data location servers, replication servers, security servers, data & storage management servers, and perhaps token management servers for undelegated directories. Ted Anderson NOTES [1] To some extent technology is overrunning the traditional security issue, because the ubiquitous network and inexpensive storage mean that it is almost impossible to guarantee that "deleted" data will not surface some day or that "secure" data won't eventually leak out. The bedrock in the security business can be summarized by two aphorisms: "if you want something done right, do it yourself" (encrypt data at its source) and "Three may keep a secret, if two of them are dead" [Benjamin Franklin] (don't share data's key more than absolutely necessary). [2] The suitability of hash values as fids depends on how frequently data is changed. Statistics show that the vast majority of files are written only once, and most of the rest are rewritten shortly after being written the first time. The likelihood of writes to a file declines very rapidly as the time since the last write increases. By having the producer (the client that originally writes a file) delay assigning a fid until it needs to flush a file from its cache (or until another client requests it) the odds that a file will need to be overwritten and its hash-fid reassigned can be made very small. This is especially easy if the directory in which the file is being created has been delegated to the same client. A library providing the familiar rewrite interface could be written using copy on write logic to accommodate an underlying file system interface that only supported write-once semantics. The few applications that really require updating files in-place should generally be using a real database not a file system. [3] By using the same encryption key for all files with the same ACL, the client can avoid contacting the certificate service for each file. In practice, it may be desirable to change the keys a little more often, say once per day or once per directory. With a public key system, a public key encryption (e.g. a bignum exponentiation) is required about as often as a secret key system would need to contact a server. On very lightweight clients, the CPU burden of PK operations may be onerous. In some environments they could be off-loaded to a trusted PK server with hardware support for the requisite arithmetic operations. [4] The use of encryption and certificates to replace trusted servers and ACLs means that revocation is no longer as easy as it once appeared. Ever since the invention of paper, the confidence one could have in revocation of privilege has always been somewhat limited. With wide-spread caching, replication, off-shore mirrors, and anonymous mailers, augmenting floppy disks, writable CDs, and laptops, revocation has never been more shaky. In this model, however, revocation of data requires decrypting it, re-encrypting it with a new key, and issuing a new certificate. Revocation of a user's privilege in a PK environment requires revocation of all data whose certificate mentions that user; this may frequently be impossible. With a secret key system utilizing a trusted certificate server it only requires revoking that user's credentials. For this reason alone, not the performance issues often cited, PK systems may have trouble catching on in corporate and government environments. [5] A consequence of using hash based fids is that chunking, as used by DFS and AFS, doesn't really make much sense. The client must fetch the whole file to verify that the hash of the data matches the fid it requested. Similarly, encryption and decryption will probably be easier if the whole file is available at one time. Very large files will be unwieldy, so instead a maximum size can be imposed which is convenient for caching, efficient transfer, hashing, and encryption (say 1Mb). Larger files can be composed by assembling individual files containing each part of the whole large file. The file describing the assemblage could either be a simple array of offsets and hash-fids, or something more sophisticated like XML. [7] "Wherever you go there you are" -- Buckaroo Banzai or "No matter where you go, there you are..." Also see http://www.slip.net/~figment/bb/q32.html. [8] By "token management" I mean something like DFS read data and write data tokens. In DFS these allow byte range granularity, but for directories, the special case of whole-file data tokens would be adequate. It would be convenient to provide byte range data and lock tokens for a few applications. This service could be offered separately, and probably doesn't need to be welded into the file system. -----BEGIN PGP SIGNATURE----- Version: 2.6.2 iQCVAwUBNvgNpgGojC9e/wyBAQEXKwP/bP9htArSNLiBiqtYmFkmPKX8Rg3qMKb1 VxKne9pw8kOLrhlS76FZTAnW23b0M3Esw/UAKH89zZ2vl9OXndL6b0TsAQqBt0Mb aJChRfOv8TabPCyJdc0Z6/29NYg/ZmRWAzVt2PiEcbO8ITRLm6O+D7klbq4Bnm2I jPhBsvgvoKc= =Vbgh -----END PGP SIGNATURE-----