-----BEGIN PGP SIGNED MESSAGE-----

Taxonomy of Distributed File System Functions

BACKGROUND

The deployment of Storage Area Networks (SAN) and Network Attached
Storage (NAS) has the goal of improving many facets of file system
behavior.  Ultimately, the possibility of attaching user machines
directly to disks offers the hope for maximizing the performance,
scalability, flexibility, and adaptability of the storage system.

A big problem, however, is that existing distributed file system
protocols just don't accommodate the idea of having the client talk to
the disk very well.  It isn't any surprise of course.  NFS, AFS and DFS
were all designed, to varying degrees, with the idea of moving work from
the servers to the clients.  But none of them considered the idea that
the server wouldn't have complete control of its disks.

The idea of further improving performance and scalability by also moving
work from the server to the disk is certainly appealing.  The simplest
schemes involve little more that porting the SCSI protocol to run over a
network.  But more ambitious plans call for object oriented (OO) disks,
which export a file-like abstraction.  Even present day disks are doing
a lot more than just storing data on physically addressed disk sectors.

How should a distributed file system evolve to take advantage of
smarter, client-accessible disks?  Deciding how to accomplish this seems
to be a matter of partitioning the file system's functions between the
various agents: the client (close to the user), the disk (close to the
data), and the file server.  In the limit the file server would have no
functions left at all, though other servers (location, replication,
backup, and security) will continue to be important.

SERVER-LITE FILE SYSTEM

A traditional file system transports data from producers to consumers
(both are called "users") across time.  A distributed file system
transports data between producers and consumers (both now called
"clients") across time and space.  From the hardware perspective, the
time transport is provided by disks and the space transport is done by
networks.  "Servers" don't appear at all in this high level picture.

To the extent that servers are *forced* into the picture they are likely
to have a negative impact on performance and scalability.  Looking at a
distributed file system in more detail, what intrinsic functions do
servers have?  If we consider the job of a distributed file system we
have these major tasks:
    data storage -- remembering data: transporting it through time.
    naming -- this is the mapping of human visible names to objects.
	Once time and space are conquered, naming is everything.  I
	believe global pathnames are the correct model.  These names
	lead to objects via an indirection layer provided by a file
	identifier (fid).  This approach is shared by AFS & DFS.  In
	other parts of the solution space we have: NFS (fids, but no
	global names), CIFS (global names, but no fids), and WWW (global
	names, but no fids).
    data location -- determining the location an object.  This is
	certainly the most complex and important part of a distributed
	file system.  Goals of scalability, performance and fault
	tolerance all demand that nearby copies be preferred to distant,
	unavailable or overloaded sources.  This leads to the caching
	and replication solutions in AFS/DFS, but more flexibility and
	adaptability will continue to improve performance.  In practice
	this is typically tied up with naming: even in AFS/DFS,
	mountpoints in the name space have locational implications.
	But this isn't necessary and always reduces flexibility.
    security -- this takes the form of the twin tasks of authentication
	and authorization.  This is always a thorny but important
	problem.  Issues like "who decides?"  (user / administrator
	a.k.a.  discretionary / non-discretionary) complicate the
	already difficult authentication problem[1].  The presence of
	caching and replication have created a significant difference
	between read and write permission.  The authorization problem
	develops two faces: access to data and control of the name
	space.  Read permission controls who can see the data.  Imagine
	that all data was encrypted and signed; read permission reduces
	to who can access the decryption key.  Writing to signed data
	just replaces it, so write permission reduces to control of the
	mapping between names and objects.  The occurrence of bona fid
	database-like files that actually need write permission controls
	is quite rare.  However, directories are common example of
	database-like objects.
    data transport -- moving the data through space between producers,
	consumers and disks.  The protocol for doing this isn't really
	part of a file system; unfortunately most distributed file
	systems come with a (unique) network protocol anyway.  There is
	no end to the proliferation of protocols in sight.  Ideally a
	file system should be as agnostic about data transport as
	possible.
    data distribution -- performance is a large challenge for
	distributed file systems.  The big-two mechanisms for getting
	good delivery performance are caching and replication.  An
	overlooked consideration is routing: some routes may be much
	faster or cheaper.  Another issue is server responsiveness
	(loaded / idle, fast / slow).
    sharing -- while there are some other advantages of distributed file
	systems, the most important one is data sharing.  For users to
	find shared data coherent, the cached (and replicated) data must
	have well defined consistency.
    storage management -- a smaller and probably decreasing advantage of
	distributed file systems is off-loading storage from the client.
	The tremendous advantage of data sharing means that the amount
	of individual storage a user or client needs is quite small;
	easily provided in most cases by the capacities of today's
	disks.  The advantage of being able to access individual storage
	from whatever client machine one is using, the "no matter where
	you go, there you are"[7] feature, is very convenient in some
	environments.  Providing storage for multiple users introduces
	the problems of sharing it, namely quota management.
    data management -- another small, but probably growing, advantage of
	distributed file systems is that data services with large
	economies of scale, such as 24x7 operation, hierarchical storage
	and backup, can be provided conveniently and cheaply.

Architecturally, DFS and AFS tie all these features intimately to the
server, only data location is provided by a different server.  The file
system's security function is supported by an auxiliary server, but the
file server still evaluates the mode bits and ACLs.  Data transport is
tied to the DCE/RPC for DFS and Rx for AFS (even NFS, LANMAN, and Web
Browsers have their own protocols).  The server holds all the cards.

To investigate the architecture of distributed file systems in the small
server limit, I am going to disregard the constraints of existing
systems and consider a family of technologically possible scenarios.
Without describing new features in great detail, I'll outline which
functions listed above can reasonably be moved to the client or disk,
then we'll see what is left.
    naming -- 
	low in the hierarchy -- allow servers to delegate long-term
	    token management[8] for directories to a client that is
	    actively updating it, subsequently redirect other clients to
	    the delegatee.  Clients who have responsibility for a
	    directory will have to export a token management interface
	    as well as server-like operations on directories (so other
	    clients can make changes without forcing a change of
	    delegation).
	high in the hierarchy -- replication, with its advantages
	    (scalability, clean snapshots) and disadvantages (out dated
	    contents) is still the preferred approach.
    data storage -- There isn't much argument that interposing a server
	between client and disk adds latency, restricts bandwidth and
	reduces scalability.  I see two components to this.
	metadata -- In the limit smart disks will allow the client to
	    read and write data to the disk without contacting the
	    server to obtain metadata.
	fid assignment -- Currently the server assigns the fid for a new
	    file, which puts the server into the file creation path.  An
	    object oriented disk will need some way to name objects
	    (files).  If the client can assign fids, it can do file
	    creation locally, at least in directories delegated to it.
	    The implication of this for the data storage system is that
	    it must treat the file identifier as an arbitrary tag.  This
	    will require use of a hash table, or similar structure, to
	    locate file metadata instead of relying on the internal
	    structure of the fid it assigns.
    data location -- clients can be much more active at maintaining and
	updating a comprehensive, persistent idea of data location
	(i.e. a mini fileset location data base (fldb)), and they can
	even swap their mini-fldbs with each other.  But, ultimately,
	the advantages of having specialized servers collect, collate,
	and publish data location information will be compelling.
	Because location information is fail-safe (at least when
	supported by some form of authentication), a collection (even a
	hodge-podge one) of complementary data location services will
	out perform highly structured and regimented services.  This
	looks like a robustly server-centric task.
    security -- 
        hash fids -- If clients assign fids, the most convenient and
	    useful fid they can select, is a cryptographically strong
	    hash of the file's contents.  This provides automatic data
	    authentication, thus freeing data location services and
	    their clients from any security worries.
        private data -- Clients generating non-public data can encrypt
	    it at the source which removes any security risks from data
	    transport or storage.  By creating certificates to make the
	    encryption key available to desired recipients the client
	    handles the access control function.  With public key (PK)
	    cryptography the client can do this with only occasional
	    reference to a key server.  Using secret key systems it will
	    need to contact a certificate service for each distinct
	    ACL[3] it uses.  Readers will have to decode certificates to
	    access the desired data.  A certificate cache should make
	    this no more onerous than it was for the data's writer.  One
	    casualty of this model is non-discretionary access control
	    (a big Defense Department desiderata during the 70s & 80s),
	    the client alone decides who can see the data[4].  Possibly
	    non-discretionary access control policies could be enforced
	    by a certificate service in a secret key based environment.
        write access -- with hash based fids, data is immutable so write
	    access doesn't apply.  Instead, control of the mapping
	    between names and data resides in the name space.  This
	    makes it clear that directory delegation must be limited to
	    clients with write access to the directory and that they
	    must be conveyed via secure and authenticated channels.
	    Directories, except in read-only replicas, cannot use hash
	    based fids.  They really are (usually) small databases with
	    well defined update semantics implemented by a server.  The
	    idea proposed above is to allow delegation of this server
	    role to authorized clients on a per directory basis.
    data transport -- using hash based fids and encryption at the
	source, the data transport is freed from many complications now
	placed on it.  Any old thing will do: Kermit, TCP/ftp, NetSCSI,
	SAN/fibre channel, DirectPC, or CD-by-FedEx.  One still
	needs an RPC interface to talk to and servers other clients, but
	it becomes easier to decouple the control and data channels.
	This decoupling is exactly what is needed for SAN and even
	DirectPC satellite systems, and will become more important as
	network technology grows more complex with a richer mix of price
	/ performance trade-offs.
    sharing -- using hash based fids eliminates the need for a cache
	consistency protocol on files[2], confining this function to
	directories.  This provides a big improvement in scalability
	since there are many more files than directories.  Caching
	directories will require token management overhead on a server
	for tracking the client delegates for each directory and doing
	full token management (and handling directory updates) for
	undelegated directories.
    data distribution -- 
        caching -- imposes little burden on any server.  For plain
	    files, the only issue finding and fetching the data[5].  For
	    directories the client either becomes the delegate and
	    "owns" the directory or contacts another client or server,
	    but it doesn't care which.  Another advantage of hash-fids
	    and end-to-end encryption is that files can be safely
	    fetched from other clients caches; call this "cache
	    sharing".  Such a client may be little more than an OO disk
	    with extra smarts enabling it to perform cache replacement
	    according to some policy.  Attractive as this sounds, it
	    adds to the data location job.  This can be viewed as a
	    problem or an opportunity.
	replication -- It is a small step from cache sharing of
	    immutable file objects to replication.  The main difference
	    is that the unit of replication is a collection of files
	    (e.g. a DFS fileset), which has some semantic meaning, while
	    caching is done blindly (without regard to semantic
	    relationship between files) on a per file basis.  With a
	    little less structure (i.e. useful bits assembled from
	    various sources) this could also be called mirroring.
        In another way caching sharing and replication are very
	different.  With cache sharing there is zero semantic content
	and the data location problem is difficult.  With replication
	the semantic content is high and the data location problem is
	easy.  Another view of the data distribution problem is sorting
	the output of the data location process to provide best
	performance or lowest cost.  Part of selecting the best source
	for a file is considering the characteristics of routes
	available to reach that source.  Another factor will be behavior
	of the source itself.  An under-utilized, but distant, source
	may be preferred to an overloaded nearby one.  A service that
	collects, collates and publishes data on sources and routes
	would be important component of an efficient distributed file
	system.
    storage management -- requires allocating a shared resource: the
	disk hardware.  Ultimately, the disk owner must establish a
	mechanism controlling the use of the disk; likely implemented by
	a server of some sort.  If the disk is managing file metadata,
	it will be in the best position to track disk usage and enforce
	limits.  However, it will collaborate with the owner's server to
	update the limits, and perhaps an agent of the storing client to
	communicate this information in a timely fashion.  If multiple
	clients are sharing the same disk storage they may need to
	cooperate with each other so that one doesn't unexpectedly
	consume the space needed by another to store its data.
    data management -- is, by definition, a centrally managed process.
	Like any other process, the cost and performance will depend on
	the extent to which it can be automated.  This naturally
	suggests an intelligent server.  Neither the client nor the disk
	are in good positions to accomplish these type of functions.

SUMMARY
    naming / low --		client
	   / high --				server
    data storage / metadata --		disk
		 / fid asgn --	client
    data location --				server
    security / private data --	client
	     / write access --	client
	     / non-PK envs --			server
    data transport --		client +disk
    sharing / files --		client
	    / dirs --		client	       +server
    data dist / caching --	client
	      / other --			server
    storage management --		disk   +server
    data management --				server

The result of this investigation is that while servers do not disappear
from the picture, they are not file servers: disks serve files.  Instead
we have data location servers, replication servers, security servers,
data & storage management servers, and perhaps token management servers
for undelegated directories.

Ted Anderson


NOTES

[1] To some extent technology is overrunning the traditional security
issue, because the ubiquitous network and inexpensive storage mean that
it is almost impossible to guarantee that "deleted" data will not
surface some day or that "secure" data won't eventually leak out.  The
bedrock in the security business can be summarized by two aphorisms: "if
you want something done right, do it yourself" (encrypt data at its
source) and "Three may keep a secret, if two of them are dead" [Benjamin
Franklin] (don't share data's key more than absolutely necessary).

[2] The suitability of hash values as fids depends on how frequently
data is changed.  Statistics show that the vast majority of files are
written only once, and most of the rest are rewritten shortly after
being written the first time.  The likelihood of writes to a file
declines very rapidly as the time since the last write increases.  By
having the producer (the client that originally writes a file) delay
assigning a fid until it needs to flush a file from its cache (or until
another client requests it) the odds that a file will need to be
overwritten and its hash-fid reassigned can be made very small.  This is
especially easy if the directory in which the file is being created has
been delegated to the same client.  A library providing the familiar
rewrite interface could be written using copy on write logic to
accommodate an underlying file system interface that only supported
write-once semantics.

The few applications that really require updating files in-place should
generally be using a real database not a file system.

[3] By using the same encryption key for all files with the same ACL,
the client can avoid contacting the certificate service for each file.
In practice, it may be desirable to change the keys a little more often,
say once per day or once per directory.  With a public key system, a
public key encryption (e.g. a bignum exponentiation) is required about
as often as a secret key system would need to contact a server.  On very
lightweight clients, the CPU burden of PK operations may be onerous.  In
some environments they could be off-loaded to a trusted PK server with
hardware support for the requisite arithmetic operations.

[4] The use of encryption and certificates to replace trusted servers
and ACLs means that revocation is no longer as easy as it once appeared.
Ever since the invention of paper, the confidence one could have in
revocation of privilege has always been somewhat limited.  With
wide-spread caching, replication, off-shore mirrors, and anonymous
mailers, augmenting floppy disks, writable CDs, and laptops, revocation
has never been more shaky.  In this model, however, revocation of data
requires decrypting it, re-encrypting it with a new key, and issuing a
new certificate.  Revocation of a user's privilege in a PK environment
requires revocation of all data whose certificate mentions that user;
this may frequently be impossible.  With a secret key system utilizing a
trusted certificate server it only requires revoking that user's
credentials.  For this reason alone, not the performance issues often
cited, PK systems may have trouble catching on in corporate and
government environments.

[5] A consequence of using hash based fids is that chunking, as used by
DFS and AFS, doesn't really make much sense.  The client must fetch the
whole file to verify that the hash of the data matches the fid it
requested.  Similarly, encryption and decryption will probably be easier
if the whole file is available at one time.  Very large files will be
unwieldy, so instead a maximum size can be imposed which is convenient
for caching, efficient transfer, hashing, and encryption (say 1Mb).
Larger files can be composed by assembling individual files containing
each part of the whole large file.  The file describing the assemblage
could either be a simple array of offsets and hash-fids, or something
more sophisticated like XML.

[7] "Wherever you go there you are" -- Buckaroo Banzai or "No matter
where you go, there you are..."
    Also see http://www.slip.net/~figment/bb/q32.html.

[8] By "token management" I mean something like DFS read data and write
data tokens.  In DFS these allow byte range granularity, but for
directories, the special case of whole-file data tokens would be
adequate.  It would be convenient to provide byte range data and lock
tokens for a few applications.  This service could be offered
separately, and probably doesn't need to be welded into the file system.

-----BEGIN PGP SIGNATURE-----
Version: 2.6.2

iQCVAwUBNvgNpgGojC9e/wyBAQEXKwP/bP9htArSNLiBiqtYmFkmPKX8Rg3qMKb1
VxKne9pw8kOLrhlS76FZTAnW23b0M3Esw/UAKH89zZ2vl9OXndL6b0TsAQqBt0Mb
aJChRfOv8TabPCyJdc0Z6/29NYg/ZmRWAzVt2PiEcbO8ITRLm6O+D7klbq4Bnm2I
jPhBsvgvoKc=
=Vbgh
-----END PGP SIGNATURE-----