-----BEGIN PGP SIGNED MESSAGE-----

Adam brings up several interesting points.  I hold forth on my
particular ideas on this subject from time to time and perhaps it is
time again.  I am not really interesting in the AP-type aspects of the
eternity business myself.  I am of the school of thought that believes
that mother nature (in the form of naive users, thumb-fingered site
admins, backhoe operators, etc) is a tougher opponent than all but the
most determined and well-financed adversaries.  My solution to the
eternity problem is to build a universal file system that robustly
handles all the "ordinary" problems of data delivery and file system
security.  The resulting mechanisms and traffic provide a great haystack
in which to bury the miniscule quantity of really problematic content.

The primary step is to divide all files into two classes: immutable and
database.  The former consists of almost all regular files; directories
mainly comprise the latter group.  In most cases read/write files are
really just a series of immutable versions of a file that replace each
other as the contents of a particular entry in the namespace.  Write
sharing is accomplished by obtaining the latest version of the file from
the namespace.  For very fine-grained write sharing you need a database.
The premiere example of which is the directories of the namespace
itself.

A result of this split is that access to non-public files is
best controlled by encrypting the data at its source (the producer) and
handing out certificates giving authorized users access to the key.
Write access resides in the namespace: the authority to change the
"meaning" of a name.

A namespace entry points to a file via its file identifier.  The best
thing to use for this identifies is a hash of the file's content (a
cfid).  Except for public data, the hash is taken over the encrypted
content.  This makes the identifier self-signing, so there is never a
question of getting the wrong data.  The consumer looks up the name in a
directory and fetches the file with the specified cfid.  If the received
bytes have the correct hash, it is the file the producer wrote.

The immutability property makes caching very easy because cache
coherence is automatic.  If you have the specified cfid in your cache it
is guaranteed to be the correct data.  All the timeliness is vested in
the namespace.  Because directories are already a write shared database,
they must be updated via authenticated communication with the owner (or
an online server acting on his behalf).  Lookups need to go to the
owner, or a replica which the owner informs synchronously of directory
updates.  This assures that the cfid is current and any resulting cache
hit is guaranteed correct.  This provides open / close cache
consistency.

The use of content hash identifiers relieves the delivery mechanism of
all security concerns.  Privacy is provided by encryption, and
authentication (connecting the name to the desired content) is provided
by the namespace.  The storage, transport, caching and replication
services only have to deliver bytes that match the requested cfid: they
are working or broken.

This neatly breaks the file system into two independent parts: a
delivery system and a namespace.  Certificates used to pass private
data between producers and consumers are just a specially formated type
of encrypted file in the namespace.  I spend more words talking about
the taxonomy of file system functions here[1].

One way to look at the the namespace problem is to consider each
directory as a private channel identified by its fid (which, except for
the top levels of the namespace, cannot be content hash based).  The
channel is protected by encryption and the key is shared by all
authorized writers of the directory.  At any time a single entity,
running with the credentials of one of the writers, is called the owner.
The owner has the current copy of the directory and is in charge of
making updates and handling lookup requests.  Readers are authorized by
the owner and the owner is in charge of ensuring coherence of cached
lookups through the use of some consistency protocol such as callbacks,
leases or the like.  Ownership can change hands by passing an encrypted
copy of the current directory from one writer to another, updating a
location database and forwarding requests from the old owner to the new
owner.

The delivery system must be able to provide content requested by cfid.
There are considerable problems in handling the data location problem.
Instead of a the traditional simple and fragile server-centric star
topology of servers and clients we would like to support a dynamic
hierarchy of caches, replicas, mirrors, archives and the like which
provide data by cfid.  Here is where e$ comes in.  Producers contract
with a data storage facility to keep their data at a primary repository:
his home file server as it were.  Consumers pay for delivered data.
Separating producers and consumers in both time and space, however, are
data merchants.  They find may it profitable to store popular data, as
well as provide data caching or location services.  For example, if the
primary repository of a very popular file is overloaded it will be
advantageous to requesters to turn to secondary sources for the file who
can then profit from having kept a copy.  Similarly, local sources will
often be preferred to distant ones (for performance reasons even if the
network doesn't charge for its role in delivering data) which also
provides niche for profitable neighborhood caches.  Devising algorithms
and protocols that allow the delivery system to respond efficiently and
quickly to changes in demand while still handling routine access to data
across a tremendous range of popularity will be challenging.  I think
that market-based mechanisms are up to the task.  I covered this topic
in a previous message[2].

The data location problem is severe.  In the limit it becomes a huge
advertising problem.  How can a requester find a supplier that stocks
the file it needs?  It will probably be useful to annotate a cfid with a
small list of collection identifiers which serve to categorize it.
Membership in a collection need not be exclusive and collections may or
may not be hierarchical.  One model is that the collection ids form as a
very broad, shallow tree which locates cfids: for example my files might
be identified by the collection tree US corporations => Transarc => ota.
Another view is that the collection identifiers are orthogonal search
keys.  These could be based on a handful of well known key concepts (the
assignment of library card catalog numbers much use a similar system).
These collection ids would help with the advertising problem because
suppliers could specialize in files identified with certain collections.
Because collections could be quite large, perhaps with 100K or more
members, the difficulty of finding suppliers for collections would be
very greatly eased compared to the problem of finding an arbitrary cfid.

In summary, the delivery mechanism will need to grow into a complex
adaptive system if it is to work at all.  As such it would likely be
highly resistant to tampering and tinkering.  However, I admit that at
this stage its design is mostly handwaving and fuzzy thinking.  As Adam
said, the namespace is a crucial link.  By decoupling the namespace from
the delivery system, you reduce the cost of (naming) replication to an
infinitesimal level.  Like web links to popular or infamous sites,
references to interesting files can replicate furiously as any user can
add one to his home directory, post them in messages to popular news
groups, weblogs, etc.  Once the "word" is out the delivery system will
ensure that the the content is replicated robustly.  A really popular
file will be unsuppressible because as an attacker eliminates (through
whatever nefarious means) copies, the remaining copies become
increasingly lucrative to their owners, so further elimination will
become exponentially expensive and difficult for the attack, not to
mentions self-healing.  A more serious problem is safe guarding
important but rarely used materials.  These may depend on bands of
zealous archivists which collect various types of data of interest to
them.

Perhaps the key point is that the system I outline here is principally
designed for routine use so, like the Internet itself, it can have a
chance to grow and prosper even if it has occasional "incorrect" uses.
A system designed specifically to support unpopular purposes, will never
live to see the light of day.

Ted Anderson

[1] http://www.transarc.com/~ota/taxonomy.txt
[2] http://www.transarc.com/~ota/Information-Silk-Road.html

-----BEGIN PGP SIGNATURE-----
Version: 2.6.2

iQCVAwUBNxsrnAGojC9e/wyBAQHnUAP/db9Alr3zb4D5vnfEHNcDSMyBx31ep8wA
7qCaO9E8zLhYPz8EcYt0CBh4dtCrLjiC1Z0aEA6vfnersbgK5+hKXrgbNgErEKWW
hvkzO8OVc47IoO8zM2/IpUxdUczQghwG+X3e8jU0+a6DsDUkUdyIdy+bZMvMRLnc
+5CgTZHGfn0=
=YM2H
-----END PGP SIGNATURE-----

[1] http://ota.polyonymo.us/taxonomy.txt
[2] http://ota.polyonymo.us/Information-Silk-Road.html