-----BEGIN PGP SIGNED MESSAGE----- Adam brings up several interesting points. I hold forth on my particular ideas on this subject from time to time and perhaps it is time again. I am not really interesting in the AP-type aspects of the eternity business myself. I am of the school of thought that believes that mother nature (in the form of naive users, thumb-fingered site admins, backhoe operators, etc) is a tougher opponent than all but the most determined and well-financed adversaries. My solution to the eternity problem is to build a universal file system that robustly handles all the "ordinary" problems of data delivery and file system security. The resulting mechanisms and traffic provide a great haystack in which to bury the miniscule quantity of really problematic content. The primary step is to divide all files into two classes: immutable and database. The former consists of almost all regular files; directories mainly comprise the latter group. In most cases read/write files are really just a series of immutable versions of a file that replace each other as the contents of a particular entry in the namespace. Write sharing is accomplished by obtaining the latest version of the file from the namespace. For very fine-grained write sharing you need a database. The premiere example of which is the directories of the namespace itself. A result of this split is that access to non-public files is best controlled by encrypting the data at its source (the producer) and handing out certificates giving authorized users access to the key. Write access resides in the namespace: the authority to change the "meaning" of a name. A namespace entry points to a file via its file identifier. The best thing to use for this identifies is a hash of the file's content (a cfid). Except for public data, the hash is taken over the encrypted content. This makes the identifier self-signing, so there is never a question of getting the wrong data. The consumer looks up the name in a directory and fetches the file with the specified cfid. If the received bytes have the correct hash, it is the file the producer wrote. The immutability property makes caching very easy because cache coherence is automatic. If you have the specified cfid in your cache it is guaranteed to be the correct data. All the timeliness is vested in the namespace. Because directories are already a write shared database, they must be updated via authenticated communication with the owner (or an online server acting on his behalf). Lookups need to go to the owner, or a replica which the owner informs synchronously of directory updates. This assures that the cfid is current and any resulting cache hit is guaranteed correct. This provides open / close cache consistency. The use of content hash identifiers relieves the delivery mechanism of all security concerns. Privacy is provided by encryption, and authentication (connecting the name to the desired content) is provided by the namespace. The storage, transport, caching and replication services only have to deliver bytes that match the requested cfid: they are working or broken. This neatly breaks the file system into two independent parts: a delivery system and a namespace. Certificates used to pass private data between producers and consumers are just a specially formated type of encrypted file in the namespace. I spend more words talking about the taxonomy of file system functions here[1]. One way to look at the the namespace problem is to consider each directory as a private channel identified by its fid (which, except for the top levels of the namespace, cannot be content hash based). The channel is protected by encryption and the key is shared by all authorized writers of the directory. At any time a single entity, running with the credentials of one of the writers, is called the owner. The owner has the current copy of the directory and is in charge of making updates and handling lookup requests. Readers are authorized by the owner and the owner is in charge of ensuring coherence of cached lookups through the use of some consistency protocol such as callbacks, leases or the like. Ownership can change hands by passing an encrypted copy of the current directory from one writer to another, updating a location database and forwarding requests from the old owner to the new owner. The delivery system must be able to provide content requested by cfid. There are considerable problems in handling the data location problem. Instead of a the traditional simple and fragile server-centric star topology of servers and clients we would like to support a dynamic hierarchy of caches, replicas, mirrors, archives and the like which provide data by cfid. Here is where e$ comes in. Producers contract with a data storage facility to keep their data at a primary repository: his home file server as it were. Consumers pay for delivered data. Separating producers and consumers in both time and space, however, are data merchants. They find may it profitable to store popular data, as well as provide data caching or location services. For example, if the primary repository of a very popular file is overloaded it will be advantageous to requesters to turn to secondary sources for the file who can then profit from having kept a copy. Similarly, local sources will often be preferred to distant ones (for performance reasons even if the network doesn't charge for its role in delivering data) which also provides niche for profitable neighborhood caches. Devising algorithms and protocols that allow the delivery system to respond efficiently and quickly to changes in demand while still handling routine access to data across a tremendous range of popularity will be challenging. I think that market-based mechanisms are up to the task. I covered this topic in a previous message[2]. The data location problem is severe. In the limit it becomes a huge advertising problem. How can a requester find a supplier that stocks the file it needs? It will probably be useful to annotate a cfid with a small list of collection identifiers which serve to categorize it. Membership in a collection need not be exclusive and collections may or may not be hierarchical. One model is that the collection ids form as a very broad, shallow tree which locates cfids: for example my files might be identified by the collection tree US corporations => Transarc => ota. Another view is that the collection identifiers are orthogonal search keys. These could be based on a handful of well known key concepts (the assignment of library card catalog numbers much use a similar system). These collection ids would help with the advertising problem because suppliers could specialize in files identified with certain collections. Because collections could be quite large, perhaps with 100K or more members, the difficulty of finding suppliers for collections would be very greatly eased compared to the problem of finding an arbitrary cfid. In summary, the delivery mechanism will need to grow into a complex adaptive system if it is to work at all. As such it would likely be highly resistant to tampering and tinkering. However, I admit that at this stage its design is mostly handwaving and fuzzy thinking. As Adam said, the namespace is a crucial link. By decoupling the namespace from the delivery system, you reduce the cost of (naming) replication to an infinitesimal level. Like web links to popular or infamous sites, references to interesting files can replicate furiously as any user can add one to his home directory, post them in messages to popular news groups, weblogs, etc. Once the "word" is out the delivery system will ensure that the the content is replicated robustly. A really popular file will be unsuppressible because as an attacker eliminates (through whatever nefarious means) copies, the remaining copies become increasingly lucrative to their owners, so further elimination will become exponentially expensive and difficult for the attack, not to mentions self-healing. A more serious problem is safe guarding important but rarely used materials. These may depend on bands of zealous archivists which collect various types of data of interest to them. Perhaps the key point is that the system I outline here is principally designed for routine use so, like the Internet itself, it can have a chance to grow and prosper even if it has occasional "incorrect" uses. A system designed specifically to support unpopular purposes, will never live to see the light of day. Ted Anderson [1] http://www.transarc.com/~ota/taxonomy.txt [2] http://www.transarc.com/~ota/Information-Silk-Road.html -----BEGIN PGP SIGNATURE----- Version: 2.6.2 iQCVAwUBNxsrnAGojC9e/wyBAQHnUAP/db9Alr3zb4D5vnfEHNcDSMyBx31ep8wA 7qCaO9E8zLhYPz8EcYt0CBh4dtCrLjiC1Z0aEA6vfnersbgK5+hKXrgbNgErEKWW hvkzO8OVc47IoO8zM2/IpUxdUczQghwG+X3e8jU0+a6DsDUkUdyIdy+bZMvMRLnc +5CgTZHGfn0= =YM2H -----END PGP SIGNATURE----- [1] http://ota.polyonymo.us/taxonomy.txt [2] http://ota.polyonymo.us/Information-Silk-Road.html