-----BEGIN PGP SIGNED MESSAGE----- Data Location and Distribution[1] Ian Clark's recent summary of the FreeNet protocol provides an opportunity for me to mention another of my concerns about it. But I'd like to wrap the description with some background explaining my perspective on this whole issue. My primary motivation is to create a global file system that almost everyone can use for almost all their file storage needs. I think that a suitably designed file system can serve as the storage medium for most distributed applications and provide support for the rest (for instance, by providing static data used in conjunction with dynamically generated content). There are certainly several file system markets which are inappropriate for the file system I envision. These are situations where performance cost of hashing all the data is too high and the benefits of sharing are too low: very high end or cluster systems and very low end or stand-alone systems. Also I don't expect to supplant the boot file system on a machine. Within a 5-10 years the Internet will become nearly pervasive. At that point this file system model could attack about 99% of the market volume and maybe 95% of the market dollars. With this in mind, and keeping an eye on the future, my target for the file system's scale is 2^40 nodes and 2^80 objects. This is clearly a long term goal as it amounts to at least 100 billion petabytes, but that's only 2^40 nodes exporting a terabyte each. For scalability and privacy reasons, each user needs to have the ability to exercise complete control over the access to his data. The obvious approach to this is to have the data producer encrypt the data and the consumer decrypt it, with keys flowing through various paths to enable these virtual channels between producers and consumers. This approach to data security will likely require new paradigms for thinking about information ownership and control. Complete autonomy is a must for the individual nodes which comprise the system. Autonomy is crucial for providing sufficient scalability to enable such a ubiquitous system. This implies very aggressive caching of data and allowing nodes do their own directory updates, manage locking on the files they control and so forth. This requirement for operational independence fits very well with the above security model base on authorization independence. To support the level of caching needed to make a system of this scale perform well, immutable files are necessary to allow the data consistency of the system to be well defined. Present systems provide allow both names and the data they point at to change, leaving consistency a very fragile commodity. Clearly, producers must be able to "say what they mean" by mapping names to arbitrary data. By making data identifiers immutable, however, the system can also insist that "they mean what they say". In other words, there is no editing of history: "What was once said, stays writ". A file system with well defined consistency guarantees based on a substrate of immutable files will also require some shift in the paradigm used by both people and applications. Managing very large systems is extremely difficult. The very largest systems always operate on the verge of being out of control. To manage even larger and more complex systems requires making an actual virtue of being out of control. Complex Adaptive Systems (CAS) work by breaking the job into very many small parts each obeying simple local rules and interacting with a few associates; no one is "in charge". This architectural model is perfectly suited to the problem of decentralized data distribution. While it is hard to "let go" of the mind-set of centralized control, in sufficiently large and complex systems, decentralized approaches are the only ones that can work in practice. The goal of providing secure, censor-proof access to data makes a virtue of decentralization and will help avoid centralized systems from creeping into the system. Adopting an eventual goal of very large scale deployment will also help avoid leaning on unscalable components. A target of 2^40 nodes supporting 2^80 objects should put any thoughts of systems with any aspect of global knowledge out of the question. The challenge is to design local algorithms that, when embodied in many participating nodes, collectively achieve the desired goals. - ----- One of the problems with FreeNet is that is has no good story on data persistence. It is a pure cache and unpopular data just falls of the end of the LRU list and disappears forever. In a real file system this is not a desirable property, though excellent caching capabilities are certainly crucial for acceptable performance in any large scale distributed file system. The Eternity system, as the name implies, focuses on persistence without addressing the issues of caching. We need a more comprehensive model for a general purpose file system. I am interested in establishing the basic feasibility of a storage network on the scale I contemplate. The that end both FreeNet and Eternity provide starting points for thinking about data distribution in really large scale file systems. The FreeNet data storage system provides a global LRU cache using an adaptive network servers. The ability of the cache to quickly and reliably produce the data matching a requested key is crucial. This ability needs to scale to large network sizes and be robust to various failure modes. More work needs to be done to clarify the operation of the FreeNet cache. Scalability measurements at the level of a few hundred or thousand, unless very carefully done, will tell us little about the scalability to 2^20 let alone 2^32 or 2^40. The simplest thing would be to simulate much larger networks so that the behavior as a function of size can be measured. There are practical considerations which will limit this approach. Better understanding at the current scale is possible and would help understand the limitations to very large networks. Parametric sensitivity studies of all the parameters affecting the scalability would greatly help. For instance, the hop count is an 8-bit field which is decremented by every host along the request path. If one route results in a loop or backtrack failure, the node retries the search down another path, using the hop count from the just received response. The effect of this is that the hop count is a measure of nodes searched, not of search depth. Unless the search converges very rapidly to a node caching the desired file, this small hop count will not allow files to be found in large networks. To address this issue it would be useful to see how small networks perform with tiny hop counts. I am unpersuaded that nodes will really specialize adaptively to handle certain ranges of keys. Will these specializations be well defined, persistent over time and known to their neighbors? Will data successfully make its way these nodes upon insertion and will appropriate requests likewise be directed to these nodes? It is pretty clear that in the face of requests, the FreeNet caching behavior will migrate data towards the most frequent users of it. However, there is no pressure encouraging data to collect itself where it can be found later. Consider the following example where FreeNet is widely used to distribute data for all sorts of applications. I am in Bosnia filing an important report so I set the insertion hop count to 255. The data ends up in 255 nodes in and around eastern Europe. Perhaps, the report even makes it to my immediate supervisor in Brussels. However, I explicitly send the key to an associate in Chile. This guy doesn't get around to fetching the data for a few days. What happens? I imagine normal daily activity in the network causes a sort of tidal bulge in which high levels of load travel around the planet with the work day hours. These local surges in demand will cycle through the caches of most active nodes. After a few days the recipient of my email tries to fetch the report. The nodes who might have had some affinity of the key identifying the report have long since purged it from their caches. The nodes that still have the report are those near nodes that have fetched it recently and are all in Europe. Optimistically, we may assume that the request from Chile makes it to a node or two in Europe that used to cache the report but the trail of where the report migrated from there days ago is gone. Because there is no tendency for data to migrate towards a rendezvous point where requesters know to look, many files that are still cached in the network will go undiscovered. Another area of concern is the method for adding nodes to the network. There is little in the FreeNet report that describes this process but I believe that in a large network the way in which new nodes are added to the system may be quite important. How do they find their initial points of contact with the network? How do paths of communication develop so that a user can obtain data residing on another continent without searching a million or a billion or more nodes on his own continent first? - ----- The core of my solution to these problems is to add active specialization to each node's data storage behavior. Each node will have a target key and it will collect files with nearby keys. In addition, I believe the whole system needs to be able to use market-based incentives to influence its allocation of resources. This will allow each node to devote a portion of its disk space to data it is paid to store. This combination of collecting behavior and profit motive will produce a decentralized distributed storage system with quantifiable reliability. When a node joins the system it selects a random hash value which identifies its position in the data storage network, called its nodeid. I assume that data keys are produced using the same hash function so they are the same size as nodeids. The node maintains a list of other nodes it associates with and their ids. The nodes are arranged in an array of node groups. Each element of the array is labeled by a target nodeid whose value matches this node's id except for the bit number which is the same as the array index, counting from the most significant end. So the first element will collect associate whose nodeids are close to this node's id with the high bit complemented, the next element's members will cluster around nodeid with the second highest bit complemented. Each node collects associates whose nodeids most closely match the target nodeid for each element of the array. If more than a few nodes are found for each group only the best are kept, where best is defined as lowest response time, most reliable or some other suitable metric. This array constitutes the edges of an n-cube graph linking the nodes of the network together. The distance between any two nodes of an n-cube is of order M if there are 2^M nodes in the network. The properties of hash functions such as SHA are such that they produce values that are uniformly distributed over the entire range of hash values; 2^160 for SHA. The expected distance, defined as the XOR of the two hash values, between a node and its closest neighbor is the ratio of the maximum hash value to the number of nodes. So, if the size of the hash value is N bits and the number of nodes is 2^M, a typical node will have M elements of its array filled in; the effective nodeid radius of each node is 2^(N-M). It also means that a path can be found to an arbitrary node given its nodeid in about the same number of hops. A simple algorithm for accomplishing this is for each node to direct a message to its associate whose nodeid is closest to the target nodeid. The first hop will get to message to a node with a nodeid whose high bit matches the destination nodeid. The second hop will take it to a node whose top two bits match, and so on, until the message reaches the target node. In reality, each node's array of associates only approximately matches this ideal arrangement, partly because of imperfect knowledge of the other nodes in the system and partly because the randomly selected nodeids are not uniformly distributed over all possible values. The lack of knowledge can be remedied by explicitly requesting associates. A node with a sparsely populated associate array can send a message seeking contact with the node whose id is closest to a specific value, the value chosen to fill a weak spot in his array. Each array element should contain at least two associates to provide a variety of paths through the network. Depending on the availability of its associates and the general reliability of the network, it may be reasonable to maintain larger groups. New nodes can be added to the network by copying the associate list from any existing node, and resorting it according to the new node's id. Then it can supplement its array by explicitly requesting contact with nodes with ids near the value needed for each element. By symmetry, the just contacted associate will also find that the new node is a good match for an element in his array. Such new participants can be added provisionally by more established members so that they are quickly woven into the n-cube network. To make use of this network for locating files, each node needs to seek out and retain files whose keys are close to its nodeid. Newly inserted data is directed towards the appropriate nodes using the same algorithm used by the network to seek out associates. Nodes now have two conflicting goals, one is to cache recently requested files, whose keys bear no relation to their nodeids and the other is to hoard those files whose keys are close to its nodeid. Balancing these two goals will require additional input. Some input can be had by observing the arrival rate of newly inserted data and the spread of their keys. The size of the network can be inferred by the number of non-empty elements in the associate array. If there are M elements filled, the number of nodes will be a within a small factor of 2^M. If the network size remains constant, the spread of data keys that arrive should also be constant. If the observed spread is 2^E and the key size is N, then the number of nodes in the network can be also be estimated as 2^(N-E). Some uncertainty in the number comes from difference choices for what happens when inserted data arrives at a node interested in storing it. Does it also pass it on, or does it just store the data? Some nodes will have larger storage capacity and will want to store files with larger spread of keys. If the node has associates with nodeids closer to the data key than its nodeid, it should also pass the data on to them, but it should probably pass it on to a few others with nearby nodeids in case some of them have larger data appetites than average. Nodes could also advertise their desired key radius so that neighbors with nearby nodeids would know whether to send new data on to them. With all this, the network still ends up operating as an LRU cache. As data is inserted, the capacity of nodes to store new data must eventually be exhausted and some data must be discarded to make room. In the absence of other criterion, LRU is probably the best choice. To provide some reasonable level of reliability, however, the system must be able to do better. This is where market incentives enter the picture. - ----- To some degree, free caching services make sense because they are offered opportunistically, on an "as is" basis. There need be no longterm expectation of service. This is not sufficient for a file system that users depend upon for regular use. To make a large decentralized system work, some mechanism will be needed to provide data persistence. Essentially the only realistic option in this regard is to use real money to pay for data storage and retrieval. It may also be necessary at some point to pay for network services as well, but at the moment this seems like a separable issue. There are three storage related services that can be charged for: data retrieval, leased storage, and spot-market storage. Charging for data retrieval is easiest: the node requesting a file pays a small fee to the node that produces it. Leased storage is a contractual arrangement between a user and a data server, where the server provides storage for a certain amount of data. Spot market storage works by creating a digital bearer bond attached to a file whose value increases as time goes by, data merchants can buy the bonds (and the associated data) to make profitable use of otherwise idle storage, or sell them to free up space for other (more profitable) uses. Data retrieval fees reward a node for caching a file and the reward is proportional to the demand for the data, which encourages caching popular data. On the other hand, for data that is extremely popular, very many nodes will have the data cached, so that requests are very diluted and any one node sees relatively few. This oversupply effect rewards the few nodes that hold relatively rare data, because they get a large fraction of the total retrieval requests. A certain balance is maintained automatically between widely cached popular data and sparsely held rare data by the economic forces of supply and demand. Leasing arrangements provide a steady, reliable home for data produced by a user which is private or not widely shared. These arrangements can be formal business contracts between otherwise unrelated parties, or can be provided as part of a large relationship. The former case is much like the way an ISP provides a certain amount of web space for each customer's home pages. With end-to-end encryption, the storage provider doesn't need to be trusted with sensitive data, it only needs to store it and make it available reliably. On the other hand, an employee would presumably have access to store business data in company file servers, which is the normal practice within the workplace. The chief complication of implementing this mechanism is that the producer and the server must agree on *what* is to be stored. A database, which could take the form of a list of keys, reference counts and sizes, must be shared between the user and data server. Because the relationship is only bilateral between contractually bound parties there are no trust or scaling issues. Perhaps the most novel mechanism is the use of a digital bearer bond to attach value to the storage of a file. The bond consists of a face value, a maturity date, a hash value and the identity of the issuer. The issuer agrees to pay the face value, on the maturity date, to the holder who presents the bond and a file matching the specified hash. The issuer is an independent party, like an escrow agent or bank, that handles the transaction. A user creates a bond by giving the issuer the face value (minus the interest the issuer can expect to earn, plus a handling fee) along with the desired date and hash value. The issuer returns the resulting bond to the user using standard e-cash protocols. The user can then sell the bond, again using e-cash mechanisms, along with a copy of the data to a node for a price reduced by the cost of storing the data until the maturity date. That node, acting as a data merchant, can later resell the bond (and data) to another node at a higher price because the maturity date is now closer. Nodes who own popular data can also make extra money via the retrieval fee. When selling a datum which is known to be popular, the buyer can expect to pay a bit more for it, because the storage costs will be offset by retrieval fees. The basic roles within the file system can be described with this financial structure in mind. Each node tries to optimize its cash flow by minimizing is expenditures and maximizing its receipts. The standard LRU cache can then be viewed as a way to minimize retrieval fees. However, the strict LRU policy can be modified by information about the historical costs of obtaining and retaining data. It wants to keep in its cache data that is either expensive to obtain or which is popular and so can earn revenue. Nodes which have excess storage can lease it out and so earn money, or can purchase storage bonds to profitably fill up their extra space. Of course, the balance between these demands on the available storage space will depend on the prices of each service the node may provide. To some degree each node may participate in all these services, but specialization is likely to be useful. The whole financial pump is driven by the end users of the system. The user funds its local node to locate, fetch and cache the data he requests, contracts with a data server for storage of personal data and creates bearer bonds for wider dispersal of public data be publishes. The money eventually makes its way to node to pay for computing, network and storage resources. Ted Anderson [1] This note is at http://www.transarc.com/~ota/datadist-19991026.txt -----BEGIN PGP SIGNATURE----- Version: 2.6.2 iQCVAwUBOBWM7gGojC9e/wyBAQEVggP+LD9rNiOQqdvfoQpWgTMBbUiMPa04zqn6 TxwBm1/QYBUxJNr9AnhrScYd8VRMQrQIKoEGdAGTZhPQOeILFFCeGvsT3GOkW/Eg R9EsDsw5ntGZJje+kg9VwGuAnGEp0Ahhr5Pam2STu7cSR7+ut5bksYXn93XqfBvr SyCCZRZpys0= =2ELU -----END PGP SIGNATURE----- [1] This note is at http://ota.polyonymo.us/datadist-19991026.txt