-----BEGIN PGP SIGNED MESSAGE-----

Data Location and Distribution[1]

Ian Clark's recent summary of the FreeNet protocol provides an
opportunity for me to mention another of my concerns about it.  But I'd
like to wrap the description with some background explaining my
perspective on this whole issue.

My primary motivation is to create a global file system that almost
everyone can use for almost all their file storage needs.  I think that
a suitably designed file system can serve as the storage medium for most
distributed applications and provide support for the rest (for instance,
by providing static data used in conjunction with dynamically generated
content).  There are certainly several file system markets which are
inappropriate for the file system I envision.  These are situations
where performance cost of hashing all the data is too high and the
benefits of sharing are too low: very high end or cluster systems and
very low end or stand-alone systems.  Also I don't expect to supplant
the boot file system on a machine.  Within a 5-10 years the Internet
will become nearly pervasive.  At that point this file system model
could attack about 99% of the market volume and maybe 95% of the market
dollars.  With this in mind, and keeping an eye on the future, my target
for the file system's scale is 2^40 nodes and 2^80 objects.  This is
clearly a long term goal as it amounts to at least 100 billion
petabytes, but that's only 2^40 nodes exporting a terabyte each.

For scalability and privacy reasons, each user needs to have the ability
to exercise complete control over the access to his data.  The obvious
approach to this is to have the data producer encrypt the data and the
consumer decrypt it, with keys flowing through various paths to enable
these virtual channels between producers and consumers.  This approach
to data security will likely require new paradigms for thinking about
information ownership and control.

Complete autonomy is a must for the individual nodes which comprise the
system.  Autonomy is crucial for providing sufficient scalability to
enable such a ubiquitous system.  This implies very aggressive caching
of data and allowing nodes do their own directory updates, manage
locking on the files they control and so forth.  This requirement for
operational independence fits very well with the above security model
base on authorization independence.  To support the level of caching
needed to make a system of this scale perform well, immutable files are
necessary to allow the data consistency of the system to be well
defined.  Present systems provide allow both names and the data they
point at to change, leaving consistency a very fragile commodity.
Clearly, producers must be able to "say what they mean" by mapping names
to arbitrary data.  By making data identifiers immutable, however, the
system can also insist that "they mean what they say".  In other words,
there is no editing of history: "What was once said, stays writ".  A
file system with well defined consistency guarantees based on a
substrate of immutable files will also require some shift in the
paradigm used by both people and applications.

Managing very large systems is extremely difficult.  The very largest
systems always operate on the verge of being out of control.  To manage
even larger and more complex systems requires making an actual virtue of
being out of control.  Complex Adaptive Systems (CAS) work by breaking
the job into very many small parts each obeying simple local rules and
interacting with a few associates; no one is "in charge".  This
architectural model is perfectly suited to the problem of decentralized
data distribution.  While it is hard to "let go" of the mind-set of
centralized control, in sufficiently large and complex systems,
decentralized approaches are the only ones that can work in practice.

The goal of providing secure, censor-proof access to data makes a virtue
of decentralization and will help avoid centralized systems from
creeping into the system.  Adopting an eventual goal of very large scale
deployment will also help avoid leaning on unscalable components.  A
target of 2^40 nodes supporting 2^80 objects should put any thoughts of
systems with any aspect of global knowledge out of the question.  The
challenge is to design local algorithms that, when embodied in many
participating nodes, collectively achieve the desired goals.

- -----

One of the problems with FreeNet is that is has no good story on data
persistence.  It is a pure cache and unpopular data just falls of the
end of the LRU list and disappears forever.  In a real file system this
is not a desirable property, though excellent caching capabilities are
certainly crucial for acceptable performance in any large scale
distributed file system.  The Eternity system, as the name implies,
focuses on persistence without addressing the issues of caching.  We
need a more comprehensive model for a general purpose file system.

I am interested in establishing the basic feasibility of a storage
network on the scale I contemplate.  The that end both FreeNet and
Eternity provide starting points for thinking about data distribution in
really large scale file systems.

The FreeNet data storage system provides a global LRU cache using an
adaptive network servers.  The ability of the cache to quickly and
reliably produce the data matching a requested key is crucial.  This
ability needs to scale to large network sizes and be robust to various
failure modes.  More work needs to be done to clarify the operation of
the FreeNet cache.

Scalability measurements at the level of a few hundred or thousand,
unless very carefully done, will tell us little about the scalability to
2^20 let alone 2^32 or 2^40.  The simplest thing would be to simulate
much larger networks so that the behavior as a function of size can be
measured.  There are practical considerations which will limit this
approach.  Better understanding at the current scale is possible and
would help understand the limitations to very large networks.
Parametric sensitivity studies of all the parameters affecting the
scalability would greatly help.

For instance, the hop count is an 8-bit field which is decremented by
every host along the request path.  If one route results in a loop or
backtrack failure, the node retries the search down another path, using
the hop count from the just received response.  The effect of this is
that the hop count is a measure of nodes searched, not of search depth.
Unless the search converges very rapidly to a node caching the desired
file, this small hop count will not allow files to be found in large
networks.  To address this issue it would be useful to see how small
networks perform with tiny hop counts.

I am unpersuaded that nodes will really specialize adaptively to handle
certain ranges of keys.  Will these specializations be well defined,
persistent over time and known to their neighbors?  Will data
successfully make its way these nodes upon insertion and will
appropriate requests likewise be directed to these nodes?  It is pretty
clear that in the face of requests, the FreeNet caching behavior will
migrate data towards the most frequent users of it.  However, there is
no pressure encouraging data to collect itself where it can be found
later.

Consider the following example where FreeNet is widely used to
distribute data for all sorts of applications.  I am in Bosnia filing an
important report so I set the insertion hop count to 255.  The data ends
up in 255 nodes in and around eastern Europe.  Perhaps, the report even
makes it to my immediate supervisor in Brussels.  However, I explicitly
send the key to an associate in Chile.  This guy doesn't get around to
fetching the data for a few days.  What happens?

I imagine normal daily activity in the network causes a sort of tidal
bulge in which high levels of load travel around the planet with the
work day hours.  These local surges in demand will cycle through the
caches of most active nodes.  After a few days the recipient of my email
tries to fetch the report.  The nodes who might have had some affinity
of the key identifying the report have long since purged it from their
caches.  The nodes that still have the report are those near nodes that
have fetched it recently and are all in Europe.  Optimistically, we may
assume that the request from Chile makes it to a node or two in Europe
that used to cache the report but the trail of where the report migrated
from there days ago is gone.  Because there is no tendency for data to
migrate towards a rendezvous point where requesters know to look, many
files that are still cached in the network will go undiscovered.

Another area of concern is the method for adding nodes to the network.
There is little in the FreeNet report that describes this process but I
believe that in a large network the way in which new nodes are added to
the system may be quite important.  How do they find their initial
points of contact with the network?  How do paths of communication
develop so that a user can obtain data residing on another continent
without searching a million or a billion or more nodes on his own
continent first?

- -----

The core of my solution to these problems is to add active
specialization to each node's data storage behavior.  Each node will
have a target key and it will collect files with nearby keys.  In
addition, I believe the whole system needs to be able to use
market-based incentives to influence its allocation of resources.  This
will allow each node to devote a portion of its disk space to data it is
paid to store.  This combination of collecting behavior and profit
motive will produce a decentralized distributed storage system with
quantifiable reliability.

When a node joins the system it selects a random hash value which
identifies its position in the data storage network, called its nodeid.
I assume that data keys are produced using the same hash function so
they are the same size as nodeids.  The node maintains a list of other
nodes it associates with and their ids.  The nodes are arranged in an
array of node groups.  Each element of the array is labeled by a target
nodeid whose value matches this node's id except for the bit number
which is the same as the array index, counting from the most significant
end.  So the first element will collect associate whose nodeids are
close to this node's id with the high bit complemented, the next
element's members will cluster around nodeid with the second highest bit
complemented.  Each node collects associates whose nodeids most closely
match the target nodeid for each element of the array.  If more than a
few nodes are found for each group only the best are kept, where best is
defined as lowest response time, most reliable or some other suitable
metric.  This array constitutes the edges of an n-cube graph linking the
nodes of the network together.  The distance between any two nodes of an
n-cube is of order M if there are 2^M nodes in the network.

The properties of hash functions such as SHA are such that they produce
values that are uniformly distributed over the entire range of hash
values; 2^160 for SHA.  The expected distance, defined as the XOR of the
two hash values, between a node and its closest neighbor is the ratio of
the maximum hash value to the number of nodes.  So, if the size of the
hash value is N bits and the number of nodes is 2^M, a typical node will
have M elements of its array filled in; the effective nodeid radius of
each node is 2^(N-M).  It also means that a path can be found to an
arbitrary node given its nodeid in about the same number of hops.  A
simple algorithm for accomplishing this is for each node to direct a
message to its associate whose nodeid is closest to the target nodeid.
The first hop will get to message to a node with a nodeid whose high bit
matches the destination nodeid.  The second hop will take it to a node
whose top two bits match, and so on, until the message reaches the
target node.

In reality, each node's array of associates only approximately matches
this ideal arrangement, partly because of imperfect knowledge of the
other nodes in the system and partly because the randomly selected
nodeids are not uniformly distributed over all possible values.  The
lack of knowledge can be remedied by explicitly requesting associates.
A node with a sparsely populated associate array can send a message
seeking contact with the node whose id is closest to a specific value,
the value chosen to fill a weak spot in his array.  Each array element
should contain at least two associates to provide a variety of paths
through the network.  Depending on the availability of its associates
and the general reliability of the network, it may be reasonable to
maintain larger groups.

New nodes can be added to the network by copying the associate list from
any existing node, and resorting it according to the new node's id.
Then it can supplement its array by explicitly requesting contact with
nodes with ids near the value needed for each element.  By symmetry, the
just contacted associate will also find that the new node is a good
match for an element in his array.  Such new participants can be added
provisionally by more established members so that they are quickly woven
into the n-cube network.

To make use of this network for locating files, each node needs to seek
out and retain files whose keys are close to its nodeid.  Newly inserted
data is directed towards the appropriate nodes using the same algorithm
used by the network to seek out associates.  Nodes now have two
conflicting goals, one is to cache recently requested files, whose keys
bear no relation to their nodeids and the other is to hoard those files
whose keys are close to its nodeid.  Balancing these two goals will
require additional input.  Some input can be had by observing the
arrival rate of newly inserted data and the spread of their keys.  The
size of the network can be inferred by the number of non-empty elements
in the associate array.  If there are M elements filled, the number of
nodes will be a within a small factor of 2^M.  If the network size
remains constant, the spread of data keys that arrive should also be
constant.  If the observed spread is 2^E and the key size is N, then the
number of nodes in the network can be also be estimated as 2^(N-E).
Some uncertainty in the number comes from difference choices for what
happens when inserted data arrives at a node interested in storing it.
Does it also pass it on, or does it just store the data?  Some nodes
will have larger storage capacity and will want to store files with
larger spread of keys.  If the node has associates with nodeids closer
to the data key than its nodeid, it should also pass the data on to
them, but it should probably pass it on to a few others with nearby
nodeids in case some of them have larger data appetites than average.
Nodes could also advertise their desired key radius so that neighbors
with nearby nodeids would know whether to send new data on to them.

With all this, the network still ends up operating as an LRU cache.  As
data is inserted, the capacity of nodes to store new data must
eventually be exhausted and some data must be discarded to make room.
In the absence of other criterion, LRU is probably the best choice.  To
provide some reasonable level of reliability, however, the system must
be able to do better.  This is where market incentives enter the
picture.

- -----

To some degree, free caching services make sense because they are
offered opportunistically, on an "as is" basis.  There need be no
longterm expectation of service.  This is not sufficient for a file
system that users depend upon for regular use.  To make a large
decentralized system work, some mechanism will be needed to provide data
persistence.  Essentially the only realistic option in this regard is to
use real money to pay for data storage and retrieval.  It may also be
necessary at some point to pay for network services as well, but at the
moment this seems like a separable issue.

There are three storage related services that can be charged for: data
retrieval, leased storage, and spot-market storage.  Charging for data
retrieval is easiest: the node requesting a file pays a small fee to the
node that produces it.  Leased storage is a contractual arrangement
between a user and a data server, where the server provides storage for
a certain amount of data.  Spot market storage works by creating a
digital bearer bond attached to a file whose value increases as time
goes by, data merchants can buy the bonds (and the associated data) to
make profitable use of otherwise idle storage, or sell them to free up
space for other (more profitable) uses.

Data retrieval fees reward a node for caching a file and the reward is
proportional to the demand for the data, which encourages caching
popular data.  On the other hand, for data that is extremely popular,
very many nodes will have the data cached, so that requests are very
diluted and any one node sees relatively few.  This oversupply effect
rewards the few nodes that hold relatively rare data, because they get a
large fraction of the total retrieval requests.  A certain balance is
maintained automatically between widely cached popular data and sparsely
held rare data by the economic forces of supply and demand.

Leasing arrangements provide a steady, reliable home for data produced
by a user which is private or not widely shared.  These arrangements can
be formal business contracts between otherwise unrelated parties, or can
be provided as part of a large relationship.  The former case is much
like the way an ISP provides a certain amount of web space for each
customer's home pages.  With end-to-end encryption, the storage provider
doesn't need to be trusted with sensitive data, it only needs to store
it and make it available reliably.  On the other hand, an employee would
presumably have access to store business data in company file servers,
which is the normal practice within the workplace.  The chief
complication of implementing this mechanism is that the producer and the
server must agree on *what* is to be stored.  A database, which could
take the form of a list of keys, reference counts and sizes, must be
shared between the user and data server.  Because the relationship is
only bilateral between contractually bound parties there are no trust or
scaling issues.

Perhaps the most novel mechanism is the use of a digital bearer bond to
attach value to the storage of a file.  The bond consists of a face
value, a maturity date, a hash value and the identity of the issuer.
The issuer agrees to pay the face value, on the maturity date, to the
holder who presents the bond and a file matching the specified hash.
The issuer is an independent party, like an escrow agent or bank, that
handles the transaction.  A user creates a bond by giving the issuer the
face value (minus the interest the issuer can expect to earn, plus a
handling fee) along with the desired date and hash value.  The issuer
returns the resulting bond to the user using standard e-cash protocols.
The user can then sell the bond, again using e-cash mechanisms, along
with a copy of the data to a node for a price reduced by the cost of
storing the data until the maturity date.  That node, acting as a data
merchant, can later resell the bond (and data) to another node at a
higher price because the maturity date is now closer.  Nodes who own
popular data can also make extra money via the retrieval fee.  When
selling a datum which is known to be popular, the buyer can expect to
pay a bit more for it, because the storage costs will be offset by
retrieval fees.

The basic roles within the file system can be described with this
financial structure in mind.  Each node tries to optimize its cash flow
by minimizing is expenditures and maximizing its receipts.  The standard
LRU cache can then be viewed as a way to minimize retrieval fees.
However, the strict LRU policy can be modified by information about the
historical costs of obtaining and retaining data.  It wants to keep in
its cache data that is either expensive to obtain or which is popular
and so can earn revenue.  Nodes which have excess storage can lease it
out and so earn money, or can purchase storage bonds to profitably fill
up their extra space.  Of course, the balance between these demands on
the available storage space will depend on the prices of each service
the node may provide.  To some degree each node may participate in all
these services, but specialization is likely to be useful.

The whole financial pump is driven by the end users of the system.  The
user funds its local node to locate, fetch and cache the data he
requests, contracts with a data server for storage of personal data and
creates bearer bonds for wider dispersal of public data be publishes.
The money eventually makes its way to node to pay for computing, network
and storage resources.

Ted Anderson

[1] This note is at http://www.transarc.com/~ota/datadist-19991026.txt

-----BEGIN PGP SIGNATURE-----
Version: 2.6.2

iQCVAwUBOBWM7gGojC9e/wyBAQEVggP+LD9rNiOQqdvfoQpWgTMBbUiMPa04zqn6
TxwBm1/QYBUxJNr9AnhrScYd8VRMQrQIKoEGdAGTZhPQOeILFFCeGvsT3GOkW/Eg
R9EsDsw5ntGZJje+kg9VwGuAnGEp0Ahhr5Pam2STu7cSR7+ut5bksYXn93XqfBvr
SyCCZRZpys0=
=2ELU
-----END PGP SIGNATURE-----

[1] This note is at http://ota.polyonymo.us/datadist-19991026.txt