Wednesday, June 12, 2002

WRL Resolution Rules

Here are the "WRL Resolution Rules". The "embedded host" information is the host from the http:// wrapper.

  • IF there is "embedded host" information
    • Check for a local copy
    • Do a HTTP HEAD against the embedded host for a list of alternative sites
    • Do a P2P query
    • Get the file from the embedded host
  • IF there isn't "embedded host information
    • Check for a local copy
    • Do a P2P query

The idea is that the "embedded host" is the last resort for downloading the file. The WRL server which is downloading the files may suggest itself to the embedded host as being an alternative host for downloading.

Bootstraping

Since it's unlikely most browsers will recognize the "wrl:" URL protocol, it's essential that we find a "bootstrap" process to make this work and be useful. So...

Until WRLs become widespread (and maybe even afterwards):

  • WRLs should be in embedded format -- http://<server>/wrl/wrl:156BE178.61B5F769.D02E4FE5.D1A9E13C.8623F71D/docs/index.html.
  • WRL-aware browsers should re-write this WRL to point to their local WRL server.
  • WRL-aware upstream caches should follow the WRL resolution rules also.

We'll describe the "WRL resolution rules" soon. The main point is that if there is something in path between the browser and the data that understands WRLs, "something different" can be done. If nothing understands WRLs, the original server of the document will be contacted, which leaves us no worse of than we were originally.

Conceptual Document IDs

A WRL may have zero, one, or more Conceptual Document IDs (CDIs). CDIs provide a way of grouping related documents together � for example, different revisions of a single document, different songs from the same album, or different versions of a software program will share either the same or similar CDIs.

CDIs always look like �<local-part>@<naming-authority>� � for example, �introduction.wrl.documents@davidjanes.com�.

The �<naming-authority>� is a well-formed Internet domain name controlled by the document creator. It does not have to map to an actual Internet site!

The �<local-part>� is a little more complicated. It is a dot-separated set of words that imply a hierarchy of document naming. This hierarchy can be used to find related documents. For example, the CDI (note the leading �.�):

  • .wrl.documents@davidjanes.com

matches the following documents

  • introduction.wrl.documents@davidjanes.com
  • specification.wrl.documents@davidjanes.com
  • business-plan.wrl.documents@davidjanes.com

There's a lot more to be done here, but the idea is to allow versioning of documents to be specified using CDIs.

Actual Document IDs

Every document implicitly has a unique Actual Document ID (ADI). The ADI is created by running the SHA-1 checksum algorithm against the document and encoding the result as a hexadecimal string. Within the bounds of reasonable probability, no two different documents will ever generate that the same ADI.

Thus, given any document, it can always be named uniquely with a WRL. Humans will never be expected to type in a WRL with an ADI � it will either be automatically generated or reached from a hyperlink.

Currently, I am thinking that ADIs should be encoded in following format: 5 dot separated groups, each group encoding 4 bytes in upper case hexidecimal format -- i.e. "D8569E7F.4B28E03B.1C4ED183.98DB3698.7D8C27E2". This takes a total of 44 bytes to encode, which is a little wasteful compared to the base 64 encoding, which takes up only 32 bytes, but is a hell of a lot easier on the eyes.

Why are WRLs URLs and not URNs?

Initially, we defined a WRLs as a type of URN (rather than a type of URL), as this is more in line "with the standard". However, we changed our minds after we discovered serveral things:

  • Since URNs have be positited, almost no useful work has been done with them (check your RFCs);
  • The RFC which specifies URN Syntax, RFC 2141 restricts the characters which can be used in URNs -- such as "/" -- that we couldn't do required things such as specify files within an archive.
  • Browsers inherently work usefully with the URL syntax, and we want to build on that;

Finally, since URNs and URLs both need "resolvers", the distinction seems rather artificial, so we went with the more powerful and much more popular URL.

What are URLs, URNs, and URIs?

Here's the description from RFC 1630:

A Universal Resource Identifier (URI) is a member of this universal set of names in registered name spaces and addresses referring to registered protocols or name spaces. A Uniform Resource Locator (URL), defined elsewhere, is a form of URI which expresses an address which maps onto an access algorithm using network protocols. Existing URI schemes which correspond to the (still mutating) concept of IETF URLs are listed here. The Uniform Resource Name (URN) debate attempts to define a name space (and presumably resolution protocols) for persistent object names.

Or to say another way: URLs are supposed to point something that can be accessed using a "network protocol"; URNs are supposed to define persistent objects; and URIs are the generic name referring to both URLs and URNs.

URI 
 +----- URL
 |       +---- wrl:
 |       +---- http:
 |       +---- ftp:
 |       +---- mailto:
 +------ URN
         +---- ...

Monday, June 10, 2002

Embedded WRLs

RFC 2169 is the prototype for embedding WRLs inside the normal "http:" URL. Here's some examples:

  • http://<server>/wrl/wrl:156BE178.61B5F769.D02E4FE5.D1A9E13C.8623F71D/docs/index.html
  • http://<server>/wrl/wrl:156BE178.61B5F769.D02E4FE5.D1A9E13C.8623F71D/docs/api/index.html
  • http://<server>/wrl/wrl:156BE178.61B5F769.D02E4FE5.D1A9E13C.8623F71D/docs/images/javalogo52x88.gif

I.e. all we've done is add "http://<server>/wrl" in front of the WRL to make a URL. There's a lot more to be said about this also.

Example WRLs

Here are some code "pure" WRLs:

  • wrl:D8569E7F.4B28E03B.1C4ED183.98DB3698.7D8C27E2/
    This WRL contains only an "Actual Document ID", which is a SHA-1 hash of the file encoded in hexcidecimal. Much more on this to come.
  • wrl:wrlideas@davidjanes.com/
    This WRL contains only a "Conceptual Document ID"
  • wrl:wrlideas@davidjanes.com,D8569E7F.4B28E03B.1C4ED183.98DB3698.7D8C27E2/
    This WRL contains both an ADI and a CDI.

Contrast the first WRL "wrl:D8569E7F.4B28E03B.1C4ED183.98DB3698.7D8C27E2/" to that of one from the CAW: "urn:sha1:RMUVHIRSGUU3VU7FJWRAKW3YWG2S2RFB". The real beauty of the WRL system is when we use archives -- these all reference files from the JavaDocs for the 1.4.0 SDK. When displayed in your browser, the "index.html" files do exactly what you expect: display the webpages with all graphics and links intact.

  • wrl:156BE178.61B5F769.D02E4FE5.D1A9E13C.8623F71D/docs/index.html
  • wrl:156BE178.61B5F769.D02E4FE5.D1A9E13C.8623F71D/docs/api/index.html
  • wrl:156BE178.61B5F769.D02E4FE5.D1A9E13C.8623F71D/docs/images/javalogo52x88.gif

Spoofing

Here's my big worry about allowing "conceptual" IDs into WRLs: spoofing. Trust rings, perhaps?

Put up or shut up

Question: Where's the source?

Answer: Coming very very soon. Really. I need to make sure that there's no dependencies on proprietary code. I need to also decide on a licensing model.

The Content-Addressable Web

Question: Doesn't this sound suspiciously like the Content-Addressable Web?

Answer: Yes, it does, but I came by my ideas independently (really). I think I do a few things nicer than CAW, and I hope to demonstrate that soon. In particular (in order of importance, I think):

  • We use URLs rather than the URNs to identify documents,
  • We can publish archives and uniquely identify documents (and directories) within those archives,
  • We can "abstractly" identify documents,
  • We can associate metadata with files

I'm not sure what Onion Networks is doing for P2P -- I'm basing my stuff on Gnutella.

What is the Whirlpool?

The Whirlpool is:

  • A technology for allowing any document to be uniquely identified and usefully retrieved using P2P protocols
  • The collection of all documents available through this technology

WRL nominally stands for , but it is secretly a reference to Vienna, where the Whirlpool was designed and where they tend to name almost everything beginning with the letter W.

What is this blog all about?

This is about a software project I've been working on for the last several months, called "The Whirlpool". Other things -- particularly my job, my life, my geographical position on the planet -- have bogged down my progress, so I thought I'd "throw out there" my work up till this point.