Gnutella Developer Forum G. Mohr Bitzi, Inc. November 30, 2001 Hash/URN Gnutella Extensions (HUGE) v0.93 Abstract HUGE is a collection of incremental extensions to the Gnutella protocol (v 0.4) which allow files to be identified and located by Uniform Resource Names (URNs) -- reliable, persistent, location- independent names, such as those provided by secure hash values. Table of Contents 1. HUGE in a Nutshell . . . . . . . . . . . . . . . . . . . . . 2 2. Background . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Motivation & Goals . . . . . . . . . . . . . . . . . . . . . 3 2.2 Hash and URN Conventions . . . . . . . . . . . . . . . . . . 4 2.3 Gnutella Version . . . . . . . . . . . . . . . . . . . . . . 5 3. General Extension Mechanism . . . . . . . . . . . . . . . . 5 4. Query Extensions . . . . . . . . . . . . . . . . . . . . . . 7 5. QueryHit Extensions . . . . . . . . . . . . . . . . . . . . 8 6. Download Extensions . . . . . . . . . . . . . . . . . . . . 8 6.1 URN-based Request-URI . . . . . . . . . . . . . . . . . . . 8 6.2 Headers . . . . . . . . . . . . . . . . . . . . . . . . . . 9 6.2.1 X-Gnutella-Content-URN . . . . . . . . . . . . . . . . . . . 9 6.2.2 X-Gnutella-Alternate-Location . . . . . . . . . . . . . . . 9 7. Implementation Recommendations . . . . . . . . . . . . . . . 10 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 11 References . . . . . . . . . . . . . . . . . . . . . . . . . 11 Author's Address . . . . . . . . . . . . . . . . . . . . . . 11 Mohr [Page 1] The GDF HUGE v0.93 November 2001 1. HUGE in a Nutshell If you would like to receive URNs, such as hashes, reported on the hits for any other Query, insert a null-terminated string indicating the prefix of the kind(s) of URNs you'd like to receive after the first null, within the Query payload. For example: QUERY: STD-HEADER: [23 bytes] QUERY-SEARCH-STRING: Gnutella Protocol[0x00]urn:[0x00] Meaning: "Find files with the keywords 'Gnutella Protocol', and if possible, label the results with any 'urn:' identifiers available." If you would like to Query for files by hash value, leave the standard search-string empty, and insert a valid URN between-the- nulls. (That's 20 raw bytes, Base32-encoded.) For example: QUERY: STD-HEADER: [23 bytes] QUERY-SEARCH-STRING: [0x00]urn:sha1:PLSTHIPQGSSZTS5FJUPAKUZWUGYQYPFB[0x00] Meaning: "Find files with exactly this SHA1 hash." When you receive a QueryHit that requests hashes, report them by inserting the valid URN between the two nulls which mark the end of each distinct result. For example: QUERYHIT: STD-HEADER: [23 bytes] QUERY-HIT-HEADER: [11 bytes] EACH-RESULT: INDEX: [4 bytes] LEN: [4 bytes] FILENAME: GnutellaProtocol04.pdf[0x00] EXTRA: urn:sha1:PLSTHIPQGSSZTS5FJUPAKUZWUGYQYPFB[0x00] SERVENT-IDENTIFIER: [16 bytes] Mohr [Page 2] The GDF HUGE v0.93 November 2001 Meaning: "Here's a file which matches your Query, and here also is its SHA1 hash." If you return such an URN, you must also accept it in an HTTP file- request, in accordance with the following syntax: GET /uri-res/N2R?urn:sha1:PLSTHIPQGSSZTS5FJUPAKUZWUGYQYPFB HTTP/1.0 This syntax is in addition to, not in place of, the traditional file- index/filename based GET convention. To be in compliance with this specification, you should support at least the SHA1 hash algorithm and format reflected here, and be able to downconvert "bitprint" requests/reports to SHA1. Other URN namespaces are optional and should be gracefully ignored when not understood. Please refer to the rest of this document for other important details. 2. Background 2.1 Motivation & Goals By enabling the GnutellaNet to identify and locate files by hash/URN, a number of features could be offered with the potential to greatly enhance end-user experience. These include: o Folding together the display of query results which represent the exact same file -- even if those identical files have different filenames. o Parallel downloading from multiple sources ("swarming") with final assurance that the complete file assembled matches the remote source files. o Safe "resume from alternate location" functionality, again with final assurance of file integrity. o Cross-indexing GnutellaNet content against external catalogs (e.g. Bitzi) or foreign P2P systems (e.g. FastTrack, OpenCola, MojoNation, Freenet, etc.) The goal of these extensions, termed the "Hash/URN Gnutella Extensions" ("HUGE"), is to enable cooperating servents to identify and search for files by hash or other URN. This is to be done in a way that does not interfere with the operation of older servents, servents which choose not to implement these features, or other Gnutella-extension proposals. Mohr [Page 3] The GDF HUGE v0.93 November 2001 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the rest of this document are to be interpreted as described in RFC 2119. 2.2 Hash and URN Conventions URN syntax was originally defined in RFC2141; a procedure for registering URN namespaces is described in RFC2611. URNs follow the general syntax: urn:[Namespace-ID]:[Namespace-Specific-String] All examples in this version of this document presume the Namespace- ID "sha1", which is not yet officially registered, and a Namespace- Specific-String which is a 32-character Base32-encoding of a 20-byte SHA1 hash value. For example: urn:sha1:PLSTHIPQGSSZTS5FJUPAKUZWUGYQYPFB Case is unimportant for these identifiers, although other URN-schemes will sometimes have case-sensitive Namespace-Specific-Strings. Formal documentation and registration of this namespace and encoding will proceed in separate documents, and this document will be updated with references when possible. The Base32 encoding to be used is the one described as "Canonical" in the Simon Josefsson-editted Internet- Draft, "Base Encodings" [1]. However, the encoded output should not include any stray intervening characters or end-padding. A nutshell description of how to calculate such Base32 encodings from binary data is: o Take bits in groups of 5, most-significant-bits first. o Append zeroes if necessary to pad the last group to 5 bits. o Replace each group with the corresponding value from the following digit-set, which leaves out the digits [0,1], for 5-bit values 0 through 31: ABCDEFGHIJKLMNOPQRSTUVWXYZ 234567 For example, taking the two bytes 0x0F 0xF5: 00001111 11110011 -> 00001 11111 11001 1[0000] -> B7ZQ (Base32) Another related URN Namespace which will be mentioned is that of "urn:bitprint". This namespace, also pending formal documentation and registration, features a 32-character SHA1 value, a connecting Mohr [Page 4] The GDF HUGE v0.93 November 2001 period, then a 39-character TigerTree value. This creates an identifier which is likely to remain robust against intentional manipulation further into the future than SHA1 alone, and offers other benefits for subrange verification. Any "bitprint" identifier which begins with 32 characters terminated by a period can be converted to a "sha1" value by truncating its Namespace-Specific-String to the first 32-characters. That is, urn:bitprint:[32-character-SHA1].[39-character-TigerTree] ...can become... urn:sha1:[32-character-SHA1] All servents compliant with this specification MUST be capable of calculating and reporting SHA1 values when appropriate. Further, servents which choose not to calculate extended "urn:bitprint" values SHOULD down-convert such values and requests, whenever received, to SHA1 values and requests. 2.3 Gnutella Version HUGE is designed as an extension to the Gnutella Protocol version 0.4, as documented by Clip2, revision 1.2. That document was available as a PDF on 2001-10-08 from the Clip2 website: http://www.clip2.com/GnutellaProtocol04.pdf urn:sha1:PLSTHIPQGSSZTS5FJUPAKUZWUGYQYPFB 3. General Extension Mechanism At its heart, HUGE requires that new, distinct information be included in Query messages and the QueryHit responses. The general mechanism used is to insert additional strings "between the nulls" -- the paired NULL characters which appear in Gnutella messages at end of Query search-strings and QueryHit results. However, numerous potential Gnutella extensions might all wish to use that same space, even at the same time. Thus a facility is required to segment and distinguish independent extensions. Servents compliant with this proposal MUST interpret the space between NULs in Queries and QueryHit results as zero or more independent extension strings, separated by ASCII character 28 -- FS, "file separator", 0x1C. (This character will not appear in any human-readable strings, and is also expressly illegal in XML.) As Mohr [Page 5] The GDF HUGE v0.93 November 2001 many extension strings as will fit inside a legal Gnutella message, of declared payload-size are allowed. Any document specifying the format and behavior of certain extension strings MUST provide a clear rule for identifying which strings are covered by its specification, based on one or more unique prefixes. Servents MUST ignore any individual extension strings they do not understand. Any extension strings beginning with "urn:" (case-insensitive) MUST be interpreted as per this specification. Future extensions SHOULD NOT introduce ambiguities as to the interpretation of any given extension string, and thus SHOULD NOT claim to cover any prefixes which are substrings or extensions of "urn:". (For example, "u", "ur", "urn", "urn:blah", etc.) So, a Query with two extension strings would fit the following general format: QUERY: STD-HEADER: [23 bytes] QUERY-SEARCH-STRING: traditional search string[0x00] EXTRA extension1[0x1C]extension2[0x00] A QueryHit with two extension strings would look like: QUERYHIT: STD-HEADER: [23 bytes] QUERY-HIT-HEADER: [11 bytes] EACH-RESULT: INDEX: [4 bytes] LEN: [4 bytes] FILENAME: Filename[0x00] EXTRA: extension1[0x1C]extension2[0x00] SERVENT-IDENTIFIER: [16 bytes] Mohr [Page 6] The GDF HUGE v0.93 November 2001 4. Query Extensions HUGE adds two new Query capabilities: the ability to request that URNs are included on returned search results, and the ability to Query-by-URN. To request that URNs be attached to search results, servents MUST include either the generic string "urn:" or namespace-specific URN prefixes, such as "urn:sha1:", as Query extension strings. For example: QUERY: STD-HEADER: [23 bytes] QUERY-SEARCH-STRING: Gnutella Protocol[0x00]urn:[0x00] Servents MAY request multiple specific URN types, but use of the generic "urn:" is recommended. When answering a Query which includes such URN requests, a remote servent SHOULD include any URNs it can provide that meet the request. In the generic "urn:" case, this means one or more URNs of the responder's choosing. When specific namespaces like "urn:sha1:" are requested, those URNs should be provided if possible. A servent MUST still return otherwise-valid hits, even if it cannot supply requested URNs. To search for a file with a specific URN, servents MUST include the whole URN as an extension string. Servents may include multiple URNs as separate extension strings, and/or include a non-empty traditional search string. Any Query for a specific URN is also an implicit request that the same sort of URN appear on all search results. For example: QUERY: STD-HEADER: [23 bytes] QUERY-SEARCH-STRING: [0x00]urn:sha1:PLSTHIPQGSSZTS5FJUPAKUZWUGYQYPFB[0x00] When answering a Query message, a servent SHOULD return any file matching any of the included URNs, or matching the traditional search string, if present. Mohr [Page 7] The GDF HUGE v0.93 November 2001 5. QueryHit Extensions When answering a Query that has requested URN-annotated results, place the URNs as an extension string or strings inside each individual result. For example: QUERYHIT: STD-HEADER: [23 bytes] QUERY-HIT-HEADER: [11 bytes] EACH-RESULT: INDEX: [4 bytes] LEN: [4 bytes] FILENAME: GnutellaProtocol04.pdf[0x00] EXTRA: urn:sha1:PLSTHIPQGSSZTS5FJUPAKUZWUGYQYPFB[0x00] SERVENT-IDENTIFIER: [16 bytes] 6. Download Extensions 6.1 URN-based Request-URI Servents which report URNs MUST support a new syntax for requesting files, based on their URN rather than their filename and local "file index". This syntax is adopted from RFC2169. Traditional Gnutella GETs are of the form: GET /get/[file-index]/[file-name] HTTP/1.0 Servents reporting URNs must also accept requests of the form: GET /uri-res/N2R?[URN] HTTP/1.0 For example: GET /uri-res/N2R?urn:sha1:PLSTHIPQGSSZTS5FJUPAKUZWUGYQYPFB HTTP/1.0 (The PUSH/GIV facilities are unaffected by the HUGE extensions.) Mohr [Page 8] The GDF HUGE v0.93 November 2001 6.2 Headers Two new headers, for inclusion on HTTP requests and responses, are defined to assist servents in ascertaining that certain files are exact duplicates of each other, and in finding alternate locations for identical files. 6.2.1 X-Gnutella-Content-URN When responding to any GET, servents compliant with this specification SHOULD use the "X-Gnutella-Content-URN" header whenever possible to report a reliable URN for the file they are providing. The URN MUST be for the full file, even when responding to "Range" requests, and multiple "X-Gnutella-Content-URN" headers MAY be used to report multiple valid URNs for the same file. For example: X-Gnutella-Content-URN: urn:sha1:PLSTHIPQGSSZTS5FJUPAKUZWUGYQYPFB When initiating a GET, servents MAY use the "X-Gnutella-Content-URN" header to indicate the URN of the content they are attempting to retrieve, regardless of the Request-URI used. If the responder is certain that the given URN does not apply to the resource it would otherwise return, it may respond with a 404 Not Found error. 6.2.2 X-Gnutella-Alternate-Location This header MUST only be used in conjunction with "X-Gnutella- Content-URN", and indicates, either in requests or responses, other locations at which a file with same URN may be found. The header's contents must include a full URL from which the file may be retrieved. After this full URL, a date and time MAY be supplied, indicating when that location was last known to be valid (i.e. used for a successful fetch of any sort). The date and time MUST be supplied in the RFC1123 format preferred by the HTTP/1.1 specification (RFC2616, section 3.3). This header MAY be provided on "not found" and "busy" responses, when it is possible to suggest other locations more likely to yield success. For example: X-Gnutella-Content-URN: urn:sha1:PLSTHIPQGSSZTS5FJUPAKUZWUGYQYPFB Mohr [Page 9] The GDF HUGE v0.93 November 2001 X-Gnutella-Alternate-Location: http://www.clip2.com/GnutellaProtocol04.pdf X-Gnutella-Alternate-Location: http://10.0.0.10:6346/get/2468/GnutellaProtocol04.pdf X-Gnutella-Alternate-Location: http://10.0.0.25:6346/uri-res/N2R?urn:sha1:PLSTHIPQGSSZTS5FJUPAKUZWUGYQYPFB Thu, 11 Nov 2001 08:49:37 GMT This indicates 3 known potential alternate sources for the same file, with only the third bearing a known-valid timestamp. Note that even places which already have a file may learn of new alternate locations on inbound requests. 7. Implementation Recommendations While full compliance with this document is recommended, functionality can be adopted in stages, without adversely affecting other servents. In particular, the facilities of this document can be addressed according to the following logical ordering: 1. Accept extension strings, gracefully ignoring unknown extensions, passing along even traditionally "empty" Query messages if they have extensions. With these steps, HUGE traffic will not cause any degradation in normal behavior. 2. Report URNs and accept URN GETs, and use the Content-URN header. After these steps, remote servents can begin to improve their downloading features, even before making any changes to search features. 3. Request URNs on generated Query messages, so that local downloading behavior can be improved. 4. Remember -- and share -- alternate-locations via headers. At this stage, even normal downloading activity helps build redundant source-meshes. 5. Generate exact URN Queries for local needs -- for example, safe resuming -- or in reaction to user choices -- such as clicks inside file-listings or web-pages. After this step, servents will be able to safely resume downloads, even days after they began, or give users the ability to request exact files. Mohr [Page 10] The GDF HUGE v0.93 November 2001 8. Acknowledgements Thanks go to Robert Kaye, Mike Linksvayer, Oscar Boykin, Justin Chapweske, Tony Kimball, Greg Bildson, Lucas Gonze and all discussion participants in the Gnutella Developer Forum for their contributions, ideas, and comments which helped shape and improve this proposal. References [1] Josefsson, S., "Base Encodings", draft-josefsson-base-encoding- 03 (work in progress), November 2001. Author's Address Gordon Mohr Bitzi, Inc. EMail: gojomo@bitzi.com URI: http://bitzi.com/ Mohr [Page 11]