Gnutella Web Caching System
Version 2 Specifications Client Developers' Guide
Copyright (c) 2003 Hauke
Dämpfling, version
1.9.4 / 18.6.2003, http://www.gnucleus.com/gwebcache/newgwc.html
Table of Contents
This document serves a guide for client developers that covers how to use the
"new" GWebCache system (as according to the "version 2 specifications", also
referred to as GWC2). This document should be considered "beta". Clients
and caches using these specs have not been thoroughly tested.
GWebCache, even though it is designed for simplicity, will only work if
several key functionalities are implemented by developers. Therefore,
developers, read this document carefully.
To understand why this is so important: Because some clients had errors in
their code, people who ran GWebCaches had (and may still have) much grief,
because these clients relentlessly hammered away at the servers, in some cases
even continuing to hammer servers' IP's when the virtual web servers were shut
down. Such utter lack of responsibility in coding put many users in a situation
that they could not escape from, and such a situation must not be repeated.
Therefore, I hope that you understand why it is critical that you read and
understand this entire document. And, when you get ready to release your
shiny new client with GWebCache v2 functionality, you will thoroughly test
the interaction with a web cache before making any releases.
A bunch of Thank Yous for support of the GWebCache project with many
ideas and code: John Marshall, Robert Rainwater, Guo Xu, Tor Klingberg,
Christopher Rohrs, Mike Green, Nick Randall, ...
If you have any questions, comments, suggestions, (constructive)
criticisms, etc., please post them in the Forum right
away.
^ Top
^
Overview
A GWebCache is a script on a web server, clients use normal HTTP. It stores
IP addresses of Gnutella nodes and the URLs of other caches. Clients
(ultrapeers) make regular updates to GWCs to keep the information fresh.
Summary of Important Things to Remember
Each of these points is
described in more detail below.
- Your client must use GWebCache only if it has no other way to discover
hosts. First, use your Pong cache and such.
- Your client may send updates only if it meets certain
criteria. For example, it must accept incoming connections as an
ultrapeer. More details below.
- In any case, your client must not send more than one request per
hour. Your client will be rejected anyway, and you don't want to be
rejected.
- If your client fails to contact a cache, it must not request to that
cache again. If a cache is down, it's down!
- Remember that GWebCaches are run by volunteers in their own webspace.
Do not abuse the privilege to be able to access GWebCaches, as they have
limited CPU and Bandwidth resources. Don't DDoS your users and service
providers.
Step 1: How to store GWC data in your client
- Keep an array of GWebCache URLs, and for each URL, store a flag as to
whether or not your client has successfully contacted this cache. The client
should forget this information when it exits and stores the information to
(for example) a text file, but your client must at keep this information in
memory while running.
- Do not hardcode any cache URLs. Include a default list of GWCs with
your client, but do not hardcode the URLs.
- You must remove any clients from your list that do not respond
correctly. More on this later.
- Hosts will be returned in the standard numerical IP : port format (i.e.
123.45.67.8:123).
- URLs always begin with
http://
- Before your client accepts new URLs into its internal list, it must
make the following changes:
- If the URL contains any %XX sequences where XX is a hex string (0-9,
a-z, A-Z), replace them by the ASCII character with the hex value (i.e. %7E
is ASCII character 0x7E, decimal 126, char "
~
").
- If the URL ends in "index.EXT" where EXT is any of the following: "php",
"cgi", "asp", "cfm", "jsp" (this list is not complete), then trim this
filename. (For example
http://zero-g.net/gcache/index.cgi
becomes http://zero-g.net/gcache/
)
- Trim any trailing slashes (
/
). (For example
http://zero-g.net/gcache/
becomes
http://zero-g.net/gcache
)
- This check is encouraged: perform a DNS lookup of the web server you are
adding to your list and compare that IP address to those of the servers
already in the list. Do not replace the webserver's hostname with it's IP
address! This would screw up virtual servers very badly. This check is
meant to avoid ambiguities between hostnames that have the same IP address.
For example, both "zero-g.net" and "www.zero-g.net" are working hostnames
for the same site, but this should not cause duplicate entries in your list
of cache URLs.
Step 2: How to interact with GWebCaches
- Your client must not exclusively rely on GWebCache. Your client
must use its internal host cache (information gathered from Pongs) and X-Try
headers with priority above GWebCache.
- Use a standard HTTP library. GWebCaches are regular scripts on
regular web servers and therefore rely on your client understanding regular,
full HTTP. (For example, 3xx responses mean "redirect" and 4xx-5xx means
"error".) Make sure that your HTTP libraries provide a mechanism
for identifying HTTP error codes.
- Do not use proxies. If the HTTP library you use uses proxies, they
should be disabled. (Scripts need to see the client's IP.)
- This should not be an issue if you use standard HTTP libraries, but since
it's happened before: make sure your libraries speak HTTP/1.1 and support
virtual hosts.
- When you contact a GWebCache, you can get four different kinds of
responses, listed here. If you get anything that is not a normal
GWebCache response, delete that cache's URL from your internal
list.
- Normal GWebCache responses (described below)
- GWebCache error (response begins with string "ERROR")
- Invalid response (not parseable)
- HTTP error (HTTP codes 400 to 599)
- In all cases except the first, your client must forget about that
cache, and do not retry. Note that in cases 2 and 3,
the HTTP response code will still be 2xx ("OK"), but these responses still
mean that the cache has had an error. In other words, only when you can
successfully parse the response did the request succeed.
- Note that, as defined below, a GWebCache will now always output at
least one line - this differs from the original GWebCache specifications,
which said that GWebCache may return an empty string. Now, returning an
empty file is invalid (note that "empty file" means that there may
still be one or more CRLF/CR/LFs in the file).
- When contacting a web cache, pick a random cache from your internal
list of caches.
- There is absolutely no reason to send more than one request per
hour. Updates can be combined with Gets and Pings. Ideally, your
client will make one request at startup only if necessary (more
on this below), and then only one update an hour if it meets the
criteria (more on this below too).
- Make sure your client can handle different end-of-line formats. Clients
and servers may be on different platforms so there is no guarantee as to
whether you will get CR, LF, or CRLF. As an example, here is some simple logic
for converting everything to LFs: If the returned file contains any LFs, then
remove all CRs, else replace all CRs by LFs.
- Your client must supply version information to a GWebCache. This is done
via the "client" parameter. Version information is a 4-character string of
uppercase letters (your client's ID) plus a max of 16 characters for the
version number. (Examples: "
GNUC1.8.4.0
", "LIME2.7.9
Pro
")
- IP Addresses must not begin with leading zeros, i.e. not
001.002.003.012 (this is dumb, and nobody does this anyway, but I just wanted
to be clear).
- Your client will send requests via HTTP GET. This means that your request
will be:
[the cache's URL] + "?
" + any number of the
following: [parameter name] + "=
" + [escape-encoded parameter
value] + "&
" + [next parameter name] + "=
" +
[escape-encoded value] etc.
The order of the parameters does not matter.
Each parameter should appear only once.
- "Escape Encoding" (RFC1738) means replacing
all characters that are not letters, numbers, dashes "-", underscores
"_", or periods "." with the following: "
%
" + [2-character ASCII
code of character in Hex]
To make this replacement:
Step 1: replace all
"%
" by their representation: "%25
"
Step 2:
replace all non-alphanumeric characters except "%
",
"-
", "_
" and ".
" by a percent
(%
) sign followed by two hex digits.
Example:
"http://www.zero-g.net/gcache/gcache.php
" becomes
"http%3a%2f%2fwww.zero-g.net%2fgcache%2fgcache.php
"
- Example requests:
http://www.server.com/path/to/gcache.cgi?client=TEST1.0&get=1
http://www.server.com/path/to/gcache.cgi?client=TEST1.0&update=1&ip=194.64.64.1%3A123&url=http%3a%2f%2fwww.otherserver.net%2fwebcache.cgi
Step 3: GWebCache output format
- Output of a GWebCache is in line-by-line format, according to the
following pattern:
x|field1|field2|field3|...
- "x" can be either "I" = Informational, "U" = URL, "H" = Host. So far, the
following responses have been defined:
I
- Informational Response
- field 1:
pong
- field 2: (version string)
Included in response to a Ping request,
returns GWebCache version
- field 1:
update
- field 2: OK
Returned when the update completed successfully
(but possibly there were warnings!)
- field 2: WARNING
field 3: "You came back too early",
"Rejected IP" or "Rejected URL" (others may be added as needed)
A
WARNING response to an update generally means that your client did
something wrong. Note that warnings can appear in addition to
an OK response.
- field 1:
nothing
Returned when there is no other
output, so your client doesn't get bored. (Actually, this is because GWC
must always output at least one line.)
U
- URLs
- field 1:
URL
The URL of the alternate cache,
beginning with http://
- field 2:
age
The time since submission of this
URL to the cache in seconds
H
- Hosts
- field 1:
Host:Port
The Host:Port of a host
- field 2:
age
The time since submission of this
URL to the cache in seconds
- Your client should of course be prepared to expect any other responses, as
long as they are in the above format: they begin with a character (a-z, A-Z,
0-9), then a pipe (|), then any number of characters and pipes. Also make sure
your client can handle extensions to the above formats (for example, expect to
have more information following an "
I|pong|
(version)" response,
i.e. something like
"I|pong|
(version)|something|else
" etc.). In other
words, your parser should be very general.
- A GWebCache may also provide an extra HTTP header for your client,
"X-Remote-IP". This header is analogous to the "Remote-IP" header provided in
the Gnutella handshaking protocol, with the difference that it cannot be
trusted as much. Trust the Remote-IP header from Gnutella connections
instead. X-Remote-IP is what the web server thinks your IP address is, and
this could be wrong due to transparent proxies and the like.
- Example responses:
- Short response to a simple
Get:
H|127.0.0.2:321|400
H|127.0.0.1:123|4456
U|http://www.server2.com/gcache/gcache.cgi|400
U|http://www.server.net/gcache/gcache.cgi|4456
- Response to an update combined with a ping:
I|pong|GWebCache
0.9.0b
I|update|WARNING|You came back too early
- Some responses that are currently not given but that are valid and your
parser should still
handle:
I|whatever
I|blah||bar
H|192.168.0.1:123|321||foo
U|http://gcache.com|321|xyz|
Step 4: How to make updates to a cache
- To make an update, your client must meet the following
criteria. Note that these are the same as the standard Ultrapeer
criteria:
- Your client must have been online (running & connected) for at
least an hour.
- Your client must accept incoming connections. (This is usually
tested by keeping track of whether or not your client has received any
incoming connections.)
- In other words, leaf nodes must not send updates.
- Your client must support the Remote-IP Gnutella header. This
header is essential for a client so that it can find it's own IP address
(for example, if your client is behind a firewall or NAT router). If your
client does not yet support this header, you should start supporting it now.
Ask on the GDF
if you have any questions regarding implementation.
- If your client meets these criteria, your client should send updates
once an hour. This is limited by the GWebCache and any updates sent too
early will be rejected. Again, there is absolutely no reason to send more
than one request per hour to a GWC.
- Updates are sent through the following parameters:
update=1
ip=
[your client's numerical IP]:
[your client's
port for incoming connections]
url=
[the url of a web cache that your client has
successfully contacted]
- Notes
- The IP address you send must be you're client's IP address. This
IP address will be checked against the one that the server sees. In case
your client is behind a transparent HTTP proxy, there is not much you can do
about it, your updates will most likely fail. However, if your IP address is
rejected ("
I|update|WARNING|Rejected IP
") on more than one
cache then your client should consider not sending any updates.
- The URL you send must be one that your cache has successfully
contacted. This is why I said above, keep tack of which caches your
client has successfully contacted.
For example, Gnucleus keeps GWebCaches
flagged with either "ALIVE" or "UNTESTED". Any web cache that is added to
the internal list is initially flagged as "Untested". When making Get
requests, Gnucleus uses a cache flagged as "Untested". If the cache is
successfully contacted, the URL is flagged as "Alive". When making updates,
Gnucleus sends the update to an "Untested" cache, and sends an "Alive" cache
in the url
parameter.
- Don't forget that the parameter values must be
URL-escape-encoded. (See the above explanation.)
- Examples:
- To send an update the cache running at
"
http://www.server.com/path/to/gcache.cgi
" with your IP/port
194.64.64.1:123
and sending the URL
"http://www.otherserver.net/webcache.cgi
":
http://www.server.com/path/to/gcache.cgi?client=TEST1.0&update=1&ip=194.64.64.1%3A123&url=http%3a%2f%2fwww.otherserver.net%2fwebcache.cgi
Step 5: How to request information from a GWebCache
- When your client needs IP addresses to connect to, first try your
internal host cache (information gathered from Pongs and X-Try headers).
On startup, your client should try about 20 IPs from its internal cache, and
only then should it contact a GWebCache.
- Requesting information is simple, send the following parameter:
- If the GWebCache has hosts and/or URLs stored, it will return them
according to the format defined above.
- Examples:
http://www.server.com/path/to/gcache.cgi?client=TEST1.0&get=1
Extras: Using the "Network" Parameter
- GWebCache now supports storing more than one list of Hosts/URLs. A cache
owner may enable his/her cache to store more than just the default Gnutella
hosts. Your client should simply send the extra parameter:
"
net=
[name of network]". When you contact a cache, there are two
situations:
- The cache supports the network you are asking for. Interaction with the
GWC will be unchanged.
- The cache does not support the network you are asking for. The following
things will happen:
- The cache will send the extra response
"
I|net-not-supported
"
- When sending Updates: The cache will assume that the URL you are
submitting supports the network that you are asking for (!). The URL will be
stored internally along with the network name. Any other clients that ask
for this network will be given this URL as a kind of "redirect" or "try
other".
- When sending Gets: If the cache knows about a URL that supports this
network then it will return that URL. Think of this as a "redirect".
- Examples:
http://www.server.com/path/to/gcache.cgi?client=TEST1.0&net=shareaza&get=1
http://www.server.com/path/to/gcache.cgi?client=TEST1.0&net=shareaza&update=1&ip=194.64.64.1%3A123&url=http%3a%2f%2fwww.otherserver.net%2fwebcache.cgi
Extras: Using the Timestamp information
- This feature is experimental, we will keep the timestamp
information but might add more information as we see necessary.
- As you may have noticed, GWC returns the "age" (time since submission) of
all URLs and IPs it stores. This information is provided as a kind of
"freshness" information.
- What your client can do with this information:
- If you notice that the information in the cache is "very fresh" then
your client can consider not sending an update for a while. For example: if
you notice that a cache has information that was submitted less than a
minute ago, you can wait two hours instead of one until you send an update.
- Be very careful with this: If you notice that the information in
the cache is very old, then your client can consider sending an update a
little earlier. For example: if you notice a cache hasn't gotten an update
for more than an hour, you can send an update right away. Remember, this
is very dangerous - your client should still not send more than one request
an hour.
Extras: Clustering Information
- The GWC2 beta supports the new "
cluster=
[keywords]"
parameter. This functionality is currently for testing of this feature, so
consider it "alpha".
- On update requests, if you include the extra parameter
"
cluster=keyword1,keyword2,...
", these keywords will be stored
along with the host you submit.
- The following limitations are placed on the keyword string: it may only
contain the characters [A-Za-z0-9.-_:], and it may not be longer than 256
characters (yes, the entire keyword string). - Characters that aren't allowed
are stripped and any keywords beyond the 256 characters are dropped.
- On get requests, the keywords are returned in the field after the
age parameter, like so:
H|127.0.0.2:321|400|keyword1,keyword2,...
^ Top
^
v1.9.4
- Changed "alpha" to "beta" status
- Added clustering
information
- Smaller corrections and updates
v1.9.3.4
- Replaced "Important Traffic Issues" by "Summary of
Important Things to Remember"
v.1.9.3.3
- Added Timestamp information
v1.9.3.2
- Added Traffic section
v.1.9.3.1
- Clarified Remote-IP/X-Remote-IP issues
v.1.9.3
- First release of "Developers' Guide"
^ Top
^
GWebCache
Home
See also: http://www.gnucleus.com/
Copyright (c) 2003 Hauke Dämpfling. License Terms: FDL.