The Digital Agency for International Development

Offline Websites and Low Bandwidth Simulator in Go

By Chris Wilson on 16 February 2011

Jon Thompson writes about Jeff Allen's interesting new work on tools for working with low bandwidth:

Jeff continues to try and solve the low bandwidth/high latency problems that aid workers face in the field every day and that we encountered in Indonesia. We all know the joy of VSAT networks that slow to a crawl because either some folks on the team are downloading stuff they shouldn’t be downloading or all the computers are infected with bandwidth sucking viruses. It appears Jeff has moved one step closer to sorting out some of the problems surrounding bandwidth optimization by utilizing the Go programming language.

Rather than try and explain to you what Jeff has done I’ll let you read ‘A rate-limiting HTTP proxy in Go‘ and ‘How to control your HTTP transactions in Go‘ and sort out what he is talking about. Hopefully, this post will bait Jeff into leaving a lengthy comment that explains exactly what the hell he is up to.

My understanding is that Jeff is developing two useful tools:

People have been trying to make offlineable websites for a long time. Some of the best examples so far are using entirely client-side (in-browser) technology, such as the Logistics Operational Guide, developed by the World Food Programme for the Logistics Cluster, which can run entirely offline using Google Gears.

Gears had a lot of potential for developers to create offlineable websites, but Google has abandoned its future development in favour of the open standard HTML5, which is not ready yet. So there's no obvious and future-proof way to develop offlineable websites at the moment. Jeff's proxy, combined with a spidering system, could be one way to download an entire site, even if it wasn't designed to be downloaded by the developers.

Another important potential comes from content management systems (CMS) such as Wordpress, Drupal and Joomla. More and more websites are developed using such systems, rather than coded from scratch. The systems know all of the pages on the site, and the links between them, and could easily build an offlineable version of the site for download into Gears, HTML5 or Jeff's proxy. And one plugin could potentially enable thousands of sites to be offlineable, especially if it was included in the CMS distribution and enabled by default.

A few wikis such as MediaWiki, MoinMoin, DocuWiki and JSPWiki have a programming interface (XML-RPC or WebDAV) that allows a smart client to download pages in their original text format, which could make them more efficient to store offline and also potentially editable offline. Jeff's proxy could be extended to support sites built in such wikis automatically. There are still some limitations to this approach:

  • The pages would not look the same as the online versions, since the styling wouldn't be downloaded and the effects of CMS plugins would not be visible;
  • It would probably still be quite slow to download an entire site this way, by spidering, without server-side support for downloading multiple pages at once;
  • Few websites are built out of Wikis, so the potential maximum reach is limited compared to better support for Wordpress, Drupal or Joomla.

Anyway, I wish I knew Go, and had time to hack on Jeff's proxy tools.

Feb. 16, 2011, 3:08 p.m. - Jeff Allen

Thanks for reading my post. My ideas run along the same lines as yours, that content publishers who want their sites usable in an offline context should be able to publish one or more packages of content that my proxy should be able to pick up (either in an automated way, or by getting it loaded in from offline medium by an admin). I'm envisioning something like robots.txt, that says, "if you want to download the following zip files and use them to seed a proxy server, we don't mind". And of course, nothing stops a third party from spidering a web site and making such a package, except that the origin of the content has copyright on it. The third-party offlining thing would work better for creative commons licensed content. Along the same lines as the offline content idea, there's Google's site maps, which is a way for a content provider to explain to the outside world what the set of URLs are that are interesting to download. But given the point of that is for search engine discovery, it's unlikely the current site maps on the net would be very useful. But the syntax could be useful to reuse. Going as far as imagining a plugin that ships with CMS's had not occurred to me, but it makes a lot of sense. Bravo! If I went farther on this, I would specify the format of the offline package to not just be a ZIP full of files, but a "proxy pre-load". That means that for every resource in the package, there would be the same things a proxy that had fetched the resource on the public network would have: expiration times and origin server headers (in particular content-type). The goal is to give the origin servers a scalable way to get their content onto proxy servers, not giving the proxy carte blanche to serve up the same stale content until the end of time. These are casual thoughts, not backed up by code. I was just using this interesting problem to guide my explorations into Go, and there's no guarantee that my Go exploration time will lead to anything usable. Another problem with Go is that it is production quality only on x86 and x86_64 architectures. It works on ARM, but it's not ready for prime time. The most interesting way to put a proxy in the field is to use low-power network hardware reflashed with a new OS where you add in your proxy. I've worked with OpenWRT with a view to making something like this. But it used to be that most devices like this were MIPS processors, making a proxy in Go useless. However, I just noticed that the Shiva Plug is based on ARM, so that's an interesting development. It seems like low-power embedded Linu is moving towards ARM, as a result of Android phones. -jeff