Author:
Werner 'Menneisyys' Ruotsalainen, member of
the Pocket PC magazine Board of Experts 2005, tech writer, PPCMag forum
moderator, frequent contributor to, say, PPCMag/FirstLoox/PPCT/Brighthand/PDAMania.hu
etc. forums
Last
edited: 14-Jun-2005 16:55 CET
There're two problems with most web sites: they return
dozens or sometimes even hundreds of kilobytes long pages that are full of
unnecessary stuff (whitespaces, scripts and formatting that are useless for PDA
users) and are highly compressible. Furthermore, the pages they return are
pretty hard for a PDA browser to display, especially with pre-WM2003SE Pocket
Internet Explorer (PIE) browsers without any additional PIE plug-ins that do
something similar to the "One Column" view mode in the WM2003SE
version of PIE.
Therefore, if you still use a pre-WM2003SE device
without any of these PIE plug-ins, using an on-web/PDA-based content ripper
and/or compressor service may be advantageous, along with heavy-weight
compression techniques.
There're quite a few of them with radically different
capabilities.
As has alredy been mentioned, the majority of web
pages contain a lot of unnecessary whitespaces. Web browsers simply ignore them
- for example, if you insert ten space characters one after another, the Web
browser will just display one instead of ten. (This is why has been introduced BTW.) Also, Web browsers just ignore
CR/LF characters unless they are in <PRE> blocks - that is, they can also
be safely removed to save bandwidth usage.
Also, there're a lot of additional constructs that
aren't necessarily needed in a "simple", "dumb" browser
like a Pocket PC-based one. For example, objects, plug-ins and, in a lot of
cases, stylesheets are not needed and can safely be omitted. These also help a
lot in reducing bandwidth usage.
Please note that not all HTML files can be content
stripped. For example, HTML files that don't contain special tags or excess
whitespace can't be "compressed" this way. You'll even see an example
of these HTML pages later, in the "How
did I test?" section.
You can not only strip unnecessary textual information
(tags, whitespace) from a HTML file before sending it to the client, but also
compress it with a compressor technology that is compatible with the client
browser. The most common compressor technology is gzip. Note that the HTTP protocol, the protocol that allows for
textual content compression, allows for other compression techniques too (for
example, compress), but they are much
more rarely used in lightweight clients like PDA-based browsers.
Gzip, especially the latest, algorithmically most
advanced versions, can achieve much better compression ratios than the above-described
content stripping/whitespace eliminating algorithm. For example, as you'll see
in the benchmark session, an ordinary test page that is originally 582 kbytes,
with content/whitespace stripping can be "compressed" to around 200
kbytes; with gzip compression, however, to 20 kbytes, which is an order of
magnitude better compression than with simple content/whitespace stripping.
As has already been pointed out, HTTP, the protocol in
charge of sending page and resource requests and returning pages and images,
allows for gzip compression. However, almost none of the current Web servers do
actually return compressed pages to save CPU processing. (It takes much more
CPU resources to compress a page before returning it to the client than the
bandwidth overhead of the uncompressed page.) That is, in the easiest setup (a
simple browser is communicating with a Web server) as depicted in the following
figure will almost always do this with uncompressed responses:

Here, along the Internet connection (the arrow in this
figure), only uncompressed data flows. (From now on, I use solid arrows to denote uncompressed
and dotted arrows to denote
compressed data flow to avoid making the figures hard-to-read by adding
excessive comments.)
This is where so-called proxies come into picture. What are they?
Proxies are servers that act as an intermediary server
for the browser. They receive HTML page and
related resource (for example, linked images) requests from the browser
and send them to the real Web server as if they (the proxies) were the clients
requesting these resources. They receive the response from the Web server and
send it over to their client, that is, the originating browser.
They (the proxy servers) do not need to send the
response to the browser verbatim, without any transformation. If they notice
that the client is, for example, a PDA-based Web browser, they may choose to
remove not needed HTML markup from the code (that is, transfer the HTML) and/or
compress it with gzip.
Proxy server addresses must be told to browsers in
their respective proxy setting dialog. An exception is Pocket Internet Explorer (PIE) starting with the Pocket PC 2002 operating system: in
recent Pocket PC/Windows Mobile operating systems, the HTTP proxy settings are
separated from Pocket Internet Explorer. In alternative browsers like Netfront,
however, they can be set straight in them. I've explained setting a proxy
server in both PIE (that is, the operating system) and Netfront 3.1/3.2 at,
say, http://www.pocketpcmag.com/forum/topic.asp?TOPIC_ID=16017.
With proxy servers coming into picture, the previous
two-tier model (a browser and a Web server directly communicates) effectively
becomes three-tier, as in the following figure. Please note that in this figure
I've used a dotted arrow to denote compressed responses from the proxy server
to the client.

What does this mean? It's pretty simple: if the proxy
server is running somewhere on the Internet and not on the PDA, then, because
the (probably very slow and expensive) Internet connection of the client is
already in the "compressed" zone (that is, where only compressed data
is transferred - the thick, vertical line in the figure), you'll save up quite
much money/time on using compression services!
There're quite a few public, free proxies on the
Internet that you can use. A list of pages containing free proxy addresses will
follow in the Anonimyty section. (Almost) None of them support compression by
default, though, so, we need to keep searching for other alternatives. The
above introduction was still needed to draw a generic picture of the simplest
case so that more advanced architectures can be understood.
Because of the lack of free, but compressing public
servers, you may opt for deploying compressing proxy servers on your own, say,
home desktop PC or, if it isn't in a heavily-defended (firewalls, other
restrictions) environment, even on your work PC. These proxy servers are
different from both third-party proxy servers, Web-based compression/content
stripping services and on-PDA clients in that it must be run on one of your
desktop PC's. This means they're only useful for people that have a PC on the
Net with sufficiently large Internet bandwidth. If you aren't one of them, you
may want to forget this option. If you, on the other hand, do have one, these
proxy servers are for you because they have
several advantages over other solutions:
1, as it's your own server without other users, it
will be much faster than other, mostly heavily overloaded services (including
PDA-based proxy clients), assuming the desktop machine it's running on has a
fast Internet connection
2, you can freely configure them
3, they are not only able to do basic stuff (content
compression, image downscaling), but also goodies like ad filtering.
One of the best personal, free proxy servers is RabbIT
( http://www.khelekore.org/rabbit/
). It is a great tool because it supports both gzipping, HTTP/1.1 (just like
Pocket Internet Explorer), image downscaling, proxy chaining (in order to
remain anonymous) and all the goodies you can think of. It's Java (so, you'll
need a JVM to run it) but uses ImageMagick
for super-fast image conversion.
You can start it (after downloading and decompressing
it to c:\Rabbit) with the command
java.exe
-classpath . rabbit.proxy.Proxy.
Before
running it, especially on Windows platforms, you must configure it. It isn't
very complicated. conf\rabbit.conf contains
the generic configuration. Under Windows, use some kind of non-notepad text
editor - for example, Wordpad - at first because it only contains LF
characters.
If
used under Windows, you most probably want to switch off rabbit.proxy.DNSJavaHandler and enable rabbit.proxy.DNSSunHandler instead if the proxy run on
your particular destop device can't resolve host names, but plain IP's work.
That is, comment out the first and remove the hash before the second property
as in:
#dnsHandler=rabbit.proxy.DNSJavaHandler
dnsHandler=rabbit.proxy.DNSSunHandler
For enabling proxy chaining, fill in the two
properties proxyhost and proxyport (see the This is the proxy that RabbIT should use when getting its files. Leave
it blank to dont have a proxy. Both of these need to be set, or they will be
ignored. section). The default port number the proxy uses is 9666.
If you want image downscaling, look for the convert property and make it point to
the convert program in ImageMagick.
In Windows, it'll be something like this (the actual ImageMagick version may be
different on your PC!):
convert=\Program
Files\ImageMagick-6.2.1-Q8\convert.exe
Please note that you can't use " marks in here.
Furthermore, you will only get a warning of RabbIT's inability to find
ImageMagick in logs/error_log in the
form of
[13/Jun/2005:10:16:48
GMT][WARN][convert -"C:\Programme\ImageMagick-6.2.1-Q8\convert.exe"-
not found, is your path correct?]
To enable advertisement blocking, you may want to modify
the httpinfilters property as
follows:
httpinfilters=rabbit.filter.BlockFilter,rabbit.filter.HTTPBaseFilter
You may also want to edit the contents of the blockURLmatching property to add/remove
blocked URL's.
If you also want to use the proxy with a desktop-based
Internet Explorer, enable the HTTP/1.1 extensions for proxies in Internet Options/Advanced tab.
If you don't have the resources (a preferably 24/7
desktop PC to run your proxy server) or are afraid of hackers and unwanted
guests using it (the latter is not really an issue because, for example, RabbIT
has quite sophisticated autentication and client filtering capabilities),
another option is using a third-party compressing proxy servers.
There're also Web-based services that do essentially
the same as outside-the-PDA proxies. This means you won't need to set a proxy
server on your Pocket PC (unlike the first case), but will need to navigate
into a proxy-like web page with your browser and enter the URL of the target Web
server that you want to access there. Content stripping and compression will be
done by the middle-layer, proxy-like Web server and you will only receive
well-compressed pages (that is, if you choose the right service that does this
- for example, Skweezer). They also save a lot of bandwidth because there're
already-compressed HTML pages transferred over the direct connection of the
PDA. There're quite a few of services
like these; the majority of services I've tested in this article are in this
category.
The resulting infrastructure can be depicted as
follows:

It has clearl advantages over the pure-proxy solution.
First, they are much easier to use
for the technologically changelled because no proxy options need to be set on
the PDA (the latter is not that simple if you use PIE and not Netfront). You
just navigate to their page (in this example, to http://1.2.3.4) and enter the
URL of the target Web server you want to access. Second, they not only gzip their contents, but may also convert
them to a more PDA-friendly format (convert them to one-column to avoid the
need for horizontal scrolling etc.)
Their disadvantages will be explained later when I
point out the problems of several Web-based services: for example, the hidden
URL, the lack of cookie handling etc.
There is another category with services like toonel (also described in this article).
They run locally on the PDA (so far,
we only used proxy servers and/or third-party compressing servers on the Web
somewhere, but definitely not on the same PDA) and are even more useful because
1, as far as Web-browsing (that is, the HTTP protocol)
is concerned, they not only compress incoming textual HTML pages but also
everything - outgoing requests, JavaScript, CSS pages (they are generally not
compressed, not even by Web-based servers), everything.
2, they go even further by allowing for compressing
SMTP and POP3, the two most important mail sending and receiving protocols.
This is a big thing because, as per the standards defining these protocols
(along with the MIME standard), they don't allow for any kind of compression
(as opposed to HTTP). This is why it's only with locally (on the PDA in this
case) running clients can introduce any kind of (decent; that is, much better
than the simple Run-Length Encoding of V42.bis, the protocol widely used in
modem-based communication, including GPRS) compression - because the TCP/IP
protocol being used (POP3 / SMTP) just doesn't allow for any kind of
compression.
Local proxy-like clients, as they are behind the internet connection of the
PDA (all programs running on the PDA share the same, with mobile phone-based
connections, slow and expensive connection), they need another (hidden;
meaning, the PDA user doesn't need to know its address) server somewhere in the
internet to connect to. This server will compress the incoming POP3/HTTP
responses and the outgoing (POP3/)SMTP/HTTP requests as is depicted in the
following figure:

Please note that the mailer client communicates with
the cruncher (compressing) client uncompressed (as per the POP3/SMTP
standards); it's only between the cruncher client and server that communication
is compressed. This causes no runtime/bandwidth problems: as with the previous
case, over the PDA-Internet connection, only (in this case, even better than in
the previous one) compressed content is sent.
This solution results in the most bandwidth saving
(not only HTTP inbound, but also SMTP outbound, POP3 inbound etc), but it
definitely eats up some of the (on a PDA, quite rare and meager) system
resources. Advanced compressing clients like toonel, however, don't deliver a considerable runtime hit to the
device.
Setting up a local client on your PDA is easier than
you think. As far as the Java-based toonel is concerned, you may want to read http://www.pocketpcmag.com/forum/topic.asp?TOPIC_ID=16017
on this.
As toonel doesn't integrate into the core operating
system unlike some of its alternatives (which, being still in beta tests, will
only be discussed in this article later at length), you will need to manually
configure the operating system / the browsers / the mailer client(s) to use it.
This is why you need to explicitly configure a proxy server on the PDA and fool
the mailer client to believe the SMTP/POP3 server being at localhost. It isn't particularly complicated, though, and is
thoroughly explained in the above link.
The second category of local services integrates into
the Windows Mobile operating system even more. It just sits in the background
and silently compresses everything, without user interaction. As with the first
category, it uses a Web-based compression server, also hidden from user.
This means you don't need to make your Web
browsers'/the operating system's proxy and/or the mailer clients point to it.
This is why they are much easier to set up (no need for proxy configuration /
mailer reconfiguration) and use.
Their structure is depicted as follows:

The only solution in this category is being developed by
Globility ( http://www.globility.co.nz ). It has not
been released yet, only as a (closed) beta. It is very good and works just great, I can tell you ;) As soon as it
becomes final, I will also release its benchmark results.
Some (namely, Thunderhawk, http://www.bitstream.com/wireless/
) solutions not only heavily strip the Web content and make it much more PDA-friendly,
but also use their own client instead of PIE (or an alternative browser like
Netfront or Minimo). This has clear advantages:

Thunderhawk is a very good
solution, with very few problems. These are as follows:
-
as there
is no VGA-optimized version, on VGA devices, the visual experience Thunderhawk delivers
is not as good as it could be - with images, it's much inferior to both Netfront
and Pocket Internet Explorer.
-
it only
allows for displaying one page; there're no tabs unlike with all PIE plug-ins
or Netfront. This makes it very hard to, say, copy information between
different web pages.
-
it doesn't have a local cache,
unlike other PPC-based Web browsers (except for NetFront versions prior to 3.2
- they had cache-related bugs explained at, say, http://www.pocketpcthoughts.com/forums/viewtopic.php?t=39674
).
-
it does strip Web content, but by no
way as well as other compression solutions - see the compression benchmarks
below.
You may have guessed that as stand-alone (that is,
proxy servers running somewhere on the Internet and not on your PDA) proxy
servers have a different Internet address than your PDA and, as they act as the
client for the real HTTP servers you're accessing resources at, you can
effectively "hide" behind them. Then, the target HTTP server you access
will only see the proxy server as the client. This is a very important question
for people that want to hide their Internet address (just Google for the word
"anonymity" to see how popular a question this is).
Most public proxy servers (and compressing services
because they effectively also act as proxy servers in this case), however, send
the X-Forwarded-For HTTP header to the
HTTP server, which undermines anonymity. This is why I've paid special
attention to checking the anonimity. I've written a simple HTTP server that
just resends the headers it receives to the client. The Java source code can be
found here.
Unfortunately, Skweezer also tells the client Internet
address to the HTTP server. Toonel, on the other hand, doesn't - much as it
sends the above HTTP header, its value is "unknown".
You can hide your identity even if you keep using Skweezer
if you use an anonymous proxy server to access Skweezer. This is possible
because Skweezer is just a HTTP server and not a proxy server - that is, even
without the so-called 'proxy chaining',
you can use a real proxy server before accessing Skweezer. Then, the problem
with Skweezer's not hiding your identity goes away. A figure depicting this
situation follows:
In this case, the anonymous proxy server the PDA uses by
default behaves like a client for the Skweezer-like service; this is why it's
the address of the anonymous server (which is entirely different than that of
your PDA and can, of course, be located anywhere in the internet) that is
passed further to the real Web server and not that of your PDA.
As most public anonymous proxy servers don't touch the
contents flowing through them in any way, the compressed content from the
Sweezer-like Web-based crunching services continue their way to the PDA
unmodified - that is, still compressed.
The anonymity situation with PDA-based proxy servers
is pretty different: you can't chain any on-PDA proxy to toonel or similar,
on-PDA proxy servers. This doesn't cause any problem with PDA-based proxies,
however, as they physically use a proxy server somewhere in the Internet. That
is, you don't need to be afraid of PDA-based proxy servers that also offer a
high degree of anonymity. You may still want to check out whether they send over
the X-Forwarded-For HTTP
header to the accessed server. For this, please see the table row
"Anonymity?" in the tables summarizing my article.
Personal proxy servers like RabbIT that you need to
run at your, say, home or workplace, are a little bit different because they
act as clients to the accessed HTTP servers by default; that is, the HTTP
server identifies the device running your personal proxy (for example, your
work computer, which you don't necessarily want) server as the client. This may
be undesirable; this is where proxy chaining
comes into picture. I've given detailed instructions on setting up RabbIT to
use an external chained proxy server in the section dedicated to RabbIT.
You may want to check out the following pages for
public proxy servers and more on anonymity:
http://www.atomintersoft.com/products/alive-proxy/proxy-list/
http://www.publicproxyservers.com/page1.html
http://www.checker.freeproxy.ru/
These pages also list commercial proxy services that
offer full anonymity and sometimes even compression/image downscaling. They
don't do content ripping/PDA-specific formatting, though.
Testing these services is a time-comsuming task even
for a TCP/IP veteran (I've written several HTTP filter proxies and know other
TCP/IP protocols as well like the palm of my hand) like me because there're a
lot to test.
First, the most important of them is the HTML
compression ratio, which comes from two factors: extracting (eliminating)
unneeded HTML / script markup from a file and utilizing the built-in capability
of using some kind of compression.
Second, additional HTTP goodies like utilizing the
local cache. All-in-one solutions like Thunderhawk, which also have a client to
display result, surprisingly failed this very important test.
Third, I've paid special attention to cookie handling.
Independent of the compressor/content stripper
service (remote or local), I've scrutinized whether the cookies sent
back by the HTTP server reach the client untouched (with local compression
clients) or rewritten to contain the path of the compressor service itself.
Surprisingly, I've found out that the otherwise very good Skweezer service
really failed at this test because it keeps all cookies on the server and the
PDA-based client only uses a single, non-persistent cookie to authorize itself,
which, because of this, don't survive a client restart.
Fourth, I've also scrutinized other things like
compression/downsampling ratio of images (where applicable), URL preservation
etc.
Fifth, I've also examined whether the given
services/servers offer anonymity.
I've already listed the non-Web-based solutions
(toonel, Thunderhawk) and personal proxy solutions (RabbIT). As there're way
more Web-based compression/content ripper services than solutions in the former
category, a complete section is needed to be dedicated to the latter.
The most important, very good service that also has a
free version.
Pros:
Cons:
This is indeed a top-notch service. However, as this
service is currently the best client-less Internet GZIP compression and content
stripping service, image downscaling could be implemented - at least with the
Pro version.
I've also listed the URL format it (and WebWarper)
uses to show how easy it is to pass these services a URL without entering it
into the field in their homepage. Fortunately, you can easily concatenate any
URL to both Skweezer and WebWarper.
With Skweezer, the full URL is as follows:
http://www.skweezer.net/skweeze.aspx?m=2&q=<URL, without http://>
Cons:
Pros:
URL format:
http://webwarper.net/ww/~av/<URL,
without http://>; in text-only mode, use
http://webwarper.net/ww/~s/ instead of this.
Because of the JavaScript download problem, I do not
recommend this service. Skweezer is much better than this service in terms of
bandwidth usage.
Cons:
Pros:
I can't really recommend this service because of the
major problems described above.
Also see http://www.pocketpcthoughts.com/forums/viewtopic.php?p=345365
on these problems.
http://216.103.91.135/rfxDM/ppcframe.php
Doesn't compress but cuts contents
pretty well.
Cookies are real server
cookies (unlike with Skweezer) but they aren't persistent. Furthermore, if you
want to navigate to different sections in a lot of pages (for example, Pocket
PC Thoughts forum), you'll need to click [+] a lot of times, which really goes
into nerves. Furthermore, if you have to use this link, all the subsequent
pages will uses static HTML (like /rfx/ss1868008.htm)
URLs and not real URL's, unlike, say, Skweezer. This means you won't be able to
bookmark a lot of pages.
http://mobileleap.net/app/demo/translator
This is a simple service
with really reduced capabilities as it doesn't support HTML forms at all (I've
scrutinized the stripped HTML source it sends back of a page that originally
contained FORM tags. The stripper engine just strips all <form>...</form> tags; this is why there is no way of
logging in). This means you won't be able to, say, log in to most forums or run
form-based searches, which is possible with almost all the other
compression/content stripping techniques. In some cases, Web pages that also
accept GET requests for log in, by passing the POST request body inside the GET
request you can 'hack' the engine to let you in, but it won't work in most
cases like with phpBB.
Pros:
- image compression (in the MobileLeap
mode; in the two other, Lynx and Text Extractor modes, it completely strips images
too)
Cons:
- no GZIP compression
- no form support at all
- no page-specific URL's
(it's always http://mobileleap.net/app/demo/translator
because, internally, it only uses POST to communicate parameters)
Bottom line: don't use it.
Much as this is, officially,
only a Google front-end, you can enter the full page URL you want to access
into it without the leading http://.
Then, if you are lucky (it doesn't work with a lot of cases!) you'll be able to
use this service to access the page without any URL hacking or manual editing
(which would be rather compicated with WML Proxy). It doesn't have images and
are, at times, because of its being a widely used service, very slow or doesn't
even work, though.
Because it didn't work when
I finished this article, I couldn't test its capabilities. This is why it's
also missing from the comparison chart.
For this test, I've used two
HTML files. One of them, a 45k-long standalone HTML file, only used one row (<p>this is just a test) and just
repeated it. This means that there is no strippable content (except for the
CR-LF characters at the end of each line) in it. This file is, however, very
well compressable: both WinZip and WinRAR produce compressed file rchives
between 100 and 200 bytes of this file.
The other HTML test file was
a snapshot of the Pocket PC Magazine 'Active Topics' ( http://pocketpcmag.com/forum/active.asp
). It's full of repeated whitespace, font and color declarations, tables and JavaScript,
as with most other webpages, so, content and whitespace stripping will work
very well with it. WinZip and WinRAR is able to compress it to 23k and 28k,
respectively.
The two input files can be
found at http://www.winmobiletech.com/062005CompressionTester/ppcmag596k.htm
and http://www.winmobiletech.com/062005CompressionTester/45kToistuva.html.
|
|
Thunderhawk |
Toonel |
|
RabbIT |
|
45k test HTML file-test; down/upload |
84k/5k (!!!!) |
1.2k/2k |
3k/3k |
|
|
590k PPCMag-test;
down/upload |
163k/7.4k |
40k/10k |
52k/8k |
|
|
Anonymity? |
|
+ (X-Forwarded-For: unknown) |
- (X-Forwarded-For is sent) |
+ (a single Via: HTTP/1.0 RabbIT header is sent). |
|
HTTP/1.1 compliant? |
|
+ |
+ |
+ |
As can clearly be seen,
Thunderhawk doesn't use any compression, just some light content stripping. It
really failed the test of the unstrippable 45kbyte-long file - it used twice
the bandwith as the original size!
|
|
Skweezer |
clickfisch |
WebXCope |
MobileLeap |
WebWarper |
|
Price? |
free; Pro version: 15
$/year |
free |
free |
free |
free |
|
URL format |
|
|
|
|
|
|
POST? |
+ |
+ (doesn't really work) |
+ |
- (not even FORM support!) |
+ |
|
ZIP'ed HTML? |
+ |
- |
- |
- |
+ |
|
Downscaled images? |
- (not even in Pro
version!) |
+ |
no images at all |
+ |
- |
|
URL's reflecting the
current page? |
+ |
- |
generally, yes; after
clicking [+], no |
- |
+ |
|
In-session cookies? |
+, stored on the Skw.
server |
- |
+, real |
- |
+, real |
|
Inter-session cookies? |
+ |
- |
- |
- |
+ (!) |
|
Size of the 45k test HTML file after possible compression with
HTTP headers |
1216 bytes w/ GZIP |
(44k) |
|
|
900 bytes |
|
Size of the 590k PPCMag
HTML file after content stripping and possible compression |
184k w/o GZIP; 18599 bytes
w/ GZIP |
answered with a 500
Internal Server Error in 4-5 seconds |
|
|
28k |
|
Time for cutting (and
compressing) the PPCMag page |
~15s |
see above |
|
|
~15s |
|
Anonymity? |
- (X-Forwarded-For) |
|
|
|
+ |
It's not
very hard to say which services are the best as of now (13/06/2005):
-
Thunderhawk, if you can put up with the
relatively high price of the service and the problems (QVGA, no tabs, lack of
real compression) of the Thunderhawk client
-
Toonel, if you don't mind running a personal proxy
server on your PDA.
-
Skweezer. Please note that I've chosen it over
WebWarper mostly due to the 195k JavaScript download-problem and the lack of a
PDA-formatted view that still has images of the latter service. WebWarper,
however, offers complete anonymity.