Welcome to my HTTP filter & Mobipocket Web Companion Support Page!


Download the latest (1.7.8) build


Version history (really worth checking out for warnings/announcements!)


Download the Word manuscript of this article (it may be a bit newer than the HTML if I forget to update the latter)




Please note that this software suite is still evolving. Use it AT YOUR OWN RISK. I've provided the sources so that you can check the code (and even recompile it). As I'm still adding new features and fixing bugs, the suite is NOT guaranteed to work.


You should also note that my HTTPProxy's functionality, except for some particular areas (page merging, POST-based logins etc) is not as mature as that of iSiloX or, even better, Sitescooper. However, you will definitely find some utilities in this suite perfect for making especially the Mobipocket Reader the best and most versatile e-News reader.


Just a sneak preview and some new info, before I completely rewrite the reader comparison section:


Some of my mails to the MP mailing lists about the latest version of MPR (4.7 build 408):


"I have another idea: what about implementing a text highlighting schema similar to that in uBook (http://www.gowerpoint.com/)? On a lot of PDA's, it's pretty hard to select a large about of text in a given pages. Newer models (I'm able to select a lot of text on both the h2210 and the Palm Zire 71, but not on any of the iPAQ 36xx models) don't suffer from the relative insensitivy of the touch screen, but older models do.


On older models, sometimes it takes 3 or 4 repeated tries to select some text because you have to drag the stylus all around the screen. It is very tiring (not to say isn't very good for the touch screen either - think of the scratches!). This is why some innovative text selecting capability would be great."


"what about the full screen toggling in the PPC version? The current solution is VERY bad IMHO for PPC's. PPC's are not like Smartphones - PPC's do have touch screens, which make the current (as of 4.7), Smartphone-optimized cursor movement scheme superfluous on PPC's. This is OK on Smartphones, but not on PPC's, where users can tap the screens if they want to follow a hyperlink.


What about releasing an updated 4.7 PPC build with the old Full screen-toggle? I think the vast majority of PPC users would prefer that one to the new, Smartphone-biased button scheme."





Welcome to my HTTP filter & Mobipocket Web Companion Support Page! 1

Contents. 1

Introduction to e-News. 2

iSiloX, MWC, SiteScooper and my HTTP proxy: capabilities. 4

Other readers without an official Web extractor utility. 10

HTML-capable readers. 10

uBook (http://www.gowerpoint.com/) 10

Team One's Reader v3.0 (http://www.teamonesoft.com/en/Products.htm) 11

Non-HTML-capable readers. 11

Microsoft Reader 2.0. 11

Haali Reader (http://haali.cs.msu.ru/pocketpc/) 12

TomeRaider (http://www.tomeraider.com/) 12

Palm Reader (http://www.palmdigitalmedia.com/S=2f7c86ffed7244ce36a1cf2ac89a995cPrkrDcCoAAsAACOeM0c3046011/product/reader/browse/free) 12

Why did I choose Mobipocket? Why do I recommend it?. 12

Mobipocket's e-News/e-Book format 13

Problems with Mobipocket Web Companion. 16

How does MWC work?. 17

.enews files. 17

action="download-site". 21

.XSL files. 22

Offline MB utilities. 25

.enews generators. 25

.in generators for forum softwares. 26

Picture utilities. 27

Resizing and no high/true-color support 27

Animated GIF's. 28

BMP support 30

Proxy-based, fully automatic solution?. 30

PRC reconstruction. 31

MBHelper 33

Installation, running. 34

Setting up your MWC/browser to use the proxy. 34

MBHelper.conf 35

The target URL. 35

Available actions. 37

substitute. 37

mergeallpagesfollownextlink. 39

changeURL. 43

killURL. 43

onlyAllowURL. 44

returnonlyhtmlbetweenbeginend. 44

prekiller 46

uniquepictures. 46

addURL. 47

POSTLogin. 47

Using HTTPSnoopProxy to get login URL's. 48

When it won't work (JavaScript, POST-only) 50

forceUTF8Conversion. 51

addServerNameToAllURLsWhenNecessary. 51

converttables. 51

cacheSimulation. 52

enableBloggerConversion. 52

URLDecode. 53

Some tips and tricks to using MBHelper 54

Future Plans. 56



Introduction to e-News


One of the most compelling advantages of any PDA's (both Palm-, H/PC- and Pocket PC-based) and Symbian / PPC Phone Edition-based mobile phones is the ability to read electronic documents.


Using a lightweight PDA for reading 'traditional' e-books is even topped by their ability to store and offer the daily news or forum archives. It's one of the greatest capabilities of a PDA.


There're three major players in the e-News field. The first is AvantGo (http://avantgo.com/frontdoor/index.html; for reviews, see e.g. http://www.epinions.com/cmsw-PalmSoftware-All-Avantgo_3_1/display_~reviews),  probably the most known service, because Pocket Internet Explorer (PIE), the built-in Internet Explorer in Pocket PC (2002), has a link to it on its main page.


The second is Mazingo (http://www.mazingo.net/) and the third is Mobipocket (http://www.mobipocket.com/).


The former two companies only offer PDA products for news reading (but no book reading). Mobipocket's reader product, Mobipocket Reader, is a bit different: it's not just a plain news reader. Actually, it's more of a full-fledged e-book reader than just a plain tool to make Pocket Internet Explorer into an offline news reader and to synchronize e-News.


The other e-News compliant players are iSilo with its extremely capable HTTP downloader, iSiloX.


There're other site extractors too; for example, the Perl language Sitescooper (also see this and this articles). It supports tons of native output formats: HTML, Palm DOC, iSilo etc (the latter two with external conversion tools - just like XDOCGenerator).


Sitescooper is a very good extractor; except for its table handling, it's much superior to both MWC (even with my toolkit) and iSiloX. However, because it handles table-based  pages much worse than iSiloX and doesn't really support page merging as well as my toolkit, there may be cases when it shoulnd't be used.


Sitescooper, of course, has its own shortcomings (not just the inferior table handling when compared to iSiloX). For example, while MWC can have any depth of page structures, Sitescooper can only handle site structures up to 3 levels of depth. This means it can't extract sites that have more than 3 levels of depth.


Actually, mainly because e-News was (or, may have been) added to Mobipocket's battery of applications as an afterthought, e-book support in Mobipocket's and iSilo's products are far superior to their e-News synchronization support. Had Mobipocket's e-News support been as good as their e-book support, I woulnd't have developed a complete toolkit to fix the shortcomings and bugs in their e-News support.


My tookit has two kinds of tools. On the one side, I've written a HTTP proxy that has advanced filtering (e.g. page merging) capabilities and Palm DOC/Mobipocket XDOC creators/decompressors.


Being a HTTP proxy also means that my HTTP filter can be used with iSiloX. This means that the already-excellent iSiloX just got even better by introducing other features in a proxy it connects to.


On the other side, I've written several tools for Mobipocket's products because I find their Pocket PC-based reader superior to all the competing products. It's just their web extractor and PRC creator tools that are very immature and almost useless for serious work. One of my aims was to enhance Mobipocket's Web extractor and PRC creator tool, Mobipocket Web Companion (MWC) to be even better than iSiloX.


iSiloX, MWC, SiteScooper and my HTTP proxy: capabilities


Here is a table of what iSiloX, Sitescooper and MWC can do with and without my HTTP proxy in terms of HTML extraction.





MWC 4.5

MWC 4.5 + tookit

isSiloX 3.35b4

isSiloX 3.35b4 + toolkit

SiteScooper 3.1.2

SiteScooper 3.1.2 + toolkit



Limited (given table header; ASCII)


No point

Limited (only completely deleting them)

Limited (see column 2 and 5)

Including filter


+ (allowURL)


No point


+ (allowURL)

Including filter - wildcard




No point



Including filter - regex




No point



Excluding URL filter


+ (killURL)


No point

- (!)

+ (killURL)

Excluding URL filter - wildcard




No point



Excluding URL filter - regex




No point





No point

+ (much better configurable)

No point


No point

Pre-made cookies




No point



Accessing op. system-level cookies


No point


No point


No point

GET login

+ (undocumented)

No point


* (only POST is natively supported; GET may work)

- (only BASIC HTTP authentication)

* (only POST is natively supported; GET may work)

POST  login, cookie







<PRE> handling


+ (basic support for inserting linebreaks)

+ (Excellent config options; Courier New support in the reader)

No point


+ (basic support for inserting linebreaks)

Image bit depth


No point (reader-dependent)

2-16 bit; can include several pics for one

No point

Keeps original; piping to XDOCGenerator works great


Image conversion configure


+ (2 options: original conversion and full quality/size GIF versions)


No point


(by modifying either the Perl or the Java program)

Page merging





+ (although more limited than that of my toolkit - pre-defined pattern of names; no forum-specific merger code etc)


Automatic redirection


+ (changeURL)





HTTP User Agent string configurable


By editing & recompiling HTTPProxy


No point


By editing & recompiling HTTPProxy

HTTP timings configurable

- (5 minutes max for a page to return)

Depends on MWC; can't be overridden from proxy

+ (to 999 sec!)

No point


can't be overridden from proxy

BMP including


100% working

50% (wasn't able to process

No iSilo generator as yet

* (everything is passed straight to DOC creator via pipes; SiteScooper doesn't modify pics)

100% working

Animated GIF conversion success rate


100% working; all frames

100% working

No iSilo generator as yet

* (everything is passed straight to DOC creator via pipes; SiteScooper doesn't modify pics)

100% working; all frames

Automatic synching to memory cards instead of system memory

+ (only with an external, non-GUI-based hack!)

No point (it's not HTTP filtering)


No point (it's not HTTP filtering)





(only the toolkit)


(only the toolkit)


+ (both SiteScooper and the toolkit)


As you can see, MWC, without an external toolkit, is pretty crippled. iSiloX is better in almost every respect. Unfortunately, iSilo, the reader is far inferior to Mobipocket's competing product, Mobipocket Reader.


First, speed. iSilo's scroll (page up/down) speed is about 3 times smaller than that of MPR. I've tested this with several multimegabyte-long archives and iSilo has always been much slower.


Note that the speed problem is only prevalent in the iSilo reader and not in iSiloX HTTP converter and iSilo PDB builder. The latter is much faster than even my XDOC creator. Here're some benchmark results (by converting Brighthand's iPAQ full 5450 forum; as of 05/03/2003, it had 6Mbytes of (converted and merged) input HTML, 5.2 Mbytes of JPG and 700k of GIF). You can download the original input from here so that you can also give the tools a try (in iSilo, don't forget to switch off link following: Properties/Links; set Maximum link depth to 0).




Creation time


iSiloX default pics

34 sec


iSiloX no resize, 16 bit

47 sec



2 min 52 sec (without reconverting animated GIF's: 2:42)


MB Publisher

3 min 10 sec



10 min 10 sec



As you can see, even without using the no-resize mode and converting all GIF's to 16 bit, it only took iSiloX to build the 24-Mbyte-long archive. It took my XDOCCreate 3 minutes to do the same; while MWC, Mobipocket's official e-News creator tool, spent 10 minutes on it.


Benchmarks (MPR 4.5 vs 4.6 vs iSilo 3.3 vs TomeRaider)


Compressing AVForum's complete back-up as of 05/17/2003; 223 Mbytes of input HTML

iSiloX (no pics, no tables, no link following): 10:30; 79.475.942 bytes

MWC 4.5: 5:00; 189.XXX.XXX bytes (no compression, just removing some HTML tags!)

MP Publisher: froze in New/Edit file properties window



Benchmarks: Palm (Zire 71) vs Pocket PC; HWSW.hu; 'csekkold'

Palm: 02:10

PPC: 5 secs

70s: 4s = 0.2 Mbyte/s : x

x = 0.2 / 17.5


Scrolling and rendering

iSilo vs MPR: see above



Source:; looked for the word 'animatedy' in the latest article (on around the 502000th page).


MPR 4.5: 9:03 in foreground (9:10 with spb GPRS monitor 2.0 w/o GPRS monitoring; 9:18 with spb GPRS monitor 2.0 with GPRS monitoring); 11:30 in background


MPR 4.6 build 405: 10:33; 10:32


iSilo: (it has a progress bar in search more, unlike MPR; this is a big plus): 34:13 in foreground.


TomeRaider, Full encarta (36.346.354 bytes), searching for 'urumchi' (it's at the end of the file): 01:23 - slightly more than two times faster than MPR (TR: 36 Mbytes/83 sec = 0.4337 M/s; MPR 4.5: 110 Mbytes /555 sec = 0.198 M/s).


My latest mail to the MPR newsgroup:


I've just bechmarked my new h2210 with AVForum's complete back-up as of 05/17/2003, searching for the word 'animatedy'. I've made all tests with no tasks running in background: I soft reset the PPC before each test. In some tests, I only used Spb GPRS monitor 2.0 as a background task. I have rerun the test after uninstalling it to find out how much processing power it needs, especially under WM2003. (Not much so it won't take much processor time.)


The results:


iPAQ 3660: 10:33 (old test with build 405!)

iPAQ 3630: 10:30 w/ Spb GPRS monitor 2.0 (build 406)

iPAQ h2210: 8:07 w/ Spb GPRS monitor 2.0; without the monitor, 7:52 (build 406)


I've run the tests off a 20* RITEK CF card.


The test clearly shows that the 2210 isn't THAT fast when it comes to mundane tasks like searching in MPR. The benchmark results published at http://www.pocketpcthoughts.com/forums/viewtopic.php?t=15275 based on Spb Benchmark seem to heavily correlate with real-life usage scenarios and real speed of a given device (except for some gaming: PocketSNES (see thread at http://discussion.brighthand.com/printthread.php?threadid=81209&perpage=1000), Pocket Quake and games like these).


It should be noted, however, that the Asus 620 is unlikely to be faster than the h2210 for reading/searhcing e-Books. It's just because it's excellence in graphics that made its overall score much higher than that of the competition (incl. the h2210). For a detailed comparison, see http://www.softspb.com/products/benchmark/compare.asp


It can be clearly seen that more mundane tasks like file reading speed and plain CPU instructions are executed with almost the same speed by both the h2210 and the Asus 620.



And my previous one:


I've made some new benchmarks on three different platforms with Mobipocket Reader (MPR) 4.6 and some other book reader apps. The latest iSilo seems still be as slow on the Pocket PC/WinCE machines as before (no benchmark results yet; the 3.35b3 seems to be as slow as previous versions which I have already benchmarked). The latest release of uBook was able to read and search a 7M HTML file pretty fast without crashing (it is good news; previous versions were pretty unreliable and crashed often). I've checked build 407 for Palm OS5 again and it's still as slow as the previous builds.


MPR 4.6 still excels on Pocket PC's/Palm-size/Handheld PC's, when compared to both uBook and iSilo. (Compare, e.g., its pure HTML loading and processing times to those of uBook and check out my older )


WindowsCE benchmarks on Casio E-15:



Mobi Reader 4.6 build 406

iSilo 3.35b3

uBook 0.8b


Aesop's Fables; searching for 'drought'


Not tested yet (the HPC 3.0 version DOES run on Palm-sized PC's!)

Not tested yet


DVDRHelp.prc; 'maestro~'

3m 14s






Palm OS5 benchmarks on Palm Zire 71:


Mobi Reader 4.6 build 407

iSilo 3.35b3

NO uBook for Palm


Aesop's Fables; searching for 'drought'

14s from both main memory and SD

Under 1s



DVDRHelp.prc; 'maestro~'

20m 30s






iPAQ 3660 benchmarks:


Mobi Reader 4.6 build 406

iSilo 3.35b3

uBook 0.8b


DVDRHelp.prc; 'maestro~'


Not tested yet



DVDRHelp HTML version, compressed version; 'maestro~'

No compression support

Not applicable

32s (after fully loading the document; that took 2:30 for loading + 1:15 for post-processing (wich meant non-responding PPC)


DVDRHelp HTML version, uncompressed version; 'maestro~'

29 sec; almost no loading time; the entire doc was available right after loading

Not applicable







Positioning to one of the last pages made iSilo (and the entire PPC) respond sluggishly. The percentage counter is completely off (it behaved strangely during the seaerch too: it always reset itself to 0% when reached 20-25%; must be an overflow problem) and the vertical scrollbar is useless. See picture below:




Second: image handling. Much as MPR only supports 8-bit-images (the enxt version will have support for 16-bit ones too), the support for inline pics is far superior to that of iSilo. In the latter, images are always displayed at their full size. You can't use the stylus to drag the picture, unlike with MB. The only way to conveniently scroll the picture is overriding the default behaviour (one line left/right/up/down) of the hardware cursor block to scroll one page at a time (if you don't want to use the scroll bars). There is no dedicated Image mode, unlike with MPR.


Furthermore, as much iSilo not only supports 16-bit modes but also 8-bit ones, 8-bit image mode in iSilo shoulnd't be compared to MPR's default 8-bit mode. The latter doesn't involve any colour transformation; the former does. Let's take a look at some sample pictures of the two modes of iSilo. The original can be found at Note that the original big picture is a GIF (that is, a 8-bit file). Still, in 8-bit mode, iSilo converts its colours.


iSilo, 8 bit mode:




iSilo, 16 bit mode:


If you make iSilo always convert GIF's into their 16-bit equialent, it almost effectively double their file size (assuming you're only using GIF input)!


The iSilo reader has a full screen mode, but you can't set the margins. Pages in this mode are much harder to read than in MPR's full screen mode.


There is one thing the iSilo reader (strictly for Pocket PC OS) is better than MPR: memory utilization. iSilo will always take about ~600k from the program memory, independent of the number of the visited pages. MPR, however, has memory leaks. If you scroll through an extremely large archive (for example, the 24-Megabytes-long Brighthand archive), MPR can eat up all your free program memory. For example, scrolling straight through the first 10000 (of the 14000) pages of the Brighthand archive, MPR allocated over 20 Mbytes of memory and only deallocated it when I switched to another document. This is clearly a bug in MPR. (It's highly unlikely, though, that it will ever cause problems in most cases. Reading over thousands of pages with only 2-3 Mbytes of free program memory in one session is highly unlikely.)


This is not a serious bug, anyway, because it doesn't make the PDA freeze. When the central (both program and data memory) is totally filled up, MPR frees up all the memory it has allocated so it never runs into problems, even with extremely large (even around the de jure (and, fortunately, also the de facto) highest threshold, 240 Mbytes) uncompressed files.


Note that neither iSiloX nor SiteScooper can act as a HTTP proxy, unlike MBHelper. This means you can only use MBHelper with MWC or a stand-alone browser. However, as my XDOCGenerator can automatically accept and process the output of SiteScooper, you may choose use SiteScooper instead of MWC.


Also note that iSiloX can only create iSilo compatible files, unlike the jack-of-all-trades SiteScooper. This is why, despite some of its nice features, my toolkit doesn't support working together with iSiloX. However, I do support SiteScooper - read the comments of Version 1.7.8. This means you don't have to use MWC + MBHelper to download HTML, you can also use SiteScooper. Furthermore, you can also combine SiteScooper and MBHelper (as a HTTP proxy) so that the missing features of SiteScooper (e.g., merging TOC pages, just to name the most important) will still be delivered by MBHelper.


In the next section, I compare Mobipocket to the other e-News players. Don't forget that, due to the size of this document, sometimes I only speak of MWC when iSiloX can also be meant. For example, iSiloX can't do page merging either by itself, doesn't know post-based logins and URL changing on-the-fly. By using the non-offline (that is, not necessarily Mobipocket-related) tools in the suite, you can also work together with iSiloX (or, for that matter, SiteScooter).


Palm OS5 Benchmark Results


Aesop's Fables; 'drought': MPR 4.6b407: 14 sec; iSilo 3.35b3: under 1 sec



I've played with both the latest (407) build of Mobi Reader 4.6 and the latest (3.35) beta build of iSilo for Palm OS5. I think the Palm OS version of iSilo is much better than the PPC version (if you're interested in a direct comparison with tons of benchmarks of MPR 4.5/4.6 and iSilo 3.3 on Pocket PC, visit


I've found the Mobipocket Reader for Palm OS5 much inferior to the Pocket PC version of the same program (on Pocket PC, Mobpipocket Reader is THE best book reader app when it comes to speed, openness and capabilities). It doesn't even use native ARM code - this means it runs Motorola code under PACE. This also means very bad speed (about 30 times worse search and picture rendering speed than on a StrongARM Pocket PC). Unfortunately, the Palm OS version lacks certain plus imaging capabilities present in the Pocket PC/WinCE version (magnification).


Creating a bookmark in Mobi Reader is 3 taps (Go To/Add bookmark/Add); accessing the list of bookmarks is 2 taps (book name in the upper left corner / Annotations).


Mobi Reader doesn't install small fonts on Palm OS5 as default; you have to install them yourself if you don't want to use the 4 system fonts (all of them being rather large). iSilo, on the contrary, contains small characters too.


All in all, on Palm OS5, I find iSilo much faster on PalmOS than the Mobi Reader. (Strangely, on WindowsCE-based machines, the situation is exactly the opposite. Perhaps because the Palm OS5 version of iSilo is optimized for ARM, unlike the Mobi Reader.)


However, if you don't need to search a lot and/or your documents don't contain many pictures, the recent Palm OS 5 version of Mobi Reader can be OK. (I really hope the Mobi staff will spruce up the Palm OS5 version so that it'll be THE book reader on Palms too.)


Other readers without an official Web extractor utility


There're several standalone readers that don't have a specific tool to download and transfer HTML content. However, most of them are capable of displaying HTML ZIP archives. This means they'll work just great with archives created with MWC, iSiloX or SiteScooper.


There're the following standalone HTML-capable readers worth mentioning:


HTML-capable readers

HTML is pretty inferior to Palm DOC/MM XDOC, when it comes to PDA's. The reason for this is utterly simple: a HTML needs to be read in the main memory at once if it's compressed into a single ZIP file (if it's stored in PDB format, however, the situation could be very different. See the discussion for uBook below). This means a HTML-capable reader will never be as fast at loading/displaying large HTML's than, for example, MobiPocket Reader. A HTML-based reader, furthermore, will never be able to load documents larger than the (free) central memory of the PDA. Actually, the situation is even made worse by the fact that Windows CE allows for a maximal file size in the object store of 16 Mbyte - this may also mean a huge disadvantage of HTML readers when compared to MPR.



uBook (http://www.gowerpoint.com/)


As of version 0.7c (released on 20th May), it's probably the prettiest e-book reader. It's highly customizable and well-documented; you have, for example, four options for the ClearType setting. Its screen output is just wonderful.


It's much slower at following links and the initial repagination. However, its scrolling speed is tolerable and is visually much better than that of iSilo.


Unfortunately, it has still no separate image mode. This means much as you can zoom in an image, it won't pop up in another window (unlike in MPR).


Unfortunately, it handles large documents pretty badly. It decompresses and repaginates (in the background) the entire HTML file at start (it takes 1:53 to decompress and repaginate the Brighthand file).


Jumping to hyperlinks is pretty bad. When jumping over 9000 pages, nothing happens (tested and have waited for 10+ minutes). Simple 'Go to' doesn't work either with moderately large files either.


Tends to freeze even with HTML files under the object store limitation (16M; tested with a 13M HTML file on a machine with some 60Mbyte free in central memory). Much smaller files are rendered OK.


Can handle HTML files in a PDB archive if you override the default text parser setting for PDB files. However, any HTML file that contains hyperlinks will cause program termination. Therefore, the usability of this, otherwise nice (think of it: you can even display MobiPocket PRC files if the standard HTML attributes are kept in the anchor tags - this means creating the PRC file with an independent tool and not MWC/MB Publisher) feature is pretty limited.


Despite its pretty screen output, the reader isn't capable of navigating really large documents.


Addition on 9th Aug: the latest version, 0.8c, is even able to read Mobipocket PRC files. It is, however, inferior to the Mobi reader because of the speed, especially when displaying large pictures in Mobi PRC documents.


Team One's Reader v3.0 (http://www.teamonesoft.com/en/Products.htm)

Frankly, I don't know why this application received so good reviews from Pocket PC Life (see http://www.pocketpclife.co.uk/featureddetails.asp?article=187). It's a highly buggy, unreliable and ugly book viewer. Its PDF capabilities are just laughable: it managed to open very few PDF's.



Tapping the ZIP name, TOR created a new 'document' called entry, which contained a lot of useless links. I've tried opening entry. Nothing happened for 2 minuets and 45 seconds; then the document content screen appeared - displaying nothing. Upon consecutive tests, the same happened. Sometimes not even the document content screen appeared, the machine just grind to halt.


Furthermore, it litters \Windows\TeamOneTemp and doesn't clear this directory upon restarting the application. Being the directory named so uniquely, few TOR users will find it.


All in all, this application should be avoided.



Non-HTML-capable readers

Microsoft Reader 2.0

Probably the worst book reader out there (not counting Team One's Reader v3.0). It's very slow, has memory leaks (http://www.infosyncworld.com/news/n/1757.html - note that this problem is much more severe than that of Mobipocket Reader because, according to the reports, even small, 10-kByte-long ebook can make Microsoft Reader allocate tens of megabytes (!) of memory!) and the LIT creation utilities just don't like non-well-formatted HTML's.


I've tested both ReaderWorks Publisher PRO v2.0 (http://www.overdrive.com/readerworks/software/publisher.asp) and Microsoft's own Word plug-in (http://www.microsoft.com/downloads/details.aspx?FamilyID=199be874-1f5e-4fb7-8fe0-6bca50c7d356&DisplayLang=en). The former didn't want to compile the Brighthand HTML into an e-Book because of the missing files/ invalid URL's. I wasn't able to test the latter because Office XP SP 1's Word just froze upon importing the HTML.


Therefore, Microsoft Reader is useless for e-News purposes because downloaded HTML's will never be well-formatted.


Haali Reader (http://haali.cs.msu.ru/pocketpc/)

It doesn't support HTML. However, the reader seems to be pretty good. However, as there're no web downloaders with FictionBook support (actually, none of them generates XML, let alone XML with Xlink), it can't really be used for reading e-News.


TomeRaider (http://www.tomeraider.com/)

A text-based (this means no pictures at all) reader. Its input is HTML-like (bold, italic, lists but not inpage links) but can't transfer real HTML.


Pretty fast (even at searching) and, as opposed to any compressed HTML-based viewer, doesn't take much memory (200kB at most, even with very large archives). Also supports real hyperlinks (unlike Palm DOC, which only supports lame and pretty useless bookmarks). Unfortunately, for some reason, hyperlinks are not supported on Palm OS (this is pretty strange as the top-notch readers, for example, iSilo and MPR, all support hyperlinks on Palms). See e.g. http://www.geek.com/hwswrev/psion/tomeraider/.


The blurb for the application


Unfortunately, the lack of inline images and the unability to convert real-life HTML into its own format makes it pretty useless for reading multimedia e-News.

Palm Reader (http://www.palmdigitalmedia.com/S=2f7c86ffed7244ce36a1cf2ac89a995cPrkrDcCoAAsAACOeM0c3046011/product/reader/browse/free)


The standard Palm DOC format (also called PDB; but it should be emphasized that PDB is a generic Palm Database format and, therefore, does not always contain Palm DOC formatted contents) reader. As the Palm DOC format is inferior to all major other e-Book formats (lack of easily usable hyperlinks - they're emulated with bookmarks -; lack of picture support etc.), using any Palm DOC format reader is highly discouraged.

Why did I choose Mobipocket? Why do I recommend it?


The answer can be pretty clear to you if you play around with all the three e-News services a bit.


First and foremost, both AvantGo and Mazingo are plain e-News services only using PIE as the display. (Sure, Mazingo has its own browser, but it's no better than PIE - it only offers ClearType support as a plus) If you have ever seen PIE and compared it to any e-book reader (not necessarily that of Mobipocket), you know even Microsoft Reader delivers much better reading experience and much better toolkit (adding annotations, page numbers, bookmarks, highlighting, built-in dictionary support) than PIE. And Microsoft Reader's speed and capabilities are definitely inferior to that of the other e-Book readers out there (mainly Mobipocket and the free uBook (http://www.gowerpoint.com/) with, for example, far superior image modes.


What's the point in an image mode? The answer is simple: advanced e-bookreader applications show only a thumbnail of a picture in the text, but if you step into the image mode by tapping the thumbnail, you can navigate the picture at full resolution. This is not that important with casual daily news pics, but is essential with tech magazines and comics.


Speaking of comics, Mazingo also offers (or, at least, claims to offer because the http://www.mazingo.net/pc/list_subcat.php?category_id=15 Comics category current doesn't have anything) comics. The Mazingo staff recommends using full-size mode for 'reading' comics. Did you ever try to navigate a large picture in PIE with the help of the scrollbars? Pretty awkward, isn't it? PIE, unlike e-book readers (or, PDF readers, for that matter) doesn't offer any dynamic pic navigation capability. You either read the document as 'fit to screen' with pic thumbnails or without fitting to screen to be able to see the pictures in full size. Unfortuinately, you always have to switch between the two modes if you want to read the text between pictures. And, as you may have already guessed, takes a lot of time, in both PIE and Mazingo's reader client, even with moderately sized pages (say, 20-30 PDA-sized pages and a few in-line pics). With Mobipocket Reader, the first page appears at once and you don't have to wait.


Much as Mazingo's PDA isntallation package offers the trial version of PicturePerfect, which is a toelrably fast (although definitely slower than Mobipocket's dedicated Image mode), you can't switch to it by tapping on a picture in neither PIE nor Mazingo's own browser.


There is only one area where AvantGo and Mazingo are clearly better than Mobipocket's e-News capability, even when using my toolkits: both can synchronize their e-News without using any PC with a pre-installed MWC - that is, even through GPRS, Wi-Fi etc.


Mobipocket's e-News/e-Book format


It should be stressed again and again that e-News are no different from traditional 'e-books'. That is, even you can emulate how the e-news creator/synchronizer applications work with a little effort by using an documentation creator suite for the given archive type. But, that can't be batched and made automatic.


It's exactly at this where applications written for synchronizing e-news shine. They don't require manual Web page downloading, converting into an e-book format (or, just ZIPping them up) and sending them to the PDA. The three e-news creator/synchronizer apps with the largest market share, AvantGo, Mazingo and Mobipocket, all free the PDA/mobile phone user from the chores of doing anything manually.


As has already been stated, the downloading/transformation process is highly automatized. This means you don't have to start any application on the PC to download the latest news from the Net. Neither have you copy anything by hand to your PDA. Just imagine what would happen if you tried to do the same with, say, Microsoft Reader. You would have to download the page and all its linked pages to your local PC, and then convert the HTML into LIT by hand, using either the Read in Plug-in or ReaderWorks Publisher.


I only recommend Mobipocket's e-news synchronizer tool, Mobipocket Web Companion (MWC) and their PRC format because of the following:

  • MWC's output, unlike that of AvantGo, doesn't use a proprietary format that can only be read with two Pocket PC-based browsers (Microsoft's Pocket Internet Explorer and the Access Netfront; not even the two other browsers, Thunderhawk and ftxPBrowser (http://www.af.wakwak.com/~ftoshi/pocket/index_e.html), know it). (Incidentally, Mazingo's file format is plain HTML. What is more, any transfer between the PDA and either the host PC or mazingo.com is compressed so there is no bandwidth loss when compared to AvantGo).

    As has already been pointed out, Mazingo's file format is plain HTML so they can be read by anything. Still, as Pocket PC-based HTML browsers are pretty immature (no image mode and being very slow at loading/redisplaying longer documents because the lack of the concept of pages), it's still not a preferable file format to read documents in it.
  • Mobipocket's reader,  Mobipocket Reader (MPR), is a pleasure to use, especially compared to Pocket Internet Explorer (PIE) or some of the competing e-news/e-book reader applications (mainly Microsoft Reader, which is far worse than MPR in almost every respect). Its Image Mode is far superior to that of PIE or Microsoft Reader. For example, the latter can't even switch to full-resolution mode.
  • Furthermore, MWC is very simple to configure. Actually, you can write a configuration file on your own for MWC telling it what to download, where from and what to do with the content.
  • With WMC, you can access any kind of Web content, not just stripped-down or downright useless, mostly pictureless content.
  • Web content providers (newspaper operators, forum moderators etc) don't have to pay big bucks for being accessible through avantgo.com. Actually, they don't even have to be able to be aware of that there're people downloading their pages via WMC and reading them on their PDA/phone with MPR. They don't have to provide specially formatted pages either because MWC's layout conversion capabilities, especially with the help of my toolkit, are pretty advanced, especially with HTML pre-filtering using third-party tools.

Incidentally, Mazingo is free (and even pays in the 'Mazingo Bounty Program') for content providers, see http://www.mazingo.net/pc/publishing.htm. Mazingo offers another pretty cool stuff too (e.g. shopping carts and Program automatically downloaded specific files with any (!!) target directory), as described at http://www.mazingo.net/pc/publishing_examples.htm.

  • You can only subscribe to 20 AvantGo channels for free at the same time. With MWC/MPR, you can subscribe to any number of channels at the same time.
  • AvantoGo only lets its clients receive content from its centalized server.


Mazingo is a bit better in this respect. It first connects to htttp://www.mazingo.net/client/get_account_440.php to get the URL's to download the individual pages from. Then, it visits all URL's and downloads all of them. Unfortunately, much as it sends out cookies (unlike MWC), it never sends out If-Modified-Since request headers to reduce network traffic. It is certainly a big problem, especially because, as you may already have guessed, the HTTP transmission is not compessed.


MWC is certainly the best in this respect because it doesn't have to turn to a centralized account management service, every information is kept locally. This also means you don't have to 'hack' into it, unlike with Mazingo to capture its accessing htttp://www.mazingo.net/client/get_account_440.php to return a set of URL's of your own.


The lack of MWC's centralized account approach is ony problematic when you want to synchronize your e-News from several machines. You have to install MWC on all of them and you have to subscribe to all the papers you want to receive on all of these PC's. This can be made easier by copying the central MWC config file (\Program Files\Mobipocket.com\MobiPocket Reader\config.xml) and the contents of \Program Files\Mobipocket.com\MobiPocket Reader\data\ after subscription but before synhronizing anything from one PC to another. But, still, it's not that elegant a solution as that of the cetntralized account management, especially when you can also synchronize e-news from your PDA (both AvantGo and Mazingo support this; MB's tools don't, but this may change if I get the permission to write a PRC creator server accessinble from a PDA).


One thing should be mentioned in favour of AvantGo, however. MWC downloads plain HTML content from the web, while AvantGo only gets already-compressed and trimmed files. This means using MWC instead of AvantGo can take up much more bandwidth. Over a GPRS or analogue modem connection, AvantGo may be a better choice.


Incidentally, when compared to Microsoft Reader (or all the other e-book readers I've tested, even the very promising uBook), MPR is the killer app. If you refuse to give MPR and MWC a try because of your bad experience with Microsoft Reader, just read on:

  • MWC's pagination, linking and scrolling is blazingly fast, compared to all the other e-book readers
  • no memory leaks: Microsoft Reader really slows down with archives over 1000 pages and/or several pictures. Compare this to MPR's ability to navigate in a 150 000-page-long (!) archive as fast as in a 20-page-long one.
  • it has full screen mode
  • dedicated image mode. Microsoft reader can only display a smaller-than-a-PDA screen image, while MPR also has a dedicated image mode where you can examine the stored picture in full resolution
  • annotations, settings etc. are stored in an easily saveable file (compare this to Microsoft Reader: you can't save your annotations in it so they won't survive a hard reset/switching to another PDA)
  • Palm DOC (and, therefore, MWC) is able to store source (HTML) text documents (without images) up to 262 Mbytes (268431360 bytes) in size (65535 records * 4096 bytes). This is equivalent to some 600 000 pages on a Pocket PC, assuming default settings in MPR and non-full screen operation. Not bad, eh?
  • MPR's e-news/e-book directory reading is very fast. It never takes  more than 1-2 seconds on a StrongArm to read all the book/e-News names in a directory. Compare this to, for example, Microsoft Reader's booting speed. When MR boots, it reads the titles of all the books. It's very time-consuming, even with a moderated number of books in your My Documents (or its subdirectories). With more than 15-20 books, you may have to wait even 1-2 minutes, especially on Pocket PC's with older processors. Jornada 52x and 54x (SH3 processor, PPC2k op. system) users always complain about Microsoft Reader's very slow booting.


Its disadvantages (lack of true-color JPEG, ClearType and ZIP'ed HTML support) pale in the light of the above advantages.


Problems with Mobipocket Web Companion


The application that actually downloads and converts Web content into the format MPR can digest, Mobipocket Web Companion (MWC), on the other hand, could be much better. It has a very simple configuration format that doesn't even allow very simple actions to be run. Actually, it's so dumb it doesn't even handle cookies. However, as it can access proxy servers, it's a great idea to implement much more sophisticated filtering, page merging, cookie handling etc. functionality in a proxy server. This keeps the synchronization process automatic because it's still MWC that generates and synchronizes the PRC files, and not us. We just make sure MWC gets the correct, already-transformed input, resend cookies, merge pages etc. in the background. MWC doesn't even know about this because it only sees a proxy server it connects to and doesn't know it's actually a content filtering and generator server.


Another advantage of the proxy-based approach is the reusability of the proxy code. As you may have guessed, the proxy can not only work with MWC, but, for that matter, with any e-news content generator or even web browsers. Actually, users that write content filtering configuration files for the proxy should test them with a browser first and just after that start the e-news synchronization tests. (The latter is far slower than a quick test with a browser.)


What's so cool about a transforming middle layer between your browser and a Web site? The answer is very simple. As has already been stated, MWC lacks configuration options. It only knows really basic rules when processing pages, which may not be enough. For example, if doesn't know Regular-Expressions so following links on some pages may be very hard. And the list goes on, the regex problem is just one of the most prevalent shortcomings of MWC.


As I find reading papers and even forums on my PDA far superior and more pleasant to reading them on either paper or a traditional desktop PC, I've decided to fix MWC's problems with both proxy-based prefiltering and, when it wasn't at all possible, post-filtering.


Unfortunately, there is an additional problem that I couldn't help so far: neither MWC nor MobiPocket's 'official' PRC builder, MobiPocket Publisher are able to include all records in a PRC. The actual article number MWC/MobiPocket Publisher is able to include is between 1000…3000. It seems it's a bug of Mobipocket's current PRC creator tools. If you use my XDOCGenerator utility, however, this won't cause problems as it includes all articles.


An example of this can be found here (it's just one example; in real life, you will find several cases of this). MWC (and MobiPocket Publisher) only included the first 16% of all the articles (see MWC-generated-erroreneous-Microsoft_Web-Based_Newsgroups.prc). Based on the source HTML, Microsoft_Web-Based_Newsgroups.html, XDOCGenerator created the file XDOCGenerator-Generated Microsoft Newsgroups archive.prc, which has no problems any more: it contains all the over 6000 articles.

How does MWC work?


Before digging in the usage and configuration options of my applications, let's dive in the configuration file formats of MWC.


Incidentally, there're two other tutorials on this subject, which may also be worth checking out. Their addresses are http://www.geocities.com/philpw99/explain.htm and  http://mitglied.lycos.de/martinstaubach/.  Unfortunately, the latter only shows examples of the better-to-be-avoided “download-site" type of downloading (see below).


MWC uses two types of configuration files: .enews and .xsl files. The latter are based on XSL (for a great tutorial on them, check out http://www.w3schools.com/xsl/xsl_intro.asp; however, it's not a mandatory subject as we will only use parts of it), but have their own additions (date inclusion and anchor handling).


.enews files


.enews files tell MWC where to download the pages from, how the pages should be traversed, what links should be followed, what content should be considered HTML text that should be saved etc. For example, let's have a look at a uBB traversing .enews configuration file (you can find it in the forums\uBB\tech\CEWindows.net directory):



   <enews xsl-rendering="file:///c:/enews/CEWindows.NET Forums - Compaq iPAQ Forum.xsl"> // 1

      <title >CEWindows.NET Forums - Compaq iPAQ Forum</title> // 2

      <enewsitem selected="yes">  // 3

         <title >all</title>  // 4





<SECTION action="section" > // 5

   <url action="follow" >http://discuss.cewindows.net/cgi-bin/ubb/forumdisplay.cgi?action=topics&number=35&DaysPrune=1000</url> // 6

      <TOPICS action="iterate"> // 7

      <filter action="extract-url follow" >HTML</filter> // 8

      <TITLE action="extract" html-filter="no-tags"> // 9

           <begin><TITLE></begin>  // 10

           <end></TITLE></end>  // 11



         <MESSAGES action="iterate"> // 12

          <AUTHOR action="extract" html-filter="no-tags">  // 13

             <begin><FONT SIZE="2" face="Verdana, Arial"><B></begin>



          <DATE action="extract" html-filter="no-tags">

             <begin><FONT SIZE="1" color="#000000" face="Verdana, Arial">posted </begin>



          <BODY action="extract" html-filter="get-pics">


             <end></FONT><P align=right></end>






This .enews file tells MWC the following in the <webcompanion-config> tag:

1.                  the XSL formatter file (more on them later) belonging to this .enews config file is called CEWindows.NET Forums - Compaq iPAQ Forum.xsl and is located at c:\enews.

2.                  the title of the file is CEWindows.NET Forums - Compaq iPAQ Forum. This title will be used in naming the PRC file too.

3-4: there is one so-called 'section' in it. Its name doesn't really matter as it will only be shown in MWC and not the final PRC document. For now, I've called it 'all'.


Please note that this example shows the minimum number of parameters of the first section (the <webcompanion-config> tag). There are quite a few other attributes too, but they aren't mandatory. If, however, you remove any attribute/tag from the configuration entry above, it either won't work or will be pretty hard to use in MWC (if you, for example, omit the <title>…</title> pair).


The <section> tag is of much more importance. Actually, it doesn't have to be called <section> at all. Any name will do; it's the action attribute that counts. This means you can use any tag names in the second part of your .enews file (except for the <begin> and <end> tag pairs used for HTML extraction. The first section, <webcompanion-config>, has restricted tagnames too). However, try to create tag names that speak for themselves and are as understandable as possible.


First of all, all sections defined in the first section should have a corresponding “section" in the lower part. No names should match (actually, there aren't any names in here); it's just the ordering that counts.


The forum page we're speaking of contains of topics. Each topic, in turn, has several messages, which have body, an author and a posting date, just to name their most important attributes. We will only extract the topic names so that we can create a Table of Contents (TOC) containing the topic titles and a clickable URL. Furthermore, we also extract all messages, along with their authors and the posting date/times.


It can be clear from the above that we'll be using a doubly nested loop. The outer one will be iterating over the TOC (the forum) page and, after extracting the next topic page, starts the inner loop on this page. The second loop will go over the topic page and extract all messages, along with their author and date.


Fortunately, telling MWC to iterate over a page and follow a given URL is pretty simple. Just take a look at


   <url action="follow" >http://discuss.cewindows.net/cgi-bin/ubb/forumdisplay.cgi?action=topics&number=35&DaysPrune=1000</url> // 6


This construction tells MWC to get the contents of the forum page (that is, the TOC) http://discuss.cewindows.net/cgi-bin/ubb/forumdisplay.cgi?action=topics&number=35&DaysPrune=1000. Note that its 'action' attribute has the value of 'follow'.


The next row, <TOPICS action="iterate">, tell MWC to actualy start iterating though the page whose URL has been given in the tag containing the action=“follow" attribute. What this iteration really means is, at this point, not yet known to MWC. It can be both content extraction (as will be the case with the inner loop, the topic extraction) and URL following. Because we're still at the TOC page (in the outer loop), we do the latter, URL following - that is, we step over the pages the URL's point to. This is what the action attribute in


      <filter action="extract-url follow" >HTML</filter> // 8


says. Please also note that this tag, as the 'follow' action in step 6, also has textual information. In this case, it's not a full URL, but just a part of it, 'HTML'. HTML has been chosen in this case because the topiclinks on the TOC page all contain the string HTML (actually, it's a directory name), while other links (the links that should not be followed) don't.


Always try to find a part of the target URL that is unique to the pages you want to be visited. In some cases it's not possible because unwanted links will also be followed (especially when only one server-side script serves all kinds of content or when there is almost nothing that is really unique. Check out the example of Helsingin Sanomat for the latter case). It's then when you should consider using my HTTP proxy, it greatly helps remove unwanted content from the target PRC files by filtering out unwanted URL's.


Unfortunately, you can't concatenate the action attributes above to save writing. The configuration code above as concise as possible.


Please note that “extract-url follow" is inside the outer “iterate" action. This means MWC will iterate over all the links that fulfill the criteria of having 'HTML' as part of them.


Under the 8th step, <filter action="extract-url follow" >HTML</filter>, we're already on the linked page (the topic page), not on the TOC (the forum page). The first thing, before starting the inner iteration, will be saving the topic's title. This must be only done once because

1.                  it's the same for all messages in the same topic

2.                  if we tried to access <title> after actually reading it, we woulnd't find anything - we would just get an empty <title></title> contents when we tried to process the output. We could use the 'goto-top' action, but then, we would get in an infinite loop. Actually, the action 'goto-top' should only be used with the action 'eraser'.


To extract the textual content of a given part of a page, use the action "extract". It also has a non-obligatory attribute 'html-filter'. In this case, the "no-tags" value of the latter states that titles (or, for that matter, simple author names and dates, as will be seen below) shoulnd't contain HTML markup code.


extract" can be used with the <begin>…</begin> and <end>…</end> pairs. This isn't true with other actions, unfortunately. Actually, this is why you can't tell MWC (without using my proxy) to follow only a subpart of links; that is, links in a given section.


You should tell MWC the beginning and the ending of the HTML text that should be saved as “TITLE". We're quite fortunate in here because UBB displays the topic title in between the HTML <TITLE> and </TITLE> tags. With most newspapers, however, this isn't true because they don't set the right browser title. They have to be found and, especially, defined for MWC using much subtler ways.


Because reading the title of the page is just an one-step process, we can close the opening <TITLE> right after the </end> tag. Please note that the two (outer and seemingly inner) <TITLE> tags have nothing to do with each other. The outer could have been named anything else (assuming the .XSL is modified accordingly) and the inner tag, in practice, will be something else with most sites.


The title extraction declaration is as follows:


      <TITLE action="extract" html-filter="no-tags"> // 9

           <begin><TITLE></begin>  // 10

           <end></TITLE></end>  // 11



Now, remember what I said about topics. Yeah, they contain several messages. If we do want to handle them separately (and we DO want this, because MWC can't convert complex HTML tables), we have to define another iteration (action="iterate"), now for the messages themselves. (Note the tag name, which can, again, be anything!)


         <MESSAGES action="iterate"> // 12


What do messages contain? An author, a date and the message body. As MWC processes a HTML document strictly serially (remember my remark about extracting the topic title?), we have to find out not only the HTML markup (or any text) that surround the title, the date and the body, but also the order of the three pieces of information. This can be very easily done by consulting the HTML source for any topic on the site (but not another one, as they'll be different because of the different colours/character types). With uBB's topics, author is defined first, then comes the date and finally the body.


You extract them exactly the same way as you did with the topic title. Please note that the message body is extracted by using the atribute html-filter="get-pics". It states that most basic HTML tags should be kept in the message, and also the pictures it links should be downloaded and included along with the extracted message.



           <AUTHOR action="extract" html-filter="no-tags">  // 13

             <begin><FONT SIZE="2" face="Verdana, Arial"><B></begin>



          <DATE action="extract" html-filter="no-tags">

             <begin><FONT SIZE="1" color="#000000" face="Verdana, Arial">posted </begin>



          <BODY action="extract" html-filter="get-pics">


             <end></FONT><P align=right></end>



The code generated based on this configuration will be executed until MWC hits the end of the page. Then, it stops the iteration (as it leaves </MESSAGES>) and goes on with the next URL in the forum page.


When it finishes processing the forum page too (it gets past over </TOPICS>), it finishes its job.




I haven't spoken of the action “download-site" so far. It's mostly for already-PDA formatted pages, which are quite rare when compared to the vast number of non-PDA-aware sites worth reading offline.


I certainly discourage using the action “download-site" because PDA-formatted pages are, as has already been stated, very rare and are mostly stripped-down versions of the pages. Most PDA versions only contain daily news without pictures. With MPR's superfast paging and close-to-excellent imaging capabilities, this seems to be really annoying because both today's PDA's and MPR is capable of displaying even thousands of Megabytes large newspaper archives without slowing down.


As has already been stated, it's only with  action="extract" that you can define from…to HTML positions (“start extraction from this tag/group of tags/text and do it until you get to that one"), not with anything else. This means that unwanted stuff (frames, link tables, ads etc) will always be downloaded along with all the linked pages and you just can't filter them out without using external tools (e.g. my filtering proxy).


There're some cases when it's overly beneficial to use “download-site" instead of action="extract", even when downloading huge sites. The most prevalent situation is downloading a newspaper that has only one big page and the TOC page only links parts of it. This can't be processed in action="extract" mode because MWC would  download and effectively multiply the entire page every time it finds an URL with an in-page reference (that is, <some url>#reference). Remember that, much as MWC is clever enough not to store the same pages with exactly the same URL twice, any plus GET parameters (read more about this in the 'killurl' section below) will make download and store the entire page again. So, pages like http://www.hhrf.org/erdelyinaplo/frissc.htm, http://www.hhrf.org/frissujsag/frissc.htm etc. should be downloaded using “download-site".


There're some other special cases when using “download-site" is preferrable over the content extraction method. JupiterMedia's publications, for example, don't contain the title of the article textually, just as a JPG. This means you could only extract them from the main TOC, not from the article itself. This is pretty complicated (you have to run “extract" on not only article, but also TOC pages and play around with the links a lot). In cases like this, it's preferrable to use “download-site" because then, you don't have to fuss around with links and title extraction. Just use my filtering proxy's 'returnonlyhtmlbetweenbeginend' to filter out unwanted content from pages.


Please also note that, unlike with action="extract",  “download-site" uses its own XML tags. This means its XSL files have to use these tag names. Check out the first pass' XML output (look for the page sctructures) and some XSL files to see how it works.


It should also be pointed out that one of my .enews generator utilities, DownloadSiteEnewsGenerator, supports the automatic (and very easy) generation of “download-site"-type .enews and the corresponding .xsl files. Check out both its documentation and the “download-site"-type .enews files I've created. The latter all have their DownloadSiteEnewsGenerator-compatible, simplified input file.


See the Picture utilities section for more thorough information on the .enews and .xsl file format used for “download-site".


.XSL files


Now, the XML files are ready. These XML files are produced by the first part of the newspaper synchronization process. Feel free to check them out in MWC's home directory. You'll see that they are indeex XML files because the tag names we've defined above are used to denote the different sections of the text. For example, if the forum contained two topics, then there're two <section>…<topics>…</topics>…</section> sections in the XML file. Please also note that the messages belonging to a topic are strictly in their <topics>…</topics> tags and nowhere else.


Also note that the XML file doesn't contain links - they'll be added later, by a very convenient properiatory XSL addition of MWC.


These XML files must be processed, though, to create HTML files out of them. This is not complicated either. The following is the CEWindows.NET Forums - Compaq iPAQ Forum.xsl file. Remember? We told MWC in the 'enews' tag what its name is and where it is (see the xsl-rendering attribute).


Again, it's a minimalist XSL file. As you can see, it's not a valid one (an XSL parser won't accept it), but that doesn't matter because MWC has its own parser. Note that only the <body> and the </body> HTML tags are mandatory in this (unless you want an empty PRC as output), the others (e.g. <h1>…</h1>) not. Also note that you may have to play around with the HTML inside to get what you want. For example, MWC can't even render titles defined in <h1><center>…</center></h1> tags correctly. More importantly, tags like <BLOCKQUOTE> may cause serious problems because MWC, in some cases, just doesn't find the closing </BLOCKQUOTE>. This is a very serious problem with some Version 6.0 uBB forums (e.g., that of hwsw.hu).


The contents of CEWindows.NET Forums - Compaq iPAQ Forum.xsl is as follows:



<body> // 1.

<h1>CEWindows.NET Forums - Compaq iPAQ Forum</h1> // 2.

<xsl:for-each select="/section/topics"> // 3.

  <xsl:ahref><xsl:value-of select="title"/></xsl:ahref><BR> // 4.



<xsl:for-each select="/section/topics"> // 5.



<h1><xsl:value-of select="title"/></h1>


<xsl:for-each select="messages"> // 6.

<br><br><i>Posted by <xsl:value-of select="author"/> on <xsl:value-of select="date"/>:</i> // 7.


       <xsl:value-of select="body"/>







As has already been stated, MWC creates XML files in the first pass. Each section has its own XML file. These files (or, if there is only one section, this file) contain all the data MWC has extracted by following to the rules in the .enews file. As has also been emphasized, this XML file is a valid one in that that its structure strictly follows that of the .enews file: the date/author/body of a message can only be placed in a <messages> tag and messages (denoted by <messages> tags) can only be placed in <topics> tags.


The order these appear in the XML file is strictly serial. The XML file is created according to the 'first come, first served' rule. This means if we convert the XML file to a HTML file strictly serially, we can keep the order of all topics, and what is even more, messages.


I've already mentioned that MWC uses a genuine way of creating in-archive references. As has already been noted, the XML files themselves don't have internal references. But that's no problem: they will be created upon the XSL-based transformation.


Let's have a look at first the second section of the XSL file, starting with comment 5. It iterates over all the available <topics> tags (note that a MWC-generated XML file has only one <section> because different sections are put in separate XML files). This is what <xsl:for-each select="/section/topics"> is for. By itself, this directrive doesn't do much because it has no output, it's just a control structure that tells the XSL parser to iterate over the rightmost tag in the argument (that is, <topics> in this case).


The following <MBP:PAGEBREAK> already has some output. This tag tells MWC to insert a page break in the PRC file. It has no effect on a HTML browser's output.


<xsl:aname/>, as you can guess, inserts the well-known <a name="some anchor name"></a> in the HTML output. The "some anchor name" is very interesting in this case; it's built up on the serial number of the enclosing tags. Remember that I've over-emphasized the fact that everything is strictly ordered in the first pass's XML output? Now you can see it certainly pays off. Creating an in-archive link to this anchor will be explained later, when I describe the first part of the XSL too.


Now, for the <h1><xsl:value-of select="title"/></h1> tags. It prints the contents of the <section>…<topics>…<title>some topic title</title>…</topics>…</section> tag. (Note that, as has already been said, do not put a <center> inside a <h1> because MPR won't be able to render it as a real title.)


If you recall how the .enews file was built, you can see it used exactly the same iteration structures. This means if you want to describe some extremely complicated structure in your config files, you can make the job simpler by just re-using the .enews config code when writing the .xsl: just copy the contents of .enews in the would-be XSL and change action="iteration" attributes to <xsl:for-each select="<current enclosing tags>"> tags,  action="extract" attributes to <xsl:value-of select="title"/> tags, with some additional work (mostly HTML formatting).


I've followed vBulletin's very simple and printer-firnedly HTML formatting in converting uBB topics to a simply displayable format. In tag <br><br><i>Posted by <xsl:value-of select="author"/> on <xsl:value-of select="date"/>:</i>, you can see how the author and the date are printed. All these are enclosed in an inner loop (remember the same loop in the .enews config file?) Just under these headers, the body of the message is copied in the HTML output, verbatim. We close the body with a <hr> so that messages are easily separated - it's printed out after printing out every message because it's still in the inner loop.


Let's move on to the first part, now that the actual formatting and getting the messages/topictitles out of the XML have been discussed. I've already shown  how unique <a name>'s can be generated for any (!!!) part of the HTML output. It's worth explaining how it can be linked up.


Rows 4 and 5 do exactly this:


<xsl:for-each select="/section/topics"> // 3.

  <xsl:ahref><xsl:value-of select="title"/></xsl:ahref><BR> // 4.



The enclosing tag is our old friend, <xsl:for-each>. It iterates over <section>…<topics>…</topics>…</section> tags. During this iteration, we print the textual name of these tags by the <xsl:value-of> tag, and make sure that this text is enclosed in a pair of <xsl:ahref> tags. The value of the opening <A> tag's   HREF attribute will come from the above-mentioned number of the <title> we're just printing. This is how MWC generates in-page links so that TOC sections are preserved (or, put in another way, built up again).


Note that the example above is one of the most complicated ones. I've deliberately chosen this to show you how easy it is to transfer even complicated structures into a very simple, PDA-displayable format. Most of the non-uBB-related .enews files in my archive only contain one iteration, because they follow the traditional TOC and linked pages model and don't iterate in the pages themselves.


Also, check out allaboutsymbian.com's phpBB scripts (All About Symbian.enews and .xsl). They're different from the script above in that they contain several sections and, therefore,

  • the .enews file contains a <title> tag for all section so that we can retrieve section names when building up sub-TOC's
  • the XSL file's TOC building section separates sections and uses subsection links.


Also note that my .enews generators only support this one-level iteration because two-level iterations (like extracting uBB forums) are pretty rare and most newspapers and non-uBB forums require one-level-iteration. For example, vBulletin topics are already table-free in their printer-friendly version. That is, if you download the printer-friendly version instead of the original, it will be displayed just right on your PDA without individually extracting and re-formatting the individual messages, dates and authors.


Offline MB utilities


.enews generators

There're several utilities in my package. The first three help in both generating MWC configuration files and parsing old, 'legacy' files into the input format used by the first two .enews generators.


They have been written to greatly reduce the work needed to create .enews files. Their default parameters will fit most needs (1-day-periodicity, standardized Table of Contents (TOC) display and link section, download pictures). It's only few parameters that must be present in their input files (the only parameter passed to the download-site generator, DownloadSiteEnewsGenerator and the first parameter passed to the TOC-based .enews and .xsl generator, GenerateMobipocketConfigFiles). For a complete description of these parameters, the reader is referred to the starting comment section of the two files.


I've included the input configuration files to all the newspapers I've written an .enews for; they're named either <newspapername>.in or, when generated with GenerateInputFilesFromExistingMobipocketConfigFiles (see below), <newspapername>.generatedin.


I've also included a utility class, GenerateInputFilesFromExistingMobipocketConfigFiles,  that converts already existing extract-type .enews files into the input format of GenerateMobipocketConfigFiles. Read the comments section for usage instructions.


.in generators for forum softwares


Most sites have several forums, for which it can be very tedious and error-prone to write an .enews file (or, better, an .in that can be later converted into .enews using my GenerateMobipocketConfigFiles) by hand. The utility classes in offline tools\forum in generators do exactly this. You only have to pass them 1, the URL of the main forum index and 2, the site name. An example of generating a GenerateMobipocketConfigFiles input config file for HowardForums:


java vBulletin http://www.howardforums.com/index.php HowardForums


And for the PPC subsection on Brighthand,


java vBulletin "http://discussion.brighthand.com/forumdisplay.php?forumid=126" "Brighthand PPC"


(Notice that I used quotes because of the ? sign in the first and the space in the second parameter.)


As of version 1.7.2, only a vBulletin and a phpBB converter is available. Other, MBHelper-supported forum (uBB5/6 etc.) config generators will follow soon.


Note that


  1. the vBulletin .in files may need a quick edit before the .enews generation. Change the “>All times are" in the 5th row to the actual after-page text used in the printer-version pages because it's in this that vBulletin boards are (or, may be) different.


The vBulletin HTML parser, as of version 1.7.1 and up, already handles section names. Older versions, however, didn't. This is why some vBulletin forum enews files don't contain category names. (Sections are links to subforumpages; these forums are all listed below the section names too.)


  1. a few phpBB sites (for example, http://www.allaboutsymbian.com/phpBB2/index.php) don't support printing pages. That is, their topic pages must be manually merged. This means there is no point in generating their .in files with the phpBB generator because GenerateMobipocketConfigFiles only handles one-level deep iteration. That is, before making a phpBB .enews file, check out whether the site in question supports printing at all (change the viewtopic.php  in topic URL's to printview.php and check out what happens - if the site returns an error message, then you have to create the .enews file by hand).

Picture utilities


A commonly asked question is whether MPR can display really large pictures and/or how the quality of the pictures stored in PRC files can be improved. You can also formulate the question as 'do we really need to convert all JPEG files into low-quality GIF's in order to store them in a PRC?'


Resizing and no high/true-color support


Unfortunately, the current situation, without deploying my  ChangeAllConvertedImageFilesBackToTheirOriginalsInAPRCFile, is pretty hopeless. For reading tech docs with large pics, MPR is not the best solution. MPR natively supports GIF pictures of any (!) size. Even with as large GIF's as 4-5 Mbytes, you only have to wait a second (on a 206 MHz StrongARM PPC2002 device) for the downsampled picture to appear, which is definitely a very good speed. (Authors of MWC/MPR should think of adding a thumbnail option to their proprietary <IMG> tag so that the large image is only looked up in the PRC file when it's really needed - that is, the user enters the Image mode for the given image). Switching to the Image mode with such large images is very fast too - there is virtually no lag.


However, even if MPR is really fast on Pocket PC's, you can't make any PRC creator tool (that is, MWC and Mobipocket Publisher 3.0) to actually include GIF's over 64520 bytes (actually, it's better to say between 64512 and 64525 - coulnd't generate a GIF of size between them so far) verbatim, without resizing, in a PRC document. Both MWC and the recently introduced Mobipocket Publisher will always resize all pictures with larger filesizes and save them as GIF87's when converting to PRC. (Actually, Mobipocket Publisher uses the even more inferior BMP format internally. MWC seems to be a bit more up-to-date than Mobipocket Publisher. My application has only been tested with MWC.)


Furthermore, MWC resizes and converts all JPEG/PNG images into GIF. This also introduces tremendous losses in picture quality. And not just because of the true color -> 256 colors conversion, which isn't at all disturbing with tech figures and screen captures.


Animated GIF's


To make life even worse, about a third of animated GIF's aren't included either (if you want to play around with them to find out why some of the pics aren't displayed in MPR, I've collected some of them I've downloaded from the PPCPassion forums. You can find them here. I've also provided some very simple .enews and .xsl files for a quick test (see next paragraph)), just an unvisible GIF87 is stored in the PRC instead. The other two-third animated pics are displayed OK.


I've also collected some examples of non-working animated GIF's from pocketgamer.org's News section.

http://www.pocketgamer.org/showthread.php?threadid=2528 (Whistle while you work...; whisraider.gif)

http://www.pocketgamer.org/showthread.php?threadid=2531 (AIM Releases Toki Tori; tokitori_2.gif)


Doesn't work:

http://www.pocketgamer.org/showthread.php?threadid=2518 (Block Busters; pocketblocks.gif)

http://www.pocketgamer.org/showthread.php?threadid=2511 (Scramble; scramble1.gif)



To help make all frames (not just the first) of all animated GIF's (and not just some 60-70% of them) visible, I've included animation support in ChangeAllConvertedImageFilesBackToTheirOriginalsInAPRCFile. It extracts the frames of all animated GIF's and appends them into one, large GIF. Finally, it's the new GIF that is inserted into into the new PRC.


This way you will be able to see all the frames in the Image mode. This is far superior to MWC/MPR's way of doing this: MPR will always display their first animation phase if it can at all (if you're unlucky, it won't) of the originally included GIF's. Not so after the ChangeAllConvertedImageFilesBackToTheirOriginalsInAPRCFile  conversion.


To test ChangeAllConvertedImageFilesBackToTheirOriginalsInAPRCFile's superiority in presenting animated GIF's, create an .enews (of any name) file with the following (minimalist) content:



      <enews xsl-rendering="file:///c:/enews/AnimatedPicTest.xsl">


            <enewsitem selected="yes">






<SEC action="section">

      <website action="download-site" get-pics="yes">





And the accompanying XSL (AnimatedPicTest.xsl) should look like this:




<xsl:value-of select="/SEC/website/page/body" />




(Incidentally: now you can see how simple it is to write an XSL for MWC. For action="download-site" style of site download, where there is no iteration at all, you don't have to iterate in the XSL either. This is why just a <xsl:value-of select="/SEC/website/page/body" /> suffices to print the downloaded content. It should be stressed again that inside your tags (<SEC><website>…</website></SEC>) MWC adds another pair of tags: <page><body>…</body></page>. This is why we access the real contents of the page as /SEC/website/page/body and not just /SEC/website. In the case of action="extract", there're no MWC-inserted additional tags to pay attention to. Also note that unless we explicitly put a <BODY>…</BODY> pair in the XSL file, the PRC will not be generated.)


After registering the .enews file with MWC and downloading the pics, go to <drive letter>:\Program Files\Mobipocket.com\MobiPocket Reader\data\AnimatedPicTest and load AnimatedPicTest.prc (the MWC-generated PRC file). Yeah, only 4 of the 7 pictures are shown. Run ChangeAllConvertedImageFilesBackToTheirOriginalsInAPRCFile  in the same directory and load AnimatedPicTest2.prc (the newly generated, modified PRC file). See the difference?


For another demo of the animated conversion, check out these two PocketGamer PRC's: the original and the converted (please note that the first article contained a bad animated GIF, and ImageMagick could not convert it. This is why  ChangeAllConvertedImageFilesBackToTheirOriginalsInAPRCFile  inserted an empty GIF there). You can also give a try to the full download's original and converted versions (beware of their sizes: 25M and 40M, respectively).


Please note two important points:

  • framed, uncompressed GIF files can be pretty large. Some of them can even take 2-3 Mbytes. After June, however, with the introduction of an ImageMagick with LZW compression, it won't be a problem any more.

    To avoid problems because of the large animated sizes, you should consider using the option “only convert the files of a given pattern" if you only really need few pictures at their original size.


  • ImageMagick can't decompress some kinds of animated GIF's. The first converted frame is OK with all animated GIF's; the latter pages, however, may have displaced/strange colours. However, the pictures are, at least, recognisible.  It's the problem with ImageMagick and not that of ChangeAllConvertedImageFilesBackToTheirOriginalsInAPRCFile.


BMP support


MWC doesn't convert BMP's either. Of the BMP's I've tried, none worked. ChangeAllConvertedImageFilesBackToTheirOriginalsInAPRCFile also excels in this area: it converts BMP files to GIF's and inclues them in the PRC without problem. An example of pages with BMP images can be found at If you change the URL from to this URL in the .enews file above, synchronize the site and, after MWC has finished, run ChangeAllConvertedImageFilesBackToTheirOriginalsInAPRCFile in the target directory (<drive letter>:\Program Files\Mobipocket.com\MobiPocket Reader\data\AnimatedPicTest). See the difference?

Proxy-based, fully automatic solution?

If only MWC didn't resize GIF's over 64520 bytes, downloading JPEG's that must be kept at their original size and quality would cause no headaches: MBHelper already contains code to intercept and modify picture links in the source HTML and, with very little effort, a new action 'convertalljpegstogifs' could be added. It would convert all received JPG files to GIF files in the background and give MWC a HTML that only contains GIF references. Actually, most of this functionality has already been implemented for the command 'uniquepictures[REGEX]'. The target GIF's would be at least an 2-3 times larger than the source JPEG's, but, at least, the texts, the annotations etc. would remain all readable on them, even on a PDA.


This, because of the already-mentioned size threshold, can't be used. That is, a fully automatic way of replacing JPEG pictures with their high resolution, high quality GIF conversions won't work because MWC will always downsample and resize them. This means I had to find a way to edit the PRC files after they have been created by MWC and before they are deployed to the PDA.


PRC reconstruction


Because (unlike with the other transformations done by MBHelper) we can't act before MWC creates the PRC, we must do the changes afterwards.


This only requires the knowledge of PRC files (see e.g. http://web.mit.edu/tytso/www/pilot/prc-format.html for an intro; the PRC format MWC uses is a bit different from it - for example, the directory format is different and it uses much better non-RLE compression). You just have to change the downsampled, resized picture records to their original versions if they're a GIF, and a full-sized GIF conversion if they have originally been a JPG.


I've chosen not to use the official Java Advanced Imaging API (http://java.sun.com/products/java-media/jai) mostly because it isn't able to write GIF (due to licensing reasons, I bet). I stayed with the well-known, free ImageMagick picture tool suite (http://www.imagemagick.org/; download the binary from ftp://ftp.imagemagick.org/pub/ImageMagick/binaries/; the latest build as of 04/16 can be downloaded from ftp://ftp.imagemagick.org/pub/ImageMagick/binaries/ImageMagick-i686-pc-windows.exe) instead. Surely, this means using operating system level calls, but it works OK and really fast. (Actually, it's approximately 2-3 times faster than MWC's own GIF converter.)

Please note that, for most input (I've tested it with mostly JupiterMedia papers and tomshardware.com), PRC's with full-blown pictures may be 2-3 times larger than PRC's with scaled-down images. Note that since Unisys claims a patent on the LZW algorithm (expiring in the US as of June 2003) used by GIF, ImageMagick binary distributions do not include support for the LZW algorithm so GIF files are written uncompressed. However, if you can put your enews directory to your storage card (and you can with Pocket PC devices - other operating systems may be a bit different), that isn't an issue.

I stated that it's recently semi-automatic to convert files now. You have to run ChangeAllConvertedImageFilesBackToTheirOriginalsInAPRCFile in all directories you want to be converted inside <drive>:\Program Files\Mobipocket.com\MobiPocket Reader\data\. It automatically computes the input HTML and PRC filenames. The output PRC filename is <original PRC name>2.prc.


You should also check out some examples of before and after PRC's. The example is that of http://www.tomshardware.com/network/20030408/index.html and http://www.tomshardware.com/network/20030325/index.html. Pay special attention to the pictures on http://www.tomshardware.com/network/20030408/bluetooth-07.html (and on) and http://www.tomshardware.com/network/20030325/wireless-10.html (and on). They're useless in the downscaled version (no text can be read), while in the PRC version created by my tool the pic quality is as good as that of the original (sure, it's only 256 colors, but that's no problem on a PDA screen). You can also check this out by downloading the two files from here and here. The difference speaks for itself.


The Java sources for the application can be downloaded here. (I link it here too because it's my working version; therefore, it may be a bit more up-to-date than the archived version in the ZIP archive)


Please note that you don't need to include all the original pictures in the PRC (which means  big blast in PRC filesize). You can choose only to include a few of them. Telling ChangeAllConvertedImageFilesBackToTheirOriginalsInAPRCFile to do so is extremely easy. For example, let's see an example, again from THG.


THG's above-mentioned (and linked) two Networking articles, the Belkin and the Linksys article, makes heavy use of high-resolution screenshot JPG's of Chariot (http://www.netiq.com/products/chr/default.asp). When converted by MWC, they become useless. Unfortunately, there is no naming standard on THG to denote files created by Chariot (unlike with images of the name imageXXX.gif, the standard benchmark picture names at THG.


Furthermore, articles like Benchmark Marathon: 65 CPUs from 100 MHz to 3066 MHz  (http://www6.tomshardware.com/cpu/20030217/index.html) make heavy use of high-resolution PNG images, which become absolutely useless after MWC's conversion (see e.g. http://www6.tomshardware.com/cpu/20030217/cpu_charts-22.html).


If you want to keep all (converted or not) pictures in the PRC file and just reinsert the above ones, you have to create a file containing the filename patterns of these files. By examining the benchmark pic names used in the Belkin and Linksys article, you can easily determine the file's content:







(Incidentally, I've also included the benchmark picture names for the arcicle series '802.11g Need-To-Know', http://www6.tomshardware.com/network/20030317/index.html).


By passing the name of this file to ChangeAllConvertedImageFilesBackToTheirOriginalsInAPRCFile, you can override the default 'convert everything'. Now, the file will only contain pictures with original resolution that match the patterns above. (Note that these are in the standard regular expression pattern format)



And a final note: as has already been said, MWC is exceptionally slow at converting pictures. The conversion of the appr. 25 Mbytes of JPEG/large GIF pics in the entire PPCPassion forum (except for the iPAQ and the Amigo forum) took slightly less than an hour on a P1.6 machine w/1G DDR RAM; the same figures for a recent full THG download (iteration = 15) were a bit above five hours (82 Mbytes of source JPEG/GIF/PNG's that should be converted). Notice the estimated time and size below which is based on the last synchronization. Most of the displayed 379 minutes were spent on building the 200 Mbyte big PRC file (don't pay attention to the “Update completed" below: I only took a snapshot of the MWC when it already did try to re-synchronize the site, this is why it also displays 0:20).




It is one of the reasons why future versions of MWC should offer picture resizing-related config options and, hopefully, JPEG support. Hope they'll also use an approach to animated GIF's similar to mine so that no information is lost.


It's worth mentioning that it took my ChangeAllConvertedImageFilesBackToTheirOriginalsInAPRCFile the conversion of the 200 Mbyte PRC (with 82 Mbytes of JPG/GIF over 64k/PNG) took one and a half hours (compare this to the 5 hours of picture conversion time of MWC!) and the resulting file size was a whopping 530 Mbyte. I coulnd't test it on a PDA. Incidentally, MPR on Pocket PC 2002 seems to be much more stable and quicker than on PC. It takes the PC-based MPR to turn pages in the 200 Mbyte THG archive (and froze a lot of times), while the PPC-based MPR worked flawlessly. This means if a large PRC doesn't work in the PC version of MPR, it doesn't necessarily mean it won't work on your PPC either.


Note that as of version 1.7.7, I've released my XDOCGenerator utility. It makes ChangeAllConvertedImageFilesBackToTheirOriginalsInAPRCFile and even creating PRC files by MWC obsolete.



The really big thing is MBHelper, the filtering proxy that helps overcome the shortcomings of MWC. As has already been mentioned, MWC is suffering from some major problems. Just to recap them, it doesn't handle cookies and automated user login. Furthermore, neither the .enews configuration file nor the .xsl formatting sheet offer any possibilities of substituting, page merging, advanced URL removing etc.


The MBHelper proxy is a self-standing HTTP proxy, specially targeted at people wanting HTML filtering and page merging. This suits MWC users too because it fulfills almost all requirements of MWC power-users.


Please note that as I've written the proxy with simplicty and speed in mind, I didn't implement support for any special HTTP action (just GET is supported but that isn't a problem because MWC only issues GET requests). This means you can't use it as a generic HTTP proxy. (Actually, its brother, also included in the package, HTTPSnoopProxy, also supports PUT so it's more suitable for generic proxy purposes. Also note that neither of them support error return code (access forbidden; see e.g. http://jguru.com/faq/printableview.jsp?EID=9920), but that won't be a problem with casual use).

Installation, running


If you already have a JDK1.4+ JDK or JRE, just unzip the MBHelper.jar file to a directory and run 'run.bat'. It will start the proxy at port 3456.


If you don't have a JDK/JRE, download it first from http://java.sun.com/j2se/. The much smaller (about 8Mbytes in size) JRE will suffice if you don't want to recompile the sources. Install it before starting run.bat.


Now, the filter is ready to work. Something isn't done yet, though. You need to place the MBHelper.conf filter configuration file in the same directory where the above-extracted jar file exist. This configuration file contains all the actions that the filter should do. (More on this later).

Setting up your MWC/browser to use the proxy


You can now set either MWC or your browser (or both) to using the proxy server at localhost:3456. It has to be emphasized several times that you should always use a traditional browser on PC to check whether the MBHelper configuration file is OK, and just after that start working on the .enews files.


Please note that if you want proxy chaining (e.g., you're behind a firewall), you have to explicitly pass the proxy server's address and port number to MBHelper. See http://jguru.com/faq/printableview.jsp?EID=9920 on this.




MBHelper.conf is a configuration file for the proxy server. It may contain any number of configuration options and comments (the latter led by a #).


A configuration option, as long as it isn't a parameterless action, consists of an actions, a target URLs and parameters. Parameters must be separated by <TAB> characters.


Because choosing the right target URL may result in saving a lot of work, I elaborate on this subject a bit more.

The target URL


MBHelper requires that you supply a URL to most actions. This ensures URL transformations, URL killing, picture uniqueness transformations, custom login, page merging, UTF-8 conversion etc. will only be executed only on the right pages and nothing else.


This URL must be either the current page URL or, at least, some part of it. It can be generic; for example, just a period for certain actions (because a URL will contain at least one period, the action can be executed for any URL). However, only use URL's without server URL's if you're absolutely sure all URL's containing the URL part you provide won't have unwanted side effects on other sites and the other parameters of the actions make the filtering proxy not execute the actions on unwanted pages.


For example, if you want to change all <DD>&nbsp;&nbsp;&nbsp;</DD> occurrences to <P>'s only on pages downloaded from www.helsinginsanomat.fi, you have to supply 'www.helsinginsanomat.fi' to the command substitute. (Along with the other parameters explaining what to change to what - see below). If you want to return pages that only contain URL's fulfilling a given criteria from www.helsinginsanomat.fi, you, again, supply www.helsinginsanomat.fi to onlyAllowURL (or any part of it; please make sure it's not too generic). Note that you can also include http:// in the URL's.


An example of generic URL's is changeURL. As can be seen from the vBulletin configuration files, it's preferable to change the showthread.php string in all outgoing page requests (if they contain it!) to printthread.php on any vBulletin site. If we gave changeURL something other than a period, we would have to write a changeURL action for all the sites we want to synchronize. However, as it's only with vBulletin sites that URL's on a forum page contain showthread.php, we can be pretty safe by giving changeURL a generic parameter.


Unfortnately, it's only vBulletin that lets you use generic configuration options. As an example of other forum engines, consider the uBB5 config files (you can find some of them in the uBB subdirectory). As you can see, the extraction of forum pages should start at different points (different background colors) for CEWindows.net and MacDebate.com. This means if you want to synchronize both forums at the same time, always use full server URL in the MBHelper configuration file instead of just forumdisplay.cgi?action=topics. By the way, exactly the same stands for merging uBB5 topic pages, the subsequent pages' extraction starts at different tags for different sites.


Again, some URL's should not have server addresses. Let's recall the example of vBulletin's showthread.php (inside a topic) and forumdisplay.php (inside a forum). They will always have exactly the same additional parameters ('addURL'); programmatic URL changing should always work the same way (showthread.php to printthread.php before sending out the request); forum pages will always begin and end at the same pair of lines (<!-- topic table --> and <!-- /topic table -->, respectively) and will always have the same Next symbol (&raquo;); and, from forum pages, always the same URL parts should be used for URL killing before returning the page to the HTTP client (URL's containing action=, goto= and pagenumber=).


If you want to synchronize more than one vBulletin forums, it's beneficial not to use special URL's but generic ones so that you don't have to copy-paste and slightly modify the same configuration data for all vBulletin sites you synch. A configuration file like this will do just fine with the following content (bold sections stand for one-time archiving only and should not be used regularly!):


changeURL .         showthread.php     printthread.php

addURL    forumdisplay.php   &daysprune=1000&perpage=999999

killurl   forumdisplay.php   action=   goto=     pagenumber=

mergeallpagesfollownextlink  forumdisplay.php   &raquo;

<!-- topic table -->

<!-- /topic table -->


uniquepictures     .         denyUniquenessFromNowOn      icon_arrow.gif     icon_biggrin.gif   icon_confused.gif  icon_cool.gif      icon_cry.gif          icon_eek.gif       icon_frown.gif     icon_idea.gif      icon_mad.gif       icon_question.gif  icon_razz.gif      icon_redface.gif          icon_sad.gif       icon_smile.gif     icon_wink.gif      thumbsup.gif       thumbsdown.gif     images/icons       images/smilies



Note the . in the changeURL action. It effectively tells MBHelper to examine all outgoing URL's (because all URL's contain a period) if they should be changed before sending out the request.


Also note the uniquepictures action: it makes sure all pictures downloaded have a unique local name, but doesn't give widely used vBulletin smileys a unique name. The cacheSimulation global action, in addition, ensures that nothing is downloaded more than once from the same URL.


Incidentally, note that you shoulnd't use either &daysprune=1000 or the mergeallpagesfollownextlink section during regular synchronizations. It should only be used when you really want to download the entire database. For daily, incremental synchronizations, you mostly will only need the first (that is, the most recent) forum page. Proper netiquette does require you not to download more from a site than you really need. I may speak to tech sites' sysadmins whether I can host their forums' PDA versions on my homepage so stay tuned.


For daily use (to download only the most recent 20-30 forums automatically), you will only need to use the following for vBulletin boards. Remember that the first action changes all outgoing URL's “showthread.php" to “printthread.php", resulting in the result's, the topictext's coming in printer-friendly, PDA-savvy format; the second adds &perpage=999999 to all topic displayer URL's (but not to forum displayers!) so that all messages are included in printthread.php's output; the third (the most important) kills all duplicated URL's from the returned TOC page (notice it's only called for forumdisplay.php's output - that is, for TOC pages!) so that all topics are only included once in the PRC file.


changeURL     .           showthread.php          printthread.php

addURL           forumdisplay.php        &perpage=999999

killurl   forumdisplay.php        action=           goto=  pagenumber=


For vBulletin boards, you may also add the login and cookie keeper action so that you only download your favourite topics. However, because all sites use different URL's to log in to them, this can't be as generic as the three actions above. Notice that I had to make the first URL site-specific (see POSTLogin for more information on logging in a site).


POSTLogin     www.pocketpcpassion.com/forum            http://www.pocketpcpassion.com/forum/member.php?action=login&username=<login>&password=<pwd>



Please note that most actions allow only one URL (the generic 'substitute' is the only exception; but I didn't want to make the code convoluted by letting the user supply any number of URL parts for all actions so I decided to go on with the single-URL approach). This URL will always be checked to the actual target URL of all requests. If the URL you supply is contained in the actual URL, the action will be executed.


Please note that this also means you can't have more than one of the same action with exactly the same URL fraction. If you still really need this (I didn't ever need), then make sure the URL's are somewhat different so that two different actions are stored. (E.g. you use ww.<some URL> instead of www. <some URL>; the action will still be executed)


It's better to be caotius: if you supply, for example, a killURL action with a too generic URL (e.g. a single period), ALL URL's will be deleted that contain the second parameter to killURL. This may lead to strange errors.


Available actions




Substituting (changing) a given string inside the HTML page to anything else. Some tags, which look cool in a desktop browser, do cause problems in the MPR. Just try to synchronize www.helsinginsanomat.fi's articles with your PDA and you will see what I mean (<DD>&nbsp;&nbsp;&nbsp;</DD>'s causing big problems). Now, you can just change all occurrences of a given string to anything else in any HTML page, and MWC will receive the already-changed HTML contents.  


Note the magic word 'thisissomethingsurelyunique' to be used in URLs that you want to filter out. 'killURL' isn't able (yet) to filter out IMG SRC URL's (only A HREF ones), but, if you change offensive (URL's that do't exist / make URLConnection throw exceptions etc) image URL's (e.g. ad.tomshardware.com for tomshardware.com pages) to contain 'thisissomethingsurelyunique', the related images won't be downloaded because HTTPProxy will return a 402 error for all URL's that contain the magic word 'thisissomethingsurelyunique', and MWC doesn't include missing pictures into a PRC. This trick, when used with, for example, tomshardware.com, greatly reduces network traffic and decreases synchronization time.


To decide what to filter out, scrutinize HTPPProxy's console. If you see java.net.MalformedURLException: no protocol exceptions, the URL is probably an ad URL that should be filtered out. An example from THG pages:


url : http://amch.questionmarket.com/adscgen/sta.php?survey_num=123988&site=tmsh


java.net.MalformedURLException: no protocol: /adsc/d123988/st_atlas.php?survey_n


        at java.net.URL.<init>(URL.java:579)

        at java.net.URL.<init>(URL.java:476)

        at java.net.URL.<init>(URL.java:425)

        at sun.net.www.protocol.http.HttpURLConnection.followRedirect(HttpURLConnection.java:1081)      at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:675)     at HTTPProxy.run(HTTPProxy.java:77)


Taking into account all ads etc. on THG, its substitute section should look as follows:


substitute         tomshardware.com

ad.tomshardware.com            thisissomethingsurelyunique

ping.nnselect.com       thisissomethingsurelyunique   thisissomethingsurelyunique

amch.questionmarket.com      thisissomethingsurelyunique



(see command 'substitute')




Merging the pages of an article (or a forum page/forum directory page; see command 'mergeallpagesfollownextlink'). This is of great importance not even with MWC, but even with a standard browser. Some forums, for example, only show some 20-25 topic titles at the same times. The worst of them is clearly Microsoft's HTTP-based (the SMTP-based is has very few articles) forum engine (http://communities2.microsoft.com/home/msnewsgroups.aspx). It's slow, the navigation buttons are hidden during scrolling most of the time etc. Browsing these forums through a pagemerger clearly helps vastly reducing the time needed for that.


Some newssites (http://www.tomshardware.com,  http://www.pcmag.com) always store their articles in this format and offer no one-page capabilities (e.g., a printer-friendly version). My proxy plug-in helps downloading these articles as one article too so that they can be also read on a PDA.


Please note that merging a number of TOC pages may take pretty long to process. Java isn't the fastest language, after all. There may be cases when MWC just stops waiting from the proxy and won't do anything. This happens when merging the directory page (TOC) takes more than five minutes. (No, returning some dummy No Operation HTML tag every 2-3 second just to keep MWC alive won't work - MWC stops if it doesn't receive the entire page in five minutes. MWC is clearly inferior in this respect to, for example, iSiloX, which can have up to 999 seconds of timeouts.) This won't happen in regular cases, anyway, because it's just at downloading entire forums that this may happen (e.g. PPCPassion's entire iPAQ forum, which has over 7000 URL's or BrightHand's iPAQ forum with 23000 threads. Do NOT download any vBulletin forum that has over 2-4000 threads with MWC without using the trick described below because all your bandwidth (and, even worse, the server load!) spent on downloading the given forum will completely be lost!). In cases like this, the easiest way to go is switching on global debugging mode (the 'debug' action in the configuration file; no URL required), going to the TOC URL in IE and, after the proxy has finished working, saving the merged TOC file (it'll be named MBFilter_out.000000). Please note that in debug mode, the filter also saves the incoming, not filtered contents into files named MBFilter_in.xxxxxx. Its extension will be the same as that of the filtered stream so you can easily compare the non-filtered input and the filtered output to find bugs. Also, especially with single pages, you can freely give these debug files the .html extension and check them with a plain HTML browser.


Note that the exact number  of the threads that will cause MWC to stop vary from forum to forum. On forums like Clié Source (http://www.cliesource.com/forums), where individual TOC pages are around 90 kBytes (because of the TOC table columns on the left), not even full TOC forum pages with 2000 threads will be downloaded. Most other vBulletin sites that don't have large links sections like those of Clié Source, have individual TOC pages around 50k so you can automtaically download forums with up to about 3000 topics.


Note that all this are caused by the Java RE's relative slowness and not that of the internet connection if it's at least a 384 kbps ADSL. (Surely, over a GPRS/56k connection, don't even think about downloading more than 100-200 topics at once.) The 2000/3000 pages I've mentioned are meant for a P1.6 GHz machine with moderate load, but the Java console in the background.


There is a dedicated action, 'addServerNameToAllURLsWhenNecessary' to include server information in all URL's when not present. Just define it in the configuration file. It takes no parameters.


With all this, an example configuration file for merging the TOC page of the entire iPAQ forum is as follows:


killurl   www.pocketpcpassion.com/forum/forumdisplay.php action=           goto=  pagenumber=

mergeallpagesfollownextlink  www.pocketpcpassion.com/forum/forumdisplay.php &raquo;

<!-- topic table -->

<!-- /topic table -->



Enter the URL http://www.pocketpcpassion.com/forum/forumdisplay.php?forumid=22&daysprune=1000 in your browser (make sure it uses my HTTPProxy as the proxy), and after downloading has been finished, rename the debugger output, MBFilter_out.000000 to, say, MBFilter_out.html. Put it in the local filesystem and edit the 'official' PPCPassion iPAQ .enews file to contain <url action="follow">file:///c:/MBFilter_out.html</url> (if the HTML file is in the root) instead of the original <url action="follow">http://www.pocketpcpassion.com/forum/forumdisplay.php?forumid=22&daysprune=1000</url>. That's all you have to do; now, import the modified .enews file in MWC and you can already download all messages from the site. Now, MBHelper.conf shhould only contain the


changeURL     .           showthread.php          printthread.php?perpage=999999


row, which tells MBHelper to change all showthread.php request URL's to printthread.php and also add ?perpage=999999 to them so that all the messages are shown.


Note that you don't need to start up a heavyweight standalone browser to initiate the full download of a merged TOC page, but also a new addition to the MBHelper suite, InitiateHTTPDownload. It greatly helps reduce the load on the local PC because it won't try to render the HTML  (which is, in most cases, will be way over 10 Mbytes) returned by MBHelper, just prints into a file named 'out.html'. It also contains functionality to emulate the addServerNameToAllURLsWhenNecessary MBHelper action, which doesn't work in all cases (e.g. with Javascript URL's) in MBHelper.


Usage: java InitiateHTTPDownload <URL> <local proxy port> <first part of relative URL's to change> <full URL's, including the previous part>


An example from Microsoft's Web-based newsgroups:


java InitiateHTTPDownload "http://communities.microsoft.com/NewsGroups/messageList.asp?ICP=pocketpc&sLCID=US&NewsGroup=microsoft.public.pocketpc&iPageNumber=1" 3456 "previewFrame.asp" "http://communities.microsoft.com/NewsGroups/previewFrame.asp"


If you use this external program (once again, you won't necessarily need it, but it's certainly advantageous over a memory-hog full-blown browser which can really slow down your PC if you download a really large TOC page), you only have to change the URL of a section in the .enews file to the absolute path of file just downloaded. Let's have a look at the example of MS.com's newsgroups.


The General PPC forum has the URL of http://communities.microsoft.com/NewsGroups/messageList.asp?ICP=pocketpc&sLCID=US&NewsGroup=microsoft.public.pocketpc&iPageNumber=1. In the .enews file, you only have to change this to file:///c:/enews/out.html to make MWC look for the TOC page not at the former URL, but at the latter.


Please note that InitiateHTTPDownload doesn't handle cookies. In the very few cases when it's a problem (for example www.thedvdforums.com, which does require login even for forum access), it can't be used. In cases like this, you must use your standard browser instead of InitiateHTTPDownload (don't forget to add 'debug' and 'addServerNameToAllURLsWhenNecessary' to MBHelper.conf and remove any 'uniquepictures' so that the picture names won't be renamed - remember that this renaming only works until you shut down MBHelper).


Incidentally, you can also make an .enews file that has both local and remote URL's. The local URL's should point to HTML TOC files created by InitiateHTTPDownload, while the remote files may remain their original versions. (Make sure the TOC pages they address aren't very large so there won't be timeouts.) Staying with the example of the Microsoft forums, it's only 'Reader, eBooks and Audio Books' that can be accessed without downloading the TOC first, the other three groups (Developer Questions, Marketplace - Buy and Sell and General PPC) will surely result in a timeout (meaning a lot of lost bandwidth) when accessing online. If you download the three large TOC's to individual HTML files and address all the three from my MS .enews file (by just changing the URLs in it), you can still have all the MS forums in one PRC file so you don't have to create a separate file for each forum.


Also note that redirected and merged pages are only stored in one pair of files in debug mode. To find the actual URL of a page in a log file, look for comments of the following form:


<!-- MBHelper log file; actual URL: <URL> -->


There can be more than one of them in a file if the pages are redirected and/or merged.


There're a lof of web sites that don't allow an article to be read at once (just to name a few: http://www.tomshardware.com, http://www.dpreview.com). The same stands for most Web-based forum softwares: they just don't allow forum, or (as with SuperGamez/uBB, or some flavors of phpBB, see http://www.allaboutsymbian.com) topic pages displayed at once in their entirety. Most web-based newsreader applications (the most prevalent example being microsoft.com's newsgroups) are also suffering from the same problem. By merging the pages in the background, you will already see/print the merged content, which can really be a timesaver with forum sites like that of microsoft.com.


Merging the pages in the background is just one of the several advantages you can use not only from MWC, but also from any HTTP client.


Please note that the next link &raquo; above (for vBulletin forums) is unique enough not to get into an infinite loop. With other forums, however, you should also include the ending tag (or the starting, still unique tag parts) in the Next URL part. Consider, for example, phpBB. It uses the word 'Next' to link the next forum/topic page. If, however, a topic page also has 'Next' in its name (as with http://www.allaboutsymbian.com/phpBB2/viewtopic.php?t=4766), MBHelper will get in an infinite loop. To avoid these cases, always use the longest possible 'next' link; with phpBB, it's “>Next</a>".


Please note that, especially the pagemerger code, have been written to be as configurable and generic as possible. You only have to supply the 'next' link's text which is the link itself (with vBulletin, it's '&raquo;'). If the 'next' link isn't a link itself, but is followed by the link (with a non-constant name), use the 'next' linkname and an 'f' (as 'forward'). This is the case with tomshardware.com.


Believe it or not: giving the next page linkname-based approach (in this case, it's &gt;) at once (without entering any Microsoft Forum-specific code into the filter class) with the Microsoft Forums, which is a great thing because Microsoft Forums are known to have a very illogical HTML engine and convoluted HTML markup. This shows how powerful my approach to page merging is. You'll very rarely need to add any site-specific code to the filtering proxy itself: playing with the config file will work in most cases. Two exceptions are uBB and SuperGamez because they don't use anything like 'next' - they only contain page number links. That is, it's impossible to know what link to follow from a non-site specific code. This is why I've also included special support for these two forum engines. As has already been emphasized, vBulletin (the most popular BB software right now) and phpBB don't need any special support because they also follow the above-mentioned pattern.




Changing a request URL (or some parts of it) to anything else in request time (that is, the requested URL's are changed on-the-fly and the webserver sees the changed request URL). It'd be of great importance with forums or news sites that differ only in a non-changing part of the URL when accessing a page in the normal and in the printer-friendly version. See, for example, the vBulletin and the phpBB examples.


An example of this: in vBulletin, the PHP script serving non-printer-friendly requests is called showthread.php. If you change this string to printthread.php in the request URL, however, you get a printer-frinedly version (you may also have to add &perpage=<some big number> to the URL too; see addURL). A lot of sites and forum softwares support generating printer-friendly pages.


(see command 'changeURL')




Killing a URL (or some parts of it) from the HTML page before passing it to MWC for further processing. As MWC's built-in erasing capabilities are very limited, not configurable and sometimes don't even work (just try to delete the above-mentioned <DD>&nbsp;&nbsp;&nbsp;</DD>), it's mostly impossible to delete unwanted links from a HTML page without using my proxy plug-in. This is a problem with MWC, because it will blindly follow these links if they are different from each other (the same links will be followed only once inside a page).


Just take a look at any vBulletin (it's probably the most widely used forum software; its homepage can be found at http://www.vbulletin.com/), phpBB (another, not that widely used forum software; http://www.phpbb.com/) or uBB (a Perl-based forum software, not very widely used on tech sites; http://www.infopop.com/products/ubb/). If you try to save them in a TOC-pages manner, topics that have multiple pages will be saved multiple times because they're referred by only slightly different URL's. You can tell MWC to follow links that have something in them (see the action="extract-url follow" attribute) but you can't tell it what NOT to follow (e.g. URL's that have attributes named mode, mark and start). This is where filtering an unwanted URL from the returned page comes into picture: MWC just won't receive any URL that you want to be killed, so it won't follow them, either. See the vBulletin/phpBB examples for this in practice.


Remark: I've only found one out of 30 news (not a forum!) sites that had doubled/tripled articles because of this. You will be using this feature mostly with forums.


(see command 'killURL')




Only allowing a (regex) URL: as has already been said, MWC doesn't understand regular-expressions (abbr. regex). This is especially problematic with newspapers like Helsingin Sanomat. HS only uses two letters in its TOC links to denote the section name after the actual date (KO stands for Kotimaa; that is, National news; other section names are URheilu (Sports) etc):




Because the section abbreviation is separated from the non-date-dependent parts of the URL, it's alsmost impossible to specify a sensible filter string to MWC. If you choose the filter string to be the section abbreviation, there'll be a lot of extra hits too, because consonant+vowel (or vowel+consonant) pairs are very common in other URL's / other parts of a URL.


You can't use the pre-date part of the URL, uutiset/juttu.asp?id=, either, because they have no section-specific information and we don't want to include articles in the section that belong to other sections.


You can't, unfortunately, tell MWC to expect eight numerals between id= and the section abbreviation (KO in this case) because these numbers will always change and MWC doesn't know regexes. Exactly in these situations that the regex-capable onlyAllowURL action should be used.


To describe the condition above using a regular-expression is very simple:


onlyAllowURL http://www.helsinginsanomat.fi/kotimaa/       /uutiset/juttu.asp\?id=[0-9]{8}KO[0-9]{1,2}


Please examine the [0-9]{8} section. It states there can be 8 numerals. The URL part after the section abbreviation, [0-9]{1,2}, states that there must be at least one and and at most two numerals here.


Also pay special attention to escaping the question mark, that is, \?. This is very important and must be done with all URL's that contain a GET parameter section.


You should consult the API documentation for java.util.regex.Pattern (http://java.sun.com/j2se/1.4.1/docs/api/java/util/regex/Pattern.html) for more on regular-expressions.


(see command 'onlyAllowURL')




Return only subpart of a page: although you can use the <begin> and <end> tags with the action="extract" attribute, you still cannot:

1.                  trim the TOC page (in not the action="download-site" mode). This causes problems if a site doesn't contain section names in its URL's so you can't filter them out (an example of this is http://www.gondola.hu) so a section TOC will always contain all the other, unwanted URL's. By using TOC trimming, you can return an already-trimmed page to MWC so it will only extract and follow the links you want.

2.                  trim ANY pages (in the action="download-site" mode). This is of even more importance than TOC page trimming with sites that can (could; actually, using mergeallpagesfollownextlink with action="extract" helps in most cases like this) only be downloaded with 'download-site' (because the majority of the articles are not linked from a central TOC repository but from each other), This makes it possible to delete stuff from all pages that would otherwise be duplicated, making reading the PDA version annoying (and the PRC file large).


Note the optional parameter addTitleHTMLAndBodyTags to returnonlyhtmlbetweenbeginend. It's of extreme importance when you want to write generic site downloaders. Consider, for example, news TOC's (e.g., that of http://www.gigalaw.com/news/index.html) where the links point to pages of entirely different page structure because they're all hosted on different sites. If you wanted to download this to PRC without using MBHelper, the only way to go would be download-site with iteration depth 2 (so that it doesn't follow sub-links) and with a lot of unwanted header/footer/ad stuff downloaded (don't forget download-site downloads full pages and can't delete parts of them). However, with a cool trick, you can use the much cleaner extract mode: find the start and the end HTML markup for the pages of all the referenced sites and create one returnonlyhtmlbetweenbeginend section for each of these sites (one section per site) with the optional addTitleHTMLAndBodyTags parameter. For the extract mode, define the plain <title>…<title> and <body>…</body> tags for the article titles and contents, respectively. Now, your .enews files will be very simple; it's MBHelper that cuts off the unwanted stuff and returns pages that surely contain the two tag pairs.


If you don't define addTitleHTMLAndBodyTags, MBHelper won't return these tags so MWC won't be able to extract anything from the article. After all, it's


Alternatively, you can still use download-site with returnonlyhtmlbetweenbeginend without the use of addTitleHTMLAndBodyTags because, as has already been stated, returnonlyhtmlbetweenbeginend will take care of removing unwanted HTML content, depending on the actual site. As returnonlyhtmlbetweenbeginend, as with most other actions, require the address of the site to work on, you can define different start/end HTML markup for all the major sites linked from these kinds of news feeds. Still, even when there is no junk in the returned pages, extract-type of PRC generation is advantegous because you can define your own TOC section, unlike with the download-site mode. However, in most cases, the original TOC that download-site downloads can be satisfactory.


The package (tested with version 1.6) also contains the configuration files for a addTitleHTMLAndBodyTags-based setup. The .enews file only declares that it wants to extract title/article content from inside <title>…<title> and <body>…</body> tags, respectively. The .enews file knows nothing about the sites it will collect (filtered) information from, because MBHelper completely hides these site-dependent things. The majority of work is done by MBHelper. Just check out the MBHelper.conf file in the Gigalaw.com Newsfeeds directory. This config file can be used not only with Gigalawcom_newsfeed.enews, but any .enews file that wants to access any of the sites listed inside.


Incidentally, you really should play around with this file because I've also included a lot of my comments on my decisions and it contains reusable actions for quite a few of popular news sites.


Also note that if you use recursion (mergeallpagesfollownextlink), returnonlyhtmlbetweenbeginend will only cut off the first/last part of the first page. It's up to mergeallpagesfollownextlink's parameters to cut off the headers/footers (including unwanted tables) from the pages inside.


(see command 'returnonlyhtmlbetweenbeginend')




Filling up <pre>…</pre> blocks with <br>'s: MPR doesn't insert line breaks in <pre>…</pre> blocks. This renders most programming-oriented articles unreadable. This can't be helped from neither .enews nor .xsl.


By adding a prekiller declaration in MBHelper's configuration file, you can make the proxy fill up these sections with <br>'s before passing the HTML page to MWC.


(see command 'prekiller')




Guaranteeing unique picture names with action="extract": a well-known bug of MWC is not being able to save picture names that have the same name under different directories or on another site. This affects several sites (for example, http://www.tomshardware.com,  http://www.gondola.hu and http://www.hetivalasz.hu). Also, when converting offline HTML's into PRC with the TOC/page architecture, it can be very frustating to have the same picture names in different directories. The files with the same name will always overwrite each other and the PRC will only display the last downloaded one at all the occurrences of the samely named file.


The remedy is telling MBHelper to add the server URL and the full directory path of all the referenced pictures in the name of the file when storing it in the local filesystem. This makes sure all pictures have different names; still, there is only one instance of the same, but multiple times referenced images.


Please note that the 'can't download pictures of the same name' problem doesn't exist in the action="download-site" mode; but that mode should be avoided with most complex sites if you don't use the above-introduced returnonlyhtmlbetweenbeginend mode.


Please also note that if you don't supply a second parameter to uniquepictures, it'll store all pictures with the extension jpg/jpeg/gif/png with a unique name.


(see command 'uniquepictures')




Adding a string to a URL: this function tells the proxy to add a given URL to all outgoing requests. It's of extreme importance with forum displayers that require an additional  daysprune=1000 attribute to display all pages of the forum and topic displayers that require &perpage=<some big number> to display all pages in printer-friendly modes (see, for example, the Brighthand, the PDABuzz, the Pocketgamer.org and the PPC-Welt-Community config files. Actually, the vast majority of vBulletin sites require this parameter. One notable exception is PPCPassion). This config command tells the proxy to do so.


(see command 'addURL')




Officially, MWC doesn't support logging in a site if dowloading content requires a login. However, MWC has some ways to help this:


-         it checks the standard cookie directory of the system and if it finds a cookie, it'll automatically use it. This means if you log in to any site that has cookie-based authentication, you only need to log in to that site once and from that on, MWC will also be able to access the restricted pages of the site. For sites that have very long-living cookies (NYT, Washington Times - their cookies last an enire year!) it's rather convenient. You only have to re-login if you deliberately delete your cookies from the default directory (e.g. by pressing Tools/Internet Options/Delete Cookies in IE).


Fortunately, MWC not only uses pre-made cookies, but it also reuses cookies sent back by the sites it visits. This made it possible to 'hack' MWC to auto-login, even without external (proxy-based) help.


-         There is another, undocumented way in MWC to automatically log in using your username/password pair before each transfer. This means you don't have to log into a site each day before downloading its articles if the site has cookies that expire in (less than) a day. You don't need to log into the site at all, actually.

Actually, it's very easy to make MWC autologin to a site even without using my tools. Just put the following tag pair in your .enews file (outside all tags with the action attribute "section" and"ignore", but inside the enclosing, global <site>, if it exists.):


<url action="follow" ><login URL></url>


The <login URL> between the opening and the closing <url> tags will depend on the actual site. Getting the URL to enter will be explained later (see section Using HTTPSnoopProxy to get login URL's).


Most of the sites requiring  login will work great with the 'hack' above. You don't even need to log into the site you want to be downloaded. There're some cases, however, when it won't work, but I will only speak of them later.


Using HTTPSnoopProxy to get login URL's


HTTPSnoopProxy is a HTTP proxy, which makes it easy to capture both HTTP requests and responses. Using it is a snap - just start it and make your HTTP browser use it. In Internet Explorer, navigate to Tools/Internet Options/Connections, press Lan Settings and check the checkbox next to Use a proxy server for your LAN. Fill in 'localhost' in the Address field and 3456 in the Port field.


Now, go to the page that has the login.


An example with New York Times


On NYT, just pressing any article link on http://www.nytimes.com/pages/world/index.html will take you straight there. Fill in your account information and press Log in. On the console screen of HTTPSnoopProxy, you will see something like this in the sea of of 'req to:' messages:


POSTed: is_continue=true&URI=http%3A%2F%2Fwww.nytimes.com%2F2003%2F04%2F29%2Finternational%2Fworldspecial%2F29CND-IRAQ.html&OQ=&USERID=<username>&PASSWORD=<pwd>&log=Log+In&SAVEOPTION=YES

to: http://www.nytimes.com/auth/login


Please note that the underlined URL is not needed (because it only contains the URL that I originally clicked).


The two rows above contain all you need: the body of the message (this is the text after 'POSTed:') and the URL it was posted to (it's after 'to:'). All you have to do is concatenating the two things (URL and message body). You have to write the URL first, put a ? (interrogation mark) after it and then the body. The concatenated address looks like this:




To test whether it works at all (for NYT, it will), just go to the address you've just created (after shutting down the browser and deleting its cookies) and check whether it works. After visiting the URL, you shouldn't need to log in any more. If you, however, see (mostly unformatted) messages like 'GET method is not supported' or 'this is the GET method' and you won't be logged in, you will indeed have to use my suite and POSTLogin. However, it's only a small minority of the Web sites that don't accept GET, just POST messages so it's very rarely that you'll need to use POSTLogin.


Some other examples of getting the URL above:


vBulletin forums


Go to http://www.pocketpcpassion.com/forum/usercp.php (PocketPCPassion uses vBulletin) and fill in your login information (name and password). Click the button. The HTTPSnoopProxy's console screen will contain something like this (only the parameters with bold will be different):


POSTed: s=2488d1dce5dd45132fd0a0eef9618c9e&username=username&password=password&action=login&url=%2Fforum%2Fusercp.php

to: http://www.pocketpcpassion.com/forum/member.php


Now, concatenate the two strings, using a ? in between them. (Incidentally, you can remove s=2488d1dce5dd45132fd0a0eef9618c9e& from the POST body before doing this because it's just a volatile, always-changing server hash. Note that I've also removed the & after it):




With the right username and password, you will be able to log in the site without explicitly visiting http://www.pocketpcpassion.com/forum/usercp.php again, by just visiting http://www.pocketpcpassion.com/forum/member.php?username=username&password=password&action=login&url=%2Fforum%2Fusercp.php with cookies enabled. (Give it a test drive!)


All you have to do is putting this in between <url> tags inside the .enews file, as with all other (working) examples.


Nando Times


Nando Times also used to have a login screen (at the time of the writing, it doesn't have. I still list it here because it may come back). The URL to log in is (was)




When it won't work (JavaScript, POST-only)


There're some cases when the auto login 'hack' into MWC won't work. One of these sites use client-side generated cookies. Washington Post is one of the really few sites that rely on them (http://www.washingtonpost.com/wp-dyn/digest/; see function setWPNIUCID() in the page source of the link above). This means it's not the server that sends us the authentication cookies, but they're generated at the client side. As JavaScript is only supported by full-fledged browsers and not Web downloader clients like MWC, logging into sites that use this kind of authentication can't be automatized.


Fortunately, as the Washington Post cookies live for a year, it isn't really an issue. Just remember to log in (by filling in the form at the link) each time you delete the cookies from your browser and before starting the very first synchronization.


The other problem is caused by POST-only sites. As has already been stated, this only affects very few sites because, as much 99% of the login is done by POST, most authenticating server-side scripts also accept GET requests.


For POST messages, you can't use MWC to do the login. You either login before the synchronization in your standard browser (and if you're lucky enough, the cookie will be a long-lived one so you won't have to re-login each day) or use the action POSTLogin in my suite.


The action is very simple:


POSTLogin     <URL>           <auth script URL>     <POST body>


An example with NYT:


POSTLogin     nytimes.com    http://www.nytimes.com/auth/login            is_continue=true&URI=&OQ=&USERID=<userid>&PASSWORD=<pwd>&log=Log+In&SAVEOPTION=YES


Please note that it's much cleaner to use the MWC 'hack' outlined above than POSTLogin because the latter means you have to use my proxy. Only use the latter when the script that authenticates users only accepts POST requests.


(see command 'POSTLogin')




Converting UTF-8 pages into default local encoding: MPR is known not to be able to decode UTF-8 text (see e.g. http://www.petesguide.com/WebStandards/eBooks/ on the PDA readers' compatibility with UTF-8). With the forceUTF8Conversion command, you can explicitly force the proxy plugin to handle the input as UTF-8.


Remark: fortunately, very few Unicode sites use UTF-8, even in non-Western language areas. Of the several Hungarian sites I've examined, only one used it (http://hvg.hu/).


(see command 'forceUTF8Conversion')




Use full server URL's when needed: if you want to reuse the debug files (especially for merged pages, see the PPCPassion iPAQ example above), you may need its URL's be not relative but absolute. By adding the 'addServerNameToAllURLsWhenNecessary' action (it has no URL argument), you can force MBHelper to do so. It's especially useful with downloading and archiving large forums; e.g., those of PPCPassion. Don't forget that MWC will not wait more than 3-4 minutes for the proxy to return the entire page, while building a merged page from hundreds of original pages may even take 20-30 minutes, depending on the other actions you supply (most likely, killurl). In cases like these, initiaiting the page merging from IE and making MWC use the saved debug file as a TOC is the best and easiest solution.


(see command 'addServerNameToAllURLsWhenNecessary')




Convert in-text tables to some digestable format: much as MPR is actually able to render HTML tables, MWC always drops them upon conversion to HTML (and, then, PRC). This causes the same problem as with <PRE> sections: not even linebreaks will be used to separate table rows. Actually, as with <PRE> tags, MWC just removes <TR> and <TD> tags from the output. Exactly this is where converttables comes into picture: it puts a line of - chars before all <TR>'s and puts a | after all <TD>'s and </TD>'s. MWC won't remove these additional characters when it removes table markup because they're plain text.


By using the 'converttables' action and supplying the table beginning and end rows, or any part of it. Try to find some unique HTML markup for all the tables that you want to be formatted even in MWC's output. On tomshardware.com, for example, there're two kinds of in-text tables that really should be preserved. One of them starts with a row <TABLE WIDTH="400" BORDER="0" CELLPADDING="3" CELLSPACING="1">, the (slightly wider) other with <TABLE WIDTH="585" BORDER="0" CELLPADDING="3" CELLSPACING="1">. What is the largest common part of these tags? Yeah, “BORDER="0" CELLPADDING="3" CELLSPACING="1">". By checking whether it is used anywhere else you can make sure that it isn't. Therefore, use the action converttables the following way:


converttables  BORDER="0" CELLPADDING="3" CELLSPACING="1">         </TABLE>


Note that if you don't let MWC build up the .PRC, but use my XDOCGenerator in XDOC mode (you supply it a parameter) for this purpose instead, in-text tables will be retained. This is another area where using XDOCGenerator over Mobipocket's official builders is recommended. (Mobipocket Publisher also gets rid of tables.)




Cache simulation: a MWC spends a lot of time/bandwidth at downloading the same files over and over again. Consider smileys in forum softwares like that of vBulletin: there may be tons of them on each topic page. MWC would download them all. With the global, parameterless command 'cacheSimulation', you can prevent MWC from doing so.


Remember that this will only work with MWC/other ebook generators and not traditional browsers because it's uncodnitional: I don't check the HTTP If-Modified-Since header. But be assured: even if MWC doesn't send out the If-Modified-Since header, the file will be there so we can pretty safely send back a 304 and not connect the server at all.


(see command 'cacheSimulation')




The (yet) parameterless 'enableBloggerConversion' makes all XML input be transformed into HTML so that MWC will also be able to understand it. For more information on Blogger, check out http://www.blogger.com/ and http://www.cnet.com/software/0-3228341-1204-9053580.html. Please note that, as of version 1.5, no extra checks are made to be sure an incoming XML is indeed a valid Blogger document.


(see command 'enableBloggerConversion')



Some sites use JavaScript (or some other means) to send back e.g. page titles. These texts may be URL encoded. This command helps in decoding lines like this.


Take, for example, the case of microsoft.com's Web-based newsgroups (for example, http://communities.microsoft.com/NewsGroups/messageList.asp?ICP=pocketpc&sLCID=US&NewsGroup=microsoft.public.pocketpc&iPageNumber=1). Individual articles return their title inside a JavaScript expression. An example of this (bold has been used to emphasize the subject):


                        var sMessage = "mailto:%2522Paulo%2520Amaral%2522%2520%253Cpaulo.amaral@letswork.pt%253E?subject=RE:pocket%20pc%20microsoft%20client&body=%250A-----Original%2520Message-----%250AFrom%253A%2520%2522Paulo%2520Amaral%2522%2520%253Cpaulo.amaral@letswork.pt%253E%250ASent%253A%25204/28/2003%25204%253A08%253A31%2520AM%250ASubject%253A%2520pocket%2520pc%2520microsoft%2520client%250A%250Ahello%2520%252C%2520i%2520have%2520wireless%2520device%2520on%2520my%2520ipac%2520pocket%2520pc%2520and%2520i%2520%250Awoul%2520like%2520to%2520access%2520my%2520win2k%2520servers%2520with%2520url%2520path%252C%2520is%2520%250Athere%2520a%2520microsoft%2520client%2520for%2520microsoft%2520networks%2520for%2520%250Apocket%2520pc%2520%253F%2520and%2520whre%2520can%2520i%2520get%2520it%253F%250Athanks%250Argds%2520Paulo%2520Amaral%250A%250A.%250A";


Telling MWC to extract the title from this can be done (?subject= and &body= delimit it), but it won't do the URL decoding so titles will be copied to PRC's verbatim (i.e., RE:pocket%20pc%20microsoft%20client). URL decoding has to be done before passing the page to MWC.


Please note that, in addition to the mandatory URL part of the action, another parameter must be passed to URLDecode; any part of the line that should be URL decoded. In the case of microsoft.com's Web-based newsgroups, I've chosen this to be 'var sMessage = '.


Including all this, the action looks as follows:


URLDecode     microsoft.com var sMessage =


(see command 'URLDecode')



Some tips and tricks to using MBHelper


The most important trick is that, when you work on MBHelper.conf, try it with your browser first and only then with your MWC. You can check the result of the actions right in your browser window and, therefore, you don't have to wait for MWC to finish. Furthermore, re-registering .enews files to be able to download pages during development is a real pain in the neck. And, most importantly, there is absolutely no output in MWC if either the <begin> or the <end> tag don't match. In this case, in a stand-alone browser, you can still see what the proxy gives back and you can go on rewriting the extraction start/end pairs.


When working on new MBHelper/MWC configuration files (when you haven't even fixed text begin/end bugs, you absolutely don't need pictures. You'll have time to switch them on later to find out pic bugs (uniqueness etc.)), always use html-filter="no-pics" instead of the default html-filter="get-pics" to reduce server load / build time.


It should also be emphasized that, in order to shorten response times and reduce server load, especially at test/development time, on TOC pages, along with action="iterate", define another attribute in the same tag with the name max and the value 2 or 3 (an example for this is <ARTICLE action="iterate" max="3">). This will only follow the first three URL's on the page. You should also use this attribute in the final, deployable version of your .enews file if you only need the latest articles. (Use Ctrl-H in Winedit, and change action="iterate"  to action="iterate" max="3" in the default output file of my .enews generator, which doesn't support max iteration setting.)


When testing a new .enews/MBHelper configuration pair, do not specify mergeallpagesfollownextlink (for example, just comment it out) so that you can reduce server load/the time needed.


If you decide to modify the MBHelper / MBHelperFilter source files, use the debug messages as you wish. I've left them there so that you can check the program flow by just uncommenting the necessary System.out.println()'s.


Also, to take advantage of the whichIfBranchWasTRUEPrinter method inside decision expressions, where you want to find out which of the subexpressions evaluated to true and which didn't. Note that this method always returns true so it doesn't alter the value of an AND expession. This means you should always use it just after the expression you want to test at runtime. In an expression, you can use any number of them, for example, to test all individual sub-expressions. I've made heavy use of them when writing the page merging if header (which is quite large and contains tons of sub-expressions). An example of the usage is as follows:


In the final ps.println() section, when you want to decide whether to print line at all, depending on whether we're inside the page area that should be returned (between the HTML markup denoted by the optional startTextFound and the endTextFound markup) and whether there're special tags in the page text that should be returned (for example, </body>, if you gave the optional addTitleHTMLAndBodyTags  to returnonlyhtmlbetweenbeginend). In this case, because whichIfBranchWasTRUEPrinter always returns true, modify the sub-expression




to (bold denotes the new code)                                                            



line.toLowerCase().indexOf("</body>")!=-1 &&

whichIfBranchWasTRUEPrinter("/body found!!!!")



Notice the additional pair of parentheses. If you didn't use them, the enitre OR expression would be evaluated to true.


If something strange happens in MWC (the output is strange), while IE/NS can render MBHelper's output, check whether the HTML's that MBHelper outputs have a starter <HTML> tag. I got very strange errors because of this with the offline JupiterMedia converter. When I didn't explicitly insert <HTML> into the MWC input, MWC just cut off the first half of the first IMG SRC, no matter what HTML tags were there before it. (This is why I add a <html>\r\n<body>\r\n in JupiterMediaCDConverter.java after cutting out the menu).


Converting very complex stuff (e.g. the offline JupiterMedia CD's) into a format that is easily processed by MWC could be easier done by offline file transformations. Check out the offline tools/JupiterMedia directory in the distribution ZIP. It contains two classes. JupiterMediaCDConverter is the more important of the both: it cuts the menus on the left and the upper image. It also replaces image links with full links (try to click a small image link on a PDA screen. It will only work in 30-40% of the cases). Actually, this is what MBHelper does not know without explicit Java coding.


JupiterMediaCDConverter also takes into account that during the years of the publication, HTML tags used for locating the main contents changed in the HTML files. It knows all the article startings/endings used during these years.


The other class, InGeneratorForJupiterMediaCD, creates standard DownloadSiteEnewsGenerator input files. It must be run in the directory of the monthly issues (it's  \archives for the JupiterMedia CD's). It even calls DownloadSiteEnewsGenerator so that you don't have to convert the .in files to .enews/.xsl files by hand.


All in all, when it's about really complex conversions/transformations, you can also give offline tools a try. However, in most (99%) cases, MBHelper will work just great.


Future Plans

  • Right in the next version: Making an online filter/converter app at first for forum softwares with POST support and “Reply/Email" links so that you can not only browse vBulletin[/phpBB/uBB] forums via GPRS/CDMA/Wi-Fi realtime and in PIE, but also answer any post. All this offers the possibility of being realtime browsing (unlike with that of using e-News based forum downloads) and interactive, while keeping the compact, no-tables form of the forums. From now on, PDA-based forum browsing will be really cool - you won't have to rely on the horizontal scrollbar when you browse popular forum pages any more!
  • Expanding the idea of online content filtering: making all kind of Web content PIE-friendly, with very little manual work. No more side-scrolling, no more slowish Thunderhawk!
  • Adding more sophisticated authentication support to HTTPProxy so that you can be sure it's only you that are using your proxy.




I am not employed or in any way associated with MobiPocket A.S. My endorsement is solely based on my satisfaction with MobiPocket Reader.