TBTF for 1999-03-01 discussed the theory of HTML smudging -- a gradual degradation of the code base of the Web caused by inattention (human and programmatic) to maintaining consistent, unambiguous HTML structure on rapidly changing Web pages. Before and after pages were offered for readers interested in experimenting to validate whether or not a "smudge" effect could be measured. Here are some readers' responses (newest at the top).
OS -- Win NT 4.0 SP3, MSIE 5.00.0910.1309, T1 connection.
Keep up the good work!!!
One solution would be to provide HTML checking tools. I have not come across a HTML editor yet which does strict checking on HTML syntax, most just try and show you what the page looks like. It would also be helpful if these tools didn't automatically include tags and 'bots that the coder didn't put there. I must admit, I do not have a great deal of experience with direct HTML coding.
Another method is to stop non-technical users updating web pages directly. Let them load into a database back-end from which the content is loaded. This should stop smudging.
OS: GNU/Linux slackware 3.4 with 2.0.36 kernel
Browser: Communicator 4.08
Connection: ISDN, 115kbps
Not much of a difference either way. Noticed that the unsmudged did a better job of displaying the page before it was done downloading. TBTF is not necessarily a good page to demonstrate this with since it is by nature not as cluttered as other pages. Think we could identify a culprit and mirror it and a unsmudged version of it?
TBTF rocks! Thanks.
Before - 7 seconds
After - 4 seconds
NT4, Pentium 200, NetScape 4.05, 64K line to UUNet in the UK.
I'm off to check my own pages...
But the smudged-unsmudged test is silly. How can you possibly compare the time to update for two different files over the Internet from a production webserver using a stopwatch with a browser that you can't trust? There are too many places for potential delays to occur.
In fact, it could be that it's your webserver which handles smudged HTML more slowly than unsmudged HTML.
I just downloaded the source to the pages in question to my local filesystem, and my browsers all bring them up much too quickly to really measure how long it takes, let alone measure a difference between the two pages.
Although you do not say so, your observation implies a long-term case for XML, with its draconian well-formedness-handling and its support for external validation of document types. Without it, we have both the "degrading code base" described, and the associated browser bloat, choking the marketplace and stalling innovation. With XML, we can simply dump HTML and start fresh with decent tag sets tailored to our requirements. (Well, HTML will probably stick around as a de-facto display language for the web....)
Summary: Smudginess is undesirable, but I could find no statistically supportable evidence with my system and the two pages you made available for testing that eliminating smudginess improves rendering speed significantly.
Last night I did the test using a nominally 115 kbps ISDN connection. I have recalculated the results from that test, which involved first 5 downloads in a row of the "smudged" page and then 5 in a row of the "desmudged" page. For the smudged page, the average time to render was 16.2 seconds (standard deviation: 6.3 seconds). For the desmudged page, the time to render was 10.8 seconds (standard deviation 3.4 seconds). The test was run between about 5:45 and 6:00 PM CST (on a Sunday). These results indicate at most only a slight, statistically significant improvement in rendering time from reducing smudginess.
This morning I did the test on two different analog connections, both nominally 56 kbps, one through a Milwaukee, WI isp, to which I am connected by a local call in Madison, and the other through the MSN network, to which I also connect through a local call in Madison. On each of the connections, I did 15 cycles of a download of the smudged version followed by a download of the desmudged version.
The downloads through the local isp were done between 7:26 and 8:02 AM CST (on a Monday). This, of course, is when the Net is being heavily accessed by people just after arriving at their offices in the Eastern and Central time zones after the weekend. The average rendering time for the smudged version was 21.0 seconds (standard deviation 1.2 seconds). The average rendering time for the desmudged version was 20.7 seconds (standard deviation 1.4 seconds). The difference in rendering times is not statistically significant.
The downloads through the MSN network were done between 8:35 and 8:57 AM CST (on a Monday). During this period, the Net is being accessed significantly less heavily than an hour earlier, because the early morning rushes in the Eastern and Central time zones are about over while the much lighter one in the Mountain Zone (with its significantly lower population) is still ramping up and the one in the Pacific Zone has not yet started. The average rendering time for the smudged version was 13.9 seconds (standard deviation 1.4 seconds). The average rendering time for the desmudged version was 13.5 seconds (standard deviation 1.0 seconds). Again, the difference in rendering times is not statistically significant.
At least with the smudged and desmudged pages and the OS/browser combination used in the tests conducted here, it appears that there is very little or no improvement in rendering time attributable to desmudging.
There are a number of factors that might explain this lack of observed improvement, even if there is some theoretical reason to believe that there should be some improvement. First, background due to random variations in transfer times over the Net might hide any improvement there might be from desmudging. It is not possible to determine from these tests whether an increase in smudginess would at some point allow an improvement in rendering time to be observed. To make such a determination, it would be necessary to test a page with at least three different levels of smudginess, including possibly the corresponding, completely smudge-free page.
Second, so many people might have been accessing your server this morning in order to review or download the March 1 issue of TBTF that the dominant factor by far in determining rendering speed (i.e., the "rate-limiting factor") would have been delay in gaining access to your server. This would again represent a masking by characteristics of net traffic of any theoretical gain there might be due to desmudging.
The chance is high that early yesterday evening (a Sunday evening), when I ran the tests with ISDN connectivity just a short time after you sent the e-mail version of TBTF for March 1, traffic to your server was much lower than it was this morning. This significant reduction in net-traffic background might be why a barely statistically significant improvement in rendering time was observed yesterday evening but could not be observed this morning.
Finally, it is possible that, at least with a 300 MHz Netscape 4.5 system, smudging can be compensated for so well by the system that desmudging does not yield any gains until the smudging becomes so bad that a page cannot be successfully loaded at all.
Much more data is needed to establish that smudging is a problem for rendering speed that needs attention anytime soon.
I do agree though that smudginess represents both sloppiness and potential problems down the road. So, considering both the elegance of doing things properly and the desirability of avoiding problems over the long run, smudginess is something to be avoided and something that purchasers of webpages should, for their own good over the long haul, insist on and be willing to pay for.
However, there's at least one complicating factor here, and that's the routing. Although TCP/IP isn't a circuit protocol, the effect of fetching a page (or otherwise touching a remote site) is somewhat circuit like.
Specifically, every machine between me and you has to know how to route the packets. If the next hop in the chain is in the machine's cache, then the lookup is fast and the packet moves out quickly. If even one site has to look up an address or do a name-to-IP-number translation that can slow things down.
Thus the test is, at least potentially, order-dependent. Since you gave the before-smudged page first, it will be more likely to be hurt. That's not to say that your basic concept is wrong, just that your test needs to eliminate other factors first. Perhaps by running a traceroute first?