<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0">
<channel>
<title>Regular Sucking Schedule</title>
<link>http://regsched.bookinfo.info/</link>
<description>A blog about the issues of RSS bandwidth usage</description>
<copyright>Copyright 2005</copyright>
<lastBuildDate>Tue, 18 Jan 2005 15:15:15 -0800</lastBuildDate>
<generator>http://www.movabletype.org/?v=3.14</generator>
<docs>http://blogs.law.harvard.edu/tech/rss</docs> 

<item>
<title>More from an Aggregator Developer</title>
<description><![CDATA[<p>
Andy Henderson, who wrote in with some detailed questions (and got some detailed responses) a few weeks ago, has revised his news aggregator intended for a specialized industry, and had some comments he was willing to share.
</p><p>
Andy writes:
</p><p>
You were good enough to publish some questions I had about implementing good RSS behaviour in my Aggregator.  I got some very useful feedback and have published an updated version of my Aggregator at:
</p><p>
<a href="http://www.seeita.com/RSSA/RSSA.html">http://www.seeita.com/RSSA/RSSA.html</a>
</p><p>
I thought you might like an update on what I found.
</p><p>
I had already implemented ETAG and Last-Modified request headers.  In .net it's just a case of using HttpWebResponse.Headers.Get to pick up ETAG and Last-Modified headers and then use HttpWebRequest.IfModifiedSince and HttpWebRequest.Headers.Add("If-None-Match", LastETAGValue) to pass the values back to the next request (provided the date supplied was valid). Then just monitor for a 'Not modified' response from the server.
</p><p>
Handling ttl (Time To Live) tags in feeds is unfortunately more subjective. It appears the intent of this tag has drifted since its definition. Originally it told file sharing programs the maximum time they should cache data.  Since then, however, it's been used by some feeds to specify the minimum polling interval they expect from Aggregators.  So, what I've done is implement ttl as a minimum polling interval subject to a blanket minimum of 60 minutes (unless people click a manual 'update now' button).  That seems to safely split out the two different kinds of use.
</p><p>
I have a similar problem with the Syndication tags: updatePeriod and updateFrequency.  They describe the feed's update frequency which does not strictly relate to polling frequency.  Again, I have implemented a blanket minimum poll interval of 60 minutes.  I also assume that 24 hours is a suitable minimum polling interval for any feed updated less frequently than once a day (otherwise it could take up to a week to spot a change in a feed that is updated weekly).  I then take values in between as minimum poll intervals.
</p><p>
I found out more about feed compression.  The .net framework doesn't handle it automatically.  However, there is a free .net library that supports gzip and deflate compression available at:
</p><p>
<a href="http://www.icsharpcode.net/OpenSource/SharpZipLib/Default.aspx">http://www.icsharpcode.net/OpenSource/SharpZipLib/Default.aspx</a>
</p><p>
There is a good article explaining how to use it at:
</p><p>
<a href="http://feralboy.com/log/archives/000420/">http://feralboy.com/log/archives/000420/</a>
</p><p>
Note, however, that the code as supplied uses Microsoft encoding defaults i.e. UTF-8 encoding.  That's true for many pages but not all (if your Aggregator drops or misrepresents UK pound signs you now know why).  Note that I did not have to code for the server 406 response.  Unless told otherwise, servers assume they can send an uncompressed response - which is what they do if they do not support either gzip or deflate.  So it's simply a matter of requesting compression and then checking HttpWebResponse.ContentEncoding to see what you get back.  I was disappointed, though, to find out how few feeds supported compression.
</p><p>
I have decided not to implement the skipDays tag.  It is rarely used and, I suspect, getting rarer.  It is IMHO fatally flawed.  The RSS spec doesn't say how different time zones should interpret 'Sunday', for example.  If both feed supplier and aggregator assume GMT then, during the Winter, a UK feed that specifies skipping on Saturday and Sunday will not be polled after 1pm Friday from some locations and up to 11am Monday from others.  If either feed suppliers or Aggregators assume local times then the skipping could be 'out' by up to 23 hours!
</p><p>
I'm still thinking about the skipHours tag.  The RSS spec says times are GMT.  Even if that is correctly interpreted by both feed suppliers and Aggregators it is inappropriate for Aggregators to interpret the tag literally otherwise a feed made available during New York business hours would never be read by a Singaporean using an Aggregator at work during normal hours.  Right now, I'm thinking it is difficult to develop a watertight implementation for limited benefit given: - Using header tags makes the interaction a very small one - Very few feeds implement skipHours according to Syndic8.
</p><p>
The Syndication spec says it "supercedes (sic) the RSS 0.91 skipDay and skipHour elements".  I'm guessing there are other people out there as wary as I am about them.
</p><p>
I got an excellent suggestion to suspend polling when a PC is inactive. This has two major benefits: - When people are away from their desks, the Aggregator does not waste bandwidth looking for changes - When people return, the Aggregator is reactivated and they get to see the latest news.
</p><p>
To start with it looked tricky to implement an inactivity routine because it used to require 'hooking' keyboard and mouse events - and that looks like spyware activity.  However, since W2K we have had a new Win32 routine called GetLastInputInfo.  The description at:
</p><p>
<a href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/winui/winui/windowsuserinterface/userinput/keyboardinput/keyboardinputreference/keyboardinputfunctions/getlastinputinfo.asp">http://msdn.microsoft.com/library/default.asp?url=/library/en-us/winui/winui/windowsuserinterface/userinput/keyboardinput/keyboardinputreference/keyboardinputfunctions/getlastinputinfo.asp</a>
</p><p>
Talks about keyboard activity but it also tracks mouse movements and clicks. This routine makes it real easy to build in an inactivity monitor - which I've done.  I have had to allow people to switch it off, though, since some people leave their PCs switched on overnight to pick up Podcasts and BitTorrents.
</p><p>
I hope that's helpful.  Certainly I've found the feedback to date to be helpful.  I remain open to any suggestions to improve or refine my attempts to be well-behaved.
</p><p>
Andy
</p>]]></description>
<link>http://regsched.bookinfo.info/archives/2005/01/index.html#004711</link>
<guid>http://regsched.bookinfo.info/archives/2005/01/index.html#004711</guid>
<category></category>
<pubDate>Tue, 18 Jan 2005 15:15:15 -0800</pubDate>
</item>
<item>
<title>Give Poor Aggregators Less</title>
<description><![CDATA[<p>
John Wilson has <strong><a href="http://www.crazybutable.com/weblog/archives/2004/12/20/another-rss-bandwidth-reducing-technique/">a great idea</a></strong> about dealing with bad aggregators: give them less. He has reconfigured his copy of Word Press, blogging software, to provide tiny summaries for aggregators that fail to offer up the right HTTP headers that demonstrate their polite behavior. This is a nifty idea, and one that I would hope we could convince blogging software makers to adopt. It's a pretty trivial change, but it really depends on how they wire in header parsing with response information. I'd like a checkbox in Movable Type that says "restrict text to subject only for aggregators that fail to demonstrate that they understand what changed content means."
</p>]]></description>
<link>http://regsched.bookinfo.info/archives/2004/12/index.html#004603</link>
<guid>http://regsched.bookinfo.info/archives/2004/12/index.html#004603</guid>
<category></category>
<pubDate>Mon, 20 Dec 2004 20:48:31 -0800</pubDate>
</item>
<item>
<title>Aggregator Developer</title>
<description><![CDATA[<p>
Andy Henderson wrote in with some very interesting questions that he is allowing me to share. Please post comments or reply to him.
</p><p>
Andy writes:
</p><p>
I am a developer of an Aggregator - the CITA RSS Aggregator available from <a href="http://www.SeeITA.com/RSSA/RSSA.html">www.SeeITA.com/RSSA/RSSA.html</a>.  I developed it for a specific target market and I do not expect its use to grow to the extent that it will materially affect any RSS servers.  However, I want to be responsible so I am trying to take suggestions from this forum seriously.  I have already implemented ETag and Last-Modified header processing and now I am considering Randy Charles Morin's HowTo document to improve my Aggregator's behaviour.
</p><p>
It gives rise to several issues:
</p><p>
1) How should I understand the word 'Hint' in this context?  Is the information an instruction that the Aggregator should obey, or is it a suggestion the Aggregator should convey to its user and allow them to decide how they act on it?
</p><p>
2) The skipHours/skipDays tags gives me several concerns:
</p><p>
a) A straw poll of 29 sites that I monitor reveals that none of them implement skipHours or skipDays.  Syndic8 says that 1.98% of feeds use skipHours and 0.18% use skipDays.  That's a pretty small minority to code for.
</p><p>
b) How many RSS publishers properly understand that the hours are in GMT? Most Americans I have met (I'm a Brit) think London is GMT - but that's only for half the year.  Alternatively, how many will simply use their local time zone by mistake?
</p><p>
c) There is an obvious issue when people are in different time zones.  For example, a reader in Singapore has no working hours overlapping with working hours in, say, New York.  Simple implementation of the skipHours tag could mean that an Aggregator would never poll some feeds for some people.
</p><p>
d) Similarly, not polling on, say, Sundays is subject to interpretation.  Is it Sunday for the reader or the publisher that the Aggregator should exclude from polling?  I guess it has to be subject to the reader's time zone, but does that implement the publisher's expectation?
</p><p>
3) ttl is more widely implemented - Syndic8 says 7.74%.  Use of the Syndication module will push this up a bit.  However, I find that most sites specify hourly caching or less.  My Aggregator's minimum poll interval is one hour, so implementing ttl would increase polling in many cases!  That having been said, I plan to implement the higher of ttl and hourly polling as a minimum polling interval - subject to question 1 above.
</p><p>
4) I'm obviously interested in the Accept-Encoding tag because it looks like everyone wins from correct implementation of that one.  However:
</p><p>
a) I use the .net HttpWebRequest class to read RSS feeds but I can't find any authoritative statement about whether this tag is implemented automatically in the .net framework, or not.  I can't see why it wouldn't be (and the referenced w3.org document suggests that, by default, the server can send a compressed response), but we are talking Microsoft here.
</p><p>
b) As I understand it, the reader tries Accept-Encoding with, say, "gzip,deflate" if it gets a 406 (Not acceptable) status code back it has to try again without the Accept-Encoding.  But doesn't this mean a double hit on servers that don't support compression?  OK, it could remember the initial response but suppose the server is upgraded to support compression later on?
</p><p>
c) I can't find any tutorial on how to handle compressed responses from a server.  If .net doesn't handle them automatically, can anyone point me in the right direction?
</p><p>
Sorry if my naivety is showing.  Obviously, I don't expect you to be able to answer my questions but maybe one of your readers will be able to help.
</p><p>
<a href="mailto:Andy@SeeITA.com">Andy Henderson</a> Constructive IT Advice Andy@SeeITA.com
</p>]]></description>
<link>http://regsched.bookinfo.info/archives/2004/12/index.html#004601</link>
<guid>http://regsched.bookinfo.info/archives/2004/12/index.html#004601</guid>
<category></category>
<pubDate>Mon, 20 Dec 2004 16:17:06 -0800</pubDate>
</item>
<item>
<title>Throttle Off</title>
<description><![CDATA[<p>
I've had to turn my RSS throttle off until I figure out a more reliable solution. I was definitely breaking various people's tools, feeds, aggregators, and suppressing more than I wanted to.
</p><p>
Also, the system load on the machine I'm running to handle many many thousands of MySQL based queries often just at the top of an hour was more than it needed to be: it wasn't stressing the machine, but I have too many demands on my systems to add that load and have good performance right now.
</p><p>
So I'll throttle off during Epiphany and before and we'll see what materializes after looking at revised statistics.
</p>]]></description>
<link>http://regsched.bookinfo.info/archives/2004/12/index.html#004600</link>
<guid>http://regsched.bookinfo.info/archives/2004/12/index.html#004600</guid>
<category></category>
<pubDate>Mon, 20 Dec 2004 14:40:43 -0800</pubDate>
</item>
<item>
<title>Specific Advice and Configuration Tips for RSS Goodness</title>
<description><![CDATA[<p>
Randy Charles Morin sends in a link to <strong><a href="http://www.kbcafe.com/rss/rssfeedstate.html">his very detailed advice</a></strong> on how to configure Web servers, modify RSS feeds, and marching orders for aggregator developers to reform the wasted bandwidth used in RSS today.
</p><p>
A main point here is that aggregators need to be well behaved. Publishers can tweak all we want, but lightly implemented or buggy aggregators that don't observe Web server and feed directives and responses are (in my opinion) most of the problem. That's not to say that those of us running servers can't tweak our settings to improve matters, too!
</p>]]></description>
<link>http://regsched.bookinfo.info/archives/2004/12/index.html#004588</link>
<guid>http://regsched.bookinfo.info/archives/2004/12/index.html#004588</guid>
<category></category>
<pubDate>Fri, 17 Dec 2004 10:43:05 -0800</pubDate>
</item>
<item>
<title>Tristan Louis on Ways to Cope with RSS</title>
<description><![CDATA[<p>
Tristan Louis sent me this link on a <strong><a href="http://www.tnl.net/blog/entry/Capacity_planning_and_RSS">post he wrote a few months ago</a></strong> that's perfectly relevant today. He recommends server-side improvements that could help reduce the bandwidth impact of aggregators.
</p><p>
Tristan also <strong><a href="http://news.netcraft.com/archives/2004/10/20/rss_focus_shifts_to_bandwidth_management.html">pointed to this great summary</a></strong> of RSS issues at NetCraft from October. It covers quite a lot of the current ground.
</p>]]></description>
<link>http://regsched.bookinfo.info/archives/2004/12/index.html#004586</link>
<guid>http://regsched.bookinfo.info/archives/2004/12/index.html#004586</guid>
<category></category>
<pubDate>Thu, 16 Dec 2004 15:10:19 -0800</pubDate>
</item>
<item>
<title>Slashdot&apos;s Jamie on RSS Throttling</title>
<description><![CDATA[<p>
Jamie McCarthy <strong><a href="http://slashdot.org/~jamie/journal/93006">writes very specifically</a></strong> about how to implement Slashdot's approach to throttling aggressive little buggers trying to request-bomb your site, intentionally or not. I'm not sure that on lower-trafficked sites this strategy works. The problem with RSS over-requesting is often that many thousands of aggregators might request a page 24 times a day when they need it once or twice. That's a hard thing to throttle, as I'm discovering.
</p><p>
Jamie's post is <strong><a href="http://dsandler.org/wp/archives/2004/12/15/real-world-rss-throttling-on-slashdot">dissected</a></strong> by Dan Sandler.
</p>]]></description>
<link>http://regsched.bookinfo.info/archives/2004/12/index.html#004582</link>
<guid>http://regsched.bookinfo.info/archives/2004/12/index.html#004582</guid>
<category></category>
<pubDate>Wed, 15 Dec 2004 16:37:58 -0800</pubDate>
</item>
<item>
<title>Julian Bond&apos;s Three RSS Conclusions</title>
<description><![CDATA[<p>
Julian Bond <strong><a href="http://www.voidstar.com/node.php?id=2123">writes about</a></strong> how old resources he created that use RSS never die: they just continue to get pounded on by abandoned projects. Is there an Internet term for abandoned resources that live on? Not zombies. Perhaps The Und3@d? 
</p><p>
His conclusions, brief already, rewritten to be even terser by yours truly: Crappy aggregators abandon ship. Opportunistic programmers abandon ship. Abandoned ships are abandoned. How long has RSS been the Marie Celeste?
</p>]]></description>
<link>http://regsched.bookinfo.info/archives/2004/12/index.html#004564</link>
<guid>http://regsched.bookinfo.info/archives/2004/12/index.html#004564</guid>
<category></category>
<pubDate>Sun, 12 Dec 2004 08:15:05 -0800</pubDate>
</item>
<item>
<title>It&apos;s the Server, Stupid!</title>
<description><![CDATA[<p>
Chris DeSalvo <strong><a href="http://www.desalvo.org/blog/index.php?p=232">makes the point</a></strong> about two weeks ago--but still timely--that servers are really the problem, not clients in the RSS world. Well-defined servers should be sending appropriate responses for If-Modified-Since and Etags.
</p>]]></description>
<link>http://regsched.bookinfo.info/archives/2004/12/index.html#004563</link>
<guid>http://regsched.bookinfo.info/archives/2004/12/index.html#004563</guid>
<category></category>
<pubDate>Sat, 11 Dec 2004 20:30:13 -0800</pubDate>
</item>
<item>
<title>Statistics a-Boing-Boing</title>
<description><![CDATA[<p>
BoingBoing has <strong><a href="http://boingboing.net/stats/">made their statistics readily available</a></strong>. They get about 300,000 page views a day and feed over 3 GB per day in HTML--and 3 GB per day in aggregation feeds (XML).
</p>]]></description>
<link>http://regsched.bookinfo.info/archives/2004/12/index.html#004562</link>
<guid>http://regsched.bookinfo.info/archives/2004/12/index.html#004562</guid>
<category></category>
<pubDate>Sat, 11 Dec 2004 18:08:28 -0800</pubDate>
</item>
<item>
<title>My Posts on Throttling</title>
<description><![CDATA[<p>
I posted three items over the last few weeks about RSS bandwidth use and my attempts to throttle it back. The first one, <strong><a href="http://blog.glennf.com/mtarchives/004445.html">on Nov. 13</a></strong>, shows a chart of usage and how rapidly its grown. My second, <strong><a href="http://blog.glennf.com/mtarchives/004469.html">on Nov. 20</a></strong>, shows my attempts to throttle usage through a script that quickly identified unpleasantly frequent aggregators. My third,<strong><a href="http://blog.glennf.com/mtarchives/004540.html"> on Dec. 7</a></strong>, shows that the throttling dropped bandwidth usage down to nearly 50 percent.
</p><p>
But as two Dans discuss--see earlier posts on this blog linking to their comments--my approach is probably restricting multiple users behind NAT and anonymous proxies. More rethink needed to strike a balance between appropriate throttling and losing subscribers.
</p>]]></description>
<link>http://regsched.bookinfo.info/archives/2004/12/index.html#004560</link>
<guid>http://regsched.bookinfo.info/archives/2004/12/index.html#004560</guid>
<category></category>
<pubDate>Sat, 11 Dec 2004 14:25:12 -0800</pubDate>
</item>
<item>
<title>Dan Sheridan Suggests RSS GUID</title>
<description><![CDATA[<p>
<strong><a href="http://www.xlogs.net/2004/12/09.html#a1590">Another very cool idea on RSS throttling</a></strong>: have your system assign a unique GUID to handle the user-side tracking through firewalls, proxies, anonymizers, etc. This Dan (do all Dans write about aggregation?) notes that I'd have to update all existing feeds to get the best benefit. But this is a good moving-forward idea. Perhaps a migration would be in order over time.
</p><p>
I can see how to implement this using PHP, but there is a load factor. I would want every page load with an RSS icon on it to have a unique session ID--unique in a MySQL database table--rather than a redirect.
</p><p>
If I just had rss.xml -&gt; rss_GUID.xml calculated through my throttling script, then aggregators that don't honor redirects during subscription would still request the old file time and again. Actually, that's probably okay. That would mean less-sophisticated aggregators would be restricted across IP + proxy + http_via, while more sophisticated (and thus better behaved aggregators) would be throttled just by their GUID.
</p><p>
There are some issues about what happens to differentiate the same aggregator that keeps getting new GUIDs that it doesn't "recall" over time. More thinking on this is necessary to avoid restricting users.
</p>]]></description>
<link>http://regsched.bookinfo.info/archives/2004/12/index.html#004559</link>
<guid>http://regsched.bookinfo.info/archives/2004/12/index.html#004559</guid>
<category></category>
<pubDate>Sat, 11 Dec 2004 14:21:47 -0800</pubDate>
</item>
<item>
<title>Dan Sandler Comments on My RSS Throttling Technique</title>
<description><![CDATA[<p>
<strong><a href="http://dsandler.org/wp/archives/2004/12/11/rss-throttling">Dan points out very adeptly</a></strong> that my idea of trying to throttle aggregation file requests on a per-client basis is riddled with holes. So perhaps my bandwidth gains (losses, as it were) are because I'm feeding out fewer unique requests to users behind generic proxies and other gateways.
</p><p>
Dan <strong><a href="http://dsandler.org/wp/archives/2004/11/09/iris-student-workshop-04">offers a presentation</a></strong> he made on switching RSS from direct polling to peer-to-peer distribution.
</p>]]></description>
<link>http://regsched.bookinfo.info/archives/2004/12/index.html#004558</link>
<guid>http://regsched.bookinfo.info/archives/2004/12/index.html#004558</guid>
<category></category>
<pubDate>Sat, 11 Dec 2004 12:15:22 -0800</pubDate>
</item>
<item>
<title>Regular Sucking Schedule</title>
<description><![CDATA[<p>
Yet another blog. Yes, yes, I know. Just try and stop me.
</p><p>
I've seen a lot of discussion lately about RSS and aggregation bandwidth usage and behavior. This blog will collect details. I also welcome email that I can post. And comments are open. Let's talk about regular sucking schedules.
</p>]]></description>
<link>http://regsched.bookinfo.info/archives/2004/12/index.html#004557</link>
<guid>http://regsched.bookinfo.info/archives/2004/12/index.html#004557</guid>
<category></category>
<pubDate>Sat, 11 Dec 2004 12:13:16 -0800</pubDate>
</item>


</channel>
</rss>