December 20, 2004
Give Poor Aggregators Less
John Wilson has a great idea about dealing with bad aggregators: give them less. He has reconfigured his copy of Word Press, blogging software, to provide tiny summaries for aggregators that fail to offer up the right HTTP headers that demonstrate their polite behavior. This is a nifty idea, and one that I would hope we could convince blogging software makers to adopt. It's a pretty trivial change, but it really depends on how they wire in header parsing with response information. I'd like a checkbox in Movable Type that says "restrict text to subject only for aggregators that fail to demonstrate that they understand what changed content means."
Posted by Glennf at 08:48 PM | Comments (1)
Aggregator Developer
Andy Henderson wrote in with some very interesting questions that he is allowing me to share. Please post comments or reply to him.
Andy writes:
I am a developer of an Aggregator - the CITA RSS Aggregator available from www.SeeITA.com/RSSA/RSSA.html. I developed it for a specific target market and I do not expect its use to grow to the extent that it will materially affect any RSS servers. However, I want to be responsible so I am trying to take suggestions from this forum seriously. I have already implemented ETag and Last-Modified header processing and now I am considering Randy Charles Morin's HowTo document to improve my Aggregator's behaviour.
It gives rise to several issues:
1) How should I understand the word 'Hint' in this context? Is the information an instruction that the Aggregator should obey, or is it a suggestion the Aggregator should convey to its user and allow them to decide how they act on it?
2) The skipHours/skipDays tags gives me several concerns:
a) A straw poll of 29 sites that I monitor reveals that none of them implement skipHours or skipDays. Syndic8 says that 1.98% of feeds use skipHours and 0.18% use skipDays. That's a pretty small minority to code for.
b) How many RSS publishers properly understand that the hours are in GMT? Most Americans I have met (I'm a Brit) think London is GMT - but that's only for half the year. Alternatively, how many will simply use their local time zone by mistake?
c) There is an obvious issue when people are in different time zones. For example, a reader in Singapore has no working hours overlapping with working hours in, say, New York. Simple implementation of the skipHours tag could mean that an Aggregator would never poll some feeds for some people.
d) Similarly, not polling on, say, Sundays is subject to interpretation. Is it Sunday for the reader or the publisher that the Aggregator should exclude from polling? I guess it has to be subject to the reader's time zone, but does that implement the publisher's expectation?
3) ttl is more widely implemented - Syndic8 says 7.74%. Use of the Syndication module will push this up a bit. However, I find that most sites specify hourly caching or less. My Aggregator's minimum poll interval is one hour, so implementing ttl would increase polling in many cases! That having been said, I plan to implement the higher of ttl and hourly polling as a minimum polling interval - subject to question 1 above.
4) I'm obviously interested in the Accept-Encoding tag because it looks like everyone wins from correct implementation of that one. However:
a) I use the .net HttpWebRequest class to read RSS feeds but I can't find any authoritative statement about whether this tag is implemented automatically in the .net framework, or not. I can't see why it wouldn't be (and the referenced w3.org document suggests that, by default, the server can send a compressed response), but we are talking Microsoft here.
b) As I understand it, the reader tries Accept-Encoding with, say, "gzip,deflate" if it gets a 406 (Not acceptable) status code back it has to try again without the Accept-Encoding. But doesn't this mean a double hit on servers that don't support compression? OK, it could remember the initial response but suppose the server is upgraded to support compression later on?
c) I can't find any tutorial on how to handle compressed responses from a server. If .net doesn't handle them automatically, can anyone point me in the right direction?
Sorry if my naivety is showing. Obviously, I don't expect you to be able to answer my questions but maybe one of your readers will be able to help.
Andy Henderson Constructive IT Advice Andy@SeeITA.com
Posted by Glennf at 04:17 PM | Comments (1)
Throttle Off
I've had to turn my RSS throttle off until I figure out a more reliable solution. I was definitely breaking various people's tools, feeds, aggregators, and suppressing more than I wanted to.
Also, the system load on the machine I'm running to handle many many thousands of MySQL based queries often just at the top of an hour was more than it needed to be: it wasn't stressing the machine, but I have too many demands on my systems to add that load and have good performance right now.
So I'll throttle off during Epiphany and before and we'll see what materializes after looking at revised statistics.
Posted by Glennf at 02:40 PM | Comments (0)
December 17, 2004
Specific Advice and Configuration Tips for RSS Goodness
Randy Charles Morin sends in a link to his very detailed advice on how to configure Web servers, modify RSS feeds, and marching orders for aggregator developers to reform the wasted bandwidth used in RSS today.
A main point here is that aggregators need to be well behaved. Publishers can tweak all we want, but lightly implemented or buggy aggregators that don't observe Web server and feed directives and responses are (in my opinion) most of the problem. That's not to say that those of us running servers can't tweak our settings to improve matters, too!
Posted by Glennf at 10:43 AM | Comments (2)
December 16, 2004
Tristan Louis on Ways to Cope with RSS
Tristan Louis sent me this link on a post he wrote a few months ago that's perfectly relevant today. He recommends server-side improvements that could help reduce the bandwidth impact of aggregators.
Tristan also pointed to this great summary of RSS issues at NetCraft from October. It covers quite a lot of the current ground.
Posted by Glennf at 03:10 PM | Comments (0)
December 15, 2004
Slashdot's Jamie on RSS Throttling
Jamie McCarthy writes very specifically about how to implement Slashdot's approach to throttling aggressive little buggers trying to request-bomb your site, intentionally or not. I'm not sure that on lower-trafficked sites this strategy works. The problem with RSS over-requesting is often that many thousands of aggregators might request a page 24 times a day when they need it once or twice. That's a hard thing to throttle, as I'm discovering.
Jamie's post is dissected by Dan Sandler.
Posted by Glennf at 04:37 PM | Comments (0)
December 12, 2004
Julian Bond's Three RSS Conclusions
Julian Bond writes about how old resources he created that use RSS never die: they just continue to get pounded on by abandoned projects. Is there an Internet term for abandoned resources that live on? Not zombies. Perhaps The Und3@d?
His conclusions, brief already, rewritten to be even terser by yours truly: Crappy aggregators abandon ship. Opportunistic programmers abandon ship. Abandoned ships are abandoned. How long has RSS been the Marie Celeste?
Posted by Glennf at 08:15 AM | Comments (1)
December 11, 2004
It's the Server, Stupid!
Chris DeSalvo makes the point about two weeks ago--but still timely--that servers are really the problem, not clients in the RSS world. Well-defined servers should be sending appropriate responses for If-Modified-Since and Etags.
Posted by Glennf at 08:30 PM | Comments (1)
Statistics a-Boing-Boing
BoingBoing has made their statistics readily available. They get about 300,000 page views a day and feed over 3 GB per day in HTML--and 3 GB per day in aggregation feeds (XML).
Posted by Glennf at 06:08 PM | Comments (1)
My Posts on Throttling
I posted three items over the last few weeks about RSS bandwidth use and my attempts to throttle it back. The first one, on Nov. 13, shows a chart of usage and how rapidly its grown. My second, on Nov. 20, shows my attempts to throttle usage through a script that quickly identified unpleasantly frequent aggregators. My third, on Dec. 7, shows that the throttling dropped bandwidth usage down to nearly 50 percent.
But as two Dans discuss--see earlier posts on this blog linking to their comments--my approach is probably restricting multiple users behind NAT and anonymous proxies. More rethink needed to strike a balance between appropriate throttling and losing subscribers.
Posted by Glennf at 02:25 PM | Comments (0)
Dan Sheridan Suggests RSS GUID
Another very cool idea on RSS throttling: have your system assign a unique GUID to handle the user-side tracking through firewalls, proxies, anonymizers, etc. This Dan (do all Dans write about aggregation?) notes that I'd have to update all existing feeds to get the best benefit. But this is a good moving-forward idea. Perhaps a migration would be in order over time.
I can see how to implement this using PHP, but there is a load factor. I would want every page load with an RSS icon on it to have a unique session ID--unique in a MySQL database table--rather than a redirect.
If I just had rss.xml -> rss_GUID.xml calculated through my throttling script, then aggregators that don't honor redirects during subscription would still request the old file time and again. Actually, that's probably okay. That would mean less-sophisticated aggregators would be restricted across IP + proxy + http_via, while more sophisticated (and thus better behaved aggregators) would be throttled just by their GUID.
There are some issues about what happens to differentiate the same aggregator that keeps getting new GUIDs that it doesn't "recall" over time. More thinking on this is necessary to avoid restricting users.
Posted by Glennf at 02:21 PM | Comments (0)
Dan Sandler Comments on My RSS Throttling Technique
Dan points out very adeptly that my idea of trying to throttle aggregation file requests on a per-client basis is riddled with holes. So perhaps my bandwidth gains (losses, as it were) are because I'm feeding out fewer unique requests to users behind generic proxies and other gateways.
Dan offers a presentation he made on switching RSS from direct polling to peer-to-peer distribution.
Posted by Glennf at 12:15 PM | Comments (0)
Regular Sucking Schedule
Yet another blog. Yes, yes, I know. Just try and stop me.
I've seen a lot of discussion lately about RSS and aggregation bandwidth usage and behavior. This blog will collect details. I also welcome email that I can post. And comments are open. Let's talk about regular sucking schedules.
Posted by Glennf at 12:13 PM | Comments (1)