January 18, 2005
More from an Aggregator Developer
Andy Henderson, who wrote in with some detailed questions (and got some detailed responses) a few weeks ago, has revised his news aggregator intended for a specialized industry, and had some comments he was willing to share.
Andy writes:
You were good enough to publish some questions I had about implementing good RSS behaviour in my Aggregator. I got some very useful feedback and have published an updated version of my Aggregator at:
http://www.seeita.com/RSSA/RSSA.html
I thought you might like an update on what I found.
I had already implemented ETAG and Last-Modified request headers. In .net it's just a case of using HttpWebResponse.Headers.Get to pick up ETAG and Last-Modified headers and then use HttpWebRequest.IfModifiedSince and HttpWebRequest.Headers.Add("If-None-Match", LastETAGValue) to pass the values back to the next request (provided the date supplied was valid). Then just monitor for a 'Not modified' response from the server.
Handling ttl (Time To Live) tags in feeds is unfortunately more subjective. It appears the intent of this tag has drifted since its definition. Originally it told file sharing programs the maximum time they should cache data. Since then, however, it's been used by some feeds to specify the minimum polling interval they expect from Aggregators. So, what I've done is implement ttl as a minimum polling interval subject to a blanket minimum of 60 minutes (unless people click a manual 'update now' button). That seems to safely split out the two different kinds of use.
I have a similar problem with the Syndication tags: updatePeriod and updateFrequency. They describe the feed's update frequency which does not strictly relate to polling frequency. Again, I have implemented a blanket minimum poll interval of 60 minutes. I also assume that 24 hours is a suitable minimum polling interval for any feed updated less frequently than once a day (otherwise it could take up to a week to spot a change in a feed that is updated weekly). I then take values in between as minimum poll intervals.
I found out more about feed compression. The .net framework doesn't handle it automatically. However, there is a free .net library that supports gzip and deflate compression available at:
http://www.icsharpcode.net/OpenSource/SharpZipLib/Default.aspx
There is a good article explaining how to use it at:
http://feralboy.com/log/archives/000420/
Note, however, that the code as supplied uses Microsoft encoding defaults i.e. UTF-8 encoding. That's true for many pages but not all (if your Aggregator drops or misrepresents UK pound signs you now know why). Note that I did not have to code for the server 406 response. Unless told otherwise, servers assume they can send an uncompressed response - which is what they do if they do not support either gzip or deflate. So it's simply a matter of requesting compression and then checking HttpWebResponse.ContentEncoding to see what you get back. I was disappointed, though, to find out how few feeds supported compression.
I have decided not to implement the skipDays tag. It is rarely used and, I suspect, getting rarer. It is IMHO fatally flawed. The RSS spec doesn't say how different time zones should interpret 'Sunday', for example. If both feed supplier and aggregator assume GMT then, during the Winter, a UK feed that specifies skipping on Saturday and Sunday will not be polled after 1pm Friday from some locations and up to 11am Monday from others. If either feed suppliers or Aggregators assume local times then the skipping could be 'out' by up to 23 hours!
I'm still thinking about the skipHours tag. The RSS spec says times are GMT. Even if that is correctly interpreted by both feed suppliers and Aggregators it is inappropriate for Aggregators to interpret the tag literally otherwise a feed made available during New York business hours would never be read by a Singaporean using an Aggregator at work during normal hours. Right now, I'm thinking it is difficult to develop a watertight implementation for limited benefit given: - Using header tags makes the interaction a very small one - Very few feeds implement skipHours according to Syndic8.
The Syndication spec says it "supercedes (sic) the RSS 0.91 skipDay and skipHour elements". I'm guessing there are other people out there as wary as I am about them.
I got an excellent suggestion to suspend polling when a PC is inactive. This has two major benefits: - When people are away from their desks, the Aggregator does not waste bandwidth looking for changes - When people return, the Aggregator is reactivated and they get to see the latest news.
To start with it looked tricky to implement an inactivity routine because it used to require 'hooking' keyboard and mouse events - and that looks like spyware activity. However, since W2K we have had a new Win32 routine called GetLastInputInfo. The description at:
Talks about keyboard activity but it also tracks mouse movements and clicks. This routine makes it real easy to build in an inactivity monitor - which I've done. I have had to allow people to switch it off, though, since some people leave their PCs switched on overnight to pick up Podcasts and BitTorrents.
I hope that's helpful. Certainly I've found the feedback to date to be helpful. I remain open to any suggestions to improve or refine my attempts to be well-behaved.
Andy
Posted by Glennf at 03:15 PM | Comments (0)