Allow apps control over which Worker Role Instances are shutdown
When you want to downscale your pool of WorkerRole instances, then the fabric will just pick one to shutdown.
However, this means it might pick an instance which is in the middle of doing a long task.
It would be better if you could somehow "hint" which one to shutdown - e.g. picking one which is currently idle.
There are several ways this could be achieved, including by the fabric asking each of the role instances whether they are "OK" about the proposed shutdown.
Obviously they might also say "no" in which case you'd have to shut one down anyway... but at least that's within the control of the developer.
Mike Volodarsky commented
Hi Azure team,
We are building a large scale SAAS app that has a network of "cache" worker roles running in Azure. We absolutely need this feature to be able to retire a specific cache node, while keeping the cache nodes that we still need operational.
Not having this feature is a huge blocker, enough to potentially consider EC2 as an alternative for this system (even though we love Azure storage and are already highly integrated with the Azure platform).
I feel inclined to add my voice here too. It get the impression the Azure architects think that anyone building a worker service has complete control over the tasks they are running, and therefore their services failure handling logic should also apply to Azure's stochastic stop behaviour. But this assumption just doesn't apply in many cases. I am building Azure support into a parametric computing meta-scheduler (alongside traditional cluster, grid and of course EC2 support). For all intents and purposes, we run black-boxes, usually many hundreds to hundreds-of-thousands of times in a single experiment.
We don't know anything about the user code, except how to start it. Sometimes the black-box will run for seconds, sometimes days. When we hit the tail-end of an experiment or batch of jobs we obviously want to reduce the service (and therefore billing) footprint. But with Azure I can't do this, because I just don't know whether an idle or active instance is going to be trashed, potentially losing days of work and increasing overall time-to-completion. This is different to transient failures. Those are to be expected occasionally, and you have to live with them. But I have this shutdown issue at the end of every experiment!
Mark Richards commented
I've noticed the fabric always shuts down the instance with the highest instance #'s. I have times where I may be going from 20 to 10 instances, but several instances are busy in a long running task. I'd like it if the fabric would ask an instance if it is OK to shut down, and if it says no, it would move on and leave that one alone. Of course, if enough instances don't say it is OK, you have a decision to make, shut down like it does today, or just wait...