During Launchpad downtimes, many (>1000) imports failed and they had to be re-queued semi-manually. The importer would have been better inspired by making tea instead of queuing imports that were bound to fail.
The circuit breaker
An automatically operated electrical switch designed to protect an electrical circuit <…> a circuit breaker can be reset (either manually or automatically) to resume normal operation.
This looks like a good candidate to avoid import failures while Launchpad is down.
In this automaton representing the behaviour of a circuit breaker, three events are used (remember that here closed == works ;)):
- attempt: we try to use the circuit,
- failure: an undesired event has occurred,
- success: the circuit is working.
The main scenario here is:
closed — failure –> open — attempt –> half open — success –> closed
The reality test
A Launchpad rollout happened Friday 30 September 08:32. The importer log file said:
2011-09-30 08:32:02,308 – __main__ – INFO – Launchpad is down, re-trying jcifs
2011-09-30 08:34:09,337 – __main__ – INFO – Launchpad *is* back
The successful import took 27″, so the importer knew Launchpad was down for 1’40” (back – down – duration(import)). I asked the Launchpad admins how long it took them and their log said:
2011-09-30 08:33:41 INFO Outage complete. 0:01:40.919527
Make tea… or not
Another interesting number here is that we retried 498 times during this downtime. This is probably excessive and can be fixed by reducing the importer concurrency while Launchpad is down. These 498 attempts were previously seen as failures for 498 different packages.
In the end, not only did we avoid these 498 spurious failures but the imports were only suspended for as long as Launchpad was down, up to the second !
But that’s a bit short to make tea…