This incident has been resolved, unfortunately our automated monitoring systems didn't alert us to this issue which meant it took longer than usual for an ops engineer to intervene.
The automated monitoring system was configured to raise alarms when it detected long queue times but it was not configured to raise an alarm for missing queue stats (indicating that the process manager had stopped reporting). We've extended our monitoring alarms to alert us if this situation occurs again in the future.
Our apologies for the inconvenience.
Posted Oct 27, 2018 - 10:34 AEDT
It appears that there was an issue with the process that manages the workers in the transcoding cluster which caused it to stop raising alarms as the 'time on queue' grew. This in turn stopped it from adding workers to the cluster as needed which slowed down the preview process.
We are continuing to investigate the issue with the management cluster to prevent this from happening again in the future, but for now a reboot of the cluster seems to have resolved the issue.
Posted Oct 27, 2018 - 09:44 AEDT
We are investing reports that its taking longer than usual for previews to appear.
Posted Oct 27, 2018 - 09:26 AEDT
This incident affected: South East Asia (South East Asia Transcoding Servers), Oceania (Oceania Transcoding Servers), Europe (Europe Transcoding Servers), and USA West (USA West Transcoding Servers).