While (as most of us have been sysadmins at somepoint) folks recognise that systems do fail over, though Red Hat needs to be communication earlier and more consistently when issues do happen, updates during the incident and as we all geeks, some details on the root cause analysis.
There was some good discussion about ensuring the signal-to-noise ratio is correct, the initial ideas were:
New Forum tag that people can subscribe to get email, possibly critical-service-status
Maybe an idea could be to have the configuration in a community repository on GitHub? If any new community endpoints need monitoring then this could be done here? Community members could raise a PR to do this and a GitHub Action could run to update the config on the host running Gatus.
We heavily discussed a workflow of status page â forum â community member, I think this is a good way but there should also be other ways to consume the status page in case other parts besides galaxy are not available
Therefore:
Include other parts of community infrastructure on the status page
forum
docs
matrix
probably other things that donât come to my mind
have additional communication channels (e.g. directly subscribe to mails from the status page)
SEO so that people find the status page in case of a larger blackout of default communication channels
Iâm happy to help on this topic or be part of a beta users group
I think that the forum evolved into quite an important part of the Ansible Community. So a status page should cover it, too, because it also can be down. But this would somehow rule out a forum banner. If we also want to cover the status of the forum, a separate and dedicated system would be needed.
Adding a forum banner additionally when other important systems are down doesnât hurt and can even be helpful. This banner could link to the status page when anything but the forum is down, and like this make people aware of it and advertise it. And if the forum is down, at least some will hopefully remember the status page and have a look there.