While (as most of us have been sysadmins at somepoint) folks recognise that systems do fail over, though Red Hat needs to be communication earlier and more consistently when issues do happen, updates during the incident and as we all geeks, some details on the root cause analysis.
There was some good discussion about ensuring the signal-to-noise ratio is correct, the initial ideas were:
New Forum tag that people can subscribe to get email, possibly critical-service-status
Maybe an idea could be to have the configuration in a community repository on GitHub? If any new community endpoints need monitoring then this could be done here? Community members could raise a PR to do this and a GitHub Action could run to update the config on the host running Gatus.
We heavily discussed a workflow of status page ā forum ā community member, I think this is a good way but there should also be other ways to consume the status page in case other parts besides galaxy are not available
Therefore:
Include other parts of community infrastructure on the status page
forum
docs
matrix
probably other things that donāt come to my mind
have additional communication channels (e.g. directly subscribe to mails from the status page)
SEO so that people find the status page in case of a larger blackout of default communication channels
Iām happy to help on this topic or be part of a beta users group
I think that the forum evolved into quite an important part of the Ansible Community. So a status page should cover it, too, because it also can be down. But this would somehow rule out a forum banner. If we also want to cover the status of the forum, a separate and dedicated system would be needed.
Adding a forum banner additionally when other important systems are down doesnāt hurt and can even be helpful. This banner could link to the status page when anything but the forum is down, and like this make people aware of it and advertise it. And if the forum is down, at least some will hopefully remember the status page and have a look there.
@mariolenz You are correct. Iām thinking a a new status.ansible.com which doesnāt use the same hosting/infrastructure as The Forum (or any other Community infrastructure), Iām thinking that if there is a service degradation, then we will (automatically, or manually) add a Form Banner.
Iād suggest using a different TLD from ansible.com, in case that is down, perhaps RedHat has some spare ones that would be suitable? If not ansible.website is available (Ā£2.99 at Gandi.net for the first year), status.ansible.website might work? Best using a different Registrar and DNS from ansible.com alsoā¦
Other things that could be monitored (or linked to):
AZP (CI for ansible-core and some collections)
Ansibleās internal testing infrastructure that is used by ansible-test when handling VMs
AZP likely has its own status page as well, and the other might be too specialized (not sure how many folks outside Red Hat have access to it anyway - also I donāt remember it having problems so far, as opposed to many other things ).
Iāll need to double check, though from memory ansible.com DNS is managed by Cloudfair. If thatās down itās fairly likely a large chunk of the Internet is as well.
Red Hat will happily host an EC2 for this, and once the prototype is working, we could look at high(er) available with shared DB, etc if we feel thatās useful.
Thatās the power of open source right there.
Nice, thanks for that and confirming we can link to other status pages.