Skip to content

Commit

Permalink
[JENKINS-50294] Essentials Instance Client Health Checking
Browse files Browse the repository at this point in the history
  • Loading branch information
batmat authored and R. Tyler Croy committed Apr 23, 2018
1 parent b5d8196 commit 958d5d2
Showing 1 changed file with 226 additions and 0 deletions.
226 changes: 226 additions & 0 deletions jep/0000/README.adoc
@@ -0,0 +1,226 @@
= JEP-0000: Essentials Instance Client Health Checking
:toc: preamble
:toclevels: 3
ifdef::env-github[]
:tip-caption: :bulb:
:note-caption: :information_source:
:important-caption: :heavy_exclamation_mark:
:caution-caption: :fire:
:warning-caption: :warning:
endif::[]

.Metadata
[cols="2"]
|===
| JEP
| 0000

| Title
| Essentials Instance Client Health Checking

| Sponsor
| https://github.com/batmat

// Use the script `set-jep-status <jep-number> <status>` to update the status.
| Status
| Not Submitted :information_source:

| Type
| Standards

| Created
| 2018-04-05
//
//
// Uncomment if there is an associated placeholder JIRA issue.
| JIRA
| https://issues.jenkins-ci.org/browse/JENKINS-50294[JENKINS-50294]
//
//
// Uncomment if there will be a BDFL delegate for this JEP.
//| BDFL-Delegate
//| :bulb: Link to github user page :bulb:
//
//
// Uncomment if discussion will occur in forum other than jenkinsci-dev@ mailing list.
//| Discussions-To
//| :bulb: Link to where discussion and final status announcement will occur :bulb:
//
//
// Uncomment if this JEP depends on one or more other JEPs.
| Requires
| link:https://github.com/jenkinsci/jep/tree/master/jep/300[JEP-300]
//
//
// Uncomment and fill if this JEP is rendered obsolete by a later JEP
//| Superseded-By
//| :bulb: JEP-NUMBER :bulb:
//
//
// Uncomment when this JEP status is set to Accepted, Rejected or Withdrawn.
//| Resolution
//| :bulb: Link to relevant post in the jenkinsci-dev@ mailing list archives :bulb:

|===


== Abstract

The first pillar of _Jenkins Essentials_ is that it is an link:https://github.com/jenkinsci/jep/tree/master/jep/300#auto-update[Automatically Updated Distribution].

To be able to achieve this goal in a durable way, we need to be able to _automatically assess the healthness_ of a a *given* instance.
The scope of this proposal is to design the way we decide if we link:https://github.com/jenkinsci/jep/tree/master/jep/302[automatically roll back] or not.

It will also regularly be fed back to the backend so that we can compute global healthness statistics for a given setup, but that is out of scope for the current document.

== Specification

We do expect to evolve the health-checking process as we learn, but as the local healthcheck is a critical part of the overall _Essentials_ story, we want to start _small_ on purpose.
Once we deem to have learned enough, we will create new proposals to discuss and document the new checks we want to add.

We will check two URLs:

* the `/login` page
* the `/metrics/evergreen/healthcheck`

=== Login URL

We check that:

* it is reachable,
* and returns a 200 HTTP status code.

=== `/metrics/evergreen/healthcheck` URL

We configure the link:https://github.com/jenkinsci/metrics-plugin/[Metrics Jenkins plugin] to provide a healthcheck under the specified URL.
The prettified returned format is the following

[source,json,title=`/metrics/evergreen/healthcheck` URL output]
{
"disk-space": {
"healthy": true
},
"plugins": {
"healthy": true,
"message": "No failed plugins"
},
"temporary-space": {
"healthy": true
},
"thread-deadlock": {
"healthy": true
}
}

From this URL, we check that:

* it returns a 200 HTTP status code
* On the produced JSON
** it is valid JSON
** `plugins.healthy` attribute is `true`
** `thread-deadlock.healthy` attribute is `true`

We are *not* checking the space related attributes on purpose, at least for now.
The rationale being that the upgrade to a new _Essentials_ BOM
footnote:[Bill Of Materials: the configuration file describing what an Essentials release is made of: what exact WAR version, which plugins, etc.]
could consume a bit more disk space, and trigger a disk space warning.
We probably do not want to wholly revert an upgrade because of this.

==== Absence of the `metrics` plugin

Making this plugin a part of the healthchecking story obviously makes it a *required* plugin.
So the _evergreen-client_ should make sure it is always present and active when upgrading.
For instance, if it is disabled, or removed from the disk, it *must* be forcefully reinstalled and enabled automatically next time.

If for some reason, the plugin fails to start, then the healthcheck should fall back to only check the `/login`, and report this issue as critical to the backend.

==== Metrics plugin Configuration

The plugin is configured using the link:https://github.com/jenkinsci/configuration-as-code-plugin[Configuration As Code] Jenkins plugin, using the following syntax:

[source,yaml,title=Essentials Configuration-as-code file]
---
jenkins:
# [snip other configurations]
metricsaccesskey:
accessKeys:
- key: "evergreen"
description: "Key for evergreen health-check"
canHealthCheck: true
canPing: false
canThreadDump: false
canMetrics: false
origins: "*"

== Motivation

There is nothing existing in this area.

== Reasoning

=== Why not leverage the error logging

In the link:https://github.com/jenkinsci/jep/tree/master/jep/304[JEP-304 on _Essentials Client Error Telemetry Logging_], we describe how the Jenkins instance is _publishing_ its error logging.

We are not going to use those logs for now for the reason stated previously: we do no think we know enough how to use them correctly yet.
So we are taking a careful path here: anyway, those logs are going to be sent to the backend as a one of the data points for assessing quality of given releases.

Over time, once we have a better idea of what they typically are, and how to use them, this is likely we will design a new proposal to enrich the way we do the healthchecking process from the _evergreen-client_.

== Backwards Compatibility

There are no backwards compatibility concerns related to this proposal.

== Security

=== Accessing the /metrics/evergreen/healthcheck URL from outside the container

=== Absence of the `metrics` plugin



TODO

== Infrastructure Requirements

There are no new infrastructure requirements related to this proposal.

== Testing

[TIP]
====
If the JEP involves any kind of behavioral change to code
(whether in a Jenkins product or backend infrastructure),
give a summary of how its correctness (and, if applicable, compatibility, security, etc.) can be tested.
In the preferred case that automated tests can be developed to cover all significant changes, simply give a short summary of the nature of these tests.
If some or all of changes will require human interaction to verify, explain why automated tests are considered impractical.
Then summarize what kinds of test cases might be required: user scenarios with action steps and expected outcomes.
Might behavior vary by platform (operating system, servlet container, web browser, etc.)?
Are there foreseeable interactions between different permissible versions of components (Jenkins core, plugins, etc.)?
Are any special tools, proprietary software, or online service accounts required to exercise a related code path (Active Directory server, GitHub login, etc.)?
When will testing take place relative to merging code changes, and might retesting be required if other changes are made to this area in the future?
If this proposal requires no testing, this section may simply say:
There are no testing issues related to this proposal.
====

== Prototype Implementation

* https://github.com/jenkins-infra/evergreen and more specifically the link:https://github.com/jenkins-infra/evergreen/pull/44[PR-44].

== References

* This proposal relates to link:https://github.com/jenkinsci/jep/tree/master/jep/302[JEP-302: Evergreen snapshotting data safety system] FIXME explain relationship

[TIP]
====
Provide links to any related documents.
====

[IMPORTANT]
====
When moving this JEP from a Draft to "Accepted" or "Final" state,
include links to the pull requests and mailing list discussions which were involved in the process.
====

0 comments on commit 958d5d2

Please sign in to comment.