A watcher runs in all job action executors and watches all RUNNING jobs to see whether or not they are progressing. The watcher inspects both ActionProgress and non-ActionProgress jobs.
The following logic is only applicable to ActionProgress jobs:
If a RUNNING job is not progressing, the watcher moves the job to a TIMED_OUT state. The watcher either takes the job from the same job executor microservice or from one of the other available executors.
If a TIMED_OUT job is not progressing, the watcher will attempt to FAIL the job.
Any job executor microservice with an available slot to run new jobs, will take the TIMED_OUT jobs before consuming new job requests from the message queue. A base back pressure mechanism exists in order to deal with this.
If the original job action executor is still running and attempts have been made to progress the job, it is marked as TIMED_OUT, considered as a fatal error, and is ignored.