Watcher and Restart Job Mechanism

Job async executors typically execute long running tasks and / or use external services to execute jobs.

If an outage occurs, the whole microservice may stall or stop entirely whilst jobs are running. These jobs are managed by a watcher. All job executors run a watcher unless they are disabled. This scheduled thread watches all running jobs (across all job executors) in order to identify the following scenarios:

Queued Jobs

QUEUED jobs may be considered as “stalled” if they have not moved to a RUNNING state during a timeout deadline. QUEUED jobs that have stalled can be taken by another job executor with an available slot.

  • Any job executor microservice can take this job and move it to a RUNNING state.
  • Any new job executor microservice can take this job and move it to a RUNNING state at the start.

Running Jobs

RUNNING jobs may be considered as “stalled” if they haven’t updated the progress before the timeout deadline. If a running job stalls, it is moved to a TIMED_OUT state.

TIMED_OUT jobs can be taken by any other job executor microservice with an available slot. If a job is taken by another job executor, it is moved to a RUNNING state.

Timed Out Jobs

TIMED_OUT jobs are moved back to a RUNNING state if the original job execution progresses.

TIMED_OUT jobs may be considered as “stalled” if they still haven’t updated the progress for the second deadline timeout.

TIMED_OUT jobs call the resume() method for the same plugin.

  • If resume() is implemented and the job progresses, it is moved to a RUNNING state.

  • If resume() is not implemented or the job has not progressed, it is marked as FAILED.