Executing and killing ruby parallel/background jobs

January 2, 2018 -
Tags: concurrency, linux, ruby, sysadmin

In a project of mine, I’m implementing a feature that runs a background job in order to perform a search; in particular, it needs to support stopping at any time.

There is a variety of strategies to do this in Ruby. In this article I will expose what is the exact outcome of the common strategies, examining how this affects the underlying operating system.

The analysis is targeted to POSIX operating systems, although, at least part of it applies to Windows machines as well.

Contents:

Brief introduction to Ruby concurrency

Ruby concurrency is a largely documented subject; I will recap a few critical points:

  • the reference Ruby implementation (MRI) can’t execute threads in parallel, but can with processes;
  • threads can still be used to achieve effective parallelism if they spend their time waiting on I/O;
  • processes are supposed to be slower to instantiate and more resource consuming, but this applies generally to large-scale processing, and, as with any performance concept, the impact must be always profiled before taking conclusions.

Technical context and preliminary notes

All the concepts in this article refer to the MRI interpreter.

The processes information is checked on an Ubuntu machine, using ps x --forest; only the relevant information is displayed (the bash/upstart processes are displayed for completeness).

The examples are simplified forms of concurrent programming. The statement sleep 0.1 is used to give a reasonable certainty that forked processes did actually start; in a rigorously written concurrent program, this call would be replaced with thread notifications, as there is no absolute guarantee that this amount of time (or any, for that matter) will be enough.

Problem statement

The requirements are:

  • to run a file search in background (backed by the the Linux find shell command);
  • to gather results from a completed search;
  • to have the ability to stop the search at any moment, predictably and cleanly, and possibly, with minimal engineering.

Effect of the common strategies on a Linux operating systems

In this first part, I will expose the most common strategy Ruby provides for performing background jobs (in general).

The code is executed in an irb session inside a Bash shell.

Threads, and Thread#kill

Threads in Ruby (MRI) are convenient as a lightweight framework to perform operations which are blocked by I/O.

Threading suffer in the area of management, though; while killing them is possible, the effects are unspecified.
Java has for example deprecated threads killing long ago, declaring it’s not possible to perform cleanup deterministically.

Ruby threads don’t play well with subshells, for example:

irb> thread = Thread.new { `sleep 10` }
irb> thread.terminate # => #<Thread:0x00000000be6798@(irb):3 dead>

The thread will be killed from a Ruby perspective, but the subshell will still run:

14937 pts/9    Ss     0:00  |   \_ -bash
18837 pts/9    Sl+    0:00  |   |   \_ irb
18841 pts/9    S+     0:00  |   |       \_ sleep 10

After that, a zombie process will also remain (notice defunct and Z+):

14937 pts/9    Ss     0:00  |   \_ -bash
18837 pts/9    Sl+    0:00  |   |   \_ irb
18841 pts/9    Z+     0:00  |   |       \_ [sleep] <defunct>

(more on zombie processes later)

Threads are therefore not a good solution, at least, for the defined problem.

Kernel#fork

Kernel#fork will fork the current process, and execute it.

The most common Ruby forking command is the block version; this is how the functionality is expressed:

irb> child_pid = fork { `sleep 10` }
irb> sleep 0.1
irb> Process.kill('SIGHUP', child_pid)

For the defined use case, using processes has a significant advantage: we know precisely the termination details, as we control the signal sent, and how the processes deal with it.

In the example we send a SIGHUP signal. SIGHUP responses vary by program; in the case of irb and shell sleep/find, they will cleanly terminate, so it’s an appropriate signal for our purpose.

This is the process information, pre-kill:

14937 pts/9    Ss     0:00  |   \_ -bash
21085 pts/9    Sl+    0:00  |   |   \_ irb
21087 pts/9    Sl+    0:00  |   |       \_ irb
21090 pts/9    S+     0:00  |   |           \_ sleep 10

We can see that irb is forked, becoming a child of the top-level interpreter, and then, in turn, it generates a child of its own, with the subshell executing the sleep command.

This is the process information after the kill:

1836  ?        Ss     0:00 /sbin/upstart --user
14937 pts/9    Ss     0:00  |   \_ -bash
21085 pts/9    Sl+    0:00  |   |   \_ irb
21087 pts/9    Z+     0:00  |   |       \_ [ruby] <defunct>
21090 pts/9    S+     0:00  \_ sleep 10

Yikes! The job (sleep) still runs. What’s happening?

It turns out that signals are not cascading, so the child irb (21087) receives the hangup signal, and terminates, but its child (21090) doesn’t, and it’s detached from it, still running in the background.

When a process is detached, it becomes a child of the root process (upstart or init), in this system, upstart.

So now we have two problems:

  1. we still have a zombie process
  2. we’re still not terminating the background job

In the next section, we’ll deal with those pesky zombie processes.

Dealing with zombie processes: Process.detach

As we’ve seen, when a signal is sent to a process, it’s its own task to handle its children; Ruby doesn’t clean forked processes with subshells, though.

We can solve this problem by detaching the child process, which then becomes a child of the root process; this will take care of it:

child_pid = fork { `sleep 10` }
sleep 0.1
Process.detach(child_pid)
Process.kill('SIGHUP', child_pid)

After fork:

 1836 ?        Ss     0:00 /sbin/upstart --user
14937 pts/9    Ss     0:00  |   \_ -bash
23144 pts/9    Sl+    0:00  |   |   \_ irb
23156 pts/9    Sl+    0:00  |   |       \_ irb
23159 pts/9    S+     0:00  |   |           \_ sleep 10

After kill:

 1836 ?        Ss     0:00 /sbin/upstart --user
14937 pts/9    Ss     0:00  |   \_ -bash
23144 pts/9    Sl+    0:00  |   |   \_ irb
23156 pts/9    Z+     0:00  |   |       \_ [ruby] <defunct>
23159 pts/9    S+     0:00  \_ sleep 10

After detach:

 1836 ?        Ss     0:00 /sbin/upstart --user
14937 pts/9    Ss     0:00  |   \_ -bash
23144 pts/9    Sl+    0:00  |   |   \_ irb
23159 pts/9    S+     0:00  \_ sleep 10

The zombie process is now gone; upstart took care of it.

Kernel#spawn

Kernel#spawn, introduced long ago in Ruby 1.9, performs two operations: fork and exec.

exec replaces the current process with the shell execution, effectively terminating the Ruby interpreter:

irb> exec 'sleep 1'; puts 'Other operation'

The second statement won’t be executed; the interpreter will exit after the sleep invocation. So why is this useful?

Sometimes people want to “fire and forget” jobs; spawn is the appropriate tool (for this requirement):

irb> child_pid = spawn 'sleep 10'

This works, without exiting the interpreter, because exec will replace the subprocess. This is the fork + exec equivalent:

irb> child_pid = fork { exec 'sleep 10' }

And this is the processes information:

14937 pts/9    Ss     0:00  |   \_ -bash
18837 pts/9    Sl+    0:00  |   |   \_ irb
19124 pts/9    S+     0:00  |   |       \_ sleep 10

Compare this with using fork { `sleep 10` }:

14937 pts/9    Ss     0:00  |   \_ -bash
19161 pts/9    Sl+    0:00  |   |   \_ irb
19222 pts/9    Sl+    0:00  |   |       \_ irb
19225 pts/9    S+     0:00  |   |           \_ sleep 10

There is a very interesting advantage; since the forked irb process, using spawn, has been replaced with the subshell (19124), we don’t have an intermediate process in the middle, and the signals go directly to the intended background job:

irb> child_pid = spawn 'sleep 10'
irb> sleep 0.1
irb> Process.kill('SIGHUP', child_pid)
irb> Process.detach(child_pid)

Status at the end:

14937 pts/9    Ss     0:00  |   \_ -bash
21330 pts/9    Sl+    0:00  |   |   \_ irb

The sleep background job is now under our direct control!

Sadly, we can’t use spawn to solve the defined problem, because we still want a child interpreter running in order to return the search result.

Introduction to process groups and their usage

While we can theoretically manually track children of children, it’s of course better to find a way to manage this automatically.

Fortunately, Process groups come to the rescue:

In a POSIX-conformant operating system, a process group denotes a collection of one or more processes. Among other things, a process group is used to control the distribution of a signal; when a signal is directed to a process group, the signal is delivered to each process that is a member of the group.

When a process is forked, it inherits its PGID from its parent. The PGID changes when a process becomes a process group leader, then its PGID is copied from its PID. From then on, the new child processes it spawns, and their descendants, inherit that PGID (unless they start new process groups of their own).

By using process groups, we just deal with the forked process, and its child(ren) will receive the intended signal, too.

Note that in the next sections, the Ruby code is executed directly by a the interpreter, and not run in an interactive session.
Also, importantly, since the example is concurrent, there are no strict guarantees of the execution order; the interest is in the (guaranteed) end result.

Non-working usage of process groups

This example shows an example which prints all the information, and attempts to use the process groups strategy to terminate the background job:

puts "Parent pid: #{Process.pid}. pgid: #{Process.getpgrp}"

child_pid = fork do
  puts "Child pid: #{Process.pid}, pgid: #{Process.getpgrp}"

  puts 'Child: long operation...'

  system 'sleep 10'
end

sleep 0.1

pgid = Process.getpgid(child_pid)

puts "Sending HUP to group #{pgid}..."

Process.kill('SIGHUP', -pgid)
Process.detach(child_pid)

puts 'Parent: exiting...'

The notable call is the negative number in Process.kill('SIGHUP', -pgid); the POSIX meaning is that we want to send the signal to the process group, not to the given process.

This is the output:

Parent pid: 4887. pgid: 4887
Child pid: 4889, pgid: 4887
Child: long operation...
Sending HUP to group 4887...
Hangup

The parent doesn’t get to the last statement. Why?

We can see from the logs, that when the fork happens, the child doesn’t get a new process group id - it still remains in the parent process group, which makes sense as a default behavior.

Therefore, when we send the hangup to the child process group, since the parent is in the same group, it will terminate as well, not reaching the last statement.

What we need is to move the child process to its own process group.

Working implementation, using setsid

Moving a process to its own group is performed by setsid; from the Linux man page:

setsid() creates a new session if the calling process is not a process group leader. The calling process is the leader of the new session, the process group leader of the new process group, and has no controlling terminal. The process group ID and session ID of the calling process are set to the PID of the calling process. […]

Ruby supports this directly through Process.setsid; let’s see what happens:

puts "Parent pid: #{Process.pid}. pgid: #{Process.getpgrp}"

child_pid = fork do
  puts "Child pid: #{Process.pid}, pgid: #{Process.getpgrp}"

  Process.setsid

  puts "Child new pgid: #{Process.getpgrp}"

  puts "Child: long operation..."

  system "sleep 10"
end

sleep 5 # for taking the process information

pgid = Process.getpgid(child_pid)

puts "Sending HUP to group #{pgid}..."

Process.kill('HUP', -pgid)

Process.detach(pgid)

puts "Parent: exiting..."

sleep 10

Output:

Parent pid: 30731. pgid: 30731
Child pid: 30733, pgid: 30731
Child new pgid: 30733
Child: long operation...
Sending HUP to group 30733...
Parent: exiting...

We can see on the third line that Process.setsid changed the forked process group id; the parent won’t belong any more to it, so we’re free to send signals without interrupting the parent process.

Processes information before killing:

 1836 ?        Ss     0:00 /sbin/upstart --user
17512 pts/10   Ss     0:00  |   \_ -bash
30731 pts/10   Sl+    0:00  |   |   \_ ruby /tmp/test.rb
30733 ?        Ssl    0:00  |   |       \_ ruby /tmp/test.rb
30736 ?        S      0:00  |   |           \_ sleep 10

and after killing:

 1836 ?        Ss     0:00 /sbin/upstart --user
17512 pts/10   Ss     0:00  |   \_ -bash
30731 pts/10   Sl+    0:00  |   |   \_ ruby /tmp/test.rb

The background process is gone, and we still have the option of managing the child process as we wish (ie. both returning results and terminating).

Conclusion

Although this article doesn’t deal with thread-specific programming, instead with processes, it gives tools to handle with many cases, especially, system(-administration) related programming, where invoking subshells is common.

Using signals and process groups allows the programmer to have a precise and clean control over the lifecycle of a background job, which is crucial for concurrent programming.