Working with dates and number in Bash (and other goodies)

January 1, 2020 -
Tags: awk, linux, perl, shell_scripting, text_processing

Every month, I purge the files trashed more than one month before.

Since it’s been scientifically proven that manual operations cause PTSD in system administrators, I’ve made a script.

In this small article, I’ll explain some concepts involved, most notably, working with dates and numbers in Bash, and some other scripting-related concepts.

Contents:

The base structure

The purge_trash script is pretty simple:

  • it lists the content of the trash (via trash-list);
  • it extracts the timestamp of each file;
  • if the trashing timestamp is before the threshold, it purges the file (via trash-rm <filename>).

Readers can check the source; here, I’ll explain the most interesting parts.

Writing safer Bash scripts

Bash provides a set of options, some of whom make scripts more solid, and that should always be used.

The extended syntax for setting an option is set -o <extended_option_name>. Generally, one puts all the options setting at the top of the file.

The next sections will explain the relevant options.

errexit

The option errexit causes the script to terminate when there is an error.

Bash, by default, uses a On Error Resume Next approach (RIP Visual Basic 6 😉), so setting this option is a no-brainer.

However, this requires some cases to be handled - some commands exit with error codes during the normal workflow, grep being one of the most common.

There is an exception to this behavior: expressions of if conditionals will not cause the script to terminate. In other words, in a script like:

if <command_that_fails>; then
  echo "Extensive error message!"
  exit 1
fi

The “Extensive error message!” will be printed; Bash will not terminate after executing <command_that_fails>.

A real-world, typical example, is grep filtering; a subsection below is dedicated to this.

nounset, with pattern

The option nounset treats as an error referencing a variable that hasn’t been initialized.

The most common logic that needs some treatment when this option is set, is user parameters handling.

Let’s suppose we write a script with the following definition: myscript [<optional_parameter>], which can translate to:

set -o nounset

if [[ "$1" == "value_1" ]]; then
  # ...
fi

If the user invokes myscript without the parameter, the script will blow up, because $1 hasn’t been initialized!

The Bash functionality handling this case is a form of parameter expansion; the syntax is ${<variable_name>:-<value>}.

If we want to default to an empty string, we can omit <value>; therefore, the myscript specific code will be:

set -o nounset

if [[ "${1:-}" == "value_1" ]]; then
  # ...
fi

This won’t fail.

Note that this expansion considers uninitialized variables and empty strings the same; see:

$ myvar=
$ myvar=${myvar:-myvalue}
$ echo $myvar
myvalue

pipefail

The option pipefail treats as an error a pipeline (a sequence of commands chained via pipe (|)) whose any of the commands fails.

By default, Bash considers, in a pipeline, only the exit status of the last command; for example:

$ bash -c '
set -o errexit

false | echo "pipeline with error"
echo "following command"
'
pipeline with error
following command

Let’s see what happens with pipefail enabled:

$ bash -c '
set -o errexit
set -o pipefail

false | echo "pipeline with error"
echo "following command"
'
pipeline with error

The “following command” is not executed; the script exited.

Note how set -o errexit is required; pipefail will mark the pipeline as errored, but that alone doesn’t imply that an error will cause an exit.

pipefail versus grep -q

A very headscratching behavior is grep -q causing an exit with error when pipefail (with errexit) is enabled.

grep -q is used when testing some data against a string, without the result to be printed (“–quiet”).

Let’s suppose we want to test if a package is not installed. In our example, the package trash-cli is installed.

The first version could be¹:

bash -c '
set -o pipefail

if ! dpkg --get-selections | grep -P "trash-cli\t+install"; then
  echo "Package not installed"
fi
'
trash-cli         install

The string Package not installed will not be printed as expected, however, trash-cli install will - it’s the output of the grep command.

Therefore, one needs to filter grep’s output:

bash -c '
set -o pipefail

if ! dpkg --get-selections | grep -P "trash-cli\t+install" > /dev/null; then
  echo "Package not installed"
fi
'

All good! No more noise.

Now, the problem is that some smartpants (like me 😬) will find out the -q/--quiet option in the grep manpages, which supposedly yields the same result:

bash -c '
set -o pipefail

if ! dpkg --get-selections | grep -qP "trash-cli\t+install"; then
  echo "Package not installed"
fi
'
Package not installed

D’oh!!

What happened? This happened. In simple terms, grep -q exits early, causing dpkg to raise an error.

For those who really want to use grep -q (I recognize the appeal of not using > /dev/null), the “here-string” operator (<<<) will do the trick:

bash -c '
set -o pipefail

if ! grep -qP "trash-cli\t+install" <<< "$(dpkg --get-selections)"; then
  echo "Package not installed"
fi
'

All good! No more pipe, no more problems, and smartpants are satisfied 😄

First step: Handling dates, cycling the data

One may occasionally want to process timestamps.

With the aid of the utility date, and the Bash arithmetic expansion ($(( <expression> ))), we can do this easily.

Let’s suppose the input:

$ trash-list
2019-12-19 22:06:09 /path/to/test abc.png
2019-12-04 23:16:48 /path/to/xorgxrdp-0.2.11.tar.gz
2019-12-25 00:15:27 /path/to/probe-data.json.bak
2019-12-26 19:13:43 /path/to/subiquity_notes.md
2019-12-04 23:16:48 /path/to/xrdp-0.9.11.tar.gz
2019-12-25 00:31:20 /path/to/issue_subiquity.txt
2019-12-25 20:57:49 /path/to/ubuntu-mate-18.04.3-desktop-amd64.iso
2019-12-25 00:31:20 /path/to/probe-data.json

We want to print, say, the files older than 15 days before today (01/01/2010). We also want to sort the output, for good UX 😉

In order to work with timestamps, we need to convert them to integers. Let’s see an example:

$ timestamp=$(echo "2019-12-19 22:06:09 /path/to/test abc.png" | awk '{print $1 " " $2 }') # 2019-12-19 22:06:09
$ echo $(date -d "$timestamp" +"%s")
1576789569

There you go. The awk command prints the first two tokens (date and time).

Now, let’s write a basic cycle, which prints the sorted output:

$ trash_content=$(trash-list | sort)
$ while IFS= read -r line; do
  echo "$line"
done <<< "$trash_content"
# full list...

The while expression you see above is a common pattern for iterating data (in this case, the output of trash-list | sort) line by line.

The IFS= expression disables the built-in field separator, whose effect, in this context, is to preserve leading and trailing whitespace.

The -r option of read doesn’t interpret backslashes in the data (e.g. \n).

The two technicalities above are not required for this dataset, but I write them here for completeness’ sake².

Next bit: extracting a filename. We have several options!

Splitting a string via Perl

We want the tokens from index 2 (base 0) to the last; Awk doesn’t have nice syntax for this, so we use Perl (❤️):

$ echo "2019-12-19 22:06:09 /path/to/test abc.png" | perl -lane 'print "@F[2..$#F]"'
/path/to/test abc.png

This reads “print the entries of the array @F, from the index 2 to the size of the array”, in other words, “read from index 2 to the end”³.

Splitting a string via cut

We can also use cut:

$ echo "2019-12-19 22:06:09 /path/to/test abc.png" | cut -d ' ' -f 3-
/path/to/test abc.png

Read as: use as delimiter space (-d ' '), and extract the fields from 3 onwards (3-)

Note how cut uses a base 1 indexing (therefore, we select from index 3 onwards).

For files with a fixed structure, we can also index by character:

$ echo "2019-12-19 22:06:09 /path/to/test abc.png" | cut -c 21-
/path/to/test abc.png

Read as: index by characters, from the number 21 onwards (-c 21-).

Dates, arithmetic, and putting all together

The last bit is the arithmetic:

if (( trash_date_in_seconds < time_now_in_seconds - threshold_seconds )); then
  echo "File before threshold: $filename"
fi

The double round brackets denote delimit an arithmetic expression; note how variables don’t require the $.

If we need to assign the result of an arithmetic expression to a variable, this is the syntax:

myvar1=12
myvar2=$(( myvar1 + 2 + 3)) # 17

Now we can put all together!

# Result of `trash_content=$(trash-list | sort)`
#
$ trash_content="2019-12-19 22:06:09 /path/to/test abc.png
2019-12-04 23:16:48 /path/to/xorgxrdp-0.2.11.tar.gz
2019-12-25 00:15:27 /path/to/probe-data.json.bak
2019-12-26 19:13:43 /path/to/subiquity_notes.md
2019-12-04 23:16:48 /path/to/xrdp-0.9.11.tar.gz
2019-12-25 00:31:20 /path/to/issue_subiquity.txt
2019-12-25 20:57:49 /path/to/ubuntu-mate-18.04.3-desktop-amd64.iso
2019-12-25 00:31:20 /path/to/probe-data.json"

$ threshold_seconds=$(( 15 * 24 * 60 * 60 ))
$ time_now_in_seconds=1577919970 #  2 Jan 00:06:10 CET 2020

$ while IFS= read -r line || [[ -n "$line" ]]; do
  trashing_timestamp=$(echo "$line" | awk '{print $1 " " $2 }') # 2019-12-19 22:06:09
  trashing_timestamp_in_seconds=$(date -d "$trashing_timestamp" +"%s")

  if (( trashing_timestamp_in_seconds < time_now_in_seconds - threshold_seconds )); then
    filename=$(echo "$line" | perl -lane 'print "@F[2..$#F]"')
    echo "File in threshold: $filename"
  fi
done <<< "$trash_content"
File before threshold: /path/to/xorgxrdp-0.2.11.tar.gz
File before threshold: /path/to/xrdp-0.9.11.tar.gz

Conclusion

Although we’ve witnessed an ugly quirk (grep -q), all the concepts introduced in this article, from Bash functionalities to Unix tools, fit smoothly to produce solid, readable and flexible scripts.

Happy scripting!

Footnotes

¹: I’m using an inappropriately simplified version of the pattern, for simplicity purposes; see the full script for the exact expression. ²: This is not a rigorous interpretation, but good enough in this context. ³: I’ve omitted also another technicality, || [[ -n "$line" ]], which may be excessive in this context.