7. Shell Scripts

We are finally ready to see what makes the shell such a powerful programming environment. We’re taking the commands we’ve previously been using up till now and save them in files, allowing us to rerun them with a single command. These files are typically called shell scripts, but they’re essentially small programs.

Let’s start by going back to molecules/ and creating a new file, middle.sh which will become our shell script:

The command nano middle.sh opens the file middle.sh within the text editor “nano” (which runs within the shell). If the file does not exist, it will be created. We can use the text editor to directly edit the file – we’ll simply insert the following line:

This is a variation on the pipe we constructed earlier: it selects lines 11-15 of the file octane.pdb. Remember, we are not running it as a command just yet: we are putting the commands in a file.

Then we save the file (Ctrl-O in nano), and exit the text editor (Ctrl-X in nano). Check that the directory molecules now contains a file called middle.sh.

Once we have saved the file, we can ask the shell to execute the commands it contains. Our shell is called bash, so we run the following command:

Sure enough, our script’s output is exactly what we would get if we ran that pipeline directly.

What if we want to select lines from an arbitrary file? We could edit middle.sh each time to change the filename, but that would probably take longer than just retyping the command. Instead, let’s edit middle.sh and make it more versatile:

Now, within “nano”, replace the text octane.pdb with the special variable called $1:

Inside a shell script, $1 means “the first filename (or other argument) on the command line”. We can now run our script like this:

or on a different file like this:

We still need to edit middle.sh each time we want to adjust the range of lines, though. Let’s fix that by using the special variables $2 and $3 for the number of lines to be passed to head and tail respectively:

We can now run:

By changing the arguments to our command we can change our script’s behaviour:

This works, but it may take the next person who reads middle.sh a moment to figure out what it does. We can improve our script by adding some comments at the top:

A comment starts with a # character and runs to the end of the line. The computer ignores comments, but they’re invaluable for helping people (including your future self) understand and use scripts. However, each time you change your script remember to check that the comment is still accurate: misleading comments are worse than no comments.

What if we want to process many files in a single pipeline? For example, if we want to sort our .pdb files by length, we would type:

Since wc -l counts lines and sort -n sorts numerically, we could create a script. However, it would only work on .pdb files in the current directory. To handle various file types, we use $@, representing all command-line arguments, enclosed in double-quotes to handle spaces. For example:

List Unique Species

Suppose we’ve executed a useful series of commands, like creating a graph for a paper. To ensure we can recreate it accurately later and avoiding potential errors from retyping them, we can save these commands in a file with the following command:

The file redo-figure-3.sh now contains:

After a moment’s work in an editor to remove the serial numbers on the commands, and to remove the final line where we called the history command, we have a completely accurate record of how we created that figure.

In practice, most people develop shell scripts by running commands at the shell prompt a few times to make sure they’re doing the right thing, then saving them in a file for re-use. This style of work allows people to recycle what they discover about their data and their workflow with one call to history and a bit of editing to clean up the output and save it as a shell script.

Nelle’s Pipeline: Creating a Script

Nelle’s supervisor insisted that all her analytics must be reproducible. The easiest way to capture all the steps is in a script.

First we return to Nelle’s data directory:

She runs the editor and writes the following:

She saves this in a file called do-stats.sh so that she can now re-do the first stage of her analysis by typing:

so that the output is just the number of files processed rather than the names of the files that were processed.

One thing to note about Nelle’s script is that it lets the person running it decide what files to process. She could have written it as:

The advantage is that this always selects the right files: she doesn’t have to remember to exclude the ‘Z’ files. The disadvantage is that it always selects just those files — she can’t run it on all files (including the ‘Z’ files), or on the ‘G’ or ‘H’ files her colleagues in Antarctica are producing, without editing the script. If she wanted to be more adventurous, she could modify her script to check for command-line arguments, and use NENE*[AB].txt if none were provided. Of course, this introduces another tradeoff between flexibility and complexity.

Variables in Shell Scripts

Find the Longest File With a Given Extension

Debugging Scripts