4 Creating Command-line Tools

Throughout the book, I’ll introduce you to many commands and pipelines that basically fit on one line. These are known as one-liners or pipelines. Being able to perform complex tasks with just a one-liner is what makes the command line powerful. It’s a very different experience from writing and using traditional programs.

Some tasks you perform only once, and some you perform more often. Some tasks are very specific and others can be generalized. If you need to repeat a certain one-liner on a regular basis, it’s worthwhile to turn this into a command-line tool of its own. So, both one-liners and command-line tools have their uses. Recognizing the opportunity requires practice and skill. The advantages of a command-line tool are that you don’t have to remember the entire one-liner and that it improves readability if you include it into some other pipeline. In that sense, you can think of a command-line tool as similar to a function in a programming language.

The benefit of a working with a programming language, however, is that the code is in one or more file. This means that you can easily edit and reuse that code. If the code has parameters it can even be generalized and re-applied to problems that follow a similar pattern.

Command-line tools have the best of both worlds: they can be used from the command line, accept parameters, and only have to be created once. In this chapter, you’re going to get familiar with creating command-line tools in two ways. First, I explain how to turn those one-liners into reusable command-line tools. By adding parameters to your commands, you can add the same flexibility that a programming language offers. Subsequently, I demonstrate how to create reusable command-line tools from code that’s written in a programming language. By following the Unix philosophy, your code can be combined with other command-line tools, which may be written in an entirely different language. In this chapter, I will focus on three programming languages: Bash, Python, and R.

I believe that creating reusable command-line tools makes you a more efficient and productive data scientist in the long run. You will gradually build up your own data science toolbox from which you can draw existing tools and apply it to problems you have encountered previously. It requires practice to recognize the opportunity to turn a one-liner or existing code into a command-line tool.

In order to turn a one-liner into a shell script, I’m going to use a tiny bit of shell scripting. This book only demonstrates a small subset of concepts from shell scripting, including variables, conditionals, and loops. A complete course in shell scripting deserves a book on its own, and is therefore beyond the scope of this one. If you want to dive deeper into shell scripting, I recommend the book Classic Shell Scripting by Arnold Robbins and Nelson H. F. Beebe ⁴⁶.

4.1 Overview

In this chapter, you’ll learn how to:

Convert one-liners into parameterized shell scripts
Turn existing Python and R code into reusable command-line tools

This chapter starts with the following files:

$ cd /data/ch04
 
$ l
total 32K
-rwxr-xr-x 1 dst dst 400 Dec 14 11:46 fizzbuzz.py*
-rwxr-xr-x 1 dst dst 391 Dec 14 11:46 fizzbuzz.R*
-rwxr-xr-x 1 dst dst 182 Dec 14 11:46 stream.py*
-rwxr-xr-x 1 dst dst 147 Dec 14 11:46 stream.R*
-rwxr-xr-x 1 dst dst 105 Dec 14 11:46 top-words-4.sh*
-rwxr-xr-x 1 dst dst 128 Dec 14 11:46 top-words-5.sh*
-rwxr-xr-x 1 dst dst 647 Dec 14 11:46 top-words.py*
-rwxr-xr-x 1 dst dst 584 Dec 14 11:46 top-words.R*

The instructions to get these files are in Chapter 2. Any other files are either downloaded or generated using command-line tools.

4.2 Converting One-liners into Shell Scripts

In this section I’m going to explain how to turn a one-liner into a reusable command-line tool. Let’s say that you would like to get the top most frequent words used in a piece of text. Take the book Alice’s Adventures in Wonderland by Lewis Carroll, which is, like many other great books, freely available on Project Gutenberg.

$ curl -sL "https://www.gutenberg.org/files/11/11-0.txt" | trim
The Project Gutenberg eBook of Alice’s Adventures in Wonderland, by Lewis …
 
This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.
 
… with 3751 more lines

The following sequence of tools or pipeline should do the job:

$ curl -sL "https://www.gutenberg.org/files/11/11-0.txt" | ➊
> tr '[:upper:]' '[:lower:]' | ➋
> grep -oE "[a-z\']{2,}" | ➌
> sort | ➍
> uniq -c | ➎
> sort -nr | ➏
> head -n 10 ➐
   1839 the
    942 and
    811 to
    638 of
    610 it
    553 she
    486 you
    462 said
    435 in
    403 alice

➊ Downloading an ebook using curl.
➋ Converting the entire text to lowercase using tr⁴⁷.
➌ Extracting all the words using grep⁴⁸ and put each word on separate line.
➍ Sort these words in alphabetical order using sort⁴⁹.
➎ Remove all the duplicates and count how often each word appears in the list using uniq⁵⁰.
➏ Sort this list of unique words by their count in descending order using sort.
➐ Keep only the top 10 lines (i.e., words) using head.

Those words indeed appear the most often in the text. Because those words (apart from the word “alice”) appear very frequently in many English texts, they carry very little meaning. In fact, these are known as stopwords. If we get rid of those, we keep the most frequent words that are related to this text.

Here’s a list of stopwords I’ve found:

$ curl -sL "https://raw.githubusercontent.com/stopwords-iso/stopwords-en/master/
stopwords-en.txt" |
> sort | tee stopwords | trim 20
10
39
a
able
ableabout
about
above
abroad
abst
accordance
according
accordingly
across
act
actually
ad
added
adj
adopted
ae
… with 1278 more lines

With grep we can filter out the stopwords right before we start counting:

$ curl -sL "https://www.gutenberg.org/files/11/11-0.txt" |
> tr '[:upper:]' '[:lower:]' |
> grep -oE "[a-z\']{2,}" |
> sort |
> grep -Fvwf stopwords | ➊
> uniq -c |
> sort -nr |
> head -n 10
    403 alice
     98 gutenberg
     88 project
     76 queen
     71 time
     63 king
     60 turtle
     57 mock
     56 hatter
     55 gryphon

➊ Obtain the patterns from a file (stopwords in our case), one per line, with -f. Interpret those patterns as fixed strings with -F. Select only those lines containing matches that form whole words with -w. Select non-matching lines with -v.

Each command-line tool used in this one-liner offers a man page. So in case you would like to know more about, say, grep, you can run man grep from the command line. The command-line tools tr, grep, uniq, and sort will be discussed in more detail in the next chapter.

There is nothing wrong with running this one-liner just once. However, imagine if you wanted to have the top 10 words of every e-book on Project Gutenberg. Or imagine that you wanted the top 10 words of a news website on a hourly basis. In those cases, it would be best to have this one-liner as a separate building block that can be part of something bigger. To add some flexibility to this one-liner in terms of parameters, let’s turn it into a shell script.

This allows us to take the one-liner as the starting point, and gradually improve on it. To turn this one-liner into a reusable command-line tool, I’ll walk you through the following six steps:

Copy and paste the one-liner into a file.
Add execute permissions.
Define a so-called shebang.
Remove the fixed input part.
Add a parameter.
Optionally extend your PATH.

4.2.1 Step 1: Create File

The first step is to create a new file. You can open your favorite text editor and copy and paste the one-liner. Let’s name the file top-words-1.sh to indicate that this is the first step towards our new command-line tool If you like to stay at the command line, you can use the builtin fc, which stands for fix command, and allows you to fix or edit the last-run command.

$ fc

Running fc invokes the default text editor, which is stored in the environment variable EDITOR. In the Docker container, this is set to nano, a straightforward text editor. As you can see, this file contains our one-liner:

  GNU nano 5.4                     /tmp/zshpu0emE                               
curl -sL "https://www.gutenberg.org/files/11/11-0.txt" |                        
tr '[:upper:]' '[:lower:]' |            
grep -oE "[a-z\']{2,}" |                
sort |              
grep -Fvwf stopwords |                  
uniq -c |           
sort -nr |          
head -n 10          
 
 
 
 
                                [ Read 8 lines ]                                
^G Help      ^O Write Out ^W Where Is  ^K Cut       ^T Execute   ^C Location    
^X Exit      ^R Read File ^\ Replace   ^U Paste     ^J Justify   ^_ Go To Line

Let’s give this temporary file a proper name by pressing Ctrl-O, removing the temporary filename, and typing top-words-1.sh:

  GNU nano 5.4                     /tmp/zshpu0emE                               
curl -sL "https://www.gutenberg.org/files/11/11-0.txt" |                        
tr '[:upper:]' '[:lower:]' |            
grep -oE "[a-z\']{2,}" |                
sort |              
grep -Fvwf stopwords |                  
uniq -c |           
sort -nr |          
head -n 10          
 
 
 
 
File Name to Write: top-words-1.sh                                              
^G Help             M-D DOS Format      M-A Append          M-B Backup File     
^C Cancel           M-M Mac Format      M-P Prepend         ^T Browse

Press Enter:

  GNU nano 5.4                     /tmp/zshpu0emE                               
curl -sL "https://www.gutenberg.org/files/11/11-0.txt" |                        
tr '[:upper:]' '[:lower:]' |            
grep -oE "[a-z\']{2,}" |                
sort |              
grep -Fvwf stopwords |                  
uniq -c |           
sort -nr |          
head -n 10          
 
 
 
 
Save file under DIFFERENT NAME?                                                 
 Y Yes                                                                          
 N No           ^C Cancel

Press Y to confirm that you want to save under a different filename:

  GNU nano 5.4                     top-words-1.sh                               
curl -sL "https://www.gutenberg.org/files/11/11-0.txt" |                        
tr '[:upper:]' '[:lower:]' |            
grep -oE "[a-z\']{2,}" |                
sort |              
grep -Fvwf stopwords |                  
uniq -c |           
sort -nr |          
head -n 10          
 
 
 
 
                               [ Wrote 8 lines ]                                
^G Help      ^O Write Out ^W Where Is  ^K Cut       ^T Execute   ^C Location    
^X Exit      ^R Read File ^\ Replace   ^U Paste     ^J Justify   ^_ Go To Line

Press Ctrl-X to exit nano and go back from whence you came.

We are using the file extension .sh to make clear that we are creating a shell script. However, command-line tools don’t need to have an extension. In fact, command-line tools rarely have extensions.

Confirm the contents of the file:

$ pwd
/data/ch04
 
$ l
total 44K
-rwxr-xr-x 1 dst dst  400 Dec 14 11:46 fizzbuzz.py*
-rwxr-xr-x 1 dst dst  391 Dec 14 11:46 fizzbuzz.R*
-rw-r--r-- 1 dst dst 7.5K Dec 14 11:47 stopwords
-rwxr-xr-x 1 dst dst  182 Dec 14 11:46 stream.py*
-rwxr-xr-x 1 dst dst  147 Dec 14 11:46 stream.R*
-rw-r--r-- 1 dst dst  173 Dec 14 11:47 top-words-1.sh
-rwxr-xr-x 1 dst dst  105 Dec 14 11:46 top-words-4.sh*
-rwxr-xr-x 1 dst dst  128 Dec 14 11:46 top-words-5.sh*
-rwxr-xr-x 1 dst dst  647 Dec 14 11:46 top-words.py*
-rwxr-xr-x 1 dst dst  584 Dec 14 11:46 top-words.R*
 
$ bat top-words-1.sh
───────┬────────────────────────────────────────────────────────────────────────
       │ File: top-words-1.sh
───────┼────────────────────────────────────────────────────────────────────────
   1   │ curl -sL "https://www.gutenberg.org/files/11/11-0.txt" |
   2   │ tr '[:upper:]' '[:lower:]' |
   3   │ grep -oE "[a-z\']{2,}" |
   4   │ sort |
   5   │ grep -Fvwf stopwords |
   6   │ uniq -c |
   7   │ sort -nr |
   8   │ head -n 10
───────┴────────────────────────────────────────────────────────────────────────

You can now use bash⁵¹ to interpret and execute the commands in the file:

$ bash top-words-1.sh
    403 alice
     98 gutenberg
     88 project
     76 queen
     71 time
     63 king
     60 turtle
     57 mock
     56 hatter
     55 gryphon

This saves you from typing the one-liner again next time.

However, because the file cannot be executed on its own, it’s not yet a real command-line tool. Let’s change that in the next step.

4.2.2 Step 2: Give Permission to Execute

The reason we cannot execute our file directly is that we don’t have the correct access permissions. In particular, you, as a user, need to have permission to execute the file. In this section we change the access permissions of our file.

In order to compare differences between steps, copy the file to top-words-2.sh using cp -v top-words-{1,2}.sh.

If you ever want to verify what the brace expansion or any other form of file expansion leads to, replace the command with echo to just print the result. For example, echo book_{draft,final}.md or echo agent-{001..007}.

To change the access permissions of a file, we need to use a command-line tool called chmod⁵², which stands for change mode. It changes the file mode bits of a specific file. The following command gives the user, you, permission to execute top-words-2.sh:

$ cp -v top-words-{1,2}.sh
'top-words-1.sh' -> 'top-words-2.sh'
 
$ chmod u+x top-words-2.sh

The argument u+x consists of three characters: (1) u indicates that we want to change the permissions for the user who owns the file, which is you, because you created the file; (2) + indicates that we want to add a permission; and (3) x, which indicates the permissions to execute.

Let’s now have a look at the access permissions of both files:

$ l top-words-{1,2}.sh
-rw-r--r-- 1 dst dst 173 Dec 14 11:47 top-words-1.sh
-rwxr--r-- 1 dst dst 173 Dec 14 11:47 top-words-2.sh*

The first column shows the access permissions for each file. For top-words-2.sh, this is -rwxrw-r--. The first character - (hyphen) indicates the file type. A - means regular file and a d means directory. The next three characters, rwx, indicate the access permissions for the user who owns the file. The r and w mean read and write, respectively. (As you can see, top-words-1.sh has a - instead of an x, which means that we cannot execute that file.) The next three characters rw- indicate the access permissions for all members of the group that owns the file. Finally, the last three characters in the column, r--, indicate access permissions for all other users.

Now you can execute the file as follows:

$ ./top-words-2.sh
    403 alice
     98 gutenberg
     88 project
     76 queen
     71 time
     63 king
     60 turtle
     57 mock
     56 hatter
     55 gryphon

If you try to execute a file for which you don’t have the correct access permissions, as with top-words-1.sh, you will see the following error message:

$ ./top-words-1.sh
zsh: permission denied: ./top-words-1.sh

4.2.3 Step 3: Define Shebang

Although we can already execute the file on its own, we should add a so-called shebang to the file. The shebang is a special line in the script that instructs the system which executable it should use to interpret the commands.

The name shebang comes from the first two characters: a hash (she) and an exclamation mark (bang): #!. It’s not a good idea to leave it out, as we have done in the previous step, because each shell has a different default executable. The Z shell, the one we’re using throughout the book, uses the executable /bin/sh by default if no shebang is defined. In this case I’d like bash to interpret the commands as that will give us some more functionality than sh.

Again, you’re free to use whatever editor you like, but I’m going to stick with nano⁵³, which is installed in the Docker image.

$ cp -v top-words-{2,3}.sh
'top-words-2.sh' -> 'top-words-3.sh'
 
$ nano top-words-3.sh

  GNU nano 5.4                     top-words-3.sh                               
curl -sL "https://www.gutenberg.org/files/11/11-0.txt" |                        
tr '[:upper:]' '[:lower:]' |            
grep -oE "[a-z\']{2,}" |                
sort |              
grep -Fvwf stopwords |                  
uniq -c |           
sort -nr |          
head -n 10          
 
 
 
 
                                [ Read 8 lines ]                                
^G Help      ^O Write Out ^W Where Is  ^K Cut       ^T Execute   ^C Location    
^X Exit      ^R Read File ^\ Replace   ^U Paste     ^J Justify   ^_ Go To Line

Go ahead and type #!/usr/bin/env/bash and press Enter. When you’re ready, press Ctrl-X to save and exit.

  GNU nano 5.4                     top-words-3.sh *                             
#!/usr/bin/env bash 
curl -sL "https://www.gutenberg.org/files/11/11-0.txt" |                        
tr '[:upper:]' '[:lower:]' |            
grep -oE "[a-z\']{2,}" |                
sort |              
grep -Fvwf stopwords |                  
uniq -c |           
sort -nr |          
head -n 10          
 
 
 
Save modified buffer?                                                           
 Y Yes                                                                          
 N No           ^C Cancel

Press Y to indicate that you want to save the file.

  GNU nano 5.4                     top-words-3.sh *                             
#!/usr/bin/env bash 
curl -sL "https://www.gutenberg.org/files/11/11-0.txt" |                        
tr '[:upper:]' '[:lower:]' |            
grep -oE "[a-z\']{2,}" |                
sort |              
grep -Fvwf stopwords |                  
uniq -c |           
sort -nr |          
head -n 10          
 
 
 
File Name to Write: top-words-3.sh                                              
^G Help             M-D DOS Format      M-A Append          M-B Backup File     
^C Cancel           M-M Mac Format      M-P Prepend         ^T Browse

Let’s confirm what top-words-3.sh looks like:

$ bat top-words-3.sh
───────┬────────────────────────────────────────────────────────────────────────
       │ File: top-words-3.sh
───────┼────────────────────────────────────────────────────────────────────────
   1   │ #!/usr/bin/env bash 
   2   │ curl -sL "https://www.gutenberg.org/files/11/11-0.txt" |
   3   │ tr '[:upper:]' '[:lower:]' |
   4   │ grep -oE "[a-z\']{2,}" |
   5   │ sort |
   6   │ grep -Fvwf stopwords |
   7   │ uniq -c |
   8   │ sort -nr |
   9   │ head -n 10
───────┴────────────────────────────────────────────────────────────────────────

That’s exactly what we need: our original pipeline with a shebang in front of it.

Sometimes you will come across scripts that have a shebang in the form of !/usr/bin/bash or !/usr/bin/python (in the case of Python, as we will see in the next section). While this generally works, if the bash or python⁵⁴ executables are installed in a different location than /usr/bin, then the script does not work anymore. It is better to use the form that I present here, namely !/usr/bin/env bash and !/usr/bin/env python, because the env⁵⁵ executable is aware where bash and python are installed. In short, using env makes your scripts more portable.

4.2.4 Step 4: Remove Fixed Input

We know have a valid command-line tool that we can execute from the command line. But we can do better than this. We can make our command-line tool more reusable. The first command in our file is curl, which downloads the text from which we wish to obtain the top 10 most-used words. So, the data and operations are combined into one.

What if we wanted to obtain the top 10 most-used words from another e-book, or any other text for that matter? The input data is fixed within the tools itself. It would be better to separate the data from the command-line tool.

If we assume that the user of the command-line tool will provide the text, the tool will become generally applicable. So, the solution is to remove the curl command from the script. Here is the updated script named top-words-4.sh:

$ cp -v top-words-{3,4}.sh
'top-words-3.sh' -> 'top-words-4.sh'
 
$ sed -i '2d' top-words-4.sh
 
$ bat top-words-4.sh
───────┬────────────────────────────────────────────────────────────────────────
       │ File: top-words-4.sh
───────┼────────────────────────────────────────────────────────────────────────
   1   │ #!/usr/bin/env bash 
   2   │ tr '[:upper:]' '[:lower:]' |
   3   │ grep -oE "[a-z\']{2,}" |
   4   │ sort |
   5   │ grep -Fvwf stopwords |
   6   │ uniq -c |
   7   │ sort -nr |
   8   │ head -n 10
───────┴────────────────────────────────────────────────────────────────────────

This works because if a script starts with a command that needs data from standard input, like tr, it will take the input that is given to the command-line tools. For example:

$ curl -sL 'https://www.gutenberg.org/files/11/11-0.txt' | ./top-words-4.sh
    403 alice
     98 gutenberg
     88 project
     76 queen
     71 time
     63 king
     60 turtle
     57 mock
     56 hatter
     55 gryphon
 
$ curl -sL 'https://www.gutenberg.org/files/12/12-0.txt' | ./top-words-4.sh
    469 alice
    189 queen
     98 gutenberg
     88 project
     72 time
     71 red
     70 white
     67 king
     63 head
     59 knight
 
$ man bash | ./top-words-4.sh
    585 command
    332 set
    313 word
    313 option
    304 file
    300 variable
    298 bash
    258 list
    257 expansion
    238 history

Although we have not done so in our script, the same principle holds for saving data. It is, in general, better to let the user take care of that using output redirection than to let the script write to a specific file. Of course, if you intend to use a command-line tool only for your own projects, then there are no limits to how specific you can be.

4.2.5 Step 5: Add Arguments

There is one more step to make our command-line tool even more reusable: parameters. In our command-line tool there are a number of fixed command-line arguments, for example -nr for sort and -n 10 for head. It is probably best to keep the former argument fixed. However, it would be very useful to allow for different values for the head command. This would allow the end user to set the number of most-often used words to output. Below shows what our file top-words-5.sh looks like:

$ bat top-words-5.sh
───────┬────────────────────────────────────────────────────────────────────────
       │ File: top-words-5.sh
───────┼────────────────────────────────────────────────────────────────────────
   1   │ #!/usr/bin/env bash
   2   │
   3   │ NUM_WORDS="${1:-10}"
   4   │
   5   │ tr '[:upper:]' '[:lower:]' |
   6   │ grep -oE "[a-z\']{2,}" |
   7   │ sort |
   8   │ grep -Fvwf stopwords |
   9   │ uniq -c |
  10   │ sort -nr |
  11   │ head -n "${NUM_WORDS}"
───────┴────────────────────────────────────────────────────────────────────────

The variable NUM_WORDS is set to the value of $1, which is a special variable in Bash. It holds the value of the first command-line argument passed to our command-line tool. The table below lists the other special variables that Bash offers. If no value is specified, it will take on the value “10.”
Note that in order to use the value of the $NUM_WORDS variable, you need to put a dollar sign in front of it. When you set it, you don’t write a dollar sign.

We could have also used $1 directly as an argument for head and not bother creating an extra variable such NUM_WORDS. However, with larger scripts and a few more command-line arguments such as $2 and $3, your code becomes more readable when you use named variables.

Now if you wanted to see the top 20 most-used words of our text, we would invoke our command-line tool as follows:

$ curl -sL "https://www.gutenberg.org/files/11/11-0.txt" > alice.txt
 
$ < alice.txt ./top-words-5.sh 20
    403 alice
     98 gutenberg
     88 project
     76 queen
     71 time
     63 king
     60 turtle
     57 mock
     56 hatter
     55 gryphon
     53 rabbit
     50 head
     48 voice
     45 looked
     44 mouse
     42 duchess
     40 tone
     40 dormouse
     37 cat
     34 march

If the user does not specify a number, then our script will show the top 10 most common words:

$ < alice.txt ./top-words-5.sh
    403 alice
     98 gutenberg
     88 project
     76 queen
     71 time
     63 king
     60 turtle
     57 mock
     56 hatter
     55 gryphon

4.2.6 Step 6: Extend Your PATH

After the previous five steps we are finally finished building a reusable command-line tool. There is, however, one more step that can be very useful. In this optional step we are going to ensure that you can execute your command-line tools from everywhere.

Currently, when you want to execute your command-line tool, you either have to navigate to the directory it is in or include the full path name as shown in step 2. This is fine if the command-line tool is specifically built for, say, a certain project. However, if your command-line tool could be applied in multiple situations, then it is useful to be able to execute it from everywhere, just like the command-line tools that come with Ubuntu.

To accomplish this, Bash needs to know where to look for your command-line tools. It does this by traversing a list of directories which are stored in an environment variable called PATH. In a fresh Docker container, the PATH looks like this:

$ echo $PATH
/usr/local/lib/R/site-library/rush/exec:/usr/bin/dsutils:/home/dst/.local/bin:/u
sr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

The directories are delimited by colons. We can print it as a list of directories by translating the colons to newlines:

$ echo $PATH | tr ':' '\n'
/usr/local/lib/R/site-library/rush/exec
/usr/bin/dsutils
/home/dst/.local/bin
/usr/local/sbin
/usr/local/bin
/usr/sbin
/usr/bin
/sbin
/bin

To change the PATH permanently, you’ll need to edit the .bashrc or .profile file located in your home directory. If you put all your custom command-line tools into one directory, say, ~/tools, then you only change the PATH once. Now, you no longer need to add the ./, but you can just use the filename. Moreover, you do no longer need to remember where the command-line tool is located.

$ cp -v top-words{-5.sh,}
'top-words-5.sh' -> 'top-words'
 
$ export PATH="${PATH}:/data/ch04"
 
$ echo $PATH
/usr/local/lib/R/site-library/rush/exec:/usr/bin/dsutils:/home/dst/.local/bin:/u
sr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/data/ch04
 
$ curl "https://www.gutenberg.org/files/11/11-0.txt" |
> top-words 10
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  170k  100  170k    0     0   115k      0  0:00:01  0:00:01 --:--:--  116k
    403 alice
     98 gutenberg
     88 project
     76 queen
     71 time
     63 king
     60 turtle
     57 mock
     56 hatter
     55 gryphon

4.3 Creating Command-line Tools with Python and R

The command-line tool that we created in the previous section was written in Bash. (Sure, not every feature of the Bash programming language was employed, but the interpreter still was bash.) As you know by now, the command line is language agnostic, so we don’t necessarily have to use Bash for creating command-line tools.

In this section I’m going demonstrate that command-line tools can be created in other programming languages as well. I’ll focus on Python and R because these are the two most popular programming languages within the data science community. I cannot offer a complete introduction to either language, so I assume that you have some familiarity with Python and or R. Other programming languages such as Java, Go, and Julia, follow a similar pattern when it comes to creating command-line tools.

There are three main reasons for creating command-line tools in another programming language than Bash. First, you may already have some code that you’d like to be able to use from the command line. Second, the command-line tool would end up encompassing more than a hundred lines of Bash code. Third, the command-line tool needs to be more safe and robust (Bash lacks many features such as type checking).

The six steps that I discussed in the previous section roughly apply to creating command-line tools in other programming languages as well. The first step, however, would not be copy pasting from the command line, but rather copy pasting the relevant code into a new file. Command-line tools written in Python and R need to specify python and Rscript⁵⁶, respectively, as the interpreter after the shebang.

When it comes to creating command-line tools using Python and R, there are two more aspects that deserve special attention. First, processing standard input, which comes natural to shell scripts, has to be taken care of explicitly in Python and R. Second, as command-line tools written in Python and R tend to be more complex, we may also want to offer the user the ability to specify more elaborate command-line arguments.

4.3.1 Porting The Shell Script

As a starting point, let’s see how we would port the shell script we just created to both Python and R. In other words, what Python and R code gives us the top most-often used words from standard input? We will first show the two files top-words.py and top-words.R and then discuss the differences with the shell code. In Python, the code would look something like:

$ cd /data/ch04
 
$ bat top-words.py
───────┬────────────────────────────────────────────────────────────────────────
       │ File: top-words.py
───────┼────────────────────────────────────────────────────────────────────────
   1   │ #!/usr/bin/env python
   2   │ import re
   3   │ import sys
   4   │
   5   │ from collections import Counter
   6   │ from urllib.request import urlopen
   7   │
   8   │ def top_words(text, n):
   9   │     with urlopen("https://raw.githubusercontent.com/stopwords-iso/stopw
       │ ords-en/master/stopwords-en.txt") as f:
  10   │         stopwords = f.read().decode("utf-8").split("\n")
  11   │
  12   │     words = re.findall("[a-z']{2,}", text.lower())
  13   │     words = (w for w in words if w not in stopwords)
  14   │
  15   │     for word, count in Counter(words).most_common(n):
  16   │         print(f"{count:>7} {word}")
  17   │
  18   │
  19   │ if __name__ == "__main__":
  20   │     text = sys.stdin.read()
  21   │
  22   │     try:
  23   │         n = int(sys.argv[1])
  24   │     except:
  25   │         n = 10
  26   │
  27   │     top_words(text, n)
───────┴────────────────────────────────────────────────────────────────────────

Note that this Python example doesn’t use any third-party packages. If you want to do advanced text processing, then I recommend you check out the NLTK package⁵⁷. If you’re going to work with a lot of numerical data, then I recommend you use the Pandas package⁵⁸.

And in R the code would look something like:

$ bat top-words.R
───────┬────────────────────────────────────────────────────────────────────────
       │ File: top-words.R
───────┼────────────────────────────────────────────────────────────────────────
   1   │ #!/usr/bin/env Rscript
   2   │ n <- as.integer(commandArgs(trailingOnly = TRUE))
   3   │ if (length(n) == 0) n <- 10
   4   │
   5   │ f_stopwords <- url("https://raw.githubusercontent.com/stopwords-iso/sto
       │ pwords-en/master/stopwords-en.txt")
   6   │ stopwords <- readLines(f_stopwords, warn = FALSE)
   7   │ close(f_stopwords)
   8   │
   9   │ f_text <- file("stdin")
  10   │ lines <- tolower(readLines(f_text))
  11   │
  12   │ words <- unlist(regmatches(lines, gregexpr("[a-z']{2,}", lines)))
  13   │ words <- words[is.na(match(words, stopwords))]
  14   │
  15   │ counts <- sort(table(words), decreasing = TRUE)
  16   │ cat(sprintf("%7d %s\n", counts[1:n], names(counts[1:n])), sep = "")
  17   │ close(f_text)
───────┴────────────────────────────────────────────────────────────────────────

Let’s check that all three implementations (i.e., Bash, Python, and R) return the same top 5 words with the same counts:

$ time < alice.txt top-words 5
    403 alice
     98 gutenberg
     88 project
     76 queen
     71 time
top-words 5 < alice.txt  0.38s user 0.04s system 149% cpu 0.278 total
 
$ time < alice.txt top-words.py 5
    403 alice
     98 gutenberg
     88 project
     76 queen
     71 time
top-words.py 5 < alice.txt  1.36s user 0.03s system 95% cpu 1.454 total
 
$ time < alice.txt top-words.R 5
    403 alice
     98 gutenberg
     88 project
     76 queen
     71 time
top-words.R 5 < alice.txt  1.48s user 0.14s system 91% cpu 1.774 total

Wonderful! Sure, the output itself is not very exciting. What’s exciting is that we can accomplish the same task with multiple languages. Let’s look at the differences between the approaches.

First, what’s immediately obvious is the difference in amount of code. For this specific task, both Python and R require much more code than Bash. This illustrates that, for some tasks, it is better to use the command line. For other tasks, you may better off using a programming language. As you gain more experience on the command line, you will start to recognize when to use which approach. When everything is a command-line tool, you can even split up the task into subtasks, and combine a Bash command-line tool with, say, a Python command-line tool. Whichever approach works best for the task at hand.

4.3.2 Processing Streaming Data from Standard Input

In the previous two code snippets, both Python and R read the complete standard input at once. On the command line, most tools pipe data to the next command-line tool in a streaming fashion. There are a few command-line tools which require the complete data before they write any data to standard output, like sort. This means the pipeline is blocked by such command-line tools. This doesn’t have to be a problem when the input data is finite, like a file. However, when the input data is a non-stop stream, such blocking command-line tools are useless.

Luckily Python and R support processing streaming data. You can apply a function on a line-per-line basis, for example. Here are two minimal examples that demonstrate how this works in Python and R, respectively.

Both the Python and R tool solve the, by now infamous, Fizz Buzz problem, which is defined as follows: Print the numbers from 1 to 100, except that if the number is divisible by 3, instead print “fizz”; if the number is divisible by 5, instead print “buzz”; and if the number is divisible by 15, instead print “fizzbuzz.” Here’s the Python code⁵⁹:

$ bat fizzbuzz.py
───────┬────────────────────────────────────────────────────────────────────────
       │ File: fizzbuzz.py
───────┼────────────────────────────────────────────────────────────────────────
   1   │ #!/usr/bin/env python
   2   │ import sys
   3   │
   4   │ CYCLE_OF_15 = ["fizzbuzz", None, None, "fizz", None,
   5   │                "buzz", "fizz", None, None, "fizz",
   6   │                "buzz", None, "fizz", None, None]
   7   │
   8   │ def fizz_buzz(n: int) -> str:
   9   │     return CYCLE_OF_15[n % 15] or str(n)
  10   │
  11   │ if __name__ == "__main__":
  12   │     try:
  13   │         while (n:= sys.stdin.readline()):
  14   │             print(fizz_buzz(int(n)))
  15   │     except:
  16   │         pass
───────┴────────────────────────────────────────────────────────────────────────

And here’s the R code:

$ bat fizzbuzz.R
───────┬────────────────────────────────────────────────────────────────────────
       │ File: fizzbuzz.R
───────┼────────────────────────────────────────────────────────────────────────
   1   │ #!/usr/bin/env Rscript
   2   │ cycle_of_15 <- c("fizzbuzz", NA, NA, "fizz", NA,
   3   │                  "buzz", "fizz", NA, NA, "fizz",
   4   │                  "buzz", NA, "fizz", NA, NA)
   5   │
   6   │ fizz_buzz <- function(n) {
   7   │   word <- cycle_of_15[as.integer(n) %% 15 + 1]
   8   │   ifelse(is.na(word), n, word)
   9   │ }
  10   │
  11   │ f <- file("stdin")
  12   │ open(f)
  13   │ while(length(n <- readLines(f, n = 1)) > 0) {
  14   │   write(fizz_buzz(n), stdout())
  15   │ }
  16   │ close(f)
───────┴────────────────────────────────────────────────────────────────────────

Let’s test both tools (to save space I pipe the output to column):

$ seq 30 | fizzbuzz.py | column -x
1               2               fizz            4               buzz
fizz            7               8               fizz            buzz
11              fizz            13              14              fizzbuzz
16              17              fizz            19              buzz
fizz            22              23              fizz            buzz
26              fizz            28              29              fizzbuzz
 
$ seq 30 | fizzbuzz.R | column -x
1               2               fizz            4               buzz
fizz            7               8               fizz            buzz
11              fizz            13              14              fizzbuzz
16              17              fizz            19              buzz
fizz            22              23              fizz            buzz
26              fizz            28              29              fizzbuzz

This output looks correct to me! It’s difficult to demonstrate that these two tools actually work in a streaming manner. You can verify this yourself by piping the input data to sample -d 100 before it’s piped to the Python or R tool. That way, you’ll add a small delay in between each line so that it’s easier to confirm that the tools don’t wait for all the input data, operate on a line by line basis.

4.4 Summary

In this intermezzo chapter, I have shown you how to build your own command-line tool. Only six steps are needed to turn your code into a reusable building block. You’ll find that it makes you much more productive. I advise you to keep an eye out for opportunities to create your own tools. The next chapter covers the second step of the OSEMN model for data science, namely scrubbing data.

4.5 For Further Exploration

Adding help documentation to your tool becomes important when the tool has many options to remember, and even more so when you want to share your tool with others. docopt is a language-agnostic framework to provide help and define the possible options that your tool accepts. There are implementations available in just about any programming language including Bash, Python, and R.
If you want to learn more about programming in Bash, I recommend Classic Shell Programming by Arnold Robbins and Nelson Beebe and Bash Cookbook by Carl Albing and JP Vossen.
Writing a robust and safe Bash script is quite tricky. ShellCheck is an online tool that will check your Bash code for mistakes and vulnerabilities. There’s also a command-line tool available.
The book Ten Essays on Fizz Buzz by Joel Grus is an insightful and fun collection of ten different ways to solve Fizz Buzz with Python.

3 Obtaining Data

5 Scrubbing Data