4 Creating Command-line Tools
Throughout the book, I’ll introduce you to many commands and pipelines that basically fit on one line. These are known as one-liners or pipelines. Being able to perform complex tasks with just a one-liner is what makes the command line powerful. It’s a very different experience from writing and using traditional programs.
Some tasks you perform only once, and some you perform more often. Some tasks are very specific and others can be generalized. If you need to repeat a certain one-liner on a regular basis, it’s worthwhile to turn this into a command-line tool of its own. So, both one-liners and command-line tools have their uses. Recognizing the opportunity requires practice and skill. The advantages of a command-line tool are that you don’t have to remember the entire one-liner and that it improves readability if you include it into some other pipeline. In that sense, you can think of a command-line tool as similar to a function in a programming language.
The benefit of a working with a programming language, however, is that the code is in one or more file. This means that you can easily edit and reuse that code. If the code has parameters it can even be generalized and re-applied to problems that follow a similar pattern.
Command-line tools have the best of both worlds: they can be used from the command line, accept parameters, and only have to be created once. In this chapter, you’re going to get familiar with creating command-line tools in two ways. First, I explain how to turn those one-liners into reusable command-line tools. By adding parameters to your commands, you can add the same flexibility that a programming language offers. Subsequently, I demonstrate how to create reusable command-line tools from code that’s written in a programming language. By following the Unix philosophy, your code can be combined with other command-line tools, which may be written in an entirely different language. In this chapter, I will focus on three programming languages: Bash, Python, and R.
I believe that creating reusable command-line tools makes you a more efficient and productive data scientist in the long run. You will gradually build up your own data science toolbox from which you can draw existing tools and apply it to problems you have encountered previously. It requires practice to recognize the opportunity to turn a one-liner or existing code into a command-line tool.
4.1 Overview
In this chapter, you’ll learn how to:
- Convert one-liners into parameterized shell scripts
- Turn existing Python and R code into reusable command-line tools
This chapter starts with the following files:
$ cd /data/ch04 $ l total 32K -rwxr-xr-x 1 dst dst 400 Dec 14 11:46 fizzbuzz.py* -rwxr-xr-x 1 dst dst 391 Dec 14 11:46 fizzbuzz.R* -rwxr-xr-x 1 dst dst 182 Dec 14 11:46 stream.py* -rwxr-xr-x 1 dst dst 147 Dec 14 11:46 stream.R* -rwxr-xr-x 1 dst dst 105 Dec 14 11:46 top-words-4.sh* -rwxr-xr-x 1 dst dst 128 Dec 14 11:46 top-words-5.sh* -rwxr-xr-x 1 dst dst 647 Dec 14 11:46 top-words.py* -rwxr-xr-x 1 dst dst 584 Dec 14 11:46 top-words.R*
The instructions to get these files are in Chapter 2. Any other files are either downloaded or generated using command-line tools.
4.2 Converting One-liners into Shell Scripts
In this section I’m going to explain how to turn a one-liner into a reusable command-line tool. Let’s say that you would like to get the top most frequent words used in a piece of text. Take the book Alice’s Adventures in Wonderland by Lewis Carroll, which is, like many other great books, freely available on Project Gutenberg.
$ curl -sL "https://www.gutenberg.org/files/11/11-0.txt" | trim The Project Gutenberg eBook of Alice’s Adventures in Wonderland, by Lewis … This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org. If you are not located in the United States, you will have to check the laws of the country where you are located before using this eBook. … with 3751 more lines
The following sequence of tools or pipeline should do the job:
$ curl -sL "https://www.gutenberg.org/files/11/11-0.txt" | ➊ > tr '[:upper:]' '[:lower:]' | ➋ > grep -oE "[a-z\']{2,}" | ➌ > sort | ➍ > uniq -c | ➎ > sort -nr | ➏ > head -n 10 ➐ 1839 the 942 and 811 to 638 of 610 it 553 she 486 you 462 said 435 in 403 alice
➊ Downloading an ebook using curl
.
➋ Converting the entire text to lowercase using tr
47.
➌ Extracting all the words using grep
48 and put each word on separate line.
➍ Sort these words in alphabetical order using sort
49.
➎ Remove all the duplicates and count how often each word appears in the list using uniq
50.
➏ Sort this list of unique words by their count in descending order using sort
.
➐ Keep only the top 10 lines (i.e., words) using head
.
Those words indeed appear the most often in the text. Because those words (apart from the word “alice”) appear very frequently in many English texts, they carry very little meaning. In fact, these are known as stopwords. If we get rid of those, we keep the most frequent words that are related to this text.
Here’s a list of stopwords I’ve found:
$ curl -sL "https://raw.githubusercontent.com/stopwords-iso/stopwords-en/master/ stopwords-en.txt" | > sort | tee stopwords | trim 20 10 39 a able ableabout about above abroad abst accordance according accordingly across act actually ad added adj adopted ae … with 1278 more lines
With grep
we can filter out the stopwords right before we start counting:
$ curl -sL "https://www.gutenberg.org/files/11/11-0.txt" | > tr '[:upper:]' '[:lower:]' | > grep -oE "[a-z\']{2,}" | > sort | > grep -Fvwf stopwords | ➊ > uniq -c | > sort -nr | > head -n 10 403 alice 98 gutenberg 88 project 76 queen 71 time 63 king 60 turtle 57 mock 56 hatter 55 gryphon
➊ Obtain the patterns from a file (stopwords in our case), one per line, with -f
. Interpret those patterns as fixed strings with -F
. Select only those lines containing matches that form whole words with -w
. Select non-matching lines with -v
.
grep
, you can run man grep
from the command line.
The command-line tools tr
, grep
, uniq
, and sort
will be discussed in more detail in the next chapter.
There is nothing wrong with running this one-liner just once. However, imagine if you wanted to have the top 10 words of every e-book on Project Gutenberg. Or imagine that you wanted the top 10 words of a news website on a hourly basis. In those cases, it would be best to have this one-liner as a separate building block that can be part of something bigger. To add some flexibility to this one-liner in terms of parameters, let’s turn it into a shell script.
This allows us to take the one-liner as the starting point, and gradually improve on it. To turn this one-liner into a reusable command-line tool, I’ll walk you through the following six steps:
- Copy and paste the one-liner into a file.
- Add execute permissions.
- Define a so-called shebang.
- Remove the fixed input part.
- Add a parameter.
- Optionally extend your PATH.
4.2.1 Step 1: Create File
The first step is to create a new file.
You can open your favorite text editor and copy and paste the one-liner.
Let’s name the file top-words-1.sh to indicate that this is the first step towards our new command-line tool If you like to stay at the command line, you can use the builtin fc
, which stands for fix command, and allows you to fix or edit the last-run command.
$ fc
Running fc
invokes the default text editor, which is stored in the environment variable EDITOR.
In the Docker container, this is set to nano
, a straightforward text editor.
As you can see, this file contains our one-liner:
GNU nano 5.4 /tmp/zshpu0emE curl -sL "https://www.gutenberg.org/files/11/11-0.txt" | tr '[:upper:]' '[:lower:]' | grep -oE "[a-z\']{2,}" | sort | grep -Fvwf stopwords | uniq -c | sort -nr | head -n 10 [ Read 8 lines ] ^G Help ^O Write Out ^W Where Is ^K Cut ^T Execute ^C Location ^X Exit ^R Read File ^\ Replace ^U Paste ^J Justify ^_ Go To Line
Let’s give this temporary file a proper name by pressing Ctrl-O
, removing the temporary filename, and typing top-words-1.sh
:
GNU nano 5.4 /tmp/zshpu0emE curl -sL "https://www.gutenberg.org/files/11/11-0.txt" | tr '[:upper:]' '[:lower:]' | grep -oE "[a-z\']{2,}" | sort | grep -Fvwf stopwords | uniq -c | sort -nr | head -n 10 File Name to Write: top-words-1.sh ^G Help M-D DOS Format M-A Append M-B Backup File ^C Cancel M-M Mac Format M-P Prepend ^T Browse
Press Enter
:
GNU nano 5.4 /tmp/zshpu0emE curl -sL "https://www.gutenberg.org/files/11/11-0.txt" | tr '[:upper:]' '[:lower:]' | grep -oE "[a-z\']{2,}" | sort | grep -Fvwf stopwords | uniq -c | sort -nr | head -n 10 Save file under DIFFERENT NAME? Y Yes N No ^C Cancel
Press Y
to confirm that you want to save under a different filename:
GNU nano 5.4 top-words-1.sh curl -sL "https://www.gutenberg.org/files/11/11-0.txt" | tr '[:upper:]' '[:lower:]' | grep -oE "[a-z\']{2,}" | sort | grep -Fvwf stopwords | uniq -c | sort -nr | head -n 10 [ Wrote 8 lines ] ^G Help ^O Write Out ^W Where Is ^K Cut ^T Execute ^C Location ^X Exit ^R Read File ^\ Replace ^U Paste ^J Justify ^_ Go To Line
Press Ctrl-X
to exit nano
and go back from whence you came.
We are using the file extension .sh to make clear that we are creating a shell script. However, command-line tools don’t need to have an extension. In fact, command-line tools rarely have extensions.
Confirm the contents of the file:
$ pwd /data/ch04 $ l total 44K -rwxr-xr-x 1 dst dst 400 Dec 14 11:46 fizzbuzz.py* -rwxr-xr-x 1 dst dst 391 Dec 14 11:46 fizzbuzz.R* -rw-r--r-- 1 dst dst 7.5K Dec 14 11:47 stopwords -rwxr-xr-x 1 dst dst 182 Dec 14 11:46 stream.py* -rwxr-xr-x 1 dst dst 147 Dec 14 11:46 stream.R* -rw-r--r-- 1 dst dst 173 Dec 14 11:47 top-words-1.sh -rwxr-xr-x 1 dst dst 105 Dec 14 11:46 top-words-4.sh* -rwxr-xr-x 1 dst dst 128 Dec 14 11:46 top-words-5.sh* -rwxr-xr-x 1 dst dst 647 Dec 14 11:46 top-words.py* -rwxr-xr-x 1 dst dst 584 Dec 14 11:46 top-words.R* $ bat top-words-1.sh ───────┬──────────────────────────────────────────────────────────────────────── │ File: top-words-1.sh ───────┼──────────────────────────────────────────────────────────────────────── 1 │ curl -sL "https://www.gutenberg.org/files/11/11-0.txt" | 2 │ tr '[:upper:]' '[:lower:]' | 3 │ grep -oE "[a-z\']{2,}" | 4 │ sort | 5 │ grep -Fvwf stopwords | 6 │ uniq -c | 7 │ sort -nr | 8 │ head -n 10 ───────┴────────────────────────────────────────────────────────────────────────
You can now use bash
51 to interpret and execute the commands in the file:
$ bash top-words-1.sh 403 alice 98 gutenberg 88 project 76 queen 71 time 63 king 60 turtle 57 mock 56 hatter 55 gryphon
This saves you from typing the one-liner again next time.
However, because the file cannot be executed on its own, it’s not yet a real command-line tool. Let’s change that in the next step.
4.2.2 Step 2: Give Permission to Execute
The reason we cannot execute our file directly is that we don’t have the correct access permissions. In particular, you, as a user, need to have permission to execute the file. In this section we change the access permissions of our file.
In order to compare differences between steps, copy the file to top-words-2.sh using cp -v top-words-{1,2}.sh
.
echo
to just print the result.
For example, echo book_{draft,final}.md
or echo agent-{001..007}
.
To change the access permissions of a file, we need to use a command-line tool called chmod
52, which stands for change mode.
It changes the file mode bits of a specific file.
The following command gives the user, you, permission to execute top-words-2.sh:
$ cp -v top-words-{1,2}.sh 'top-words-1.sh' -> 'top-words-2.sh' $ chmod u+x top-words-2.sh
The argument u+x
consists of three characters: (1) u
indicates that we want to change the permissions for the user who owns the file, which is you, because you created the file; (2) +
indicates that we want to add a permission; and (3) x
, which indicates the permissions to execute.
Let’s now have a look at the access permissions of both files:
$ l top-words-{1,2}.sh -rw-r--r-- 1 dst dst 173 Dec 14 11:47 top-words-1.sh -rwxr--r-- 1 dst dst 173 Dec 14 11:47 top-words-2.sh*
The first column shows the access permissions for each file.
For top-words-2.sh, this is -rwxrw-r--
.
The first character -
(hyphen) indicates the file type.
A -
means regular file and a d
means directory.
The next three characters, rwx
, indicate the access permissions for the user who owns the file.
The r
and w
mean read and write, respectively.
(As you can see, top-words-1.sh has a -
instead of an x
, which means that we cannot execute that file.) The next three characters rw-
indicate the access permissions for all members of the group that owns the file.
Finally, the last three characters in the column, r--
, indicate access permissions for all other users.
Now you can execute the file as follows:
$ ./top-words-2.sh 403 alice 98 gutenberg 88 project 76 queen 71 time 63 king 60 turtle 57 mock 56 hatter 55 gryphon
If you try to execute a file for which you don’t have the correct access permissions, as with top-words-1.sh, you will see the following error message:
$ ./top-words-1.sh zsh: permission denied: ./top-words-1.sh
4.2.3 Step 3: Define Shebang
Although we can already execute the file on its own, we should add a so-called shebang to the file. The shebang is a special line in the script that instructs the system which executable it should use to interpret the commands.
The name shebang comes from the first two characters: a hash (she) and an exclamation mark (bang): #!
.
It’s not a good idea to leave it out, as we have done in the previous step, because each shell has a different default executable.
The Z shell, the one we’re using throughout the book, uses the executable /bin/sh by default if no shebang is defined.
In this case I’d like bash
to interpret the commands as that will give us some more functionality than sh
.
Again, you’re free to use whatever editor you like, but I’m going to stick with nano
53, which is installed in the Docker image.
$ cp -v top-words-{2,3}.sh 'top-words-2.sh' -> 'top-words-3.sh' $ nano top-words-3.sh
GNU nano 5.4 top-words-3.sh curl -sL "https://www.gutenberg.org/files/11/11-0.txt" | tr '[:upper:]' '[:lower:]' | grep -oE "[a-z\']{2,}" | sort | grep -Fvwf stopwords | uniq -c | sort -nr | head -n 10 [ Read 8 lines ] ^G Help ^O Write Out ^W Where Is ^K Cut ^T Execute ^C Location ^X Exit ^R Read File ^\ Replace ^U Paste ^J Justify ^_ Go To Line
Go ahead and type #!/usr/bin/env/bash
and press Enter
.
When you’re ready, press Ctrl-X
to save and exit.
GNU nano 5.4 top-words-3.sh * #!/usr/bin/env bash curl -sL "https://www.gutenberg.org/files/11/11-0.txt" | tr '[:upper:]' '[:lower:]' | grep -oE "[a-z\']{2,}" | sort | grep -Fvwf stopwords | uniq -c | sort -nr | head -n 10 Save modified buffer? Y Yes N No ^C Cancel
Press Y
to indicate that you want to save the file.
GNU nano 5.4 top-words-3.sh * #!/usr/bin/env bash curl -sL "https://www.gutenberg.org/files/11/11-0.txt" | tr '[:upper:]' '[:lower:]' | grep -oE "[a-z\']{2,}" | sort | grep -Fvwf stopwords | uniq -c | sort -nr | head -n 10 File Name to Write: top-words-3.sh ^G Help M-D DOS Format M-A Append M-B Backup File ^C Cancel M-M Mac Format M-P Prepend ^T Browse
Let’s confirm what top-words-3.sh looks like:
$ bat top-words-3.sh ───────┬──────────────────────────────────────────────────────────────────────── │ File: top-words-3.sh ───────┼──────────────────────────────────────────────────────────────────────── 1 │ #!/usr/bin/env bash 2 │ curl -sL "https://www.gutenberg.org/files/11/11-0.txt" | 3 │ tr '[:upper:]' '[:lower:]' | 4 │ grep -oE "[a-z\']{2,}" | 5 │ sort | 6 │ grep -Fvwf stopwords | 7 │ uniq -c | 8 │ sort -nr | 9 │ head -n 10 ───────┴────────────────────────────────────────────────────────────────────────
That’s exactly what we need: our original pipeline with a shebang in front of it.
Sometimes you will come across scripts that have a shebang in the form of !/usr/bin/bash
or !/usr/bin/python
(in the case of Python, as we will see in the next section).
While this generally works, if the bash
or python
54 executables are installed in a different location than /usr/bin, then the script does not work anymore.
It is better to use the form that I present here, namely !/usr/bin/env bash
and !/usr/bin/env python
, because the env
55 executable is aware where bash
and python
are installed.
In short, using env
makes your scripts more portable.
4.2.4 Step 4: Remove Fixed Input
We know have a valid command-line tool that we can execute from the command line.
But we can do better than this.
We can make our command-line tool more reusable.
The first command in our file is curl
, which downloads the text from which we wish to obtain the top 10 most-used words.
So, the data and operations are combined into one.
What if we wanted to obtain the top 10 most-used words from another e-book, or any other text for that matter? The input data is fixed within the tools itself. It would be better to separate the data from the command-line tool.
If we assume that the user of the command-line tool will provide the text, the tool will become generally applicable.
So, the solution is to remove the curl
command from the script.
Here is the updated script named top-words-4.sh:
$ cp -v top-words-{3,4}.sh 'top-words-3.sh' -> 'top-words-4.sh' $ sed -i '2d' top-words-4.sh $ bat top-words-4.sh ───────┬──────────────────────────────────────────────────────────────────────── │ File: top-words-4.sh ───────┼──────────────────────────────────────────────────────────────────────── 1 │ #!/usr/bin/env bash 2 │ tr '[:upper:]' '[:lower:]' | 3 │ grep -oE "[a-z\']{2,}" | 4 │ sort | 5 │ grep -Fvwf stopwords | 6 │ uniq -c | 7 │ sort -nr | 8 │ head -n 10 ───────┴────────────────────────────────────────────────────────────────────────
This works because if a script starts with a command that needs data from standard input, like tr
, it will take the input that is given to the command-line tools.
For example:
$ curl -sL 'https://www.gutenberg.org/files/11/11-0.txt' | ./top-words-4.sh 403 alice 98 gutenberg 88 project 76 queen 71 time 63 king 60 turtle 57 mock 56 hatter 55 gryphon $ curl -sL 'https://www.gutenberg.org/files/12/12-0.txt' | ./top-words-4.sh 469 alice 189 queen 98 gutenberg 88 project 72 time 71 red 70 white 67 king 63 head 59 knight $ man bash | ./top-words-4.sh 585 command 332 set 313 word 313 option 304 file 300 variable 298 bash 258 list 257 expansion 238 history
4.2.5 Step 5: Add Arguments
There is one more step to make our command-line tool even more reusable: parameters.
In our command-line tool there are a number of fixed command-line arguments, for example -nr
for sort
and -n 10
for head
.
It is probably best to keep the former argument fixed.
However, it would be very useful to allow for different values for the head
command.
This would allow the end user to set the number of most-often used words to output.
Below shows what our file top-words-5.sh looks like:
$ bat top-words-5.sh ───────┬──────────────────────────────────────────────────────────────────────── │ File: top-words-5.sh ───────┼──────────────────────────────────────────────────────────────────────── 1 │ #!/usr/bin/env bash 2 │ 3 │ NUM_WORDS="${1:-10}" 4 │ 5 │ tr '[:upper:]' '[:lower:]' | 6 │ grep -oE "[a-z\']{2,}" | 7 │ sort | 8 │ grep -Fvwf stopwords | 9 │ uniq -c | 10 │ sort -nr | 11 │ head -n "${NUM_WORDS}" ───────┴────────────────────────────────────────────────────────────────────────
- The variable NUM_WORDS is set to the value of $1, which is a special variable in Bash. It holds the value of the first command-line argument passed to our command-line tool. The table below lists the other special variables that Bash offers. If no value is specified, it will take on the value “10.”
- Note that in order to use the value of the $NUM_WORDS variable, you need to put a dollar sign in front of it. When you set it, you don’t write a dollar sign.
We could have also used $1 directly as an argument for head
and not bother creating an extra variable such NUM_WORDS.
However, with larger scripts and a few more command-line arguments such as $2 and $3, your code becomes more readable when you use named variables.
Now if you wanted to see the top 20 most-used words of our text, we would invoke our command-line tool as follows:
$ curl -sL "https://www.gutenberg.org/files/11/11-0.txt" > alice.txt $ < alice.txt ./top-words-5.sh 20 403 alice 98 gutenberg 88 project 76 queen 71 time 63 king 60 turtle 57 mock 56 hatter 55 gryphon 53 rabbit 50 head 48 voice 45 looked 44 mouse 42 duchess 40 tone 40 dormouse 37 cat 34 march
If the user does not specify a number, then our script will show the top 10 most common words:
$ < alice.txt ./top-words-5.sh 403 alice 98 gutenberg 88 project 76 queen 71 time 63 king 60 turtle 57 mock 56 hatter 55 gryphon
4.2.6 Step 6: Extend Your PATH
After the previous five steps we are finally finished building a reusable command-line tool. There is, however, one more step that can be very useful. In this optional step we are going to ensure that you can execute your command-line tools from everywhere.
Currently, when you want to execute your command-line tool, you either have to navigate to the directory it is in or include the full path name as shown in step 2. This is fine if the command-line tool is specifically built for, say, a certain project. However, if your command-line tool could be applied in multiple situations, then it is useful to be able to execute it from everywhere, just like the command-line tools that come with Ubuntu.
To accomplish this, Bash needs to know where to look for your command-line tools. It does this by traversing a list of directories which are stored in an environment variable called PATH. In a fresh Docker container, the PATH looks like this:
$ echo $PATH /usr/local/lib/R/site-library/rush/exec:/usr/bin/dsutils:/home/dst/.local/bin:/u sr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
The directories are delimited by colons. We can print it as a list of directories by translating the colons to newlines:
$ echo $PATH | tr ':' '\n' /usr/local/lib/R/site-library/rush/exec /usr/bin/dsutils /home/dst/.local/bin /usr/local/sbin /usr/local/bin /usr/sbin /usr/bin /sbin /bin
To change the PATH permanently, you’ll need to edit the .bashrc or .profile file located in your home directory. If you put all your custom command-line tools into one directory, say, ~/tools, then you only change the PATH once. Now, you no longer need to add the ./, but you can just use the filename. Moreover, you do no longer need to remember where the command-line tool is located.
$ cp -v top-words{-5.sh,} 'top-words-5.sh' -> 'top-words' $ export PATH="${PATH}:/data/ch04" $ echo $PATH /usr/local/lib/R/site-library/rush/exec:/usr/bin/dsutils:/home/dst/.local/bin:/u sr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/data/ch04 $ curl "https://www.gutenberg.org/files/11/11-0.txt" | > top-words 10 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 170k 100 170k 0 0 115k 0 0:00:01 0:00:01 --:--:-- 116k 403 alice 98 gutenberg 88 project 76 queen 71 time 63 king 60 turtle 57 mock 56 hatter 55 gryphon
4.3 Creating Command-line Tools with Python and R
The command-line tool that we created in the previous section was written in Bash.
(Sure, not every feature of the Bash programming language was employed, but the interpreter still was bash
.) As you know by now, the command line is language agnostic, so we don’t necessarily have to use Bash for creating command-line tools.
In this section I’m going demonstrate that command-line tools can be created in other programming languages as well. I’ll focus on Python and R because these are the two most popular programming languages within the data science community. I cannot offer a complete introduction to either language, so I assume that you have some familiarity with Python and or R. Other programming languages such as Java, Go, and Julia, follow a similar pattern when it comes to creating command-line tools.
There are three main reasons for creating command-line tools in another programming language than Bash. First, you may already have some code that you’d like to be able to use from the command line. Second, the command-line tool would end up encompassing more than a hundred lines of Bash code. Third, the command-line tool needs to be more safe and robust (Bash lacks many features such as type checking).
The six steps that I discussed in the previous section roughly apply to creating command-line tools in other programming languages as well.
The first step, however, would not be copy pasting from the command line, but rather copy pasting the relevant code into a new file.
Command-line tools written in Python and R need to specify python
and Rscript
56, respectively, as the interpreter after the shebang.
When it comes to creating command-line tools using Python and R, there are two more aspects that deserve special attention. First, processing standard input, which comes natural to shell scripts, has to be taken care of explicitly in Python and R. Second, as command-line tools written in Python and R tend to be more complex, we may also want to offer the user the ability to specify more elaborate command-line arguments.
4.3.1 Porting The Shell Script
As a starting point, let’s see how we would port the shell script we just created to both Python and R. In other words, what Python and R code gives us the top most-often used words from standard input? We will first show the two files top-words.py and top-words.R and then discuss the differences with the shell code. In Python, the code would look something like:
$ cd /data/ch04 $ bat top-words.py ───────┬──────────────────────────────────────────────────────────────────────── │ File: top-words.py ───────┼──────────────────────────────────────────────────────────────────────── 1 │ #!/usr/bin/env python 2 │ import re 3 │ import sys 4 │ 5 │ from collections import Counter 6 │ from urllib.request import urlopen 7 │ 8 │ def top_words(text, n): 9 │ with urlopen("https://raw.githubusercontent.com/stopwords-iso/stopw │ ords-en/master/stopwords-en.txt") as f: 10 │ stopwords = f.read().decode("utf-8").split("\n") 11 │ 12 │ words = re.findall("[a-z']{2,}", text.lower()) 13 │ words = (w for w in words if w not in stopwords) 14 │ 15 │ for word, count in Counter(words).most_common(n): 16 │ print(f"{count:>7} {word}") 17 │ 18 │ 19 │ if __name__ == "__main__": 20 │ text = sys.stdin.read() 21 │ 22 │ try: 23 │ n = int(sys.argv[1]) 24 │ except: 25 │ n = 10 26 │ 27 │ top_words(text, n) ───────┴────────────────────────────────────────────────────────────────────────
Note that this Python example doesn’t use any third-party packages. If you want to do advanced text processing, then I recommend you check out the NLTK package57. If you’re going to work with a lot of numerical data, then I recommend you use the Pandas package58.
And in R the code would look something like:
$ bat top-words.R ───────┬──────────────────────────────────────────────────────────────────────── │ File: top-words.R ───────┼──────────────────────────────────────────────────────────────────────── 1 │ #!/usr/bin/env Rscript 2 │ n <- as.integer(commandArgs(trailingOnly = TRUE)) 3 │ if (length(n) == 0) n <- 10 4 │ 5 │ f_stopwords <- url("https://raw.githubusercontent.com/stopwords-iso/sto │ pwords-en/master/stopwords-en.txt") 6 │ stopwords <- readLines(f_stopwords, warn = FALSE) 7 │ close(f_stopwords) 8 │ 9 │ f_text <- file("stdin") 10 │ lines <- tolower(readLines(f_text)) 11 │ 12 │ words <- unlist(regmatches(lines, gregexpr("[a-z']{2,}", lines))) 13 │ words <- words[is.na(match(words, stopwords))] 14 │ 15 │ counts <- sort(table(words), decreasing = TRUE) 16 │ cat(sprintf("%7d %s\n", counts[1:n], names(counts[1:n])), sep = "") 17 │ close(f_text) ───────┴────────────────────────────────────────────────────────────────────────
Let’s check that all three implementations (i.e., Bash, Python, and R) return the same top 5 words with the same counts:
$ time < alice.txt top-words 5 403 alice 98 gutenberg 88 project 76 queen 71 time top-words 5 < alice.txt 0.38s user 0.04s system 149% cpu 0.278 total $ time < alice.txt top-words.py 5 403 alice 98 gutenberg 88 project 76 queen 71 time top-words.py 5 < alice.txt 1.36s user 0.03s system 95% cpu 1.454 total $ time < alice.txt top-words.R 5 403 alice 98 gutenberg 88 project 76 queen 71 time top-words.R 5 < alice.txt 1.48s user 0.14s system 91% cpu 1.774 total
Wonderful! Sure, the output itself is not very exciting. What’s exciting is that we can accomplish the same task with multiple languages. Let’s look at the differences between the approaches.
First, what’s immediately obvious is the difference in amount of code. For this specific task, both Python and R require much more code than Bash. This illustrates that, for some tasks, it is better to use the command line. For other tasks, you may better off using a programming language. As you gain more experience on the command line, you will start to recognize when to use which approach. When everything is a command-line tool, you can even split up the task into subtasks, and combine a Bash command-line tool with, say, a Python command-line tool. Whichever approach works best for the task at hand.
4.3.2 Processing Streaming Data from Standard Input
In the previous two code snippets, both Python and R read the complete standard input at once.
On the command line, most tools pipe data to the next command-line tool in a streaming fashion.
There are a few command-line tools which require the complete data before they write any data to standard output, like sort
.
This means the pipeline is blocked by such command-line tools.
This doesn’t have to be a problem when the input data is finite, like a file.
However, when the input data is a non-stop stream, such blocking command-line tools are useless.
Luckily Python and R support processing streaming data. You can apply a function on a line-per-line basis, for example. Here are two minimal examples that demonstrate how this works in Python and R, respectively.
Both the Python and R tool solve the, by now infamous, Fizz Buzz problem, which is defined as follows: Print the numbers from 1 to 100, except that if the number is divisible by 3, instead print “fizz”; if the number is divisible by 5, instead print “buzz”; and if the number is divisible by 15, instead print “fizzbuzz.” Here’s the Python code59:
$ bat fizzbuzz.py ───────┬──────────────────────────────────────────────────────────────────────── │ File: fizzbuzz.py ───────┼──────────────────────────────────────────────────────────────────────── 1 │ #!/usr/bin/env python 2 │ import sys 3 │ 4 │ CYCLE_OF_15 = ["fizzbuzz", None, None, "fizz", None, 5 │ "buzz", "fizz", None, None, "fizz", 6 │ "buzz", None, "fizz", None, None] 7 │ 8 │ def fizz_buzz(n: int) -> str: 9 │ return CYCLE_OF_15[n % 15] or str(n) 10 │ 11 │ if __name__ == "__main__": 12 │ try: 13 │ while (n:= sys.stdin.readline()): 14 │ print(fizz_buzz(int(n))) 15 │ except: 16 │ pass ───────┴────────────────────────────────────────────────────────────────────────
And here’s the R code:
$ bat fizzbuzz.R ───────┬──────────────────────────────────────────────────────────────────────── │ File: fizzbuzz.R ───────┼──────────────────────────────────────────────────────────────────────── 1 │ #!/usr/bin/env Rscript 2 │ cycle_of_15 <- c("fizzbuzz", NA, NA, "fizz", NA, 3 │ "buzz", "fizz", NA, NA, "fizz", 4 │ "buzz", NA, "fizz", NA, NA) 5 │ 6 │ fizz_buzz <- function(n) { 7 │ word <- cycle_of_15[as.integer(n) %% 15 + 1] 8 │ ifelse(is.na(word), n, word) 9 │ } 10 │ 11 │ f <- file("stdin") 12 │ open(f) 13 │ while(length(n <- readLines(f, n = 1)) > 0) { 14 │ write(fizz_buzz(n), stdout()) 15 │ } 16 │ close(f) ───────┴────────────────────────────────────────────────────────────────────────
Let’s test both tools (to save space I pipe the output to column
):
$ seq 30 | fizzbuzz.py | column -x 1 2 fizz 4 buzz fizz 7 8 fizz buzz 11 fizz 13 14 fizzbuzz 16 17 fizz 19 buzz fizz 22 23 fizz buzz 26 fizz 28 29 fizzbuzz $ seq 30 | fizzbuzz.R | column -x 1 2 fizz 4 buzz fizz 7 8 fizz buzz 11 fizz 13 14 fizzbuzz 16 17 fizz 19 buzz fizz 22 23 fizz buzz 26 fizz 28 29 fizzbuzz
This output looks correct to me!
It’s difficult to demonstrate that these two tools actually work in a streaming manner.
You can verify this yourself by piping the input data to sample -d 100
before it’s piped to the Python or R tool.
That way, you’ll add a small delay in between each line so that it’s easier to confirm that the tools don’t wait for all the input data, operate on a line by line basis.
4.4 Summary
In this intermezzo chapter, I have shown you how to build your own command-line tool. Only six steps are needed to turn your code into a reusable building block. You’ll find that it makes you much more productive. I advise you to keep an eye out for opportunities to create your own tools. The next chapter covers the second step of the OSEMN model for data science, namely scrubbing data.
4.5 For Further Exploration
- Adding help documentation to your tool becomes important when the tool has many options to remember, and even more so when you want to share your tool with others.
docopt
is a language-agnostic framework to provide help and define the possible options that your tool accepts. There are implementations available in just about any programming language including Bash, Python, and R. - If you want to learn more about programming in Bash, I recommend Classic Shell Programming by Arnold Robbins and Nelson Beebe and Bash Cookbook by Carl Albing and JP Vossen.
- Writing a robust and safe Bash script is quite tricky. ShellCheck is an online tool that will check your Bash code for mistakes and vulnerabilities. There’s also a command-line tool available.
- The book Ten Essays on Fizz Buzz by Joel Grus is an insightful and fun collection of ten different ways to solve Fizz Buzz with Python.