$ docker pull datascienceworkshops/data-science-at-the-command-line
Data Science at the Command Line is a book written by Jeroen Janssens. This website contains information about upcoming events, instructions on how to how to get started with the Docker image, and an overview of all the command-line tools discussed in the book.
This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You'll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data.
Discover why the command line is an agile, scalable, and extensible technology. Even if you're already comfortable processing data with, say, Python or R, you'll greatly improve your data science workflow by also leveraging the power of the command line.
Did you know you can learn from the author Jeroen Janssens in person? Data Science Workshops B.V., a company he founded in January 2017, offers in-company training and coaching in topics such as programming, data visualization, and machine learning. Jeroen's first hands-on workshop three years ago was about Data Science at the Command Line, and this remains, together Python and R, one of the more popular subjects.
The process is simple: we determine the topics that you're interested in, decide whether it should be a one-, two-, or three-day hands-on workshop, and pick the date(s). Typical group sizes are between 5 and 15 people and prices range from €3,500 to €12,000. The exact price depends on the duration, location, number of participants, and whether you want to include your own data. Don't hesistate to contact us if you'd like to know more.
Contact Data Science Workshops
Because the book discusses more than 80 command-line tools, many of which have very particular installation instructions, it would take you the better part of the day to get started. To get you up and running quickly, whether you're on Linux, MacOS, or Windows, we created a Docker image that contains all the tools pre-installed.
To install the Docker image, you first need to download and install Docker itself and then run the following command on your terminal or command prompt (don't type the dollar sign):$ docker pull datascienceworkshops/data-science-at-the-command-line
You can start a Docker container as follows (remember the directory from which you run this):
$ docker run --rm -v`pwd`:/data \
> -it datascienceworkshops/data-science-at-the-command-line
Please note that those are backticks, not single quotes. You're now inside a Linux environment (Alpine Linux, to be exact) with all the necessary command-line tools installed. The directory from which you started the container is mapped to the /data directory. This enables you to get data in and out of the container.
If the following command produces an enthusiastic cow, then you know everything is working correctly:
$ cowsay "Let's go!"
Enjoy!
This is an overview of all the command-line tools discussed in the book. (Please note that, due to time constraints, the webcast will only discuss a subset.) This includes binary executables, interpreted scripts, and Bash builtins and keywords. For each command-line tool, we state, when available and appropriate, the following information:
All command-line tools listed here are included in the Data Science Toolbox for Data Science at the Command Line. See the previous sections for instructions on how to set it up. The install commands assume that you're running Ubuntu 14.04. Please note that citing open source software is not trivial, and that some information may be missing or incorrect.
Define or display aliases. Alias is a Bash builtin.
$ help alias
$ alias ll='ls -alF'
Pattern scanning and text processing language. Mawk (version 1.3.3) by Mike Brennan (1994). http://invisible-island.net/mawk.
$ sudo apt-get install mawk
$ man awk
$ seq 5 | awk '{sum+=$1} END {print sum}'
15
Manage AWS Services such as EC2 and S3 from the command line. AWS Command Line Interface (version 1.3.24) by Amazon Web Services (2014). http://aws.amazon.com/cli.
$ sudo pip install awscli
$ aws help
$ aws ec2 describe-regions | head -n 5
{
"Regions": [
{
"Endpoint": "ec2.eu-west-1.amazonaws.com",
"RegionName": "eu-west-1"
GNU Bourne-Again SHell. Bash (version 4.3) by Brian Fox and Chet Ramey (2010). http://www.gnu.org/software/bash.
$ sudo apt-get install bash
$ man bash
Evaluate equation from standard input. Bc (version 1.06.95) by Philip A. Nelson (2006). http://www.gnu.org/software/bc.
$ sudo apt-get install bc
$ man bc
$ echo 'e(1)' | bc -l
2.71828182845904523536
Access BigML’s prediction API. BigMLer (version 1.12.2) by BigML (2014). http://bigmler.readthedocs.org.
$ sudo pip install bigmler
$ bigmler --help
Apply expression to all but the first line. Useful if you want to apply classic command-line tools to CSV files with a header. Body by Jeroen H.M. Janssens (2014). https://github.com/jeroenjanssens/data-science-at-the-command-line.
$ git clone https://github.com/jeroenjanssens/data-science-at-the-command-line.git
$ echo -e "value\n7\n2\n5\n3" | body sort -n
value
2
3
5
7
Concatenate files and standard input, and print on standard output. Cat (version 8.21) by Torbjorn Granlund and Richard M. Stallman (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils
$ man cat
$ cat results-01 results-02 results-03 > results-all
Change the shell working directory. Cd is a Bash builtin.
$ help cd
$ cd ~; pwd; cd ..; pwd
/home/vagrant
/home
Change file mode bits. We use it to make our command-line tools executable. Chmod (version 8.21) by David MacKenzie and Jim Meyering (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils
$ man chmod
$ chmod u+x experiment.sh
Apply a command to a subset of the columns and merge the result back with the remaining columns. Cols by Jeroen H.M. Janssens (2014). https://github.com/jeroenjanssens/data-science-at-the-command-line.
$ git clone https://github.com/jeroenjanssens/data-science-at-the-command-line.git
$ < iris.csv cols -C species body tapkee --method pca | header -r x,y,species
Generate an ASCII picture of a cow with a message. Particularly useful when building up a particular pipeline is starting to frustrate you a bit too much. Cowsay (version 3.03+dfsg1) by Tony Monroe (1999).
$ sudo apt-get install cowsay
$ man cowsay
$ echo 'The command line is awesome!' | cowsay
______________________________
< The command line is awesome! >
------------------------------
\ ^__^
\ (oo)\_______
(__)\ )\/\
||----w |
|| ||
Copy files and directories. Cp (version 8.21) by Torbjorn Granlund, David MacKenzie, and Jim Meyering (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils
$ man cp
Extract columns from CSV data. Like cut command-line tool, but for tabular data. Csvkit (version 0.8.0) by Christopher Groskopf (2014). http://csvkit.readthedocs.org.
$ sudo pip install csvkit
$ csvcut --help
Filter tabular data to only those rows where certain columns contain a given value or match a regular expression. Csvkit (version 0.8.0) by Christopher Groskopf (2014). http://csvkit.readthedocs.org.
$ sudo pip install csvkit
$ csvgrep --help
Merge two or more CSV tables together using a method analogous to SQL JOIN operation. Csvkit (version 0.8.0) by Christopher Groskopf (2014). http://csvkit.readthedocs.org.
$ sudo pip install csvkit
$ csvjoin --help
Renders a CSV file to the command line in a readable, fixed-width format. Csvkit (version 0.8.0) by Christopher Groskopf (2014). http://csvkit.readthedocs.org.
$ sudo pip install csvkit
$ csvlook --help
$ echo -e "a,b\n1,2\n3,4" | csvlook
|----+----|
| a | b |
|----+----|
| 1 | 2 |
| 3 | 4 |
|----+----|
Sort CSV files. Like the sort command-line tool, but for tabular data. Csvkit (version 0.8.0) by Christopher Groskopf (2014). http://csvkit.readthedocs.org.
$ sudo pip install csvkit
$ csvsort --help
Execute SQL queries directly on CSV data or insert CSV into a database. Csvkit (version 0.8.0) by Christopher Groskopf (2014). http://csvkit.readthedocs.org.
$ sudo pip install csvkit
$ csvsql --help
Stack up the rows from multiple CSV files, optionally adding a grouping value to each row. Csvkit (version 0.8.0) by Christopher Groskopf (2014). http://csvkit.readthedocs.org.
$ sudo pip install csvkit
$ csvstack --help
Print descriptive statistics for all columns in a CSV file. Csvkit (version 0.8.0) by Christopher Groskopf (2014). http://csvkit.readthedocs.org.
$ sudo pip install csvkit
$ csvstat --help
Download data from a URL. cURL (version 7.35.0) by Daniel Stenberg (2012). http://curl.haxx.se.
$ sudo apt-get install curl
$ man curl
Remove sections from each line of files. Cut (version 8.21) by David M. Ihnat, David MacKenzie, and Jim Meyering (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils
$ man cut
Display an image or image sequence on any X server. Can read image data from standard input. Display (version 8:6.7.7.10) by ImageMagick Studio LLC (2009). http://www.imagemagick.org.
$ sudo apt-get install imagemagick
$ man display
Manage a data workflow. Drake (version 0.1.6) by Factual (2014). https://github.com/Factual/drake.
$ # Please see Chapter 6 for installation instructions.
$ drake --help
Generate sequence of dates relative to today. Dseq by Jeroen H.M. Janssens (2014). https://github.com/jeroenjanssens/data-science-at-the-command-line.
$ git clone https://github.com/jeroenjanssens/data-science-at-the-command-line.git
$ dseq -2 0 # day before yesterday till today
2014-07-15
2014-07-16
2014-07-17
Display a line of text. Echo (version 8.21) by Brian Fox and Chet Ramey (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils
$ man echo
Run a program in a modified environment. It’s often used to specify which interpreter should run our script. Env (version 8.21) by Richard Mlynarik and David MacKenzie (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils
$ man env
$ #!/usr/bin/env python
Set export attribute for shell variables. Useful for making shell variables available to other command-line tools. Export is a Bash builtin.
$ help export
$ export WEKAPATH=$HOME/bin
Generate a script for gnuplot while passing data to standard input. Feedgnuplot (version 1.32) by Dima Kogan (2014). http://search.cpan.org/perldoc?feedgnuplot.
$ sudo apt-get install feedgnuplot
$ man feedgnuplot
Search for files in a directory hierarchy. Find (version 4.4.2) by James Youngman (2008). http://www.gnu.org/software/findutils.
$ sudo apt-get install findutils
$ man find
Execute commands for each member in a list. In Chapter 8, we discuss the advantages of using parallel instead of for. For is a Bash keyword.
$ help for
$ for i in {A..C} "It's easy as" {1..3}; do echo $i; done
A
B
C
It's easy as
1
2
3
Manage repositories for git, which is a distributed version control system. Git (version 1:1.9.1) by Linus Torvalds and Junio C. Hamano (2014). http://git-scm.com.
$ sudo apt-get install git
$ man git
Print lines matching a pattern. Grep (version 2.16) by Jim Meyering (2012). http://www.gnu.org/software/grep.
$ sudo apt-get install grep
$ man grep
Output the first part of files. Head (version 8.21) by David MacKenzie and Jim Meyering (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils
$ man head
$ seq 5 | head -n 3
1
2
3
Add, replace, and delete header lines. Header by Jeroen H.M. Janssens (2014). https://github.com/jeroenjanssens/data-science-at-the-command-line.
$ git clone https://github.com/jeroenjanssens/data-science-at-the-command-line.git
$ header -h
Convert common, but less awesome, tabular data formats to CSV. Csvkit (version 0.8.0) by Christopher Groskopf (2014). http://csvkit.readthedocs.org.
$ sudo pip install csvkit
$ in2csv --help
Process JSON. Jq (version jq-1.4) by Stephen Dolan (2014). http://stedolan.github.com/jq.
$ # See website for installation instructions
$ # See website for documentation
Convert JSON to CSV. Json2Csv (version 1.1) by Jehiah Czebotar (2014). https://github.com/jehiah/json2csv.
$ go get github.com/jehiah/json2csv
$ json2csv --help
Paginate large files. Less (version 458) by Mark Nudelman (2013). http://www.greenwoodsoftware.com/less.
$ sudo apt-get install less
$ man less
$ csvlook iris.csv | less
List directory contents. Ls (version 8.21) by Richard M. Stallman and David MacKenzie (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils
$ man ls
Read reference manuals of command-line tools. Man (version 2.6.7.1) by John W. Eaton and Colin Watson (2014).
$ sudo apt-get install man
$ man man
$ man grep
Make directories. Mkdir (version 8.21) by David MacKenzie (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils
$ man mkdir
Move or rename files and directories. Mv (version 8.21) by Mike Parker, David MacKenzie, and Jim Meyering (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils
$ man mv
Build and execute shell command lines from standard input in parallel. GNU Parallel (version 20140622) by Ole Tange (2014). http://www.gnu.org/software/parallel.
$ # See website for installation instructions
$ man parallel
$ seq 3 | parallel echo Processing file {}.csv
Processing file 1.csv
Processing file 2.csv
Processing file 3.csv
Merge lines of files. Paste (version 8.21) by David M. Ihnat and David MacKenzie (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils
$ man paste
Run bc with parallel. First column of input CSV is mapped to {1}, second to {2}, and so forth. Pbc by Jeroen H.M. Janssens (2014). https://github.com/jeroenjanssens/data-science-at-the-command-line.
$ git clone https://github.com/jeroenjanssens/data-science-at-the-command-line.git
$ seq 5 | pbc '{1}^2'
1
4
9
16
25
Install and manage Python packages. Pip (version 1.5.4) by PyPA (2014). https://pip.pypa.io.
$ sudo apt-get install python-pip
$ man pip
Print name of current working directory. Pwd (version 8.21) is a Bash builtin by Jim Meyering (2012). http://www.gnu.org/software/coreutils.
$ man pwd
$ pwd
/home/vagrant
Execute Python, which is an interpreted, interactive, and object-oriented programming language. Python (version 2.7.5) by Python Software Foundation (2014). http://www.python.org.
$ sudo apt-get install python
$ man python
Analyze data and create visualizations with the R programming language. To install the latest version of R on Ubuntu, follow the instructions on http://cran.r-project.org/bin/linux/ubuntu/README. R (version 3.1.1) by R Foundation for Statistical Computing (2014). http://www.r-project.org.
$ sudo apt-get install r-base-dev
$ man R
Load CSV from standard input into R as a data.frame, execute given commands, and get the output as CSV or PNG. Rio by Jeroen H.M. Janssens (2014). https://github.com/jeroenjanssens/data-science-at-the-command-line.
$ git clone https://github.com/jeroenjanssens/data-science-at-the-command-line.git
$ Rio -h
$ seq 10 | Rio -nf sum
55
Create a scatter plot from CSV using Rio. Rio-Scatter by Jeroen H.M. Janssens (2014). https://github.com/jeroenjanssens/data-science-at-the-command-line.
$ git clone https://github.com/jeroenjanssens/data-science-at-the-command-line.git
$ < iris.csv Rio-scatter sepal_length sepal_width species > iris.png
Remove files or directories. Rm (version 8.21) by Paul Rubin, David MacKenzie, Richard M. Stallman, and Jim Meyering (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils
$ man rm
Run machine learning experiments with scikit-learn. SciKit-Learn Laboratory (version 0.26.0) by Educational Testing Service (2014). https://skll.readthedocs.org.
$ sudo pip install skll
$ run_experiment --help
Print lines from standard output with a given probability, for a given duration, and with a given delay between lines. Sample by Jeroen H.M. Janssens (2014). https://github.com/jeroenjanssens/data-science-at-the-command-line.
$ git clone https://github.com/jeroenjanssens/data-science-at-the-command-line.git
$ sample --help
Copy remote files securely. Scp (version 1:6.6p1) by Timo Rinne and Tatu Ylonen (2014). http://www.openssh.com.
$ sudo apt-get install openssh-client
$ man scp
Extract HTML elements using an XPath query or CSS3 selector. Scrape by Jeroen H.M. Janssens (2014). https://github.com/jeroenjanssens/data-science-at-the-command-line.
$ git clone https://github.com/jeroenjanssens/data-science-at-the-command-line.git
$ curl -sL 'http://datasciencetoolbox.org' | scrape -e 'head > title'
<title>Data Science Toolbox</title>
Filter and transform text. Sed (version 4.2.2) by Jay Fenlason, Tom Lord, Ken Pizzini, and Paolo Bonzini (2012). http://www.gnu.org/software/sed.
$ sudo apt-get install sed
$ man sed
Print a sequence of numbers. Seq (version 8.21) by Ulrich Drepper (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils
$ man seq
$ seq 5
1
2
3
4
5
Generate random permutations. Shuf (version 8.21) by Paul Eggert (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils
$ man shuf
Sort lines of text files. Sort (version 8.21) by Mike Haertel and Paul Eggert (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils
$ man sort
Split a file into pieces. Split (version 8.21) by Torbjorn Granlund and Richard M. Stallman (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils
$ man split
Executes arbitrary commands against an SQL database and outputs the results as a CSV. Csvkit (version 0.8.0) by Christopher Groskopf (2014). http://csvkit.readthedocs.org.
$ sudo pip install csvkit
$ sql2csv --help
Login to remote machines. OpenSSH client (version 1.8.9) by Tatu Ylonen, Aaron Campbell, Bob Beck, Markus Friedl, Niels Provos, Theo de Raadt, Dug Song, and Markus Friedl (2014). http://www.openssh.com.
$ sudo apt-get install ssh
$ man ssh
Execute a command as another user. Sudo (version 1.8.9p5) by Todd C. Miller (2013). http://www.sudo.ws/sudo.
$ sudo apt-get install sudo
$ man sudo
Output the last part of files. Tail (version 8.21) by Paul Rubin, David MacKenzie, Ian Lance Taylor, and Jim Meyering (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils
$ man tail
$ seq 5 | tail -n 3
3
4
5
Reduce dimensionality of a data set using various algorithms. Tapkee by Sergey Lisitsyn and Fernando Iglesias (2014). http://tapkee.lisitsyn.me.
$ # See website for installation instructions
$ tapkee --help
$ < iris.csv cols -C species body tapkee --method pca | header -r x,y,species
Create, list, and extract tar archives. Tar (version 1.27.1) by Jeff Bailey, Paul Eggert, and Sergey Poznyakoff (2014). http://www.gnu.org/software/tar.
$ sudo apt-get install tar
$ man tar
Read from standard input and write to standard output and files. Tee (version 8.21) by Mike Parker, Richard M. Stallman, and David MacKenzie (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils
$ man tee
Translate or delete characters. Tr (version 8.21) by Jim Meyering (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils
$ man tr
List contents of directories in a tree-like format. Tree (version 1.6.0) by Steve Baker (2014). https://launchpad.net/ubuntu/+source/tree.
$ sudo apt-get install tree
$ man tree
Display the type of a command-line tool. Type is a Bash builtin.
$ help type
$ type cd
cd is a shell builtin
Report or omit repeated lines. Uniq (version 8.21) by Richard M. Stallman and David MacKenzie (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils
$ man uniq
Extract common file formats. Unpack by Patrick Brisbin (2013). https://github.com/jeroenjanssens/data-science-at-the-command-line.
$ git clone https://github.com/jeroenjanssens/data-science-at-the-command-line.git
$ unpack file.tgz
Extract files from rar archives. Unrar (version 1:0.0.1+cvs20071127) by Ben Asselstine, Christian Scheurer, and Johannes Winkelmann (2014). http://home.gna.org/unrar.
$ sudo apt-get install unrar-free
$ man unrar
List, test and extract compressed files in a ZIP archive. Unzip (version 6.0) by Samuel H. Smith (2009). http://www.info-zip.org/pub/infozip.
$ sudo apt-get install unzip
$ man unzip
Print newline, word, and byte counts for each file. Wc (version 8.21) by Paul Rubin and David MacKenzie (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils
$ man wc
$ echo 'hello world' | wc -c
12
Weka is a collection of machine learning algorithms for data mining tasks by Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. This command-line tool allows you to run Weka from the command line. Weka command-line tool by Jeroen H.M. Janssens (2014). https://github.com/jeroenjanssens/data-science-at-the-command-line.
$ git clone https://github.com/jeroenjanssens/data-science-at-the-command-line.git
Locate a command-line tool. Does not work for Bash builtins. Which by Unknown (2009).
$ man which
$ which man
/usr/bin/man
Convert XML to JSON Xml2Json (version 0.0.2) by Francois Parmentier (2014). https://github.com/parmentf/xml2json.
$ npm install xml2json-command
$ xml2json < input.xml > output.json