3 Obtaining Data

This chapter deals with the first step of the OSEMN model: obtaining data. After all, without any data, there is not much data science that we can do. I assume that the data you need to solve your data science problem already exists. Your first task is to get this data onto your computer (and possibly also inside the Docker container) in a form that you can work with.

According to the Unix philosophy, text is a universal interface. Almost every command-line tool takes text as input, produces text as output, or both. This is the main reason why command-line tools can work so well together. However, as we’ll see, even just text can come in multiple forms.

Data can be obtained in several ways—for example by downloading it from a server, querying a database, or connecting to a Web API. Sometimes, the data comes in a compressed form or in a binary format such as a Microsoft Excel Spreadsheet. In this chapter, I discuss several tools that help tackle this from the command line, including: curl33, in2csv34, sql2csv35, and tar36.

3.1 Overview

In this chapter, you’ll learn how to:

  • Copy local files to the Docker image
  • Download data from the Internet
  • Decompress files
  • Extract data from spreadsheets
  • Query relational databases
  • Call web APIs

This chapter starts with the following files:

$ cd /data/ch03
 
$ l
total 924K
-rw-r--r-- 1 dst dst 627K Dec 14 11:46 logs.tar.gz
-rw-r--r-- 1 dst dst 189K Dec 14 11:46 r-datasets.db
-rw-r--r-- 1 dst dst  149 Dec 14 11:46 tmnt-basic.csv
-rw-r--r-- 1 dst dst  148 Dec 14 11:46 tmnt-missing-newline.csv
-rw-r--r-- 1 dst dst  181 Dec 14 11:46 tmnt-with-header.csv
-rw-r--r-- 1 dst dst  91K Dec 14 11:46 top2000.xlsx

The instructions to get these files are in Chapter 2. Any other files are either downloaded or generated using command-line tools.

3.2 Copying Local Files to the Docker Container

A common situation is that you already have the necessary files on your own computer. This section explains how you can get those files into the Docker container.

I mentioned in Chapter 2 that the Docker container is an isolated virtual environment. Luckily there is one exception to that: files can be transferred in and out the Docker container. The local directory from which you ran docker run, is mapped to a directory in the Docker container. This directory is called /data. Note that this is not the home directory, which is /home/dst.

If you have one or more files on your local computer, and you want to apply some command-line tools to them, all you have to do is copy or move the files to that mapped directory. Let’s assume that you have a file called logs.csv in your Downloads directory.

If you’re running Windows, open the command prompt or PowerShell and run the following two commands:

> cd %UserProfile%\Downloads
> copy logs.csv MyDataScienceToolbox\

If you are running Linux or macOS, open a terminal and execute the following command on your operating system (and not inside the Docker container):

$ cp ~/Downloads/logs.csv ~/my-data-science-toolbox

You can also drag-and-drop the file into the right directory using a graphical file manager such as Windows Explorer or macOS Finder.

3.3 Downloading from the Internet

The Internet provides, without a doubt, the largest resource for interesting data. The command-line tool curl can be considered the command line’s Swiss Army knife when it comes to downloading data from the Internet.

3.3.1 Introducing curl

When you browse to a URL, which stands for uniform resource locator, your browser interprets the data it downloads. For example, the browser renders HTML files, plays video files automatically, and shows PDF files. However, when you use curl to access a URL, it downloads the data and, by default, prints it to standard output. curl doesn’t do any interpretation, but luckily other command-line tools can be used to process the data further.

The easiest invocation of curl is to specify a URL as a command-line argument. Let’s, try downloading an article from Wikipedia:

$ curl "https://en.wikipedia.org/wiki/List_of_windmills_in_the_Netherlands" |
> trim 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0<!
DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>List of windmills in the Netherlands - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":…
"wgRelevantPageName":"List_of_windmills_in_the_Netherlands","wgRelevantArticleI…
,"site.styles":"ready","user.styles":"ready","ext.globalCssJs.user":"ready","us…
"ext.growthExperiments.SuggestedEditSession"];</script>
<script>(RLQ=window.RLQ||[]).push(function(){mw.loader.implement("user.options@…
100  249k    0  249k    0     0   853k      0 --:--:-- --:--:-- --:--:--  867k
… with 1754 more lines

Remember, trim is used only to make the output fit nicely in the book.

As you can see, curl downloads the raw HTML returned by Wikipedia’s server; no interpretation is being done and all the contents is immediately printed on standard output. Because of the URL, you’d think that this article would list all the windmills in the Netherlands. However, there are apparently so many windmills left that each province has its own page. Fascinating.

By default, curl outputs a progress meter that shows the download rate and the expected time of completion. This output isn’t written to standard output, but a separate channel, known as standard error, so this doesn’t interfere when you add another tool to the pipeline. While this information can be useful when downloading very large files, I usually find it distracting, so I specify the -s option to silence this output.

$ curl -s "https://en.wikipedia.org/wiki/List_of_windmills_in_Friesland" |
> pup -n 'table.wikitable tr' 
234

I’ll discuss pup37, a handy tool for scraping websites, in more detail in Chapter 5.

And what do you know, there are apparently 234 windmills in the province of Friesland alone!

3.3.2 Saving

You can let curl save the output to a file by adding the -O option. The filename will be based on the last part of the URL.

$ curl -s "https://en.wikipedia.org/wiki/List_of_windmills_in_Friesland" -O
 
$ l
total 1.4M
-rw-r--r-- 1 dst dst 432K Dec 14 11:46 List_of_windmills_in_Friesland
-rw-r--r-- 1 dst dst 627K Dec 14 11:46 logs.tar.gz
-rw-r--r-- 1 dst dst 189K Dec 14 11:46 r-datasets.db
-rw-r--r-- 1 dst dst  149 Dec 14 11:46 tmnt-basic.csv
-rw-r--r-- 1 dst dst  148 Dec 14 11:46 tmnt-missing-newline.csv
-rw-r--r-- 1 dst dst  181 Dec 14 11:46 tmnt-with-header.csv
-rw-r--r-- 1 dst dst  91K Dec 14 11:46 top2000.xlsx

If you don’t like that filename then you can use the -o option together with a filename or redirect the output to a file yourself:

$ curl -s "https://en.wikipedia.org/wiki/List_of_windmills_in_Friesland" > fries
land.html

3.3.3 Other Protocols

In total, curl supports more than 20 protocols. To download from an FTP server, which stands for File Transfer Protocol, you use curl the same way. Here I download the file welcome.msg from ftp.gnu.org:

$ curl -s "ftp://ftp.gnu.org/welcome.msg" | trim
NOTICE (Updated October 15 2021):
 
If you maintain scripts used to access ftp.gnu.org over FTP,
we strongly encourage you to change them to use HTTPS instead.
 
Eventually we hope to shut down FTP protocol access, but plan
to give notice here and other places for several months ahead
of time.
 
--
… with 19 more lines

If the specified URL is a directory, curl will list the contents of that directory. When the URL is password protected, you can specify a username and a password as follows with the -u option.

Or how about the DICT protocol, which allows you to access various dictionaries and lookup definitions. Here’s the definition of “windmill” according to the Collaborative International Dictionary of English:

$ curl -s "dict://dict.org/d:windmill" | trim
220 dict.dict.org dictd 1.12.1/rf on Linux 4.19.0-10-amd64 <auth.mime> <8291886…
250 ok
150 1 definitions retrieved
151 "Windmill" gcide "The Collaborative International Dictionary of English v.0…
Windmill \Wind"mill`\, n.
   A mill operated by the power of the wind, usually by the
   action of the wind upon oblique vanes or sails which radiate
   from a horizontal shaft. --Chaucer.
   [1913 Webster]
.
… with 2 more lines

When downloading data from the Internet, however, the protocol will most likely be HTTP, so the URL will start with either http:// or https://.

3.3.4 Following Redirects

When you access a shortened URL, such as the ones that start with http://bit.ly/ or http://t.co/, your browser automatically redirects you to the correct location. With curl, however, you need to specify the -L or --location option in order to be redirected. If you don’t, you can get something like:

$ curl -s "https://bit.ly/2XBxvwK"
<html>
<head><title>Bitly</title></head>
<body><a href="https://youtu.be/dQw4w9WgXcQ">moved here</a></body>
</html>%

Sometimes you get nothing back, just like when we follow the URL mentioned above:

$ curl -s "https://youtu.be/dQw4w9WgXcQ"

By specifying the -I or --head option, curl fetches only the HTTP header of the response, which allows you to inspect the status code and other information returned by the server.

$ curl -sI "https://youtu.be/dQw4w9WgXcQ" | trim
HTTP/2 303
content-type: application/binary
x-content-type-options: nosniff
cache-control: no-cache, no-store, max-age=0, must-revalidate
pragma: no-cache
expires: Mon, 01 Jan 1990 00:00:00 GMT
date: Tue, 14 Dec 2021 10:46:22 GMT
location: https://www.youtube.com/watch?v=dQw4w9WgXcQ&feature=youtu.be
content-length: 0
x-frame-options: SAMEORIGIN
… with 11 more lines

The first line shows the protocol followed by the HTTP status code, which is 303 in this case. You can also see the location this URL redirects to. Inspecting the header and getting the status code is a useful debugging tool in case curl does not give you the expected result. Other common HTTP status codes include 404 (not found) and 403 (forbidden). Wikipedia has a page that lists all HTTP status codes.

In summary, curl is a useful command-line tool for downloading data from the Internet. Its three most common options are -s to silence the progress meter, -u to specify a username and password, and -L to automatically follow redirects. See its man page for more information (and to make your head spin):

$ man curl | trim 20
curl(1)                           Curl Manual                          curl(1)
 
NAME
       curl - transfer a URL
 
SYNOPSIS
       curl [options / URLs]
 
DESCRIPTION
       curl  is  a tool to transfer data from or to a server, using one of the
       supported protocols (DICT, FILE, FTP, FTPS, GOPHER, HTTP, HTTPS,  IMAP,
       IMAPS,  LDAP,  LDAPS,  MQTT, POP3, POP3S, RTMP, RTMPS, RTSP, SCP, SFTP,
       SMB, SMBS, SMTP, SMTPS, TELNET and TFTP). The command  is  designed  to
       work without user interaction.
 
       curl offers a busload of useful tricks like proxy support, user authen‐
       tication, FTP upload, HTTP post, SSL connections, cookies, file  trans‐
       fer  resume,  Metalink,  and more. As you will see below, the number of
       features will make your head spin!
 
… with 3986 more lines

3.4 Decompressing Files

If the original dataset is very large or it’s a collection of many files, the file may be a compressed archive. Datasets which contain many repeated values (such as the words in a text file or the keys in a JSON file) are especially well suited for compression.

Common file extensions of compressed archives are: .tar.gz, .zip, and .rar. To decompress these, you would use the command-line tools tar, unzip38, and unrar39, respectively. (There are a few more, though less common, file extensions for which you would need yet other tools.)

Let’s take tar.gz (pronounced as “gzipped tarball”) as an example. In order to extract an archive named logs.tar.gz, you would use the following incantation:

$ tar -xzf logs.tar.gz  

It’s common to combine these three short options, like I did here, but you can also specify them separately as -x -z -f. In fact, many command-tools allow you to combine options that consist of a single character.

Indeed, tar is notorious for its many command-line arguments. In this case, the three options -x, -z, and -f specify that tar should extract files from an archive, use gzip as the decompression algorithm and use file logs.tar.gz.

However, since we’re not yet familiar with this archive, it’s a good idea to first examine its contents. This can be done with the -t option (instead of the -x option):

$ tar -tzf logs.tar.gz | trim
E1FOSPSAYDNUZI.2020-09-01-00.0dd00628
E1FOSPSAYDNUZI.2020-09-01-00.b717c457
E1FOSPSAYDNUZI.2020-09-01-01.05f904a4
E1FOSPSAYDNUZI.2020-09-01-02.36588daf
E1FOSPSAYDNUZI.2020-09-01-02.6cea8b1d
E1FOSPSAYDNUZI.2020-09-01-02.be4bc86d
E1FOSPSAYDNUZI.2020-09-01-03.16f3fa32
E1FOSPSAYDNUZI.2020-09-01-03.1c0a370f
E1FOSPSAYDNUZI.2020-09-01-03.76df64bf
E1FOSPSAYDNUZI.2020-09-01-04.0a1ade1b
… with 2427 more lines

Is seems that this archive contains a lot of files, and they are not inside a directory. In order to keep the current directory clean, it’s a good idea to first create a new directory using mkdir and extract those files there using the -C option.

$ mkdir logs
 
$ tar -xzf logs.tar.gz -C logs

Let’s verify the number of files and some of their contents:

$ ls logs | wc -l
2437
 
$ cat logs/* | trim
#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem…
2020-09-01      00:51:54        SEA19-C1        391     206.55.174.150  GET    …
2020-09-01      00:54:59        CPH50-C2        384     82.211.213.95   GET    …
#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem…
2020-09-01      00:04:28        DFW50-C1        391     2a03:2880:11ff:9::face:…
#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem…
2020-09-01      01:04:14        ATL56-C4        385     2600:1700:2760:da20:548…
… with 10279 more lines

Excellent. Now, I understand that you’d like to scrub and explore these log files, but that’s for later in Chapter 5 and Chapter 7.

In time, you’ll get used to these options, but I’d like to show you an alternative option, which might be convenient. Rather than remembering the different command-line tools and their options, there’s a handy script called unpack40, which will decompress many different formats. unpack looks at the extension of the file that you want to decompress, and calls the appropriate command-line tool. Now, in order to decompress this same file, you would run:

$ unpack logs.tar.gz

3.5 Converting Microsoft Excel Spreadsheets to CSV

For many people, Microsoft Excel offers an intuitive way to work with small datasets and perform calculations on them. As a result, a lot of data is embedded into Microsoft Excel spreadsheets. These spreadsheets are, depending on the extension of the filename, stored in either a proprietary binary format (.xls) or as a collection of compressed XML files (.xlsx). In both cases, the data is not readily usable by most command-line tools. It would be a shame if we could not use those valuable datasets just because they are stored this way.

Especially when you’re just starting out at the command line, you might be tempted to convert your spreadsheet to CSV by opening it in Microsoft Excel or an open source variant such as LibreOffice Calc, and manually exporting it to CSV. While this works as a one-off solution, the disadvantage is that it does not scale well to multiple files and cannot be automated. Furthermore, when you’re working on a server, chances are that you don’t have such an application available. Trust me, you’ll get the hang of it.

Luckily, there is a command-line tool called in2csv, that converts Microsoft Excel spreadsheets to CSV files. CSV stands for comma-separated values. Working with CSV can be tricky because it lacks a formal specification. Yakov Shafranovich defines the CSV format according to the following three points:41

  1. Each record is located on a separate line, delimited by a line break (). Take, for example, the following CSV file with crucial information about the Teenage Mutant Ninja Turtles:
$ bat -A tmnt-basic.csv 
───────┬────────────────────────────────────────────────────────────────────────
       │ File: tmnt-basic.csv
───────┼────────────────────────────────────────────────────────────────────────
   1   │ Leonardo,Leo,blue,two·ninjakens
   2   │ Raphael,Raph,red,pair·of·sai
   3   │ Michelangelo,Mikey·or·Mike,orange,pair·of·nunchaku
   4   │ Donatello,Donnie·or·Don,purple,staff
───────┴────────────────────────────────────────────────────────────────────────

The -A option makes bat show all non-printable characters like spaces, tabs, and newlines.

  1. The last record in the file may or may not have an ending line break. For example:
$ bat -A tmnt-missing-newline.csv
───────┬────────────────────────────────────────────────────────────────────────
       │ File: tmnt-missing-newline.csv
───────┼────────────────────────────────────────────────────────────────────────
   1   │ Leonardo,Leo,blue,two·ninjakens
   2   │ Raphael,Raph,red,pair·of·sai
   3   │ Michelangelo,Mikey·or·Mike,orange,pair·of·nunchaku
   4   │ Donatello,Donnie·or·Don,purple,staff
───────┴────────────────────────────────────────────────────────────────────────
  1. There may be a header appearing as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file. For example:
$ bat -A tmnt-with-header.csv
───────┬────────────────────────────────────────────────────────────────────────
       │ File: tmnt-with-header.csv
───────┼────────────────────────────────────────────────────────────────────────
   1   │ name,nickname,mask_color,weapon
   2   │ Leonardo,Leo,blue,two·ninjakens
   3   │ Raphael,Raph,red,pair·of·sai
   4   │ Michelangelo,Mikey·or·Mike,orange,pair·of·nunchaku
   5   │ Donatello,Donnie·or·Don,purple,staff
───────┴────────────────────────────────────────────────────────────────────────

As you can see, CSV by default is not too readable. You can pipe the data to a tool called csvlook42, which will nicely format it into a table. If the CSV data has no header, like tmnt-missing-newline.csv then you need to add the -H option, otherwise the first line will be interpreted as the header.

$ csvlook tmnt-with-header.csv
│ name         │ nickname      │ mask_color │ weapon           │
├──────────────┼───────────────┼────────────┼──────────────────┤
│ Leonardo     │ Leo           │ blue       │ two ninjakens    │
│ Raphael      │ Raph          │ red        │ pair of sai      │
│ Michelangelo │ Mikey or Mike │ orange     │ pair of nunchaku │
│ Donatello    │ Donnie or Don │ purple     │ staff            │
 
$ csvlook tmnt-basic.csv
│ Leonardo     │ Leo           │ blue   │ two ninjakens    │
├──────────────┼───────────────┼────────┼──────────────────┤
│ Raphael      │ Raph          │ red    │ pair of sai      │
│ Michelangelo │ Mikey or Mike │ orange │ pair of nunchaku │
│ Donatello    │ Donnie or Don │ purple │ staff            │
 
$ csvlook -H tmnt-missing-newline.csv 
│ a            │ b             │ c      │ d                │
├──────────────┼───────────────┼────────┼──────────────────┤
│ Leonardo     │ Leo           │ blue   │ two ninjakens    │
│ Raphael      │ Raph          │ red    │ pair of sai      │
│ Michelangelo │ Mikey or Mike │ orange │ pair of nunchaku │
│ Donatello    │ Donnie or Don │ purple │ staff            │

The -H option specifies that the CSV file has no header.

Let’s demonstrate in2csv using a spreadsheet that contains the 2000 most popular songs according to an annual Dutch marathon radio program Top 2000. To extract its data, you invoke in2csv as follows:

$ curl https://cms-assets.nporadio.nl/npoRadio2/TOP-2000-2020.xlsx > top2000.xlsx
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 41744  100 41744    0     0  46024      0 --:--:-- --:--:-- --:--:-- 46228
 
$ in2csv top2000.xlsx | tee top2000.csv | trim
NR.,ARTIEST,TITEL,JAAR
1,Danny Vera,Roller Coaster,2019
2,Queen,Bohemian Rhapsody,1975
3,Eagles,Hotel California,1977
4,Billy Joel,Piano Man,1974
5,Led Zeppelin,Stairway To Heaven,1971
6,Pearl Jam,Black,1992
7,Boudewijn de Groot,Avond,1997
8,Coldplay,Fix You,2005
9,Pink Floyd,Wish You Were Here,1975
… with 1991 more lines

Who is Danny Vera? The most popular song is supposed to be Bohemian Rhapsody, of course. Well, at least Queen appears plenty of times in the Top 2000 so I can’t really complain:

$ csvgrep top2000.csv --columns ARTIEST --regex '^Queen$' | csvlook -I 
| NR.  | ARTIEST | TITEL                           | JAAR |
| ---- | ------- | ------------------------------- | ---- |
| 2    | Queen   | Bohemian Rhapsody               | 1975 |
| 11   | Queen   | Love Of My Life                 | 1975 |
| 46   | Queen   | Innuendo                        | 1991 |
| 55   | Queen   | Don't Stop Me Now               | 1979 |
| 70   | Queen   | Somebody To Love                | 1976 |
| 85   | Queen   | Who Wants To Live Forever       | 1986 |
| 89   | Queen   | The Show Must Go On             | 1991 |
| 131  | Queen   | Killer Queen                    | 1974 |
| 146  | Queen   | I Want To Break Free            | 1984 |
| 217  | Queen   | Radio Ga Ga                     | 1984 |
| 218  | Queen   | We Will Rock You                | 1977 |
| 342  | Queen   | These Are The Days Of Our Lives | 1991 |
| 383  | Queen   | We Are The Champions            | 1977 |
| 387  | Queen   | Another One Bites The Dust      | 1980 |
| 447  | Queen   | Crazy Little Thing Called Love  | 1979 |
| 569  | Queen   | Hammer To Fall                  | 1984 |
| 606  | Queen   | I Want It All                   | 1989 |
| 771  | Queen   | A Kind Of Magic                 | 1986 |
| 776  | Queen   | Fat Bottomed Girls              | 1978 |
| 845  | Queen   | You're My Best Friend           | 1976 |
| 969  | Queen   | Bicycle Race                    | 1978 |
| 973  | Queen   | Breakthru                       | 1989 |
| 986  | Queen   | Tie Your Mother Down            | 1977 |
| 1140 | Queen   | '39                             | 1975 |
| 1206 | Queen   | Thank God It's Christmas        | 1984 |
| 1207 | Queen   | Spread Your Wings               | 1978 |
| 1267 | Queen   | Friends Will Be Friends         | 1986 |
| 1308 | Queen   | One Vision                      | 1985 |
| 1325 | Queen   | Heaven For Everyone             | 1995 |
| 1365 | Queen   | You Don't Fool Me               | 1996 |
| 1493 | Queen   | Save Me                         | 1980 |
| 1860 | Queen   | Good Old-Fashioned Lover Boy    | 1976 |

The value after the --regex options is a regular expression (or regex). It’s a special syntax for defining patterns. Here, I only want to match artists that exactly match “Queen,” so I use the caret (^) and dollar sign ($) to match the start and end of the values in the column ARTIEST.

By the way, the tools in2csv, csvgrep, and csvlook are part of CSVkit, which is a collection of command-line tools to work with CSV data.

The format of the file is automatically determined by the extension, .xlsx in this case. If you were to pipe the data into in2csv, you would have to specify the format explicitly.

A spreadsheet can contain multiple worksheets. in2csv extracts, by default, the first worksheet. To extract a different worksheet, you need to pass the name of worksheet to the --sheet option. If you’re not sure what’s the worksheet is called, you can use the --names option, which prints the names of all the worksheets. Here we see that top2000.xlsx has only one sheet, named Blad1 (which is Dutch for Sheet1).

$ in2csv --names top2000.xlsx
Blad1

3.6 Querying Relational Databases

Many companies store their data in a relational database. Just as with spreadsheets, it would be great if we could obtain that data from the command line.

Examples of relational databases are MySQL, PostgreSQL, and SQLite. These databases all have a slightly different way of interfacing with them. Some provide a command-line tool or a command-line interface, while others do not. Moreover, they are not very consistent when it comes to their usage and output.

Fortunately, there is a command-line tool called sql2csv, which is part of the CSVkit suite. It works with many different databases through a common interface, including MySQL, Oracle, PostgreSQL, SQLite, Microsoft SQL Server, and Sybase. The output of sql2csv is, as its name suggests, in CSV format.

We can obtain data from relational databases by executing a SELECT query on them. (sql2csv also support INSERT, UPDATE, and DELETE queries, but that’s not the purpose of this chapter.)

sql2csv needs two arguments: --db, which specifies the database URL, of which the typical form is: dialect+driver://username:password@host:port/database and --query, which contains the SELECT query. For example, given an SQLite database that contains the standard datasets from R43 , I can select all the rows from the table mtcars and sort them by the mpg column as follows:

$ sql2csv --db 'sqlite:///r-datasets.db' \
> --query 'SELECT row_names AS car, mpg FROM mtcars ORDER BY mpg' | csvlook
│ car                 │  mpg │
├─────────────────────┼──────┤
│ Cadillac Fleetwood  │ 10.4 │
│ Lincoln Continental │ 10.4 │
│ Camaro Z28          │ 13.3 │
│ Duster 360          │ 14.3 │
│ Chrysler Imperial   │ 14.7 │
│ Maserati Bora       │ 15.0 │
│ Merc 450SLC         │ 15.2 │
│ AMC Javelin         │ 15.2 │
… with 24 more lines

This SQLite database is a local file, so in this I don’t need to specify any username, password or host. If you wish to query the database of your employer, you’d of course need know how you can access it and you’d need the permission to do so.

3.7 Calling Web APIs

In the previous section I explained how to download files from the Internet. Another way data can come from the Internet is through a web API, which stands for Application Programming Interface. The number of APIs that are being offered is growing at an increasing rate, which means a lot of interesting data for us data scientists.

Web APIs are not meant to be presented in a nice layout, such as websites. Instead, most web APIs return data in a structured format, such as JSON or XML. Having data in a structured form has the advantage that the data can be easily processed by other tools, such as jq. For example, the API of Ice and Fire, which contains a lot of information about George R.R. Martin’s fictional world, in which the Game of Thrones book and TV shows take place, returns data in the following JSON structure:

$ curl -s "https://anapioficeandfire.com/api/characters/583" | jq '.'
{
  "url": "https://anapioficeandfire.com/api/characters/583",
  "name": "Jon Snow",
  "gender": "Male",
  "culture": "Northmen",
  "born": "In 283 AC",
  "died": "", 
  "titles": [
    "Lord Commander of the Night's Watch"
  ],
  "aliases": [
    "Lord Snow",
    "Ned Stark's Bastard",
    "The Snow of Winterfell",
    "The Crow-Come-Over",
    "The 998th Lord Commander of the Night's Watch",
    "The Bastard of Winterfell",
    "The Black Bastard of the Wall",
    "Lord Crow"
  ],
  "father": "",
  "mother": "",
  "spouse": "",
  "allegiances": [
    "https://anapioficeandfire.com/api/houses/362"
  ],
  "books": [
    "https://anapioficeandfire.com/api/books/5"
  ],
  "povBooks": [
    "https://anapioficeandfire.com/api/books/1",
    "https://anapioficeandfire.com/api/books/2",
    "https://anapioficeandfire.com/api/books/3",
    "https://anapioficeandfire.com/api/books/8"
  ],
  "tvSeries": [
    "Season 1",
    "Season 2",
    "Season 3",
    "Season 4",
    "Season 5",
    "Season 6"
  ],
  "playedBy": [
    "Kit Harington"
  ]
}

Spoiler alert: this data is not entirely up-to-date.

The data is piped to a command-line tool jq just to display it in a nice way. jq has many more scrubbing and exploring possibilities that I will explore in Chapter 5 and Chapter 7.

3.7.1 Authentication

Some web APIs require you to authenticate (that is, to prove your identity) before you can consume their output. There are several ways to do this. Some web APIs use API keys while others use the OAuth protocol. News API, an independent source of headlines and news articles, is a great example. Let’s see what happens when you try to access this API without an API key:

$ curl -s "http://newsapi.org/v2/everything?q=linux" | jq .
{
  "status": "error",
  "code": "apiKeyMissing",
  "message": "Your API key is missing. Append this to the URL with the apiKey pa
ram, or use the x-api-key HTTP header."
}

Well, that was to be expected. The part after the question mark, by the way, is where we pass any query parameters. That’s also the place where you need to specify an API key. I’d like to keep my own API key a secret, so I insert it below by reading the file /data/.secret/newsapi.org_apikey using command substitution.

$ curl -s "http://newsapi.org/v2/everything?q=linux&apiKey=$(< /data/.secret/new
sapi.org_apikey)" |
> jq '.' | trim 30
{
  "status": "ok",
  "totalResults": 9653,
  "articles": [
    {
      "source": {
        "id": "engadget",
        "name": "Engadget"
      },
      "author": "Igor Bonifacic",
      "title": "'Arma 3' and 'DayZ' add BattlEye anti-cheat support through Val…
      "description": "While there are still many unknowns about Steam Deck\r\n
      "url": "https://www.engadget.com/arma-3-dayz-proton-battleye-support-2246…
      "urlToImage": "https://s.yimg.com/os/creatr-uploaded-images/2021-12/bcb0f…
      "publishedAt": "2021-12-03T22:46:25Z",
      "content": "While there are still many unknowns about Steam Deck\r\ns lib…
    },
    {
      "source": {
        "id": null,
        "name": "Slashdot.org"
      },
      "author": "EditorDavid",
      "title": "AWS Embraces Fedora Linux for Its Cloud-Based 'Amazon Linux'",
      "description": "ZDNet reports:\n\nBy and large, the public cloud runs on …
      "url": "https://linux.slashdot.org/story/21/11/27/0328223/aws-embraces-fe…
      "urlToImage": "https://a.fsdn.com/sd/topics/cloud_64.png",
      "publishedAt": "2021-11-27T16:34:00Z",
      "content": "By and large, the public cloud runs on Linux. Most users, eve…
    },
… with 236 more lines

You can get your own API key at News API’s website.

3.7.2 Streaming APIs

Some web APIs return data in a streaming manner. This means that once you connect to it, the data will continue to pour in, until the connection is closed. A well-known example is the Twitter “firehose,” which constantly streams all the tweets being sent around the world. Luckily, most command-line tools also operate in a streaming matter.

Let’s take a 10 second sample of one of Wikimedia’s streaming APIs, for example:

$ curl -s "https://stream.wikimedia.org/v2/stream/recentchange" |
> sample -s 10 > wikimedia-stream-sample

This particular API returns all changes that have been made to Wikipedia and other properties of Wikimedia. The command-line tool sample is used to close the connection after 10 seconds. The connection can also be closed manually by pressing Ctrl-C to send an interrupt. The output is saved to a file wikimedia-stream-sample. Let’s take a peek using trim:

$ < wikimedia-stream-sample trim
:ok
 
event: message
id: [{"topic":"eqiad.mediawiki.recentchange","partition":0,"timestamp":16101133…
data: {"$schema":"/mediawiki/recentchange/1.0.0","meta":{"uri":"https://en.wiki…
 
event: message
id: [{"topic":"eqiad.mediawiki.recentchange","partition":0,"timestamp":16101133…
data: {"$schema":"/mediawiki/recentchange/1.0.0","meta":{"uri":"https://www.wik…
 
… with 1078 more lines

With a little bit of sed and jq I can scrub this data to get a glimpse of the changes happening on the English version of Wikipedia:

$ < wikimedia-stream-sample sed -n 's/^data: //p' | 
> jq 'select(.type == "edit" and .server_name == "en.wikipedia.org") | .title' 
"Odion Ighalo"
"Hold Up (song)"
"Talk:Royal Bermuda Yacht Club"
"Jenna Ushkowitz"
"List of films released by Yash Raj Films"
"SP.A"
"Odette (musician)"
"Talk:Pierre Avoi"
"User:Curlymanjaro/sandbox3"
"List of countries by electrification rate"
"Grieg (crater)"
"Gorman, Edmonton"
"Khabza"
"QAnon"
"Khaw Boon Wan"
"Draft:Oggy and the Cockroaches (1975 TV series)"
"Renzo Reggiardo"
"Greer, Arizona"
"National Curriculum for England"
"Mod DB"
"Jordanian Pro League"
"List of foreign Serie A players"

This sed expression only prints lines that start with data: and prints the part after the semicolon, which happens to be JSON.
This jq expression prints the title key of JSON objects that have a certain type and server_name.

Speaking of streaming, did you know that you could stream Star Wars: Episode IV – A New Hope for free using telnet44?

$ telnet towel.blinkenlights.nl

And after some time, we see that Han Solo did shoot first!

 
 
                       -===                    `"',
       I'll bet you   ""o o                    O O|)
           have!      _\ -/_                  _\o/ _
                     || || |*                /|\ / |\
                     \\ || ***              //| |  |\\
                      \\o=***********      // | |  | ||
                      |\(#'***\\        -==#  | |  | ||
                      |====|*  ')         '\  |====| /#
                      |/|| |                  | || |  "
                      ( )( )                  | || |
                      |-||-|                  | || |
                      | || |                  | || |
      ________________[_][__\________________/__)(_)_____________________
 

Sure, it’s probably not a good source of data, but there’s nothing wrong with enjoying an old classic while training your machine learning models45.

3.8 Summary

Congratulations, you have finished the first step of the OSEMN model. You’ve learned a variety of ways to obtain data, ranging from downloading to querying a relational database. In the next chapter, which is an intermezzo chapter, I’ll teach you how to create your own command-line tools. Feel free to skip this and go to Chapter 5 (the second step of the OSEMN model) if you cannot wait to learn about scrubbing data.

3.9 For Further Exploration

  • Looking for a dataset to practice on? The GitHub repository Awesome Public Datasets lists hundreds of high-quality datasets that are publicly available.
  • Or perhaps you’d rather practice with an API? The GitHub repository Public APIs lists many free APIs. City Bikes and The One API are among my favorites.
  • Writing SQL queries to obtain data from a relational database is an important skill. The first 15 lessons of the book SQL in 10 Minutes a Day by Ben Forta teach the SELECT statement and its filtering, grouping, and sorting capabilities.