mnist and the idx file format
Boser, Guyon, and Vapnik (BGV) introduce support vector machines (SVMs) in [1] and their immediate application is that of optical character recognition (OCR). More specifically, given images of handwritten numerals, they train a classifier to categorize a given image by digit. BGV show that SVMs can achieve performance comparable to that of a more sophisticated five-layer neural network classifier. I’ve been working through their paper to learn about SVMs, and I thought it might be fun to try to follow along. There is a fairly large dataset of handwritten numerals due to LeCun, Cortes, and Burges [2], called the MNIST database that’s publicly available. What follows are my notes (mostly for my future self) on how to get started working with this dataset.
the idx
file format
At the time of writing, unfortunately, the website hosting the MNIST database and the relevant documentation seems to be locked behind a authentication wall. The dataset’s popular enough, though, that you can find it with a quick internet search. Once you’ve downloaded the dataset (and uncompressed it), you should have 4 files: 2 files for training data and 2 files for testing data.
1ls -lh data/mnist/
1total 53M
2-rw-r--r-- 1 nilay nilay 7.5M Jun 17 17:31 t10k-images.idx3-ubyte
3-rw-r--r-- 1 nilay nilay 9.8K Jun 17 17:31 t10k-labels.idx1-ubyte
4-rw-r--r-- 1 nilay nilay 45M Jun 17 17:31 train-images.idx3-ubyte
5-rw-r--r-- 1 nilay nilay 59K Jun 17 17:31 train-labels.idx1-ubyte
You’ll notice immediately that the data is stored in a slightly idiosyncratic
manner. A cursory head data/train-images.idx3-ubyte
will show that we’re
working with binary blobs, not images or comma-separated values.1 A bit of
elbow grease is necessary, then, in order to work with this dataset. This short
post is an overview of how to work with this file format. If you’d like a quick
and easy solution, I’d suggest taking a look at idx2numpy
project [3].
I found this project halfway through writing up my own code, but it’s a pretty
straightforward exercise regardless.
The idx
format is used to store data that has the shape of an
\(d\)-dimensional matrix.2 In the files above, the digit following .idx
indicates the dimension \(d\) and the ubyte
indicates that the data (the matrix
entries) will be unsigned bytes. Let’s fix a bit of notation: bytes can take
values between \x00
and \xff
. That is, a byte consists of two hexadecimal
digits. Since each digit can take \(16=2^4\) values, two hex digits gives us \(16\times
16=256=2^8\) values, which is exactly 8 bits = 1 byte.3 The dimension and
the datatype are actually described in the binary, so the filenames are just a
helpful reminder.
An idx
file is structured as follows (all multibyte sequences are big-endian
[4]):
1| \x00 | \x00 | dtype | d | n_1 | ... | n_d |
2
3| a_{1, 1, ..., 1} | ... | a_{1, 1, ..., 1, n_d} | ... | a_{1, n_2, ..., n_d} |
4| . | | . | | . |
5| . | | . | | . |
6| . | | . | | . |
7| a_{n_1, 1, ..., 1} | ... | a_{n_1, 1, ..., 1, n_d} | ... | a_{n_1, n_2, ..., n_d} |
I’ve written this out as a table, but remember that this is a binary file so the newlines/formatting here are just for readability. Let’s break down what’s going on here:
- the first two bytes are always zero,
\x00
.4 - the third byte
dtype
is the datatype:We’ll be using unsigned bytes, so we expect to see1 | value | type | size (bytes) | 2 |-------+---------------+--------------| 3 | \x08 | unsigned byte | 1 | 4 | \x09 | signed byte | 1 | 5 | \x0b | short | 2 | 6 | \x0c | int | 4 | 7 | \x0d | float | 4 | 8 | \x0e | double | 8 |
\x08
. - the fourth byte
d
is the dimension of the data matrix, and the following \(4d\) bytes are 4-byte integers \(n_1,…,n_d\). - the rest of the file stores the \(d\)-dimensional data matrix, flattened to a
2-dimensional matrix as the notation above indicates. Each cell has size
determined by the
dtype
above.
To make things more concrete, let’s take a look at the MNIST data.
the mnist images
There are a number of standard utilities for displaying the contents of a binary
file as hexadecimal bytes. We’ll use hexdump
to display the first 16 bytes (in
the “canonical” format, hence the -C
flag):
1hexdump -C --length=16 data/mnist/train-images.idx3-ubyte
100000000 00 00 08 03 00 00 ea 60 00 00 00 1c 00 00 00 1c |.......`........|
200000010
The column on the left is a column of hexadecimal offsets (essentially line
numbers), so the first row is displaying the first 16 bytes of our file. The
second row of 16 bytes (represented by the hex 10
) is empty, because we’ve
only queried 16 bytes with the --length=16
flag. The 16 middle columns are
what we’re after. We’ll return to those in a moment. The last column is an
attempt at ASCII representations of the bytes. This is irrelevant for us, as
we’re not expecting any plaintext strings in our file, so you can ignore this
column.
The first two bytes are 0, as required by the idx
format. The next byte \x08
tells us that the elements of the data matrix appearing later will be 1-byte
each, to be interpreted directly as unsigned integers between 0 and 255. The
fourth byte \x03
tells us that the data matrix is 3-dimensional. What are the
dimensions \(n_1, n_2, n_3\)? We check the next 12 bytes: \x0000ea60
,
\x0000001c
, and \x0000001c
. Converting these to decimal using the calculator
bc
,
1echo "ibase=16; EA60; 1C" | bc
160000
228
we find that
\begin{equation*} n_1 = 60000, \qquad n_2 = n_3 = 28. \end{equation*}
This means that we’re dealing with a \((60000 \times 28 \times 28)\) matrix. We have 60000 training images, with each square image having \(28 \times 28 = 784\) pixels, and with each pixel colored in grayscale from 0 (white) to 255 (black). According to the format specification above, all this data is stored in the file as a \((60000 \times 784)\) matrix of bytes. The \(k\)th row represents the \(k\)th image — the \(k\)th feature vector.
The first image consists of the subsequent 784 bytes:
1hexdump -C --length=784 --skip=16 data/mnist/train-images.idx3-ubyte
100000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
2*
3000000a0 00 00 00 00 00 00 00 00 03 12 12 12 7e 88 af 1a |............~...|
4000000b0 a6 ff f7 7f 00 00 00 00 00 00 00 00 00 00 00 00 |................|
5000000c0 1e 24 5e 9a aa fd fd fd fd fd e1 ac fd f2 c3 40 |.$^............@|
6000000d0 00 00 00 00 00 00 00 00 00 00 00 31 ee fd fd fd |...........1....|
7000000e0 fd fd fd fd fd fb 5d 52 52 38 27 00 00 00 00 00 |......]RR8'.....|
8000000f0 00 00 00 00 00 00 00 12 db fd fd fd fd fd c6 b6 |................|
900000100 f7 f1 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
1000000110 00 00 00 00 50 9c 6b fd fd cd 0b 00 2b 9a 00 00 |....P.k.....+...|
1100000120 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
1200000130 00 0e 01 9a fd 5a 00 00 00 00 00 00 00 00 00 00 |.....Z..........|
1300000140 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 8b |................|
1400000150 fd be 02 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
1500000160 00 00 00 00 00 00 00 00 00 00 00 0b be fd 46 00 |..............F.|
1600000170 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
1700000180 00 00 00 00 00 00 00 00 23 f1 e1 a0 6c 01 00 00 |........#...l...|
1800000190 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
19000001a0 00 00 00 00 00 51 f0 fd fd 77 19 00 00 00 00 00 |.....Q...w......|
20000001b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
21000001c0 00 00 2d ba fd fd 96 1b 00 00 00 00 00 00 00 00 |..-.............|
22000001d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 10 |................|
23000001e0 5d fc fd bb 00 00 00 00 00 00 00 00 00 00 00 00 |]...............|
24000001f0 00 00 00 00 00 00 00 00 00 00 00 00 00 f9 fd f9 |................|
2500000200 40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |@...............|
2600000210 00 00 00 00 00 00 2e 82 b7 fd fd cf 02 00 00 00 |................|
2700000220 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
2800000230 27 94 e5 fd fd fd fa b6 00 00 00 00 00 00 00 00 |'...............|
2900000240 00 00 00 00 00 00 00 00 00 00 18 72 dd fd fd fd |...........r....|
3000000250 fd c9 4e 00 00 00 00 00 00 00 00 00 00 00 00 00 |..N.............|
3100000260 00 00 00 00 17 42 d5 fd fd fd fd c6 51 02 00 00 |.....B......Q...|
3200000270 00 00 00 00 00 00 00 00 00 00 00 00 00 00 12 ab |................|
3300000280 db fd fd fd fd c3 50 09 00 00 00 00 00 00 00 00 |......P.........|
3400000290 00 00 00 00 00 00 00 00 37 ac e2 fd fd fd fd f4 |........7.......|
35000002a0 85 0b 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
36000002b0 00 00 00 00 88 fd fd fd d4 87 84 10 00 00 00 00 |................|
37000002c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
38*
3900000320
You’ll notice that a number of bytes are omitted because the offsets jump from
\x10
to \xa0
. The asterisk *
denotes that the omitted bytes are identical
(in \(16 \times 16\) chunks) to the 16 bytes displayed on the previous line.5 This
isn’t surprising in the case of handwritten digits: we expect a good chunk of
whitespace towards the top and bottom of the image. It’s clearly not easy to
visualize the digit just from this hexdump. We’ll write some python code to help
us visualize them later, after we’ve loaded the images into numpy arrays.
First, though, we’ll look at the labels provided for learning supervision. Just as before, let’s look into the training labels.
1hexdump -C --length=8 data/mnist/train-labels.idx1-ubyte
100000000 00 00 08 01 00 00 ea 60 |.......`|
200000008
The first 8 bytes tell us that this file contains a vector of length 60000, whose elements are unsigned bytes. Let’s take a look at the first 16 elements:
1hexdump -C --length=16 --skip=8 data/mnist/train-labels.idx1-ubyte
100000008 05 00 04 01 09 02 01 03 01 04 03 05 03 06 01 07 |................|
200000018
As expected for labels indicating digits, all these values are between 0 and 9.
Great, now that we understand how the data is stored, we can get stop messing around in the command line and load the data into our python scripts.
reading images into numpy arrays
The python implementation of loading the MNIST images is actually pretty
straightforward, now that we know exactly what our datafiles look like. We’re
not going to write a program to take an arbitrary idx
datafile and convert
it to a numpy array because that’s overkill (just use idx2numpy
!). We’ll stick
to our specific case. That means we just need to read 60000 blocks of size 784
bytes, each (as well as the associated labels). Well okay, let’s just read, say,
the first \(N=2\) images and the first two labels, because the extension to
general N will be clear.
We make sure to open the datafile for binary reading with the rb
flag. This
allows us to read data from the file into a bytes
object, which stores an
immutable sequence of bytes. We throw away the first 16 bytes that we analyzed
above. As it happens, numpy has a super convenient frombuffer
method (I
learned about this by peeking at idx2numpy
’s code). This function takes any
object that exposes “the buffer interface” and produces from it a numpy array.
This was my first time hearing about python’s buffer protocol, but I’ll banish
my short description of it to the footnotes, since it’s a bit off-topic.6
The upshot is that we can pass in our sequence of bytes and get back out a
vector of size 784. All that’s left, then, is to .reshape((28, 28))
to get it
into a nice square matrix, and follow that up with a bit of custom pretty
printing.
1import numpy as np
2# let's look at the first two images
3N = 2
4# 'rb' means that we're reading bytes
5with open('data/mnist/train-images.idx3-ubyte', 'rb') as f:
6 # discard the first 16 bytes
7 f.read(16)
8 for n in range(N):
9 # read the next 784 bytes into a numpy array and reshape
10 image = np.frombuffer(f.read(784), dtype = np.ubyte).reshape((28, 28))
11
12 # draw in ['.', 'o', 'O', '@'] depending on the intensity of a pixel
13 for i in range(28):
14 for j in range(28):
15 if image[i][j] == 0:
16 print('.', end='')
17 elif image[i][j] < 64:
18 print('o', end='')
19 elif image[i][j] < 128:
20 print('O', end='')
21 else:
22 print('@', end='')
23 print()
24 print()
1............................
2............................
3............................
4............................
5............................
6............ooooO@@o@@@O....
7........ooO@@@@@@@@@@@@O....
8.......o@@@@@@@@@@OOOoo.....
9.......o@@@@@@@@@@..........
10........O@O@@@o.o@..........
11.........oo@@O..............
12...........@@@o.............
13...........o@@O.............
14............o@@@Oo..........
15.............O@@@Oo.........
16..............o@@@@o........
17...............oO@@@........
18.................@@@O.......
19..............o@@@@@o.......
20............o@@@@@@@........
21..........oO@@@@@@O.........
22........oO@@@@@@Oo..........
23......o@@@@@@@Oo............
24....o@@@@@@@@o..............
25....@@@@@@@o................
26............................
27............................
28............................
29
30............................
31............................
32............................
33............................
34...............o@@@o........
35..............o@@@@@........
36.............o@@@@@@oo......
37...........oo@@@@@O@@O......
38...........@@@@@@@O@@@......
39..........o@@@@O@@oO@@......
40.........o@@@@oOOo..@@o.....
41........o@@@@O......@@@.....
42.......o@@@Ooo......@@@.....
43.......o@@o.........@@@.....
44.......@@@..........@@@.....
45......O@@O..........@@@.....
46......O@@o........o@@@o.....
47......O@@........o@@@O......
48......O@@.......o@@@........
49......O@@......O@@@.........
50......O@@@ooO@@@@@o.........
51......O@@@@@@@@@@...........
52......o@@@@@@@@.............
53.......o@@@@@o..............
54............................
55............................
56............................
57............................
That looks like a 5 and a 0 to me. Let’s double check by loading in the first two training labels, which involves more or less the same code as above.
1import numpy as np
2# let's look at the first two labels
3N = 2
4# 'rb' means that we're reading bytes
5with open('data/mnist/train-labels.idx1-ubyte', 'rb') as f:
6 # discard the first 8 bytes
7 f.read(8)
8 # read the next byte into a numpy array (corresponding to the first label)
9 labels = np.frombuffer(f.read(N), dtype = np.ubyte)
10 print(labels)
1[5 0]
Looks good!
Well, that about wraps up this post. We haven’t looked at the testing data in the remaining two files, but those are treated exactly the same. The only difference there is that the testing set is comprised of 10000 images instead of 60000, so there’s not much to change.
This is probably done to keep file sizes down. If the numerals were stored as images, the dataset would likely be much larger in size. Moreover, since the numerals will have large numbers of white pixels, I would guess that this binary format, which stores white pixels as
\x00
, yields relatively high compression ratios. ↩︎What is
idx
short for? Image Data matriX? Incredibly Dense Xylophones? ↩︎Notice that a signed byte uses one bit of information to store the sign, and hence can only store integers in \((-2^7, 2^7)\). Here we’re interested in keeping track of how dark a pixel is, for which it’s perfectly convenient to just use an unsigned byte. ↩︎
Why include the first two bytes if they’re always zero? I think it’s because the first two bytes of a file are often used as a file signature to indicate what the filetype is. The creators of the
idx
format were probably just adhering to this convention, and indicating that the file is not in a commonly used format. ↩︎You can force
hexdump
to show all bytes explicitly by passing it the no-squeezing flag--no-squeezing
, or-v
for short. ↩︎Straight from PEP 3118 [5]: the python buffer protocol “allows different Python types to exchange a pointer to a sequence of internal buffers. This functionality is extremely useful for sharing large segments of memory between different high-level objects…” This seems to be quite important for heavily performance-reliant code (say in numpy) or code that interacts with older legacy software packages written in C or Fortran. ↩︎