Sampling ImageNet

Sunday April 30, 2017

ImageNet is a standard image dataset. It's pretty big; just the IDs and URLs of the images take over a gigabyte of text. I collected a fun sampling for small-scale purposes.

ImageNet is distributed primarily as a text file of image URLs. The compressed file is 334 megabytes. The unpacked file is 1.1 gigabytes.

$ wget
$ tar zxvf imagenet_fall11_urls.tgz
$ wc fall11_urls.txt
##  14197122 28414665 1134662781 fall11_urls.txt
$ head -3 fall11_urls.txt
## n00004475_6590
## n00004475_15899
## n00004475_32312

The first field is an image ID. The part before the underscore is a WordNet ID, so the first image is of n00004475. What's that?

The mapping from WordNet ID to a brief text label can be downloaded from a link on the ImageNet API page.

$ wget
$ wc words.txt
##   82114  302059 2655750 words.txt
$ head -3 words.txt
## n00001740   entity
## n00001930   physical entity
## n00002137   abstraction, abstract entity

There are 82,114 WordNet IDs. Now we can decode the one we're interested in.

$ grep n00004475 words.txt
## n00004475    organism, being

So the first picture in ImageNet is of an "organism, being". What does such a thing look like?

organism, being

There are eight examples of "organism, being" and two of the others are cats.

I think 82,114 categories is too many to try to sample randomly from, for my purposes. I'll use the 200 categories specified for the ILSVRC2017 object detection challenge.

wget -O 200words.html

I used Emacs to pull out the 200 WordNet IDs and convenient extra-short descriptions, saved in 200words.csv. The script produces 200words100urls.csv with 100 random URLs for each of the categories. Finally, downloads five working JPGs for each category. A couple came back with "missing" images, so I manually replaced those with others from the list.

The results are packaged up on GitHub at ajschumacher/imagen and feature such beauties as n02118333_27_fox.jpg.