Working with Shodan Data Files

Only a tiny fraction of the data that Shodan gathers is viewable via the main website. To leverage the full power of Shodan it's necessary to look at all the information that's contained in the banners. There are a few different ways of getting the full banner information but for now we're going to focus on the Shodan file format which is used for storing banners. The data files are most commonly generated from either:

Exporting results via the website, or
Downloading using the Shodan command-line interface
Bulk Data API

In any case, you will end up with a local file that ends in a json.gz extension. It doesn't matter if you downloaded the file from the Shodan website, you created it yourself from the command-line or received it as part of your Enterprise Data License - they all output the same file format.

The format itself is simple: the file is compressed using Gzip and each line corresponds to a JSON-encoded banner. Fortunately, there are tools and helper methods available to make working with these data files very easy:

Command-Line Processing

First, lets check out the arguably easiest way of working with Shodan data files: the command-line interface for Shodan.

$ shodan parse -h
Usage: shodan parse [OPTIONS] <filenames>

  Extract information out of compressed JSON files.

Options:
  --color / --no-color
  --fields TEXT         List of properties to output.
  -f, --filters TEXT    Filter the results for specific values using key:value
                        pairs.
  -O, --filename TEXT   Save the filtered results in the given file (append if
                        file exists).
  --separator TEXT      The separator between the properties of the search
                        results.
  -h, --help            Show this message and exit.

The shodan parse command is the workhorse for processing Shodan data files. It lets you extract information, filter based on specific property values and create new data files.

One of the most common tasks is generating a list of IPs based on the Shodan data file. This can easily be created by parsing the file and only printing out the ip_str field/ property using the --fields option. The following command reads all the banners from the file called malware.json.gz and prints the IP address from each banner:

shodan parse --fields ip_str malware.json.gz

Lets say that we want a file that contains both the IP address of the malware command & control center and the name of the malware as stored in the product property:

shodan parse --fields ip_str,product malware.json.gz

The --fields option accepts a comma-separated list of property names. Nested properties can be shown by using the dot . as a hierarchical separator. For example, to print out the SSL certificate issuer which is stored in the ssl.cert.issuer.CN property as well as its IP and port:

shodan parse --fields ssl.cert.issuer.CN,ip_str,port https-data.json.gz

Here is a short video that shows how to print properties stored on banners:

The properties will be printed to the terminal in the order that they're defined in the --fields option.

Lets say you have a data file which you would like to split into smaller files depending on the content in the banners. To filter banners based on a property the CLI provides the -f or --filter option. It has the following syntax:

shodan parse -f propertyname:value data.json.gz

For example, here is how to filter out results on port 443 (HTTPS) from a file:

shodan parse -f port:443 mixed-data.json.gz

You can also specify multiple filters by providing multiple -f options:

shodan parse -f port:443 -f product:Apache mixed-data.json.gz

Note that the value of a filter is searched case-sensitive. This means that filtering for a value of apache is not the same as filtering for Apache.

You can combine the --fields option with the --filter option so you don't need to show the content of the property you're filtering on. For example, here we are printing out a list of IPs where the product was identified to be a Nginx web server:

shodan parse --fields ip_str -f product:nginx mixed-data.json.gz

And to store the filtered banners in a separate data file there is the -O (output) option. It's useful to generate new, smaller data files that match specific criteria to share with others or load into other tools for further processing. For example, the following command reads mixed-data.json.gz, extracts all the banners for industrial control systems from the data file and stores them in a separate file called ics-data.json.gz:

shodan parse -f tags:ics -O ics-data.json.gz mixed-data.json.gz

Run the command with the -h option to see all available options:

$ shodan parse -h

The CLI also supports the ability to convert a Shodan data file into a different file format. The following conversions are currently supported:

CSV
Excel
Image Extraction
KML

To convert files from json.gz into other file formats use the shodan convert command:

$ shodan convert -h
Usage: shodan convert [OPTIONS] <input file> <output format>

  Convert the given input data file into a different format. The following
  file formats are supported:

  kml, csv, geo.json, images, xlsx

  Example: shodan convert data.json.gz kml

Options:
  --fields TEXT  List of properties to output.
  -h, --help     Show this message and exit.

Note that most of the conversions are lossy which means that the new file will have less information than the original Shodan data file. For example, if you convert to the KML format then you are discarding most non-geographical information. This is why it's important to always keep the original Shodan data file in case you want to perform additional analysis on the data file.

Converting files is extremely simple - the command has the following syntax:

$ shodan convert <datafile.json.gz> <format>

For example, to extract all the images from a file called remote-desktops.json.gz:

$ shodan convert remote-desktops.json.gz images
Successfully extracted images to directory: remote-desktops-images

Or to convert the data file into the native Excel format xlsx:

$ shodan convert remote-desktops.json.gz xlsx

Custom Analysis

The Python library for Shodan provides developer-friendly helper methods for working with the data files:

shodan.helpers.iterate_files(filenames)
shodan.helpers.open_file(filename)
shodan.helpers.write_banner(file, banner)

The shodan.helpers.iterate_files() method accepts a data filename (or a list of filenames) and returns an iterator where each item is a Shodan banner.

Following is a sample Python script that reads Shodan data files and prints out their banners:

# Import the method that helps us parse the data file
from shodan.helpers import iterate_files

# Standard Python libraries
from pprint import pprint
from sys import argv, exit

# The user has to provide at least 1 file
if len(argv) == 1:
    print('Usage: {} <file1.json.gz> [file2.json.gz] ...'.format(argv[0]))
    exit(1)

# Iterate over all of the provided data files
for banner in iterate_files(argv[1:]):
    # The banner object can be very large
    pprint(banner)

There are also utility methods for creating and writing Shodan data files available in the Python library. Following is a modified version of the above script which filters out banners that belong to VPN services. It uses the shodan.helpers.open_file() and shodan.helpers.write_banner() methods to create a Shodan data file that is compatible with the CLI and other tools that are able to consume Shodan data:

# Import the method that helps us parse the data file
from shodan.helpers import iterate_files, open_file, write_banner

# Standard Python libraries
from pprint import pprint
from sys import argv, exit

# Settings
OUTPUT_FILENAME = 'mydata.json.gz'

# The user has to provide at least 1 file
if len(argv) == 1:
    print('Usage: {} <file1.json.gz> [file2.json.gz] ...'.format(argv[0]))
    exit(1)

# Create the output file
with open_file(OUTPUT_FILENAME) as fout:
    # Iterate over all of the provided data files
    for banner in iterate_files(argv[1:]):
        # Is this a VPN service?
        if 'tags' in banner and 'vpn' in banner['tags']:
            # Show the banner
            pprint(banner)

            # Store it in the output file
            write_banner(fout, banner)