Datasets#

It is often useful to save results from flatspin simulations to persistent storage. Simulations can take a long time, and storing results allows us to perform analysis afterwards without needing to re-run the simulations. Or maybe we simply wish to archive the results in case we need them later.

Whatever the motiviation, the flatspin dataset provides a powerful format in which to organize and save results.

All the flatspin command-line tools work with datasets, so a good understanding of the dataset format is essential when working with flatspin data from the command line.

Anatomy of a dataset#

A flatspin dataset consists of three major components:

  1. Dictionaries with simulation parameters and information

  2. An index of the runs included in the dataset

  3. The results of each run, stored in one or more tables

As you can see, (1) and (2) is metadata about the results, while (3) is the actual result data.

A simple example#

Let us begin with a simple example, where we store the results from a single run in a dataset.

First, we run a simulation and store results in two tables: spin contains the state of all the spins over time, and h_ext contains the external field values.

import pandas as pd
from flatspin.model import PinwheelSpinIceDiamond
from flatspin.encoder import Triangle

# Model parameters
model_params = {
    'size': (4, 4),
    'disorder': 0.05,
    'use_opencl': True,
}

# Encoder parameters
encoder_params = {
    'H': 0.2,
    'phi': 30,
    'phase': 270,
}

# Create the model object
model = PinwheelSpinIceDiamond(**model_params)

# Create the encoder
encoder = Triangle(**encoder_params)

# Use the encoder to create a global external field
input = [1]
h_ext = encoder(input)

# Save spin state over time
spin = []

# Loop over field values and flip spins accordingly
for h in h_ext:
    model.set_h_ext(h)
    model.relax()
    # Take a snapshot (copy) of the spin state
    spin.append(model.spin.copy())

# Create two tables, one for m_tot and one for h_ext
result = {}
result['spin'] = pd.DataFrame(spin)
result['spin'].index.name = 't'
result['h_ext'] = pd.DataFrame(h_ext, columns=['h_extx', 'h_exty'])
result['h_ext'].index.name = 't'

display(result['spin'])
display(result['h_ext'])
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
t
0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
95 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
96 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
97 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
98 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
99 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

100 rows × 40 columns

h_extx h_exty
t
0 0.000000 0.000
1 -0.006928 -0.004
2 -0.013856 -0.008
3 -0.020785 -0.012
4 -0.027713 -0.016
... ... ...
95 0.034641 0.020
96 0.027713 0.016
97 0.020785 0.012
98 0.013856 0.008
99 0.006928 0.004

100 rows × 2 columns

Creating the Dataset object#

We are now ready to create the Dataset object. A Dataset object only contains metadata about the results, but offers several methods to inspect the results available in the dataset.

import os
import shutil
import flatspin
from flatspin.data import Dataset

# Where to store the dataset and results
basepath = '/tmp/flatspin/mydataset'
if os.path.exists(basepath):
    shutil.rmtree(basepath)

# Create params dictionary
# We store both the model params and encoder params
# (there is no overlap between model and encoder params)
params = model_params.copy()
params.update(encoder.get_params())

# Create info dictionary (misc info)
info = {
    'model': f'{model.__class__.__module__}.{model.__class__.__name__}',
    'version': flatspin.__version__,
    'comment': 'My simple dataset'
}

# Create the index table, with a single entry for the above run
# The index must contain a column named 'outdir' which points
# to the location of the result directory / archive file
outdir = 'myrun.npz'
index = pd.DataFrame({'outdir': [outdir]})

# Create the dataset directory
os.makedirs(basepath)

# Create the dataset object
dataset = Dataset(index, params, info, basepath)

Once the dataset object has been created, it can be saved to disk with Dataset.save().

print("Saving dataset:", repr(dataset))
dataset.save()
Saving dataset: Dataset('/tmp/flatspin/mydataset'): 1 items

Saving results#

As mentioned, a Dataset object only keeps track of metadata. The result tables must be saved separately using save_table(), to the location referenced by the outdir column in the dataset index.

flatspin tools (and save_table) supports a few different table formats, selected based on the file extension:

We recommend using the npy or npz formats, since they are fast, efficient and easy to work with.

Note

Depending on the table format, the outdir column in the dataset index refers to either a filesystem directory (in case of csv or npy) or the name of an archive file (in case of hdf or npz). Hence tables are either stored as separate files inside an output directory, or as entries inside an archive file.

Below we save the result tables we created earlier in a npz archive.

from flatspin.data import save_table

# Save the result tables to the npz archive
for name, df in result.items():
    filename = f"{basepath}/{outdir}/{name}"
    print(f"Saving table {name} to {filename}")
    save_table(df, filename)
Saving table spin to /tmp/flatspin/mydataset/myrun.npz/spin
Saving table h_ext to /tmp/flatspin/mydataset/myrun.npz/h_ext

The dataset directory#

Let us take a look at the files and directories in our newly created dataset:

!tree $dataset.basepath
/tmp/flatspin/mydataset
├── index.csv
├── info.csv
├── myrun.npz
└── params.csv

0 directories, 4 files

As you can see, the dataset basepath directory contains three CSV files: params.csv and info.csv contain the parameters and misc information (dataset.params and dataset.info), while index.csv contains the index of the runs (dataset.index).

index.csv contains a list of all the runs included in the dataset. In this simple example, the dataset contains only one run, with results stored in the archive myrun.npz.

!cat $dataset.basepath/index.csv
# outdir
myrun.npz

Reading datasets#

To read a dataset from disk, use Dataset.read():

dataset = Dataset.read(basepath)

Printing a dataset object produces a summary of its contents:

print(dataset)
Dataset: mydataset

params:
 disorder=0.05
 H=0.2
 H0=0
 phase=270
 phi=30
 size=(4, 4)
 timesteps=100
 use_opencl=True

info:
 comment: My simple dataset
 model: flatspin.model.PinwheelSpinIceDiamond
 version: 2.3.dev2+g269b179

index:
      outdir
0  myrun.npz

Reading results#

To read the results from a run, use tablefile() to get the path to the table of interest, and read_table() to read it into memory:

from flatspin.data import read_table

print(dataset.tablefile('h_ext'))
df = read_table(dataset.tablefile('h_ext'), index_col='t')
display(df)
/tmp/flatspin/mydataset/myrun.npz/h_ext
h_extx h_exty
t
0 0.000000 0.000
1 -0.006928 -0.004
2 -0.013856 -0.008
3 -0.020785 -0.012
4 -0.027713 -0.016
... ... ...
95 0.034641 0.020
96 0.027713 0.016
97 0.020785 0.012
98 0.013856 0.008
99 0.006928 0.004

100 rows × 2 columns

You can also get a list of all available tables with tablefiles():

dataset.tablefiles()
['/tmp/flatspin/mydataset/myrun.npz/h_ext',
 '/tmp/flatspin/mydataset/myrun.npz/spin']

Datasets with many runs#

The real power of the dataset comes apparent when dealing with the results from many simulations. A common use case is parameter sweeps, where simulations are run repeatedly using different values of some parameter(s). In this case, we add columns to the index that keep track of the parameters being swept.

In the next example, we sweep the angle phi of the external field and store the results from each run in results.

# Sweep angle phi of external field
phis = np.arange(0, 41, 10)
results = []
for phi in phis:
    # Model params are not swept in this example
    model = PinwheelSpinIceDiamond(**model_params)
    
    # Override phi from encoder_params
    ep = encoder_params.copy()
    ep['phi'] = phi
    encoder = Triangle(**ep)
    h_ext = encoder([1])
    
    spin = []
    for h in h_ext:
        model.set_h_ext(h)
        model.relax()
        spin.append(model.spin.copy())

    result = {}
    result['spin'] = pd.DataFrame(spin)
    result['spin'].index.name = 't'
    result['h_ext'] = pd.DataFrame(h_ext, columns=['h_extx', 'h_exty'])
    result['h_ext'].index.name = 't'

    results.append(result)

print(len(results), 'results')
5 results

Next, we create the dataset for the sweep. The code is the same as earlier, except the index now:

  • contains several rows of results

  • has an additional column phi which contains the swept field angles

# Where to store the dataset and results
basepath = '/tmp/flatspin/mysweep'
if os.path.exists(basepath):
    shutil.rmtree(basepath)
os.makedirs(basepath)

# params and info unchanged

# Create the index table, with one entry per value of phi
outdirs = [f'run{i:02d}.npz' for i in range(len(phis))]
index = pd.DataFrame({'phi': phis, 'outdir': outdirs})
display(index)

# Create and save the dataset
dataset = Dataset(index, params, info, basepath)
print("Saving dataset:", repr(dataset))
dataset.save()

# Save the results of each run
for outdir, result in zip(outdirs, results):
    for name, df in result.items():
        filename = f"{basepath}/{outdir}/{name}"
        print(f"Saving table {name} to {filename}")
        save_table(df, filename)
phi outdir
0 0 run00.npz
1 10 run01.npz
2 20 run02.npz
3 30 run03.npz
4 40 run04.npz
Saving dataset: Dataset('/tmp/flatspin/mysweep'): 5 items
Saving table spin to /tmp/flatspin/mysweep/run00.npz/spin
Saving table h_ext to /tmp/flatspin/mysweep/run00.npz/h_ext
Saving table spin to /tmp/flatspin/mysweep/run01.npz/spin
Saving table h_ext to /tmp/flatspin/mysweep/run01.npz/h_ext
Saving table spin to /tmp/flatspin/mysweep/run02.npz/spin
Saving table h_ext to /tmp/flatspin/mysweep/run02.npz/h_ext
Saving table spin to /tmp/flatspin/mysweep/run03.npz/spin
Saving table h_ext to /tmp/flatspin/mysweep/run03.npz/h_ext
Saving table spin to /tmp/flatspin/mysweep/run04.npz/spin
Saving table h_ext to /tmp/flatspin/mysweep/run04.npz/h_ext

When the dataset contains multiple items, tablefile() returns a list of files:

dataset.tablefile('spin')
['/tmp/flatspin/mysweep/run00.npz/spin',
 '/tmp/flatspin/mysweep/run01.npz/spin',
 '/tmp/flatspin/mysweep/run02.npz/spin',
 '/tmp/flatspin/mysweep/run03.npz/spin',
 '/tmp/flatspin/mysweep/run04.npz/spin']

… and tablefiles() returns a list of lists:

dataset.tablefiles()
[['/tmp/flatspin/mysweep/run00.npz/h_ext',
  '/tmp/flatspin/mysweep/run00.npz/spin'],
 ['/tmp/flatspin/mysweep/run01.npz/h_ext',
  '/tmp/flatspin/mysweep/run01.npz/spin'],
 ['/tmp/flatspin/mysweep/run02.npz/h_ext',
  '/tmp/flatspin/mysweep/run02.npz/spin'],
 ['/tmp/flatspin/mysweep/run03.npz/h_ext',
  '/tmp/flatspin/mysweep/run03.npz/spin'],
 ['/tmp/flatspin/mysweep/run04.npz/h_ext',
  '/tmp/flatspin/mysweep/run04.npz/spin']]

Selecting parts of a dataset#

Dataset offers several methods to select a subset of the runs in a dataset. These methods return a new Dataset object with a modified index containing only the selected subset.

A subset of the dataset can be selected by the IDs in the index:

# Select run with id 3
print(repr(dataset[3]))
print('id =', dataset[3].id())
dataset[3].index
Dataset('/tmp/flatspin/mysweep'): 1 items
id = 3
phi outdir
3 30 run03.npz
# Select a range of runs
dataset[1:3].index
phi outdir
1 10 run01.npz
2 20 run02.npz
3 30 run03.npz

It is also possible to filter results based on a column in the index:

dataset.filter(phi=30).index
phi outdir
3 30 run03.npz

The nth row in the index can be obtained with row(n), independently of the run id:

display(dataset[3:].index)
dataset[3:].row(0)
phi outdir
3 30 run03.npz
4 40 run04.npz
phi              30
outdir    run03.npz
Name: 3, dtype: object

Iterating over a dataset#

Iterating over a dataset will generate a dataset object for each row in the index:

for ds in dataset:
    print(repr(ds), ds.row(0)['outdir'])
Dataset('/tmp/flatspin/mysweep'): 1 items run00.npz
Dataset('/tmp/flatspin/mysweep'): 1 items run01.npz
Dataset('/tmp/flatspin/mysweep'): 1 items run02.npz
Dataset('/tmp/flatspin/mysweep'): 1 items run03.npz
Dataset('/tmp/flatspin/mysweep'): 1 items run04.npz

Iterate over (id, dataset) tuples with items():

for i, ds in dataset.items():
    print(i, repr(ds))
0 Dataset('/tmp/flatspin/mysweep'): 1 items
1 Dataset('/tmp/flatspin/mysweep'): 1 items
2 Dataset('/tmp/flatspin/mysweep'): 1 items
3 Dataset('/tmp/flatspin/mysweep'): 1 items
4 Dataset('/tmp/flatspin/mysweep'): 1 items

… or over (row, dataset) tuples with iterrows():

for row, ds in dataset.iterrows():
    print(row, repr(ds))
(0, 0, 'run00.npz') Dataset('/tmp/flatspin/mysweep'): 1 items
(1, 10, 'run01.npz') Dataset('/tmp/flatspin/mysweep'): 1 items
(2, 20, 'run02.npz') Dataset('/tmp/flatspin/mysweep'): 1 items
(3, 30, 'run03.npz') Dataset('/tmp/flatspin/mysweep'): 1 items
(4, 40, 'run04.npz') Dataset('/tmp/flatspin/mysweep'): 1 items

… or iterate over a column in the index with groupby():

for phi, ds in dataset.groupby('phi'):
    print(phi, repr(ds))
0 Dataset('/tmp/flatspin/mysweep'): 1 items
10 Dataset('/tmp/flatspin/mysweep'): 1 items
20 Dataset('/tmp/flatspin/mysweep'): 1 items
30 Dataset('/tmp/flatspin/mysweep'): 1 items
40 Dataset('/tmp/flatspin/mysweep'): 1 items

Sweeping several parameters#

Sweeping several parameters produce more complex datasets. Below we create a dataset where two parameters are swept: alpha and phi. The purpose of the example below is to demonstrate some features of Dataset, so we don’t actually run any simulations or save any data.

import itertools
sweep = {
    'alpha': [0.001, 0.002, 0.003],
    'phi': [0, 10, 20, 30, 40],
}
sweep_values = list(itertools.product(sweep['alpha'], sweep['phi']))
sweep_values = np.transpose(sweep_values)

# Create the index table, with one entry per value of phi
alphas = sweep_values[0]
phis = sweep_values[1]
outdirs = [f'run{i:02d}.npz' for i in range(len(phis))]
index = pd.DataFrame({'alpha': alphas, 'phi': phis, 'outdir': outdirs})
display(index)

# Create the dataset object
dataset = Dataset(index)
alpha phi outdir
0 0.001 0.0 run00.npz
1 0.001 10.0 run01.npz
2 0.001 20.0 run02.npz
3 0.001 30.0 run03.npz
4 0.001 40.0 run04.npz
5 0.002 0.0 run05.npz
6 0.002 10.0 run06.npz
7 0.002 20.0 run07.npz
8 0.002 30.0 run08.npz
9 0.002 40.0 run09.npz
10 0.003 0.0 run10.npz
11 0.003 10.0 run11.npz
12 0.003 20.0 run12.npz
13 0.003 30.0 run13.npz
14 0.003 40.0 run14.npz

In this case, filter() and groupby() now returns subsets with several items:

dataset.filter(phi=30).index
alpha phi outdir
3 0.001 30.0 run03.npz
8 0.002 30.0 run08.npz
13 0.003 30.0 run13.npz
for alpha, ds in dataset.groupby('alpha'):
    print(alpha)
    print(ds.index)
0.001
   alpha   phi     outdir
0  0.001   0.0  run00.npz
1  0.001  10.0  run01.npz
2  0.001  20.0  run02.npz
3  0.001  30.0  run03.npz
4  0.001  40.0  run04.npz
0.002
   alpha   phi     outdir
5  0.002   0.0  run05.npz
6  0.002  10.0  run06.npz
7  0.002  20.0  run07.npz
8  0.002  30.0  run08.npz
9  0.002  40.0  run09.npz
0.003
    alpha   phi     outdir
10  0.003   0.0  run10.npz
11  0.003  10.0  run11.npz
12  0.003  20.0  run12.npz
13  0.003  30.0  run13.npz
14  0.003  40.0  run14.npz

It is also possible to filter() and groupby() with multiple parameters:

dataset.filter(alpha=0.003, phi=30).index
alpha phi outdir
13 0.003 30.0 run13.npz
for (alpha, phi), ds in dataset.groupby(['phi', 'alpha']):
    print(alpha, phi, ds.id(), repr(ds))
0.0 0.001 0 Dataset(None): 1 items
0.0 0.002 5 Dataset(None): 1 items
0.0 0.003 10 Dataset(None): 1 items
10.0 0.001 1 Dataset(None): 1 items
10.0 0.002 6 Dataset(None): 1 items
10.0 0.003 11 Dataset(None): 1 items
20.0 0.001 2 Dataset(None): 1 items
20.0 0.002 7 Dataset(None): 1 items
20.0 0.003 12 Dataset(None): 1 items
30.0 0.001 3 Dataset(None): 1 items
30.0 0.002 8 Dataset(None): 1 items
30.0 0.003 13 Dataset(None): 1 items
40.0 0.001 4 Dataset(None): 1 items
40.0 0.002 9 Dataset(None): 1 items
40.0 0.003 14 Dataset(None): 1 items

Further reading#

While this guide has introduced the core concepts of the dataset, the Dataset class contains a few extra bells and whistles. In addition, flatspin.data contains several useful functions for processing simulation results.