Datasets
Contents
Datasets#
It is often useful to save results from flatspin simulations to persistent storage. Simulations can take a long time, and storing results allows us to perform analysis afterwards without needing to re-run the simulations. Or maybe we simply wish to archive the results in case we need them later.
Whatever the motiviation, the flatspin dataset provides a powerful format in which to organize and save results.
All the flatspin command-line tools work with datasets, so a good understanding of the dataset format is essential when working with flatspin data from the command line.
Anatomy of a dataset#
A flatspin dataset consists of three major components:
Dictionaries with simulation parameters and information
An index of the runs included in the dataset
The results of each run, stored in one or more tables
As you can see, (1) and (2) is metadata about the results, while (3) is the actual result data.
A simple example#
Let us begin with a simple example, where we store the results from a single run in a dataset.
First, we run a simulation and store results in two tables: spin
contains the state of all the spins over time, and h_ext
contains the external field values.
import pandas as pd
from flatspin.model import PinwheelSpinIceDiamond
from flatspin.encoder import Triangle
# Model parameters
model_params = {
'size': (4, 4),
'disorder': 0.05,
'use_opencl': True,
}
# Encoder parameters
encoder_params = {
'H': 0.2,
'phi': 30,
'phase': 270,
}
# Create the model object
model = PinwheelSpinIceDiamond(**model_params)
# Create the encoder
encoder = Triangle(**encoder_params)
# Use the encoder to create a global external field
input = [1]
h_ext = encoder(input)
# Save spin state over time
spin = []
# Loop over field values and flip spins accordingly
for h in h_ext:
model.set_h_ext(h)
model.relax()
# Take a snapshot (copy) of the spin state
spin.append(model.spin.copy())
# Create two tables, one for m_tot and one for h_ext
result = {}
result['spin'] = pd.DataFrame(spin)
result['spin'].index.name = 't'
result['h_ext'] = pd.DataFrame(h_ext, columns=['h_extx', 'h_exty'])
result['h_ext'].index.name = 't'
display(result['spin'])
display(result['h_ext'])
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
t | ||||||||||||||||||||||||||||||||||||||||
0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
3 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
4 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
96 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
97 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
98 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
99 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
100 rows × 40 columns
h_extx | h_exty | |
---|---|---|
t | ||
0 | 0.000000 | 0.000 |
1 | -0.006928 | -0.004 |
2 | -0.013856 | -0.008 |
3 | -0.020785 | -0.012 |
4 | -0.027713 | -0.016 |
... | ... | ... |
95 | 0.034641 | 0.020 |
96 | 0.027713 | 0.016 |
97 | 0.020785 | 0.012 |
98 | 0.013856 | 0.008 |
99 | 0.006928 | 0.004 |
100 rows × 2 columns
Creating the Dataset object#
We are now ready to create the Dataset
object.
A Dataset
object only contains metadata about the results, but offers several methods to inspect the results available in the dataset.
import os
import shutil
import flatspin
from flatspin.data import Dataset
# Where to store the dataset and results
basepath = '/tmp/flatspin/mydataset'
if os.path.exists(basepath):
shutil.rmtree(basepath)
# Create params dictionary
# We store both the model params and encoder params
# (there is no overlap between model and encoder params)
params = model_params.copy()
params.update(encoder.get_params())
# Create info dictionary (misc info)
info = {
'model': f'{model.__class__.__module__}.{model.__class__.__name__}',
'version': flatspin.__version__,
'comment': 'My simple dataset'
}
# Create the index table, with a single entry for the above run
# The index must contain a column named 'outdir' which points
# to the location of the result directory / archive file
outdir = 'myrun.npz'
index = pd.DataFrame({'outdir': [outdir]})
# Create the dataset directory
os.makedirs(basepath)
# Create the dataset object
dataset = Dataset(index, params, info, basepath)
Once the dataset object has been created, it can be saved to disk with Dataset.save()
.
print("Saving dataset:", repr(dataset))
dataset.save()
Saving dataset: Dataset('/tmp/flatspin/mydataset'): 1 items
Saving results#
As mentioned, a Dataset
object only keeps track of metadata.
The result tables must be saved separately using save_table()
, to the location referenced by the outdir
column in the dataset index.
flatspin tools (and save_table
) supports a few different table formats, selected based on the file extension:
csv
: Simple CSV text fileshdf
: A Hierarchical Data Format archivenpy
: NumPy binary filesnpz
: Compressed archive of .npy files
We recommend using the npy
or npz
formats, since they are fast, efficient and easy to work with.
Note
Depending on the table format, the outdir
column in the dataset index refers to either a filesystem directory (in case of csv
or npy
) or the name of an archive file (in case of hdf
or npz
).
Hence tables are either stored as separate files inside an output directory, or as entries inside an archive file.
Below we save the result tables we created earlier in a npz
archive.
from flatspin.data import save_table
# Save the result tables to the npz archive
for name, df in result.items():
filename = f"{basepath}/{outdir}/{name}"
print(f"Saving table {name} to {filename}")
save_table(df, filename)
Saving table spin to /tmp/flatspin/mydataset/myrun.npz/spin
Saving table h_ext to /tmp/flatspin/mydataset/myrun.npz/h_ext
The dataset directory#
Let us take a look at the files and directories in our newly created dataset:
!tree $dataset.basepath
/tmp/flatspin/mydataset
├── index.csv
├── info.csv
├── myrun.npz
└── params.csv
0 directories, 4 files
As you can see, the dataset basepath
directory contains three CSV files: params.csv
and info.csv
contain the parameters and misc information (dataset.params
and dataset.info
), while index.csv
contains the index of the runs (dataset.index
).
index.csv
contains a list of all the runs included in the dataset.
In this simple example, the dataset contains only one run, with results stored in the archive myrun.npz
.
!cat $dataset.basepath/index.csv
# outdir
myrun.npz
Reading datasets#
To read a dataset from disk, use Dataset.read()
:
dataset = Dataset.read(basepath)
Printing a dataset object produces a summary of its contents:
print(dataset)
Dataset: mydataset
params:
disorder=0.05
H=0.2
H0=0
phase=270
phi=30
size=(4, 4)
timesteps=100
use_opencl=True
info:
comment: My simple dataset
model: flatspin.model.PinwheelSpinIceDiamond
version: 2.1
index:
outdir
0 myrun.npz
Reading results#
To read the results from a run, use tablefile()
to get the path to the table of interest, and read_table()
to read it into memory:
from flatspin.data import read_table
print(dataset.tablefile('h_ext'))
df = read_table(dataset.tablefile('h_ext'), index_col='t')
display(df)
/tmp/flatspin/mydataset/myrun.npz/h_ext
h_extx | h_exty | |
---|---|---|
t | ||
0 | 0.000000 | 0.000 |
1 | -0.006928 | -0.004 |
2 | -0.013856 | -0.008 |
3 | -0.020785 | -0.012 |
4 | -0.027713 | -0.016 |
... | ... | ... |
95 | 0.034641 | 0.020 |
96 | 0.027713 | 0.016 |
97 | 0.020785 | 0.012 |
98 | 0.013856 | 0.008 |
99 | 0.006928 | 0.004 |
100 rows × 2 columns
You can also get a list of all available tables with tablefiles()
:
dataset.tablefiles()
['/tmp/flatspin/mydataset/myrun.npz/h_ext',
'/tmp/flatspin/mydataset/myrun.npz/spin']
Datasets with many runs#
The real power of the dataset comes apparent when dealing with the results from many simulations. A common use case is parameter sweeps, where simulations are run repeatedly using different values of some parameter(s). In this case, we add columns to the index that keep track of the parameters being swept.
In the next example, we sweep the angle phi
of the external field and store the results from each run in results
.
# Sweep angle phi of external field
phis = np.arange(0, 41, 10)
results = []
for phi in phis:
# Model params are not swept in this example
model = PinwheelSpinIceDiamond(**model_params)
# Override phi from encoder_params
ep = encoder_params.copy()
ep['phi'] = phi
encoder = Triangle(**ep)
h_ext = encoder([1])
spin = []
for h in h_ext:
model.set_h_ext(h)
model.relax()
spin.append(model.spin.copy())
result = {}
result['spin'] = pd.DataFrame(spin)
result['spin'].index.name = 't'
result['h_ext'] = pd.DataFrame(h_ext, columns=['h_extx', 'h_exty'])
result['h_ext'].index.name = 't'
results.append(result)
print(len(results), 'results')
5 results
Next, we create the dataset for the sweep.
The code is the same as earlier, except the index
now:
contains several rows of results
has an additional column
phi
which contains the swept field angles
# Where to store the dataset and results
basepath = '/tmp/flatspin/mysweep'
if os.path.exists(basepath):
shutil.rmtree(basepath)
os.makedirs(basepath)
# params and info unchanged
# Create the index table, with one entry per value of phi
outdirs = [f'run{i:02d}.npz' for i in range(len(phis))]
index = pd.DataFrame({'phi': phis, 'outdir': outdirs})
display(index)
# Create and save the dataset
dataset = Dataset(index, params, info, basepath)
print("Saving dataset:", repr(dataset))
dataset.save()
# Save the results of each run
for outdir, result in zip(outdirs, results):
for name, df in result.items():
filename = f"{basepath}/{outdir}/{name}"
print(f"Saving table {name} to {filename}")
save_table(df, filename)
phi | outdir | |
---|---|---|
0 | 0 | run00.npz |
1 | 10 | run01.npz |
2 | 20 | run02.npz |
3 | 30 | run03.npz |
4 | 40 | run04.npz |
Saving dataset: Dataset('/tmp/flatspin/mysweep'): 5 items
Saving table spin to /tmp/flatspin/mysweep/run00.npz/spin
Saving table h_ext to /tmp/flatspin/mysweep/run00.npz/h_ext
Saving table spin to /tmp/flatspin/mysweep/run01.npz/spin
Saving table h_ext to /tmp/flatspin/mysweep/run01.npz/h_ext
Saving table spin to /tmp/flatspin/mysweep/run02.npz/spin
Saving table h_ext to /tmp/flatspin/mysweep/run02.npz/h_ext
Saving table spin to /tmp/flatspin/mysweep/run03.npz/spin
Saving table h_ext to /tmp/flatspin/mysweep/run03.npz/h_ext
Saving table spin to /tmp/flatspin/mysweep/run04.npz/spin
Saving table h_ext to /tmp/flatspin/mysweep/run04.npz/h_ext
When the dataset contains multiple items, tablefile()
returns a list of files:
dataset.tablefile('spin')
['/tmp/flatspin/mysweep/run00.npz/spin',
'/tmp/flatspin/mysweep/run01.npz/spin',
'/tmp/flatspin/mysweep/run02.npz/spin',
'/tmp/flatspin/mysweep/run03.npz/spin',
'/tmp/flatspin/mysweep/run04.npz/spin']
… and tablefiles()
returns a list of lists:
dataset.tablefiles()
[['/tmp/flatspin/mysweep/run00.npz/h_ext',
'/tmp/flatspin/mysweep/run00.npz/spin'],
['/tmp/flatspin/mysweep/run01.npz/h_ext',
'/tmp/flatspin/mysweep/run01.npz/spin'],
['/tmp/flatspin/mysweep/run02.npz/h_ext',
'/tmp/flatspin/mysweep/run02.npz/spin'],
['/tmp/flatspin/mysweep/run03.npz/h_ext',
'/tmp/flatspin/mysweep/run03.npz/spin'],
['/tmp/flatspin/mysweep/run04.npz/h_ext',
'/tmp/flatspin/mysweep/run04.npz/spin']]
Selecting parts of a dataset#
Dataset
offers several methods to select a subset of the runs in a dataset.
These methods return a new Dataset
object with a modified index
containing only the selected subset.
A subset of the dataset can be selected by the IDs in the index:
# Select run with id 3
print(repr(dataset[3]))
print('id =', dataset[3].id())
dataset[3].index
Dataset('/tmp/flatspin/mysweep'): 1 items
id = 3
phi | outdir | |
---|---|---|
3 | 30 | run03.npz |
# Select a range of runs
dataset[1:3].index
phi | outdir | |
---|---|---|
1 | 10 | run01.npz |
2 | 20 | run02.npz |
3 | 30 | run03.npz |
It is also possible to filter
results based on a column in the index:
dataset.filter(phi=30).index
phi | outdir | |
---|---|---|
3 | 30 | run03.npz |
The n
th row in the index can be obtained with row(n)
, independently of the run id:
display(dataset[3:].index)
dataset[3:].row(0)
phi | outdir | |
---|---|---|
3 | 30 | run03.npz |
4 | 40 | run04.npz |
phi 30
outdir run03.npz
Name: 3, dtype: object
Iterating over a dataset#
Iterating over a dataset will generate a dataset object for each row in the index:
for ds in dataset:
print(repr(ds), ds.row(0)['outdir'])
Dataset('/tmp/flatspin/mysweep'): 1 items run00.npz
Dataset('/tmp/flatspin/mysweep'): 1 items run01.npz
Dataset('/tmp/flatspin/mysweep'): 1 items run02.npz
Dataset('/tmp/flatspin/mysweep'): 1 items run03.npz
Dataset('/tmp/flatspin/mysweep'): 1 items run04.npz
Iterate over (id, dataset)
tuples with items()
:
for i, ds in dataset.items():
print(i, repr(ds))
0 Dataset('/tmp/flatspin/mysweep'): 1 items
1 Dataset('/tmp/flatspin/mysweep'): 1 items
2 Dataset('/tmp/flatspin/mysweep'): 1 items
3 Dataset('/tmp/flatspin/mysweep'): 1 items
4 Dataset('/tmp/flatspin/mysweep'): 1 items
… or over (row, dataset)
tuples with iterrows()
:
for row, ds in dataset.iterrows():
print(row, repr(ds))
(0, 0, 'run00.npz') Dataset('/tmp/flatspin/mysweep'): 1 items
(1, 10, 'run01.npz') Dataset('/tmp/flatspin/mysweep'): 1 items
(2, 20, 'run02.npz') Dataset('/tmp/flatspin/mysweep'): 1 items
(3, 30, 'run03.npz') Dataset('/tmp/flatspin/mysweep'): 1 items
(4, 40, 'run04.npz') Dataset('/tmp/flatspin/mysweep'): 1 items
… or iterate over a column in the index with groupby()
:
for phi, ds in dataset.groupby('phi'):
print(phi, repr(ds))
0 Dataset('/tmp/flatspin/mysweep'): 1 items
10 Dataset('/tmp/flatspin/mysweep'): 1 items
20 Dataset('/tmp/flatspin/mysweep'): 1 items
30 Dataset('/tmp/flatspin/mysweep'): 1 items
40 Dataset('/tmp/flatspin/mysweep'): 1 items
Sweeping several parameters#
Sweeping several parameters produce more complex datasets.
Below we create a dataset where two parameters are swept: alpha
and phi
.
The purpose of the example below is to demonstrate some features of Dataset
, so we don’t actually run any simulations or save any data.
import itertools
sweep = {
'alpha': [0.001, 0.002, 0.003],
'phi': [0, 10, 20, 30, 40],
}
sweep_values = list(itertools.product(sweep['alpha'], sweep['phi']))
sweep_values = np.transpose(sweep_values)
# Create the index table, with one entry per value of phi
alphas = sweep_values[0]
phis = sweep_values[1]
outdirs = [f'run{i:02d}.npz' for i in range(len(phis))]
index = pd.DataFrame({'alpha': alphas, 'phi': phis, 'outdir': outdirs})
display(index)
# Create the dataset object
dataset = Dataset(index)
alpha | phi | outdir | |
---|---|---|---|
0 | 0.001 | 0.0 | run00.npz |
1 | 0.001 | 10.0 | run01.npz |
2 | 0.001 | 20.0 | run02.npz |
3 | 0.001 | 30.0 | run03.npz |
4 | 0.001 | 40.0 | run04.npz |
5 | 0.002 | 0.0 | run05.npz |
6 | 0.002 | 10.0 | run06.npz |
7 | 0.002 | 20.0 | run07.npz |
8 | 0.002 | 30.0 | run08.npz |
9 | 0.002 | 40.0 | run09.npz |
10 | 0.003 | 0.0 | run10.npz |
11 | 0.003 | 10.0 | run11.npz |
12 | 0.003 | 20.0 | run12.npz |
13 | 0.003 | 30.0 | run13.npz |
14 | 0.003 | 40.0 | run14.npz |
In this case, filter()
and groupby()
now returns subsets with several items:
dataset.filter(phi=30).index
alpha | phi | outdir | |
---|---|---|---|
3 | 0.001 | 30.0 | run03.npz |
8 | 0.002 | 30.0 | run08.npz |
13 | 0.003 | 30.0 | run13.npz |
for alpha, ds in dataset.groupby('alpha'):
print(alpha)
print(ds.index)
0.001
alpha phi outdir
0 0.001 0.0 run00.npz
1 0.001 10.0 run01.npz
2 0.001 20.0 run02.npz
3 0.001 30.0 run03.npz
4 0.001 40.0 run04.npz
0.002
alpha phi outdir
5 0.002 0.0 run05.npz
6 0.002 10.0 run06.npz
7 0.002 20.0 run07.npz
8 0.002 30.0 run08.npz
9 0.002 40.0 run09.npz
0.003
alpha phi outdir
10 0.003 0.0 run10.npz
11 0.003 10.0 run11.npz
12 0.003 20.0 run12.npz
13 0.003 30.0 run13.npz
14 0.003 40.0 run14.npz
It is also possible to filter()
and groupby()
with multiple parameters:
dataset.filter(alpha=0.003, phi=30).index
alpha | phi | outdir | |
---|---|---|---|
13 | 0.003 | 30.0 | run13.npz |
for (alpha, phi), ds in dataset.groupby(['phi', 'alpha']):
print(alpha, phi, ds.id(), repr(ds))
0.0 0.001 0 Dataset(None): 1 items
0.0 0.002 5 Dataset(None): 1 items
0.0 0.003 10 Dataset(None): 1 items
10.0 0.001 1 Dataset(None): 1 items
10.0 0.002 6 Dataset(None): 1 items
10.0 0.003 11 Dataset(None): 1 items
20.0 0.001 2 Dataset(None): 1 items
20.0 0.002 7 Dataset(None): 1 items
20.0 0.003 12 Dataset(None): 1 items
30.0 0.001 3 Dataset(None): 1 items
30.0 0.002 8 Dataset(None): 1 items
30.0 0.003 13 Dataset(None): 1 items
40.0 0.001 4 Dataset(None): 1 items
40.0 0.002 9 Dataset(None): 1 items
40.0 0.003 14 Dataset(None): 1 items
Further reading#
While this guide has introduced the core concepts of the dataset, the Dataset
class contains a few extra bells and whistles.
In addition, flatspin.data
contains several useful functions for processing simulation results.