{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2dff730d",
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [],
   "source": [
    "%config InlineBackend.figure_formats = ['svg']\n",
    "\n",
    "import numpy as np\n",
    "from numpy.linalg import norm\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "572ac839",
   "metadata": {},
   "source": [
    "(dataset)=\n",
    "\n",
    "# Datasets\n",
    "\n",
    "It is often useful to save results from flatspin simulations to persistent storage.\n",
    "Simulations can take a long time, and storing results allows us to perform analysis afterwards without needing to re-run the simulations.\n",
    "Or maybe we simply wish to archive the results in case we need them later.\n",
    "\n",
    "Whatever the motiviation, the flatspin dataset provides a powerful format in which to organize and save results.\n",
    "\n",
    "All the flatspin [command-line tools](cmdline) work with datasets, so a good understanding of the dataset format is essential when working with flatspin data from the command line.\n",
    "\n",
    "## Anatomy of a dataset\n",
    "\n",
    "A flatspin dataset consists of three major components:\n",
    "\n",
    "1. Dictionaries with simulation **parameters** and **information**\n",
    "2. An **index** of the runs included in the dataset\n",
    "3. The results of each run, stored in one or more **tables**\n",
    "\n",
    "As you can see, (1) and (2) is *metadata* about the results, while (3) is the actual result *data*.\n",
    "\n",
    "## A simple example\n",
    "\n",
    "Let us begin with a simple example, where we store the results from a single run in a dataset.\n",
    "\n",
    "First, we run a simulation and store results in two tables: `spin` contains the state of all the spins over time, and `h_ext` contains the external field values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d30fdac1",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from flatspin.model import PinwheelSpinIceDiamond\n",
    "from flatspin.encoder import Triangle\n",
    "\n",
    "# Model parameters\n",
    "model_params = {\n",
    "    'size': (4, 4),\n",
    "    'disorder': 0.05,\n",
    "    'use_opencl': True,\n",
    "}\n",
    "\n",
    "# Encoder parameters\n",
    "encoder_params = {\n",
    "    'H': 0.2,\n",
    "    'phi': 30,\n",
    "    'phase': 270,\n",
    "}\n",
    "\n",
    "# Create the model object\n",
    "model = PinwheelSpinIceDiamond(**model_params)\n",
    "\n",
    "# Create the encoder\n",
    "encoder = Triangle(**encoder_params)\n",
    "\n",
    "# Use the encoder to create a global external field\n",
    "input = [1]\n",
    "h_ext = encoder(input)\n",
    "\n",
    "# Save spin state over time\n",
    "spin = []\n",
    "\n",
    "# Loop over field values and flip spins accordingly\n",
    "for h in h_ext:\n",
    "    model.set_h_ext(h)\n",
    "    model.relax()\n",
    "    # Take a snapshot (copy) of the spin state\n",
    "    spin.append(model.spin.copy())\n",
    "\n",
    "# Create two tables, one for m_tot and one for h_ext\n",
    "result = {}\n",
    "result['spin'] = pd.DataFrame(spin)\n",
    "result['spin'].index.name = 't'\n",
    "result['h_ext'] = pd.DataFrame(h_ext, columns=['h_extx', 'h_exty'])\n",
    "result['h_ext'].index.name = 't'\n",
    "\n",
    "display(result['spin'])\n",
    "display(result['h_ext'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8b0e8c4",
   "metadata": {},
   "source": [
    "## Creating the Dataset object\n",
    "\n",
    "We are now ready to create the {class}`Dataset <flatspin.data.Dataset>` object.\n",
    "A {class}`Dataset <flatspin.data.Dataset>` object *only contains metadata* about the results, but offers several methods to inspect the results available in the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e566c890",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import shutil\n",
    "import flatspin\n",
    "from flatspin.data import Dataset\n",
    "\n",
    "# Where to store the dataset and results\n",
    "basepath = '/tmp/flatspin/mydataset'\n",
    "if os.path.exists(basepath):\n",
    "    shutil.rmtree(basepath)\n",
    "\n",
    "# Create params dictionary\n",
    "# We store both the model params and encoder params\n",
    "# (there is no overlap between model and encoder params)\n",
    "params = model_params.copy()\n",
    "params.update(encoder.get_params())\n",
    "\n",
    "# Create info dictionary (misc info)\n",
    "info = {\n",
    "    'model': f'{model.__class__.__module__}.{model.__class__.__name__}',\n",
    "    'version': flatspin.__version__,\n",
    "    'comment': 'My simple dataset'\n",
    "}\n",
    "\n",
    "# Create the index table, with a single entry for the above run\n",
    "# The index must contain a column named 'outdir' which points\n",
    "# to the location of the result directory / archive file\n",
    "outdir = 'myrun.npz'\n",
    "index = pd.DataFrame({'outdir': [outdir]})\n",
    "\n",
    "# Create the dataset directory\n",
    "os.makedirs(basepath)\n",
    "\n",
    "# Create the dataset object\n",
    "dataset = Dataset(index, params, info, basepath)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f4ebcf5",
   "metadata": {},
   "source": [
    "Once the dataset object has been created, it can be saved to disk with {func}`Dataset.save() <flatspin.data.Dataset.save>`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8b1f7e88",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Saving dataset:\", repr(dataset))\n",
    "dataset.save()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f81b663",
   "metadata": {},
   "source": [
    "(dataset-saving)=\n",
    "## Saving results\n",
    "As mentioned, a {class}`Dataset <flatspin.data.Dataset>` object only keeps track of metadata.\n",
    "The result tables must be saved separately using {func}`save_table() <flatspin.data.save_table>`, to the location referenced by the `outdir` column in the dataset index.\n",
    "\n",
    "flatspin tools (and {func}`save_table <flatspin.data.save_table>`) supports a few different table formats, selected based on the file extension:\n",
    "* `csv`: Simple CSV text files\n",
    "* `hdf`: A [Hierarchical Data Format](https://pandas.pydata.org/docs/reference/api/pandas.read_hdf.html) archive\n",
    "* `npy`: [NumPy binary files](https://numpy.org/doc/stable/reference/generated/numpy.lib.format.html)\n",
    "* `npz`: Compressed archive of .npy files\n",
    "\n",
    "We recommend using the `npy` or `npz` formats, since they are fast, efficient and easy to work with.\n",
    "\n",
    "```{note}\n",
    "Depending on the table format, the `outdir` column in the dataset index refers to either a filesystem directory (in case of `csv` or `npy`) or the name of an archive file (in case of `hdf` or `npz`).\n",
    "Hence tables are either stored as separate files inside an output directory, or as entries inside an archive file.\n",
    "```\n",
    "\n",
    "Below we save the result tables we created earlier in a `npz` archive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "286d1259",
   "metadata": {},
   "outputs": [],
   "source": [
    "from flatspin.data import save_table\n",
    "\n",
    "# Save the result tables to the npz archive\n",
    "for name, df in result.items():\n",
    "    filename = f\"{basepath}/{outdir}/{name}\"\n",
    "    print(f\"Saving table {name} to {filename}\")\n",
    "    save_table(df, filename)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c088ad47",
   "metadata": {},
   "source": [
    "## The dataset directory\n",
    "\n",
    "Let us take a look at the files and directories in our newly created dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "51b934e5",
   "metadata": {},
   "outputs": [],
   "source": [
    "!tree $dataset.basepath"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b8e7288",
   "metadata": {},
   "source": [
    "As you can see, the dataset `basepath` directory contains three CSV files: `params.csv` and `info.csv` contain the parameters and misc information (`dataset.params` and `dataset.info`), while `index.csv` contains the index of the runs (`dataset.index`).\n",
    "\n",
    "`index.csv` contains a list of all the runs included in the dataset.\n",
    "In this simple example, the dataset contains only one run, with results stored in the archive `myrun.npz`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "77e51dd9",
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat $dataset.basepath/index.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a302bb19",
   "metadata": {},
   "source": [
    "## Reading datasets\n",
    "\n",
    "To read a dataset from disk, use {func}`Dataset.read() <flatspin.data.Dataset.read>`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ffedd751",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = Dataset.read(basepath)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f22b7a4",
   "metadata": {},
   "source": [
    "Printing a dataset object produces a summary of its contents:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d6e9fe39",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9cfd8b34",
   "metadata": {},
   "source": [
    "## Reading results\n",
    "\n",
    "To read the results from a run, use {func}`tablefile() <flatspin.data.Dataset.tablefile>` to get the path to the table of interest, and {func}`read_table() <flatspin.data.read_table>` to read it into memory:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c4051369",
   "metadata": {},
   "outputs": [],
   "source": [
    "from flatspin.data import read_table\n",
    "\n",
    "print(dataset.tablefile('h_ext'))\n",
    "df = read_table(dataset.tablefile('h_ext'), index_col='t')\n",
    "display(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78292e69",
   "metadata": {},
   "source": [
    "You can also get a list of all available tables with {func}`tablefiles() <flatspin.data.Dataset.tablefiles>`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "048cd622",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset.tablefiles()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5c5d3fe",
   "metadata": {},
   "source": [
    "## Datasets with many runs\n",
    "\n",
    "The real power of the dataset comes apparent when dealing with the results from many simulations.\n",
    "A common use case is parameter sweeps, where simulations are run repeatedly using different values of some parameter(s).\n",
    "In this case, we add columns to the index that keep track of the parameters being swept.\n",
    "\n",
    "In the next example, we sweep the angle `phi` of the external field and store the results from each run in `results`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "95fab100",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sweep angle phi of external field\n",
    "phis = np.arange(0, 41, 10)\n",
    "results = []\n",
    "for phi in phis:\n",
    "    # Model params are not swept in this example\n",
    "    model = PinwheelSpinIceDiamond(**model_params)\n",
    "    \n",
    "    # Override phi from encoder_params\n",
    "    ep = encoder_params.copy()\n",
    "    ep['phi'] = phi\n",
    "    encoder = Triangle(**ep)\n",
    "    h_ext = encoder([1])\n",
    "    \n",
    "    spin = []\n",
    "    for h in h_ext:\n",
    "        model.set_h_ext(h)\n",
    "        model.relax()\n",
    "        spin.append(model.spin.copy())\n",
    "\n",
    "    result = {}\n",
    "    result['spin'] = pd.DataFrame(spin)\n",
    "    result['spin'].index.name = 't'\n",
    "    result['h_ext'] = pd.DataFrame(h_ext, columns=['h_extx', 'h_exty'])\n",
    "    result['h_ext'].index.name = 't'\n",
    "\n",
    "    results.append(result)\n",
    "\n",
    "print(len(results), 'results')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df7f603b",
   "metadata": {},
   "source": [
    "Next, we create the dataset for the sweep.\n",
    "The code is the same as earlier, except the `index` now:\n",
    "* contains several rows of results\n",
    "* has an additional column `phi` which contains the swept field angles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6b347c55",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Where to store the dataset and results\n",
    "basepath = '/tmp/flatspin/mysweep'\n",
    "if os.path.exists(basepath):\n",
    "    shutil.rmtree(basepath)\n",
    "os.makedirs(basepath)\n",
    "\n",
    "# params and info unchanged\n",
    "\n",
    "# Create the index table, with one entry per value of phi\n",
    "outdirs = [f'run{i:02d}.npz' for i in range(len(phis))]\n",
    "index = pd.DataFrame({'phi': phis, 'outdir': outdirs})\n",
    "display(index)\n",
    "\n",
    "# Create and save the dataset\n",
    "dataset = Dataset(index, params, info, basepath)\n",
    "print(\"Saving dataset:\", repr(dataset))\n",
    "dataset.save()\n",
    "\n",
    "# Save the results of each run\n",
    "for outdir, result in zip(outdirs, results):\n",
    "    for name, df in result.items():\n",
    "        filename = f\"{basepath}/{outdir}/{name}\"\n",
    "        print(f\"Saving table {name} to {filename}\")\n",
    "        save_table(df, filename)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "611cec3f",
   "metadata": {},
   "source": [
    "When the dataset contains multiple items, {func}`tablefile() <flatspin.data.Dataset.tablefile>` returns a list of files:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9e7bd60a",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset.tablefile('spin')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1e89b537",
   "metadata": {},
   "source": [
    "... and {func}`tablefiles() <flatspin.data.Dataset.tablefiles>` returns a list of lists:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "02fbaf99",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset.tablefiles()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "acc0bbdf",
   "metadata": {},
   "source": [
    "(dataset-subset)=\n",
    "## Selecting parts of a dataset\n",
    "\n",
    "{class}`Dataset <flatspin.data.Dataset>` offers several methods to select a subset of the runs in a dataset.\n",
    "These methods return a new {class}`Dataset <flatspin.data.Dataset>` object with a modified `index` containing only the selected subset.\n",
    "\n",
    "A subset of the dataset can be selected by the IDs in the index:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bfc90bab",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select run with id 3\n",
    "print(repr(dataset[3]))\n",
    "print('id =', dataset[3].id())\n",
    "dataset[3].index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6310deaf",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select a range of runs\n",
    "dataset[1:3].index"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fb15aa93",
   "metadata": {},
   "source": [
    "It is also possible to {func}`filter <flatspin.data.Dataset.filter>` results based on a column in the index:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "141d594c",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset.filter(phi=30).index"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff0aed53",
   "metadata": {},
   "source": [
    "The `n`th row in the index can be obtained with {func}`row(n) <flatspin.data.Dataset.row>`, independently of the run id:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b16b81d8",
   "metadata": {},
   "outputs": [],
   "source": [
    "display(dataset[3:].index)\n",
    "dataset[3:].row(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26136541",
   "metadata": {},
   "source": [
    "## Iterating over a dataset\n",
    "\n",
    "Iterating over a dataset will generate a dataset object for each row in the index:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e93fa41b",
   "metadata": {},
   "outputs": [],
   "source": [
    "for ds in dataset:\n",
    "    print(repr(ds), ds.row(0)['outdir'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c2639d2",
   "metadata": {},
   "source": [
    "Iterate over `(id, dataset)` tuples with {func}`items() <flatspin.data.Dataset.items>`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7016ef17",
   "metadata": {},
   "outputs": [],
   "source": [
    "for i, ds in dataset.items():\n",
    "    print(i, repr(ds))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8798e490",
   "metadata": {},
   "source": [
    "... or over `(row, dataset)` tuples with {func}`iterrows() <flatspin.data.Dataset.iterrows>`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "446775ca",
   "metadata": {},
   "outputs": [],
   "source": [
    "for row, ds in dataset.iterrows():\n",
    "    print(row, repr(ds))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "110985e3",
   "metadata": {},
   "source": [
    "... or iterate over a column in the index with {func}`groupby() <flatspin.data.Dataset.groupby>`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0cd6292f",
   "metadata": {},
   "outputs": [],
   "source": [
    "for phi, ds in dataset.groupby('phi'):\n",
    "    print(phi, repr(ds))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7dfbd1ff",
   "metadata": {},
   "source": [
    "## Sweeping several parameters\n",
    "\n",
    "Sweeping several parameters produce more complex datasets.\n",
    "Below we create a dataset where two parameters are swept: `alpha` and `phi`.\n",
    "The purpose of the example below is to demonstrate some features of {class}`Dataset <flatspin.data.Dataset>`, so we don't actually run any simulations or save any data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc5744e9",
   "metadata": {},
   "outputs": [],
   "source": [
    "import itertools\n",
    "sweep = {\n",
    "    'alpha': [0.001, 0.002, 0.003],\n",
    "    'phi': [0, 10, 20, 30, 40],\n",
    "}\n",
    "sweep_values = list(itertools.product(sweep['alpha'], sweep['phi']))\n",
    "sweep_values = np.transpose(sweep_values)\n",
    "\n",
    "# Create the index table, with one entry per value of phi\n",
    "alphas = sweep_values[0]\n",
    "phis = sweep_values[1]\n",
    "outdirs = [f'run{i:02d}.npz' for i in range(len(phis))]\n",
    "index = pd.DataFrame({'alpha': alphas, 'phi': phis, 'outdir': outdirs})\n",
    "display(index)\n",
    "\n",
    "# Create the dataset object\n",
    "dataset = Dataset(index)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "875f7dd0",
   "metadata": {},
   "source": [
    "In this case, {func}`filter() <flatspin.data.Dataset.filter>` and {func}`groupby() <flatspin.data.Dataset.groupby>` now returns subsets with several items:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ad0cce75",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset.filter(phi=30).index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "53ee4423",
   "metadata": {},
   "outputs": [],
   "source": [
    "for alpha, ds in dataset.groupby('alpha'):\n",
    "    print(alpha)\n",
    "    print(ds.index)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "537e4d61",
   "metadata": {},
   "source": [
    "It is also possible to {func}`filter() <flatspin.data.Dataset.filter>` and {func}`groupby() <flatspin.data.Dataset.groupby>` with multiple parameters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf95fccf",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset.filter(alpha=0.003, phi=30).index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "41e39ff9",
   "metadata": {},
   "outputs": [],
   "source": [
    "for (alpha, phi), ds in dataset.groupby(['phi', 'alpha']):\n",
    "    print(alpha, phi, ds.id(), repr(ds))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f774985c",
   "metadata": {},
   "source": [
    "## Further reading\n",
    "\n",
    "While this guide has introduced the core concepts of the dataset, the {class}`Dataset <flatspin.data.Dataset>` class contains a few extra bells and whistles.\n",
    "In addition, {mod}`flatspin.data` contains several useful functions for processing simulation results."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "formats": "md:myst",
   "text_representation": {
    "extension": ".md",
    "format_name": "myst",
    "format_version": 0.13,
    "jupytext_version": "1.11.5"
   }
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  },
  "source_map": [
   15,
   23,
   53,
   101,
   108,
   143,
   147,
   150,
   172,
   180,
   186,
   188,
   195,
   197,
   203,
   205,
   209,
   211,
   217,
   223,
   227,
   229,
   239,
   268,
   275,
   300,
   304,
   306,
   310,
   312,
   322,
   329,
   332,
   336,
   338,
   342,
   345,
   351,
   354,
   358,
   361,
   365,
   368,
   372,
   375,
   383,
   401,
   405,
   409,
   413,
   417,
   421,
   424
  ]
 },
 "nbformat": 4,
 "nbformat_minor": 5
}