Skip to content

Colab   Kaggle   Binder

Batch Analysis


Setup

%reload_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd

from neuro_py.process import batch_analysis

Section 1: Define the analysis

Here, I'm defining the analysis in the notebook, but in a real project, you would define it in a separate .py file and import it here.

def toy_analysis(basepath, parameter_1=1, parameter_2=2):
    results = pd.DataFrame()
    results["basepath"] = [basepath]
    results["parameter_1"] = parameter_1
    results["parameter_2"] = parameter_2
    results["random_number"] = np.random.randint(0, 100)
    return results

For your project, you will have a .csv file with the basepaths you want to analyze. Here, I'm creating a DataFrame with the basepaths for the purpose of this notebook.

sessions = pd.DataFrame(
    dict(
        basepath=[
            "U:/data/hpc_ctx_project/HP01/day_1_20240227",
            "U:/data/hpc_ctx_project/HP01/day_2_20240228",
            "U:/data/hpc_ctx_project/HP01/day_3_20240229",
        ]
    )
)

You will need to define the path where you want to save the results of your analysis.

It's useful to nest the analysis version in a subfolder (toy_analysis\toy_analysis_v1) to keep track of the different versions of your analysis.

save_path = r"Z:\home\ryanh\projects\hpc_ctx\toy_analysis\toy_analysis_v1"

Section 2: Run the analysis

Finally, you can run your analysis in batch mode. This will loop through the basepaths and save the results in the specified folder.

The batch_analysis function is a general function that you can use for any analysis. You just need to pass the function you want to run, the basepaths you want to analyze, and the save path.

If your analysis fails, running again will start from where it left off.

There is a parallel option that you can set to True if you want to run the analysis in parallel. This will speed up the analysis if you have multiple cores.

batch_analysis.run(
    sessions,
    save_path,
    toy_analysis,
    parallel=False,
    verbose=True,
)
100%|██████████| 3/3 [00:00<00:00, 759.52it/s]

U:\data\hpc_ctx_project\HP01\day_1_20240227
U:\data\hpc_ctx_project\HP01\day_2_20240228
U:\data\hpc_ctx_project\HP01\day_3_20240229

Section 3: Load the results

There is a built in loader that concatenates the results of the analysis into a single DataFrame.

results = batch_analysis.load_results(save_path)
results
basepath paramater_1 paramater_2 random_number
0 U:\data\hpc_ctx_project\HP01\day_1_20240227 1 2 34
1 U:\data\hpc_ctx_project\HP01\day_2_20240228 1 2 30
2 U:\data\hpc_ctx_project\HP01\day_3_20240229 1 2 66

Bonus: More complicated results

Your results won't always fit nicely into a single DataFrame. Sometimes you will have multiple data types you need to save.

For example, you might have values for each cell in a DataFrame and also PSTHs for each cell. Your analysis will store both in a dictionary and you will construct a custom loader in your analysis.

Define the analysis

import glob
import os
import pickle


def toy_analysis_2(basepath, paramater_1=1, paramater_2=2):
    results_df = pd.DataFrame()
    results_df["basepath"] = [basepath]
    results_df["paramater_1"] = paramater_1
    results_df["paramater_2"] = paramater_2
    results_df["random_number"] = np.random.randint(0, 100)

    window_starttime, window_stoptime = [-1, 1]
    window_bins = int(np.ceil(((window_stoptime - window_starttime) * 1000)))
    time_lags = np.linspace(window_starttime, window_stoptime, window_bins)
    psths = pd.DataFrame(
        index=time_lags,
        columns=np.arange(1),
    )
    psths[:] = np.random.rand(window_bins, 1)

    results = {
        "results_df": results_df,
        "psth": psths,
    }
    return results


# custom loader
def load_results(save_path, verbose=False):
    # check if folder exists
    if not os.path.exists(save_path):
        raise ValueError(f"folder {save_path} does not exist")

    # get all the sessions
    sessions = glob.glob(save_path + os.sep + "*.pkl")

    results_df = []
    psths = []

    # iterate over the sessions
    for session in sessions:
        if verbose:
            print(session)

        # load the session
        with open(session, "rb") as f:
            results_ = pickle.load(f)

        if results_ is None:
            continue
        results_df.append(results_["results_df"])
        psths.append(results_["psth"])

    results_df = pd.concat(results_df, axis=0, ignore_index=True)
    psths = pd.concat(psths, axis=1, ignore_index=True)

    return results_df, psths

Run the analysis

save_path = r"Z:\home\ryanh\projects\hpc_ctx\toy_analysis\toy_analysis_v2"

batch_analysis.run(
    sessions,
    save_path,
    toy_analysis_2,
    parallel=False,
    verbose=True,
)
100%|██████████| 3/3 [00:00<00:00, 840.94it/s]

U:\data\hpc_ctx_project\HP01\day_1_20240227
U:\data\hpc_ctx_project\HP01\day_2_20240228
U:\data\hpc_ctx_project\HP01\day_3_20240229

Load the results

results_df, psths = load_results(save_path)

display(results_df)
display(psths)
basepath paramater_1 paramater_2 random_number
0 U:\data\hpc_ctx_project\HP01\day_1_20240227 1 2 56
1 U:\data\hpc_ctx_project\HP01\day_2_20240228 1 2 32
2 U:\data\hpc_ctx_project\HP01\day_3_20240229 1 2 56
0 1 2
-1.000000 0.190685 0.490553 0.248958
-0.998999 0.078999 0.689063 0.40577
-0.997999 0.094847 0.788747 0.966084
-0.996998 0.287616 0.804512 0.846309
-0.995998 0.723807 0.996373 0.850087
... ... ... ...
0.995998 0.023565 0.136486 0.120244
0.996998 0.298943 0.844828 0.227437
0.997999 0.514455 0.847778 0.782702
0.998999 0.975054 0.795339 0.898294
1.000000 0.122129 0.228904 0.168518

2000 rows × 3 columns


Section 4: HDF5 Format and Partial Loading

The batch analysis system now supports HDF5 format, which offers several advantages over pickle:

  • Better performance for large datasets
  • Selective loading of specific data components
  • Cross-platform compatibility
  • More efficient storage for numerical data

Run analysis with HDF5 format

# Use HDF5 format for better performance and selective loading
save_path_hdf5 = r"Z:\home\ryanh\projects\hpc_ctx\toy_analysis\toy_analysis_v3_hdf5"

batch_analysis.run(
    sessions,
    save_path_hdf5,
    toy_analysis_2,
    parallel=False,
    verbose=True,
    format_type="hdf5",  # Use HDF5 format
)
100%|██████████| 3/3 [00:00<00:00, 380.03it/s]

U:\data\hpc_ctx_project\HP01\day_1_20240227
U:\data\hpc_ctx_project\HP01\day_2_20240228
U:\data\hpc_ctx_project\HP01\day_3_20240229

Partial loading with load_specific_data()

# Get a specific file path
session_file = batch_analysis.encode_file_path(
    sessions.iloc[0]["basepath"], save_path_hdf5, format_type="hdf5"
)

print(f"Loading from: {session_file}")

# Load only the results DataFrame
results_only = batch_analysis.load_specific_data(session_file, key="results_df")
print("Results DataFrame only:")
display(results_only)

# Load only the PSTH data
psth_only = batch_analysis.load_specific_data(session_file, key="psth")
print("\nPSTH data only:")
display(psth_only.head())

# Load everything (equivalent to not specifying a key)
all_data = batch_analysis.load_specific_data(session_file)
print(f"\nAll data keys: {list(all_data.keys())}")
Loading from: Z:\home\ryanh\projects\hpc_ctx\toy_analysis\toy_analysis_v3_hdf5\U---___data___hpc_ctx_project___HP01___day_1_20240227.h5
Results DataFrame only:
basepath paramater_1 paramater_2 random_number
0 U:\data\hpc_ctx_project\HP01\day_1_20240227 1 2 42
PSTH data only:
0
-1.000000 0.09495039896565927
-0.998999 0.025459594964744592
-0.997999 0.7897323765370252
-0.996998 0.3043882313446068
-0.995998 0.08990904706906877
All data keys: ['psth', 'results_df']

When to use HDF5 vs Pickle

Use HDF5 when:

  • Working with large datasets (>100MB per file)
  • You need to load only specific components
  • Cross-platform compatibility is important
  • You have mostly numerical data (pandas DataFrames, numpy arrays)

Use Pickle when:

  • Working with small datasets
  • You have complex Python objects that don't translate well to HDF5
  • You always need to load the complete dataset
  • Simplicity is preferred

This new functionality maintains backward compatibility while providing more efficient options for large-scale analyses.