Batch Analysis¶

Setup¶

In [1]:

  Copied!     
 
%reload_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd

from neuro_py.process import batch_analysis
%reload_ext autoreload %autoreload 2 import numpy as np import pandas as pd from neuro_py.process import batch_analysis

Section 1: Define the analysis¶

Here, I'm defining the analysis in the notebook, but in a real project, you would define it in a separate .py file and import it here.

In [2]:

  Copied!     
 
def toy_analysis(basepath, parameter_1=1, parameter_2=2):
    results = pd.DataFrame()
    results["basepath"] = [basepath]
    results["parameter_1"] = parameter_1
    results["parameter_2"] = parameter_2
    results["random_number"] = np.random.randint(0, 100)
    return results
def toy_analysis(basepath, parameter_1=1, parameter_2=2): results = pd.DataFrame() results["basepath"] = [basepath] results["parameter_1"] = parameter_1 results["parameter_2"] = parameter_2 results["random_number"] = np.random.randint(0, 100) return results

For your project, you will have a .csv file with the basepaths you want to analyze. Here, I'm creating a DataFrame with the basepaths for the purpose of this notebook.

In [3]:

  Copied!     
 
sessions = pd.DataFrame(
    dict(
        basepath=[
            r"U:\data\hpc_ctx_project\HP01\day_1_20240227",
            r"U:\data\hpc_ctx_project\HP01\day_2_20240228",
            r"U:\data\hpc_ctx_project\HP01\day_3_20240229",
        ]
    )
)
sessions = pd.DataFrame( dict( basepath=[ r"U:\data\hpc_ctx_project\HP01\day_1_20240227", r"U:\data\hpc_ctx_project\HP01\day_2_20240228", r"U:\data\hpc_ctx_project\HP01\day_3_20240229", ] ) )

You will need to define the path where you want to save the results of your analysis.

It's useful to nest the analysis version in a subfolder (toy_analysis\toy_analysis_v1) to keep track of the different versions of your analysis.

In [4]:

  Copied!     
 
save_path = r"Z:\home\ryanh\projects\hpc_ctx\toy_analysis\toy_analysis_v1"
save_path = r"Z:\home\ryanh\projects\hpc_ctx\toy_analysis\toy_analysis_v1"

Section 2: Run the analysis¶

Finally, you can run your analysis in batch mode. This will loop through the basepaths and save the results in the specified folder.

The batch_analysis function is a general function that you can use for any analysis. You just need to pass the function you want to run, the basepaths you want to analyze, and the save path.

If your analysis fails, running again will start from where it left off.

There is a parallel option that you can set to True if you want to run the analysis in parallel. This will speed up the analysis if you have multiple cores.

In [5]:

  Copied!     
 
batch_analysis.run(
    sessions,
    save_path,
    toy_analysis,
    parallel=False,
    verbose=True,
)
batch_analysis.run( sessions, save_path, toy_analysis, parallel=False, verbose=True, )

100%|██████████| 3/3 [00:00<00:00, 759.52it/s]

U:\data\hpc_ctx_project\HP01\day_1_20240227
U:\data\hpc_ctx_project\HP01\day_2_20240228
U:\data\hpc_ctx_project\HP01\day_3_20240229

Section 3: Load the results¶

There is a built in loader that concatenates the results of the analysis into a single DataFrame.

In [6]:

  Copied!     
 
results = batch_analysis.load_results(save_path)
results
results = batch_analysis.load_results(save_path) results

Out[6]:

	basepath	paramater_1	paramater_2	random_number
0	U:\data\hpc_ctx_project\HP01\day_1_20240227	1	2	34
1	U:\data\hpc_ctx_project\HP01\day_2_20240228	1	2	30
2	U:\data\hpc_ctx_project\HP01\day_3_20240229	1	2	66

Bonus: More complicated results¶

Your results won't always fit nicely into a single DataFrame. Sometimes you will have multiple data types you need to save.

For example, you might have values for each cell in a DataFrame and also PSTHs for each cell. Your analysis will store both in a dictionary and you will construct a custom loader in your analysis.

Define the analysis¶

In [7]:

  Copied!     
 
import glob
import os
import pickle


def toy_analysis_2(basepath, paramater_1=1, paramater_2=2):
    results_df = pd.DataFrame()
    results_df["basepath"] = [basepath]
    results_df["paramater_1"] = paramater_1
    results_df["paramater_2"] = paramater_2
    results_df["random_number"] = np.random.randint(0, 100)

    window_starttime, window_stoptime = [-1, 1]
    window_bins = int(np.ceil(((window_stoptime - window_starttime) * 1000)))
    time_lags = np.linspace(window_starttime, window_stoptime, window_bins)
    psths = pd.DataFrame(
        index=time_lags,
        columns=np.arange(1),
    )
    psths[:] = np.random.rand(window_bins, 1)

    results = {
        "results_df": results_df,
        "psth": psths,
    }
    return results


# custom loader
def load_results(save_path, verbose=False):
    # check if folder exists
    if not os.path.exists(save_path):
        raise ValueError(f"folder {save_path} does not exist")

    # get all the sessions
    sessions = glob.glob(save_path + os.sep + "*.pkl")

    results_df = []
    psths = []

    # iterate over the sessions
    for session in sessions:
        if verbose:
            print(session)

        # load the session
        with open(session, "rb") as f:
            results_ = pickle.load(f)

        if results_ is None:
            continue
        results_df.append(results_["results_df"])
        psths.append(results_["psth"])

    results_df = pd.concat(results_df, axis=0, ignore_index=True)
    psths = pd.concat(psths, axis=1, ignore_index=True)

    return results_df, psths
import glob import os import pickle def toy_analysis_2(basepath, paramater_1=1, paramater_2=2): results_df = pd.DataFrame() results_df["basepath"] = [basepath] results_df["paramater_1"] = paramater_1 results_df["paramater_2"] = paramater_2 results_df["random_number"] = np.random.randint(0, 100) window_starttime, window_stoptime = [-1, 1] window_bins = int(np.ceil(((window_stoptime - window_starttime) * 1000))) time_lags = np.linspace(window_starttime, window_stoptime, window_bins) psths = pd.DataFrame( index=time_lags, columns=np.arange(1), ) psths[:] = np.random.rand(window_bins, 1) results = { "results_df": results_df, "psth": psths, } return results # custom loader def load_results(save_path, verbose=False): # check if folder exists if not os.path.exists(save_path): raise ValueError(f"folder {save_path} does not exist") # get all the sessions sessions = glob.glob(save_path + os.sep + "*.pkl") results_df = [] psths = [] # iterate over the sessions for session in sessions: if verbose: print(session) # load the session with open(session, "rb") as f: results_ = pickle.load(f) if results_ is None: continue results_df.append(results_["results_df"]) psths.append(results_["psth"]) results_df = pd.concat(results_df, axis=0, ignore_index=True) psths = pd.concat(psths, axis=1, ignore_index=True) return results_df, psths

Run the analysis¶

In [8]:

  Copied!     
 
save_path = r"Z:\home\ryanh\projects\hpc_ctx\toy_analysis\toy_analysis_v2"

batch_analysis.run(
    sessions,
    save_path,
    toy_analysis_2,
    parallel=False,
    verbose=True,
)
save_path = r"Z:\home\ryanh\projects\hpc_ctx\toy_analysis\toy_analysis_v2" batch_analysis.run( sessions, save_path, toy_analysis_2, parallel=False, verbose=True, )

100%|██████████| 3/3 [00:00<00:00, 840.94it/s]

U:\data\hpc_ctx_project\HP01\day_1_20240227
U:\data\hpc_ctx_project\HP01\day_2_20240228
U:\data\hpc_ctx_project\HP01\day_3_20240229

Load the results¶

In [9]:

  Copied!     
 
results_df, psths = load_results(save_path)

display(results_df)
display(psths)
results_df, psths = load_results(save_path) display(results_df) display(psths)

	basepath	paramater_1	paramater_2	random_number
0	U:\data\hpc_ctx_project\HP01\day_1_20240227	1	2	56
1	U:\data\hpc_ctx_project\HP01\day_2_20240228	1	2	32
2	U:\data\hpc_ctx_project\HP01\day_3_20240229	1	2	56

	0	1	2
-1.000000	0.190685	0.490553	0.248958
-0.998999	0.078999	0.689063	0.40577
-0.997999	0.094847	0.788747	0.966084
-0.996998	0.287616	0.804512	0.846309
-0.995998	0.723807	0.996373	0.850087
...	...	...	...
0.995998	0.023565	0.136486	0.120244
0.996998	0.298943	0.844828	0.227437
0.997999	0.514455	0.847778	0.782702
0.998999	0.975054	0.795339	0.898294
1.000000	0.122129	0.228904	0.168518

2000 rows × 3 columns

Section 4: HDF5 Format and Partial Loading¶

The batch analysis system now supports HDF5 format, which offers several advantages over pickle:

Better performance for large datasets
Selective loading of specific data components
Cross-platform compatibility
More efficient storage for numerical data

Run analysis with HDF5 format¶

In [10]:

  Copied!     
 
# Use HDF5 format for better performance and selective loading
save_path_hdf5 = r"Z:\home\ryanh\projects\hpc_ctx\toy_analysis\toy_analysis_v3_hdf5"

batch_analysis.run(
    sessions,
    save_path_hdf5,
    toy_analysis_2,
    parallel=False,
    verbose=True,
    format_type="hdf5",  # Use HDF5 format
)
# Use HDF5 format for better performance and selective loading save_path_hdf5 = r"Z:\home\ryanh\projects\hpc_ctx\toy_analysis\toy_analysis_v3_hdf5" batch_analysis.run( sessions, save_path_hdf5, toy_analysis_2, parallel=False, verbose=True, format_type="hdf5", # Use HDF5 format )

100%|██████████| 3/3 [00:00<00:00, 380.03it/s]

U:\data\hpc_ctx_project\HP01\day_1_20240227
U:\data\hpc_ctx_project\HP01\day_2_20240228
U:\data\hpc_ctx_project\HP01\day_3_20240229

Partial loading with load_specific_data()¶

In [11]:

  Copied!     
 
# Get a specific file path
session_file = batch_analysis.encode_file_path(
    sessions.iloc[0]["basepath"], save_path_hdf5, format_type="hdf5"
)

print(f"Loading from: {session_file}")

# Load only the results DataFrame
results_only = batch_analysis.load_specific_data(session_file, key="results_df")
print("Results DataFrame only:")
display(results_only)

# Load only the PSTH data
psth_only = batch_analysis.load_specific_data(session_file, key="psth")
print("\nPSTH data only:")
display(psth_only.head())

# Load everything (equivalent to not specifying a key)
all_data = batch_analysis.load_specific_data(session_file)
print(f"\nAll data keys: {list(all_data.keys())}")
# Get a specific file path session_file = batch_analysis.encode_file_path( sessions.iloc[0]["basepath"], save_path_hdf5, format_type="hdf5" ) print(f"Loading from: {session_file}") # Load only the results DataFrame results_only = batch_analysis.load_specific_data(session_file, key="results_df") print("Results DataFrame only:") display(results_only) # Load only the PSTH data psth_only = batch_analysis.load_specific_data(session_file, key="psth") print("\nPSTH data only:") display(psth_only.head()) # Load everything (equivalent to not specifying a key) all_data = batch_analysis.load_specific_data(session_file) print(f"\nAll data keys: {list(all_data.keys())}")

Loading from: Z:\home\ryanh\projects\hpc_ctx\toy_analysis\toy_analysis_v3_hdf5\U---___data___hpc_ctx_project___HP01___day_1_20240227.h5
Results DataFrame only:

	basepath	paramater_1	paramater_2	random_number
0	U:\data\hpc_ctx_project\HP01\day_1_20240227	1	2	42

PSTH data only:

	0
-1.000000	0.09495039896565927
-0.998999	0.025459594964744592
-0.997999	0.7897323765370252
-0.996998	0.3043882313446068
-0.995998	0.08990904706906877

All data keys: ['psth', 'results_df']

When to use HDF5 vs Pickle¶

Use HDF5 when:¶

Working with large datasets (>100MB per file)
You need to load only specific components
Cross-platform compatibility is important
You have mostly numerical data (pandas DataFrames, numpy arrays)

Use Pickle when:¶

Working with small datasets
You have complex Python objects that don't translate well to HDF5
You always need to load the complete dataset
Simplicity is preferred

This new functionality maintains backward compatibility while providing more efficient options for large-scale analyses.