Batch Analysis¶
Setup¶
%reload_ext autoreload
%autoreload 2
from neuro_py.process import batch_analysis
import pandas as pd
import numpy as np
Section 1: Define the analysis¶
Here, I'm defining the analysis in the notebook, but in a real project, you would define it in a separate .py
file and import it here.
def toy_analysis(basepath, parameter_1=1, parameter_2=2):
results = pd.DataFrame()
results["basepath"] = [basepath]
results["parameter_1"] = parameter_1
results["parameter_2"] = parameter_2
results["random_number"] = np.random.randint(0, 100)
return results
For your project, you will have a .csv
file with the basepaths
you want to analyze. Here, I'm creating a DataFrame
with the basepaths
for the purpose of this notebook.
sessions = pd.DataFrame(dict(basepath=[
r"U:\data\hpc_ctx_project\HP01\day_1_20240227",
r"U:\data\hpc_ctx_project\HP01\day_2_20240228",
r"U:\data\hpc_ctx_project\HP01\day_3_20240229",
]))
You will need to define the path where you want to save the results of your analysis.
It's useful to nest the analysis version in a subfolder (toy_analysis\toy_analysis_v1
) to keep track of the different versions of your analysis.
save_path = r"Z:\home\ryanh\projects\hpc_ctx\toy_analysis\toy_analysis_v1"
Section 2: Run the analysis¶
Finally, you can run your analysis in batch mode. This will loop through the basepaths
and save the results in the specified folder.
The batch_analysis
function is a general function that you can use for any analysis. You just need to pass the function you want to run, the basepaths
you want to analyze, and the save path.
If your analysis fails, running again will start from where it left off.
There is a parallel
option that you can set to True
if you want to run the analysis in parallel. This will speed up the analysis if you have multiple cores.
batch_analysis.run(
sessions,
save_path,
toy_analysis,
parallel=False,
verbose=True,
)
100%|██████████| 3/3 [00:00<00:00, 759.52it/s]
U:\data\hpc_ctx_project\HP01\day_1_20240227 U:\data\hpc_ctx_project\HP01\day_2_20240228 U:\data\hpc_ctx_project\HP01\day_3_20240229
Section 3: Load the results¶
There is a built in loader that concatenates the results of the analysis into a single DataFrame
.
results = batch_analysis.load_results(save_path)
results
basepath | paramater_1 | paramater_2 | random_number | |
---|---|---|---|---|
0 | U:\data\hpc_ctx_project\HP01\day_1_20240227 | 1 | 2 | 34 |
1 | U:\data\hpc_ctx_project\HP01\day_2_20240228 | 1 | 2 | 30 |
2 | U:\data\hpc_ctx_project\HP01\day_3_20240229 | 1 | 2 | 66 |
Bonus: More complicated results¶
Your results won't always fit nicely into a single DataFrame
. Sometimes you will have multiple data types you need to save.
For example, you might have values for each cell in a DataFrame
and also PSTHs for each cell. Your analysis will store both in a dictionary and you will construct a custom loader in your analysis.
Define the analysis¶
import glob
import os
import pickle
def toy_analysis_2(basepath, paramater_1=1, paramater_2=2):
results_df = pd.DataFrame()
results_df["basepath"] = [basepath]
results_df["paramater_1"] = paramater_1
results_df["paramater_2"] = paramater_2
results_df["random_number"] = np.random.randint(0, 100)
window_starttime, window_stoptime = [-1, 1]
window_bins = int(np.ceil(((window_stoptime - window_starttime) * 1000)))
time_lags = np.linspace(window_starttime, window_stoptime, window_bins)
psths = pd.DataFrame(
index=time_lags,
columns=np.arange(1),
)
psths[:] = np.random.rand(window_bins, 1)
results = {
"results_df": results_df,
"psth": psths,
}
return results
# custom loader
def load_results(save_path, verbose=False):
# check if folder exists
if not os.path.exists(save_path):
raise ValueError(f"folder {save_path} does not exist")
# get all the sessions
sessions = glob.glob(save_path + os.sep + "*.pkl")
results_df = []
psths = []
# iterate over the sessions
for session in sessions:
if verbose:
print(session)
# load the session
with open(session, "rb") as f:
results_ = pickle.load(f)
if results_ is None:
continue
results_df.append(results_["results_df"])
psths.append(results_["psth"])
results_df = pd.concat(results_df, axis=0, ignore_index=True)
psths = pd.concat(psths, axis=1, ignore_index=True)
return results_df, psths
Run the analysis¶
save_path = r"Z:\home\ryanh\projects\hpc_ctx\toy_analysis\toy_analysis_v2"
batch_analysis.run(
sessions,
save_path,
toy_analysis_2,
parallel=False,
verbose=True,
)
100%|██████████| 3/3 [00:00<00:00, 840.94it/s]
U:\data\hpc_ctx_project\HP01\day_1_20240227 U:\data\hpc_ctx_project\HP01\day_2_20240228 U:\data\hpc_ctx_project\HP01\day_3_20240229
Load the results¶
results_df, psths = load_results(save_path)
display(results_df)
display(psths)
basepath | paramater_1 | paramater_2 | random_number | |
---|---|---|---|---|
0 | U:\data\hpc_ctx_project\HP01\day_1_20240227 | 1 | 2 | 56 |
1 | U:\data\hpc_ctx_project\HP01\day_2_20240228 | 1 | 2 | 32 |
2 | U:\data\hpc_ctx_project\HP01\day_3_20240229 | 1 | 2 | 56 |
0 | 1 | 2 | |
---|---|---|---|
-1.000000 | 0.190685 | 0.490553 | 0.248958 |
-0.998999 | 0.078999 | 0.689063 | 0.40577 |
-0.997999 | 0.094847 | 0.788747 | 0.966084 |
-0.996998 | 0.287616 | 0.804512 | 0.846309 |
-0.995998 | 0.723807 | 0.996373 | 0.850087 |
... | ... | ... | ... |
0.995998 | 0.023565 | 0.136486 | 0.120244 |
0.996998 | 0.298943 | 0.844828 | 0.227437 |
0.997999 | 0.514455 | 0.847778 | 0.782702 |
0.998999 | 0.975054 | 0.795339 | 0.898294 |
1.000000 | 0.122129 | 0.228904 | 0.168518 |
2000 rows × 3 columns
Section 4: HDF5 Format and Partial Loading¶
The batch analysis system now supports HDF5 format, which offers several advantages over pickle:
- Better performance for large datasets
- Selective loading of specific data components
- Cross-platform compatibility
- More efficient storage for numerical data
Run analysis with HDF5 format¶
# Use HDF5 format for better performance and selective loading
save_path_hdf5 = r"Z:\home\ryanh\projects\hpc_ctx\toy_analysis\toy_analysis_v3_hdf5"
batch_analysis.run(
sessions,
save_path_hdf5,
toy_analysis_2,
parallel=False,
verbose=True,
format_type="hdf5" # Use HDF5 format
)
100%|██████████| 3/3 [00:00<00:00, 380.03it/s]
U:\data\hpc_ctx_project\HP01\day_1_20240227 U:\data\hpc_ctx_project\HP01\day_2_20240228 U:\data\hpc_ctx_project\HP01\day_3_20240229
Partial loading with load_specific_data()¶
# Get a specific file path
session_file = batch_analysis.encode_file_path(
sessions.iloc[0]['basepath'],
save_path_hdf5,
format_type="hdf5"
)
print(f"Loading from: {session_file}")
# Load only the results DataFrame
results_only = batch_analysis.load_specific_data(session_file, key="results_df")
print("Results DataFrame only:")
display(results_only)
# Load only the PSTH data
psth_only = batch_analysis.load_specific_data(session_file, key="psth")
print("\nPSTH data only:")
display(psth_only.head())
# Load everything (equivalent to not specifying a key)
all_data = batch_analysis.load_specific_data(session_file)
print(f"\nAll data keys: {list(all_data.keys())}")
Loading from: Z:\home\ryanh\projects\hpc_ctx\toy_analysis\toy_analysis_v3_hdf5\U---___data___hpc_ctx_project___HP01___day_1_20240227.h5 Results DataFrame only:
basepath | paramater_1 | paramater_2 | random_number | |
---|---|---|---|---|
0 | U:\data\hpc_ctx_project\HP01\day_1_20240227 | 1 | 2 | 42 |
PSTH data only:
0 | |
---|---|
-1.000000 | 0.09495039896565927 |
-0.998999 | 0.025459594964744592 |
-0.997999 | 0.7897323765370252 |
-0.996998 | 0.3043882313446068 |
-0.995998 | 0.08990904706906877 |
All data keys: ['psth', 'results_df']
When to use HDF5 vs Pickle¶
Use HDF5 when:¶
- Working with large datasets (>100MB per file)
- You need to load only specific components
- Cross-platform compatibility is important
- You have mostly numerical data (pandas DataFrames, numpy arrays)
Use Pickle when:¶
- Working with small datasets
- You have complex Python objects that don't translate well to HDF5
- You always need to load the complete dataset
- Simplicity is preferred
This new functionality maintains backward compatibility while providing more efficient options for large-scale analyses.