bursty_dynamics.trains
This module contains functions for detecting the trains, calculating the BP and MC of the trains, and also getting information of the trains.
- bursty_dynamics.trains.train_detection(df, subject_id, time_col, max_iet, time_unit='days', min_burst=3, only_trains=True)
Detects and assigns train IDs to events in the provided DataFrame based on the specified parameters.
Parameters
- dfDataFrame
The DataFrame containing the data.
- subject_idstr
The column name for subject IDs.
- time_colstr
The column name for the datetime values.
- max_ietint
Maximum distance between consecutive events in a train, in units specified by time_unit.
- time_unitstr, optional
Unit of time for the intervals ('seconds', 'minutes', 'hours', 'days', 'weeks', 'months', and 'years'). Default is 'days'.
- min_burstint, optional
Minimum number of events required to form a train. Default is 3.
- only_trainsbool, optional
Whether to return only the events that form trains. Default is True.
Returns
- DataFrame
DataFrame with train_id included which indicates the train the events belong to.
Examples
>>> data = { ... 'subject_id': [1, 1, 1, 1 ,2 ,2 ], ... 'event_time': ['2023-01-01', '2023-01-02', '2023-01-10','2023-01-20', '2023-01-01', '2023-01-03'] ... } >>> df = pd.DataFrame(data) >>> train_df = train_detection(df, 'subject_id', 'event_time', max_iet=30, time_unit='days', min_burst=2) >>> train_df subject_id event_time train_id 0 1 2023-01-01 1 1 1 2023-01-02 1 2 1 2023-01-10 1 3 1 2023-01-20 1 4 2 2023-01-01 1 5 2 2023-01-03 1
- bursty_dynamics.trains.train_info(train_df, subject_id, time_col, summary_statistic=False)
Calculate summary statistics for train data. This function processes event data grouped by subject and train ID, calculating key metrics such as event counts, total terms, train start and end times, train durations, and total trains per subject. Optionally, it prints descriptive statistics about the dataset.
Parameters
- train_dfpd.DataFrame
DataFrame containing the train data, including subject IDs, train IDs, and event timestamps.
- subject_idstr
Name of the column containing subject IDs.
- time_colstr
Name of the column containing timestamps (e.g., 'event_time').
- summary_statisticbool, optional
If True, prints summary statistics of train durations and event counts. Default is False.
Returns
- pd.DataFrame
A DataFrame containing aggregated train-level information with the following columns:
subject_id (str): Unique identifier for each subject.
train_id (int): Identifier for each train sequence (group of events).
unique_event_counts (int): Number of distinct event dates after removing duplicate events on the same day.
total_term_counts (int): Total number of events including duplicates (e.g., multiple events on the same date).
train_start (datetime): Earliest event timestamp for the train.
train_end (datetime): Latest event timestamp for the train.
train_duration_yrs (float): Duration of the train in years, rounded to two decimal places.
total_trains (int): Total number of non-zero trains for each subject.
Examples
>>> train_info(train_df, subject_id = 'subject_id', time_col = 'event_time') subject_id train_id unique_event_counts total_term_counts train_start train_end train_duration_yrs total_trains 0 1 1 4 4 2023-01-01 2023-01-20 0.05 1 1 2 1 2 2 2023-01-01 2023-01-03 0.01 1
- bursty_dynamics.trains.train_scores(train_df, subject_id, time_col, min_event_n=None, scatter=False, hist=False)
Calculate Burstiness Parameter (BP) and Memory Coefficient (MC) for each train_id per subject_id.
Parameters
- train_dfpd.DataFrame
Input DataFrame.
- subject_idstr
Name of the column containing subject IDs.
- time_colstr
Name of the column containing the date.
- min_event_nint, optional
Minimum number of unique (non-time duplicate) events required in a train for it to be included in the dataset. If None (default), no filtering is applied.
- scatterbool, optional
Whether to plot scatter plot. Defaults to False.
- histbool or str, optional
Type of histogram to plot. Options:
True: Plot histograms for both BP and MC.
"BP": Plot histogram for BP only.
"MC": Plot histogram for MC only.
"Both": Plot histograms for both BP and MC on the same plot.
False: Do not plot any histograms (default).
Returns
- tuple or DataFrame
If both scatter and hist are True: returns (merged_df, scatter_plot, hist_plot).
If only scatter is True: returns (merged_df, scatter_plot).
If only hist is True: returns (merged_df, hist_plot).
If neither scatter nor hist is True: returns merged_df.
Notes
- merged_dfDataFrame
The input DataFrame with burstiness parameter (BP) and memory coefficient (MC) for each train_id per subject_id.
- scatter_plotmatplotlib.figure.Figure or None
The figure object containing the scatter plot (if scatter=True).
- hist_plotsmatplotlib.figure.Figure or None
The figure objects containing the histogram (if hist=True).
Multiple events occurring at the same time will be aggregated into a single event when calcualting the BP and MC.
Examples
>>> train_scores(train_df, subject_id = 'subject_id', time_col ='event_time', min_event_n= 3) subject_id train_id BP MC 0 1 1 -0.19709 1.0