bursty_dynamics.trains

This module contains functions for detecting the trains, calculating the BP and MC of the trains, and also getting information of the trains.

bursty_dynamics.trains.train_detection(df, subject_id, time_col, max_iet, time_unit='days', min_burst=3, only_trains=True)

Detects and assigns train IDs to events in the provided DataFrame based on the specified parameters.

Parameters

dfDataFrame: The DataFrame containing the data.
subject_idstr: The column name for subject IDs.
time_colstr: The column name for the datetime values.
max_ietint: Maximum distance between consecutive events in a train, in units specified by time_unit.
time_unitstr, optional: Unit of time for the intervals ('seconds', 'minutes', 'hours', 'days', 'weeks', 'months', and 'years'). Default is 'days'.
min_burstint, optional: Minimum number of events required to form a train. Default is 3.
only_trainsbool, optional: Whether to return only the events that form trains. Default is True.

Returns

DataFrame: DataFrame with train_id included which indicates the train the events belong to.

Examples

>>> data = {
...     'subject_id': [1, 1, 1, 1 ,2 ,2 ],
...     'event_time': ['2023-01-01', '2023-01-02', '2023-01-10','2023-01-20', '2023-01-01', '2023-01-03']
... }
>>> df = pd.DataFrame(data)
>>> train_df = train_detection(df, 'subject_id', 'event_time', max_iet=30, time_unit='days', min_burst=2)
>>> train_df
     subject_id  event_time  train_id
0      1         2023-01-01    1
1      1         2023-01-02    1
2      1         2023-01-10    1
3      1         2023-01-20    1
4      2         2023-01-01    1
5      2         2023-01-03    1

bursty_dynamics.trains.train_info(train_df, subject_id, time_col, summary_statistic=False)

Calculate summary statistics for train data. This function processes event data grouped by subject and train ID, calculating key metrics such as event counts, total terms, train start and end times, train durations, and total trains per subject. Optionally, it prints descriptive statistics about the dataset.

Parameters

train_dfpd.DataFrame: DataFrame containing the train data, including subject IDs, train IDs, and event timestamps.
subject_idstr: Name of the column containing subject IDs.
time_colstr: Name of the column containing timestamps (e.g., 'event_time').
summary_statisticbool, optional: If True, prints summary statistics of train durations and event counts. Default is False.

Returns

pd.DataFrame

A DataFrame containing aggregated train-level information with the following columns:

subject_id (str): Unique identifier for each subject.
train_id (int): Identifier for each train sequence (group of events).
unique_event_counts (int): Number of distinct event dates after removing duplicate events on the same day.
total_term_counts (int): Total number of events including duplicates (e.g., multiple events on the same date).
train_start (datetime): Earliest event timestamp for the train.
train_end (datetime): Latest event timestamp for the train.
train_duration_yrs (float): Duration of the train in years, rounded to two decimal places.
total_trains (int): Total number of non-zero trains for each subject.

Examples

>>> train_info(train_df, subject_id = 'subject_id', time_col = 'event_time')
    subject_id train_id  unique_event_counts  total_term_counts  train_start  train_end  train_duration_yrs  total_trains
0      1          1          4                   4                2023-01-01   2023-01-20    0.05                 1
1      2          1          2                   2                2023-01-01   2023-01-03    0.01                 1

bursty_dynamics.trains.train_scores(train_df, subject_id, time_col, min_event_n=None, scatter=False, hist=False)

Calculate Burstiness Parameter (BP) and Memory Coefficient (MC) for each train_id per subject_id.

Parameters

train_dfpd.DataFrame

Input DataFrame.

subject_idstr

Name of the column containing subject IDs.

time_colstr

Name of the column containing the date.

min_event_nint, optional

Minimum number of unique (non-time duplicate) events required in a train for it to be included in the dataset. If None (default), no filtering is applied.

scatterbool, optional

Whether to plot scatter plot. Defaults to False.

histbool or str, optional

Type of histogram to plot. Options:

True: Plot histograms for both BP and MC.
"BP": Plot histogram for BP only.
"MC": Plot histogram for MC only.
"Both": Plot histograms for both BP and MC on the same plot.
False: Do not plot any histograms (default).

Returns

tuple or DataFrame

If both scatter and hist are True: returns (merged_df, scatter_plot, hist_plot).
If only scatter is True: returns (merged_df, scatter_plot).
If only hist is True: returns (merged_df, hist_plot).
If neither scatter nor hist is True: returns merged_df.

Notes

merged_dfDataFrame
The input DataFrame with burstiness parameter (BP) and memory coefficient (MC) for each train_id per subject_id.
scatter_plotmatplotlib.figure.Figure or None
The figure object containing the scatter plot (if scatter=True).
hist_plotsmatplotlib.figure.Figure or None
The figure objects containing the histogram (if hist=True).
Multiple events occurring at the same time will be aggregated into a single event when calcualting the BP and MC.

Examples

>>> train_scores(train_df, subject_id = 'subject_id', time_col ='event_time', min_event_n= 3)
    subject_id  train_id  BP         MC
0      1           1      -0.19709   1.0