bursty_dynamics.trains

This module contains functions for detecting the trains, calculating the BP and MC of the trains, and also getting information of the trains.

bursty_dynamics.trains.train_detection(df, subject_id, time_col, max_iet, time_unit='days', min_burst=3, only_trains=True)

Detects and assigns train IDs to events in the provided DataFrame based on the specified parameters.

Parameters

dfDataFrame

The DataFrame containing the data.

subject_idstr

The column name for subject IDs.

time_colstr

The column name for the datetime values.

max_ietint

Maximum distance between consecutive events in a train, in units specified by time_unit.

time_unitstr, optional

Unit of time for the intervals ('seconds', 'minutes', 'hours', 'days', 'weeks', 'months', and 'years'). Default is 'days'.

min_burstint, optional

Minimum number of events required to form a train. Default is 3.

only_trainsbool, optional

Whether to return only the events that form trains. Default is True.

Returns

DataFrame

DataFrame with train_id included which indicates the train the events belong to.

Examples

>>> data = {
...     'subject_id': [1, 1, 1, 1 ,2 ,2 ],
...     'event_time': ['2023-01-01', '2023-01-02', '2023-01-10','2023-01-20', '2023-01-01', '2023-01-03']
... }
>>> df = pd.DataFrame(data)
>>> train_df = train_detection(df, 'subject_id', 'event_time', max_iet=30, time_unit='days', min_burst=2)
>>> train_df
     subject_id  event_time  train_id
0      1         2023-01-01    1
1      1         2023-01-02    1
2      1         2023-01-10    1
3      1         2023-01-20    1
4      2         2023-01-01    1
5      2         2023-01-03    1
bursty_dynamics.trains.train_info(train_df, subject_id, time_col, summary_statistic=False)

Calculate summary statistics for train data. This function processes event data grouped by subject and train ID, calculating key metrics such as event counts, total terms, train start and end times, train durations, and total trains per subject. Optionally, it prints descriptive statistics about the dataset.

Parameters

train_dfpd.DataFrame

DataFrame containing the train data, including subject IDs, train IDs, and event timestamps.

subject_idstr

Name of the column containing subject IDs.

time_colstr

Name of the column containing timestamps (e.g., 'event_time').

summary_statisticbool, optional

If True, prints summary statistics of train durations and event counts. Default is False.

Returns

pd.DataFrame

A DataFrame containing aggregated train-level information with the following columns:

  • subject_id (str): Unique identifier for each subject.

  • train_id (int): Identifier for each train sequence (group of events).

  • unique_event_counts (int): Number of distinct event dates after removing duplicate events on the same day.

  • total_term_counts (int): Total number of events including duplicates (e.g., multiple events on the same date).

  • train_start (datetime): Earliest event timestamp for the train.

  • train_end (datetime): Latest event timestamp for the train.

  • train_duration_yrs (float): Duration of the train in years, rounded to two decimal places.

  • total_trains (int): Total number of non-zero trains for each subject.

Examples

>>> train_info(train_df, subject_id = 'subject_id', time_col = 'event_time')
    subject_id train_id  unique_event_counts  total_term_counts  train_start  train_end  train_duration_yrs  total_trains
0      1          1          4                   4                2023-01-01   2023-01-20    0.05                 1
1      2          1          2                   2                2023-01-01   2023-01-03    0.01                 1
bursty_dynamics.trains.train_scores(train_df, subject_id, time_col, min_event_n=None, scatter=False, hist=False)

Calculate Burstiness Parameter (BP) and Memory Coefficient (MC) for each train_id per subject_id.

Parameters

train_dfpd.DataFrame

Input DataFrame.

subject_idstr

Name of the column containing subject IDs.

time_colstr

Name of the column containing the date.

min_event_nint, optional

Minimum number of unique (non-time duplicate) events required in a train for it to be included in the dataset. If None (default), no filtering is applied.

scatterbool, optional

Whether to plot scatter plot. Defaults to False.

histbool or str, optional

Type of histogram to plot. Options:

  • True: Plot histograms for both BP and MC.

  • "BP": Plot histogram for BP only.

  • "MC": Plot histogram for MC only.

  • "Both": Plot histograms for both BP and MC on the same plot.

  • False: Do not plot any histograms (default).

Returns

tuple or DataFrame
  • If both scatter and hist are True: returns (merged_df, scatter_plot, hist_plot).

  • If only scatter is True: returns (merged_df, scatter_plot).

  • If only hist is True: returns (merged_df, hist_plot).

  • If neither scatter nor hist is True: returns merged_df.

Notes

  • merged_dfDataFrame

    The input DataFrame with burstiness parameter (BP) and memory coefficient (MC) for each train_id per subject_id.

  • scatter_plotmatplotlib.figure.Figure or None

    The figure object containing the scatter plot (if scatter=True).

  • hist_plotsmatplotlib.figure.Figure or None

    The figure objects containing the histogram (if hist=True).

  • Multiple events occurring at the same time will be aggregated into a single event when calcualting the BP and MC.

Examples

>>> train_scores(train_df, subject_id = 'subject_id', time_col ='event_time', min_event_n= 3)
    subject_id  train_id  BP         MC
0      1           1      -0.19709   1.0