This page contains the auto-generated API reference documentation.

Preprocessing Config

class sequifier.config.preprocess_config.PreprocessorModel(*, project_path: str, data_path: str, read_format: str = 'csv', write_format: str = 'parquet', combine_into_single_file: bool = True, selected_columns: list[str] | None = None, group_proportions: list[float], seq_length: int, seq_step_sizes: list[int] | None = None, max_rows: int | None = None, seed: int, n_cores: int | None = None, batches_per_file: int = 1024, process_by_file: bool = True)[source]

Pydantic model for preprocessor configuration.

project_path

The path to the sequifier project directory.

Type:

str

data_path

The path to the input data file.

Type:

str

read_format

The file type of the input data. Can be ‘csv’ or ‘parquet’.

Type:

str

write_format

The file type for the preprocessed output data.

Type:

str

combine_into_single_file

If True, combines all preprocessed data into a single file.

Type:

bool

selected_columns

A list of columns to be included in the preprocessing. If None, all columns are used.

Type:

list[str] | None

group_proportions

A list of floats that define the relative sizes of data splits (e.g., for train, validation, test). The sum of proportions must be 1.0.

Type:

list[float]

seq_length

The sequence length for the model inputs.

Type:

int

seq_step_sizes

A list of step sizes for creating subsequences within each data split.

Type:

list[int] | None

max_rows

The maximum number of input rows to process. If None, all rows are processed.

Type:

int | None

seed

A random seed for reproducibility.

Type:

int

n_cores

The number of CPU cores to use for parallel processing. If None, it uses the available CPU cores.

Type:

int | None

batches_per_file

The number of batches to process per file.

Type:

int

process_by_file

A flag to indicate if processing should be done file by file.

Type:

bool

Training Config

class sequifier.config.train_config.ModelSpecModel(*, d_model: int, d_model_by_column: dict[str, int] | None = None, nhead: int, d_hid: int, nlayers: int)[source]

Pydantic model for model specifications.

d_model

The number of expected features in the input.

Type:

int

d_model_by_column

The embedding dimensions for each input column. Must sum to d_model.

Type:

dict[str, int] | None

nhead

The number of heads in the multi-head attention models.

Type:

int

d_hid

The dimension of the feedforward network model.

Type:

int

nlayers

The number of layers in the transformer model.

Type:

int

class sequifier.config.train_config.TrainModel(*, project_path: str, ddconfig_path: str, model_name: str, training_data_path: str, validation_data_path: str, read_format: str = 'parquet', selected_columns: list[str], column_types: dict[str, str], categorical_columns: list[str], real_columns: list[str], target_columns: list[str], target_column_types: dict[str, str], id_maps: dict[str, dict[str | int, int]], seq_length: int, n_classes: dict[str, int], inference_batch_size: int, seed: int, export_generative_model: bool, export_embedding_model: bool, export_onnx: bool = True, export_pt: bool = False, export_with_dropout: bool = False, model_spec: ModelSpecModel, training_spec: TrainingSpecModel)[source]

Pydantic model for training configuration.

project_path

The path to the sequifier project directory.

Type:

str

ddconfig_path

The path to the data-driven configuration file.

Type:

str

model_name

The name of the model being trained.

Type:

str

training_data_path

The path to the training data.

Type:

str

validation_data_path

The path to the validation data.

Type:

str

read_format

The file format of the input data (e.g., ‘csv’, ‘parquet’).

Type:

str

selected_columns

The list of input columns to be used for training.

Type:

list[str]

column_types

A dictionary mapping each column to its numeric type (‘int64’ or ‘float64’).

Type:

dict[str, str]

categorical_columns

A list of columns that are categorical.

Type:

list[str]

real_columns

A list of columns that are real-valued.

Type:

list[str]

target_columns

The list of target columns for model training.

Type:

list[str]

target_column_types

A dictionary mapping target columns to their types (‘categorical’ or ‘real’).

Type:

dict[str, str]

id_maps

For each categorical column, a map from distinct values to their indexed representation.

Type:

dict[str, dict[str | int, int]]

seq_length

The sequence length of the model’s input.

Type:

int

n_classes

The number of classes for each categorical column.

Type:

dict[str, int]

inference_batch_size

The batch size to be used for inference after model export.

Type:

int

seed

The random seed for numpy and PyTorch.

Type:

int

export_generative_model

If True, exports the generative model.

Type:

bool

export_embedding_model

If True, exports the embedding model.

Type:

bool

export_onnx

If True, exports the model in ONNX format.

Type:

bool

export_pt

If True, exports the model using torch.save.

Type:

bool

export_with_dropout

If True, exports the model with dropout enabled.

Type:

bool

model_spec

The specification of the transformer model architecture.

Type:

sequifier.config.train_config.ModelSpecModel

training_spec

The specification of the training run configuration.

Type:

sequifier.config.train_config.TrainingSpecModel

class sequifier.config.train_config.TrainingSpecModel(*, device: str, device_max_concat_length: int = 12, epochs: int, log_interval: int = 10, class_share_log_columns: list[str] = None, early_stopping_epochs: int | None = None, iter_save: int, batch_size: int, lr: float, criterion: dict[str, str], class_weights: dict[str, list[float]] | None = None, accumulation_steps: int | None = None, dropout: float = 0.0, loss_weights: dict[str, float] | None = None, optimizer: DotDict = None, scheduler: DotDict = None, continue_training: bool = True, distributed: bool = False, load_full_data_to_ram: bool = True, world_size: int = 1, num_workers: int = 0, backend: str = 'nccl')[source]

Pydantic model for training specifications.

device

The torch.device to train the model on (e.g., ‘cuda’, ‘cpu’, ‘mps’).

Type:

str

device_max_concat_length

Maximum sequence length for concatenation on device.

Type:

int

epochs

The total number of epochs to train for.

Type:

int

log_interval

The interval in batches for logging.

Type:

int

class_share_log_columns

A list of column names for which to log the class share of predictions.

Type:

list[str]

early_stopping_epochs

Number of epochs to wait for validation loss improvement before stopping.

Type:

int | None

iter_save

The interval in epochs for checkpointing the model.

Type:

int

batch_size

The training batch size.

Type:

int

lr

The learning rate.

Type:

float

criterion

A dictionary mapping each target column to a loss function.

Type:

dict[str, str]

class_weights

A dictionary mapping categorical target columns to a list of class weights.

Type:

dict[str, list[float]] | None

accumulation_steps

The number of gradient accumulation steps.

Type:

int | None

dropout

The dropout value for the transformer model.

Type:

float

loss_weights

A dictionary mapping columns to specific loss weights.

Type:

dict[str, float] | None

optimizer

The optimizer configuration.

Type:

sequifier.config.train_config.DotDict

scheduler

The learning rate scheduler configuration.

Type:

sequifier.config.train_config.DotDict

continue_training

If True, continue training from the latest checkpoint.

Type:

bool

distributed

If True, enables distributed training.

Type:

bool

load_full_data_to_ram

If True, loads the entire dataset into RAM.

Type:

bool

world_size

The number of processes for distributed training.

Type:

int

num_workers

The number of worker threads for data loading.

Type:

int

backend

The distributed training backend (e.g., ‘nccl’).

Type:

str

Inference Config

class sequifier.config.infer_config.InfererModel(*, project_path: str, ddconfig_path: str, model_path: str | list[str], model_type: str, data_path: str, training_config_path: str = 'configs/train.yaml', read_format: str = 'parquet', write_format: str = 'csv', selected_columns: list[str], categorical_columns: list[str], real_columns: list[str], target_columns: list[str], column_types: dict[str, str], target_column_types: dict[str, str], output_probabilities: bool = False, map_to_id: bool = True, seed: int, device: str, seq_length: int, inference_batch_size: int, distributed: bool = False, load_full_data_to_ram: bool = True, world_size: int = 1, num_workers: int = 0, sample_from_distribution_columns: list[str] | None = None, infer_with_dropout: bool = False, autoregression: bool = False, autoregression_extra_steps: int | None = None)[source]

Pydantic model for inference configuration.

project_path

The path to the sequifier project directory.

Type:

str

ddconfig_path

The path to the data-driven configuration file.

Type:

str

model_path

The path to the trained model file(s).

Type:

str | list[str]

model_type

The type of model, either ‘embedding’ or ‘generative’.

Type:

str

data_path

The path to the data to be used for inference.

Type:

str

training_config_path

The path to the training configuration file.

Type:

str

read_format

The file format of the input data (e.g., ‘csv’, ‘parquet’).

Type:

str

write_format

The file format for the inference output.

Type:

str

selected_columns

The list of input columns used for inference.

Type:

list[str]

categorical_columns

A list of columns that are categorical.

Type:

list[str]

real_columns

A list of columns that are real-valued.

Type:

list[str]

target_columns

The list of target columns for inference.

Type:

list[str]

column_types

A dictionary mapping each column to its numeric type (‘int64’ or ‘float64’).

Type:

dict[str, str]

target_column_types

A dictionary mapping target columns to their types (‘categorical’ or ‘real’).

Type:

dict[str, str]

output_probabilities

If True, outputs the probability distributions for categorical target columns.

Type:

bool

map_to_id

If True, maps categorical output values back to their original IDs.

Type:

bool

seed

The random seed for reproducibility.

Type:

int

device

The device to run inference on (e.g., ‘cuda’, ‘cpu’, ‘mps’).

Type:

str

seq_length

The sequence length of the model’s input.

Type:

int

inference_batch_size

The batch size for inference.

Type:

int

distributed

If True, enables distributed inference.

Type:

bool

load_full_data_to_ram

If True, loads the entire dataset into RAM.

Type:

bool

world_size

The number of processes for distributed inference.

Type:

int

num_workers

The number of worker threads for data loading.

Type:

int

sample_from_distribution_columns

A list of columns from which to sample from the distribution.

Type:

list[str] | None

infer_with_dropout

If True, applies dropout during inference.

Type:

bool

autoregression

If True, performs autoregressive inference.

Type:

bool

autoregression_extra_steps

The number of additional steps for autoregressive inference.

Type:

int | None

Hyperparameter Search Config

class sequifier.config.hyperparameter_search_config.HyperparameterSearch(*, project_path: str, ddconfig_path: str, hp_search_name: str, search_strategy: str = 'sample', n_samples: int | None = None, model_config_write_path: str, training_data_path: str, validation_data_path: str, read_format: str = 'parquet', selected_columns: list[list[str]], column_types: list[dict[str, str]], categorical_columns: list[list[str]], real_columns: list[list[str]], target_columns: list[str], target_column_types: dict[str, str], id_maps: dict[str, dict[str | int, int]], seq_length: list[int], n_classes: dict[str, int], inference_batch_size: int, export_onnx: bool = True, export_pt: bool = False, export_with_dropout: bool = False, model_hyperparameter_sampling: ModelSpecHyperparameterSampling, training_hyperparameter_sampling: TrainingSpecHyperparameterSampling)[source]

Pydantic model for hyperparameter search configuration.

project_path

The path to the sequifier project directory.

Type:

str

ddconfig_path

The path to the data-driven configuration file.

Type:

str

hp_search_name

The name for the hyperparameter search.

Type:

str

search_strategy

The search strategy, either “sample” or “grid”.

Type:

str

n_samples

The number of samples to draw for the search.

Type:

int | None

model_config_write_path

The path to write the model configurations to.

Type:

str

training_data_path

The path to the training data.

Type:

str

validation_data_path

The path to the validation data.

Type:

str

read_format

The file format of the input data.

Type:

str

selected_columns

A list of lists of columns to be used for training.

Type:

list[list[str]]

column_types

A list of dictionaries mapping columns to their types.

Type:

list[dict[str, str]]

categorical_columns

A list of lists of categorical columns.

Type:

list[list[str]]

real_columns

A list of lists of real-valued columns.

Type:

list[list[str]]

target_columns

The list of target columns for model training.

Type:

list[str]

target_column_types

A dictionary mapping target columns to their types.

Type:

dict[str, str]

id_maps

A dictionary mapping categorical values to their indexed representation.

Type:

dict[str, dict[str | int, int]]

seq_length

A list of possible sequence lengths.

Type:

list[int]

n_classes

The number of classes for each categorical column.

Type:

dict[str, int]

inference_batch_size

The batch size for inference.

Type:

int

export_onnx

If True, exports the model in ONNX format.

Type:

bool

export_pt

If True, exports the model using torch.save.

Type:

bool

export_with_dropout

If True, exports the model with dropout enabled.

Type:

bool

model_hyperparameter_sampling

The sampling configuration for model hyperparameters.

Type:

sequifier.config.hyperparameter_search_config.ModelSpecHyperparameterSampling

training_hyperparameter_sampling

The sampling configuration for training hyperparameters.

Type:

sequifier.config.hyperparameter_search_config.TrainingSpecHyperparameterSampling

grid_sample(i)[source]

Select a full training configuration based on a grid search index.

This method generates a grid of all possible configurations and selects the configuration at the given index.

Parameters:

i – The index of the configuration to select from the grid.

Returns:

A TrainModel instance populated with the selected configuration.

n_combinations()[source]

Calculate the total number of possible configurations.

This method computes the total number of unique configurations that can be generated by a grid search over all defined hyperparameters.

Returns:

The total number of possible hyperparameter configurations.

random_sample(i)[source]

Randomly sample a full training configuration.

This method generates a complete training configuration by randomly sampling model and training hyperparameters, as well as selecting a column set and sequence length.

Parameters:

i – The index of the sample, used to create a unique model name.

Returns:

A TrainModel instance populated with a randomly sampled configuration.

sample(i)[source]

Sample a configuration based on the specified search strategy.

This method delegates to either random_sample or grid_sample based on the search_strategy attribute.

Parameters:

i – The index of the sample or grid combination to generate.

Returns:

A TrainModel instance with a generated configuration.

Raises:

Exception – If the search_strategy is not ‘sample’ or ‘grid’.

class sequifier.config.hyperparameter_search_config.ModelSpecHyperparameterSampling(*, d_model: list[int], d_model_by_column: list[dict[str, int]] | None = None, nhead: list[int], d_hid: list[int], nlayers: list[int])[source]

Pydantic model for model specification hyperparameter sampling.

d_model

A list of possible numbers of expected features in the input.

Type:

list[int]

d_model_by_column

A list of possible embedding dimensions for each input column.

Type:

list[dict[str, int]] | None

nhead

A list of possible numbers of heads in the multi-head attention models.

Type:

list[int]

d_hid

A list of possible dimensions of the feedforward network model.

Type:

list[int]

nlayers

A list of possible numbers of layers in the transformer model.

Type:

list[int]

grid_sample(i)[source]

Select a set of model hyperparameters based on a grid search index.

This method generates a grid of all possible model hyperparameter combinations and selects the combination at the given index.

Parameters:

i – The index of the hyperparameter combination to select from the grid.

Returns:

A ModelSpecModel instance populated with the selected set of hyperparameters.

n_combinations()[source]

Calculate the total number of model hyperparameter combinations.

This method computes the total number of unique model hyperparameter sets that can be generated by the grid search.

Returns:

The total number of possible model hyperparameter combinations.

random_sample()[source]

Randomly sample a set of model hyperparameters.

This method selects a random combination of model hyperparameters from the defined lists of possibilities. It ensures that d_model, d_model_by_column, and nhead are paired correctly.

Returns:

A ModelSpecModel instance populated with a randomly sampled set of hyperparameters.

class sequifier.config.hyperparameter_search_config.TrainingSpecHyperparameterSampling(*, device: str, epochs: list[int], log_interval: int = 10, class_share_log_columns: list[str] = None, early_stopping_epochs: int | None = None, iter_save: int, batch_size: list[int], lr: list[float], criterion: dict[str, str], class_weights: dict[str, list[float]] | None = None, accumulation_steps: list[int], dropout: list[float] = [0.0], loss_weights: dict[str, float] | None = None, optimizer: list[DotDict] = None, scheduler: list[DotDict] = None, continue_training: bool = True)[source]

Pydantic model for training specification hyperparameter sampling.

device

The device to train on (e.g., ‘cuda’, ‘cpu’).

Type:

str

epochs

A list of possible numbers of epochs to train for.

Type:

list[int]

log_interval

The interval in batches for logging.

Type:

int

class_share_log_columns

Columns for which to log class share.

Type:

list[str]

early_stopping_epochs

Number of epochs for early stopping.

Type:

int | None

iter_save

Interval in epochs for saving model checkpoints.

Type:

int

batch_size

A list of possible batch sizes.

Type:

list[int]

lr

A list of possible learning rates.

Type:

list[float]

criterion

A dictionary mapping target columns to loss functions.

Type:

dict[str, str]

class_weights

Optional dictionary mapping columns to class weights.

Type:

dict[str, list[float]] | None

accumulation_steps

A list of possible gradient accumulation steps.

Type:

list[int]

dropout

A list of possible dropout rates.

Type:

list[float]

loss_weights

Optional dictionary mapping columns to loss weights.

Type:

dict[str, float] | None

optimizer

A list of possible optimizer configurations.

Type:

list[sequifier.config.train_config.DotDict]

scheduler

A list of possible scheduler configurations.

Type:

list[sequifier.config.train_config.DotDict]

continue_training

Flag to continue training from a checkpoint.

Type:

bool

__init__(**kwargs)[source]

Initialize the TrainingSpecHyperparameterSampling instance.

This method initializes the Pydantic BaseModel and then processes the optimizer and scheduler configurations from the provided keyword arguments, converting them into DotDict objects.

Parameters:

**kwargs – Keyword arguments that correspond to the attributes of this class. The ‘optimizer’ and ‘scheduler’ arguments are expected to be lists of dictionaries.

grid_sample(i)[source]

Select a set of training hyperparameters based on a grid search index.

This method generates a grid of all possible hyperparameter combinations and selects the combination at the given index.

Parameters:

i – The index of the hyperparameter combination to select from the grid.

Returns:

A TrainingSpecModel instance populated with the selected set of hyperparameters.

n_combinations()[source]

Calculate the total number of hyperparameter combinations.

This method computes the total number of unique hyperparameter sets that can be generated by the grid search.

Returns:

The total number of possible hyperparameter combinations.

random_sample()[source]

Randomly sample a set of training hyperparameters.

This method selects a random combination of hyperparameters from the defined lists of possibilities. It ensures that learning rates and schedulers are paired correctly.

Returns:

A TrainingSpecModel instance populated with a randomly sampled set of hyperparameters.

Non-standard Optimizers

class sequifier.optimizers.ademamix.AdEMAMix(params={}, lr=0.001, betas=(0.9, 0.999, 0.9999), eps=1e-08, weight_decay=0, alpha=5.0, T_alpha_beta3=None)[source]

Implements the AdEMAMix optimizer.

This optimizer is based on the paper “AdEMAMix: A Novel Adaptive Optimizer for Deep Learning”. It combines the advantages of Adam and EMA, and introduces a mixing term to further improve performance.

Parameters:
  • params (iterable) – Iterable of parameters to optimize or dicts defining parameter groups.

  • lr (float, optional) – Learning rate (default: 1e-3).

  • betas (Tuple[float, float, float], optional) – Coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999, 0.9999)).

  • eps (float, optional) – Term added to the denominator to improve numerical stability (default: 1e-8).

  • weight_decay (float, optional) – Weight decay (L2 penalty) (default: 0).

  • alpha (float, optional) – Mixing coefficient (default: 5.0).

  • T_alpha_beta3 (int, optional) – Time period for alpha and beta3 scheduling (default: None).

__setstate__(state)[source]

Set the state of the optimizer.

Parameters:

state (dict) – The state of the optimizer.

step(closure=None)[source]

Perform a single optimization step.

Parameters:

closure (callable, optional) – A closure that reevaluates the model and returns the loss. (default: None)

Returns:

The loss, if the closure is provided. Otherwise, returns None.

Internals

sequifier.sequifier.build_args_config(args: Any) dict[str, Any][source]

Build configuration dictionary from command-line arguments.

Parameters:

args – Parsed command-line arguments.

Returns:

Dictionary containing configuration options.

sequifier.sequifier.main() None[source]

Main function to run the Sequifier CLI.

sequifier.sequifier.setup_parser() ArgumentParser[source]

Set up the argument parser for the command-line interface.

Returns:

Configured ArgumentParser object.

class sequifier.preprocess.Preprocessor(project_path: str, data_path: str, read_format: str, write_format: str, combine_into_single_file: bool, selected_columns: list[str] | None, group_proportions: list[float], seq_length: int, seq_step_sizes: list[int], max_rows: int | None, seed: int, n_cores: int | None, batches_per_file: int, process_by_file: bool)[source]

A class for preprocessing data for the sequifier model.

This class handles loading, preprocessing, and saving data. It supports single-file and multi-file processing, and can handle large datasets by processing them in batches.

project_path

The path to the sequifier project directory.

Type:

str

batches_per_file

The number of batches to process per file.

Type:

int

data_name_root

The root name of the data file.

Type:

str

combine_into_single_file

Whether to combine the output into a single file.

Type:

bool

target_dir

The target directory for temporary files.

Type:

str

seed

The random seed for reproducibility.

Type:

int

n_cores

The number of cores to use for parallel processing.

Type:

int

split_paths

The paths to the output split files.

Type:

list[str]

__init__(project_path: str, data_path: str, read_format: str, write_format: str, combine_into_single_file: bool, selected_columns: list[str] | None, group_proportions: list[float], seq_length: int, seq_step_sizes: list[int], max_rows: int | None, seed: int, n_cores: int | None, batches_per_file: int, process_by_file: bool)[source]

Initializes the Preprocessor with the given parameters.

Parameters:
  • project_path – The path to the sequifier project directory.

  • data_path – The path to the input data file.

  • read_format – The file type of the input data.

  • write_format – The file type for the preprocessed output data.

  • combine_into_single_file – Whether to combine the output into a single file.

  • selected_columns – A list of columns to be included in the preprocessing.

  • group_proportions – A list of floats that define the relative sizes of data splits.

  • seq_length – The sequence length for the model inputs.

  • seq_step_sizes – A list of step sizes for creating subsequences.

  • max_rows – The maximum number of input rows to process.

  • seed – A random seed for reproducibility.

  • n_cores – The number of CPU cores to use for parallel processing.

  • batches_per_file – The number of batches to process per file.

  • process_by_file – A flag to indicate if processing should be done file by file.

sequifier.preprocess.cast_columns_to_string(data: DataFrame) DataFrame[source]

Casts the column names of a Polars DataFrame to strings.

This is often necessary because Polars schemas may use integers as column names (e.g., ‘0’, ‘1’, ‘2’…) which need to be strings for some operations.

Parameters:

data – The Polars DataFrame.

Returns:

The same DataFrame with its columns attribute modified.

sequifier.preprocess.combine_maps(map1: dict[str | int, int], map2: dict[str | int, int]) dict[str | int, int][source]

Combines two ID maps into a new, consolidated map.

Takes all unique keys from both map1 and map2, sorts them, and creates a new, single map where keys are mapped to 1-based indices based on the sorted order. This ensures a consistent mapping across different data chunks.

Parameters:
  • map1 – The first ID map.

  • map2 – The second ID map.

Returns:

A new, combined, and re-indexed ID map.

sequifier.preprocess.combine_multiprocessing_outputs(project_path: str, target_dir: str, n_splits: int, input_files: dict[int, list[str]], dataset_name: str, write_format: str, in_target_dir: bool = False, pre_split_str: str | None = None, post_split_str: str | None = None) None[source]

Combines multiple intermediate batch files into final split files.

This function iterates through each split and combines all the intermediate files listed in input_files[split] into a single final output file for that split.

  • For “csv” format, it uses the csvstack command-line utility.

  • For “parquet” format, it uses pyarrow.parquet.ParquetWriter to concatenate the files efficiently.

Parameters:
  • project_path – The path to the sequifier project directory.

  • target_dir – The temporary directory containing intermediate files.

  • n_splits – The number of data splits.

  • input_files – A dictionary mapping split index (int) to a list of input file paths (str) for that split.

  • dataset_name – The root name for the final output files.

  • write_format – The file format (“csv” or “parquet”).

  • in_target_dir – If True, the final combined file is written inside target_dir. If False, it’s written to data/.

  • pre_split_str – An optional string to insert into the filename before the “-split{i}” part.

  • post_split_str – An optional string to insert into the filename after the “-split{i}” part.

sequifier.preprocess.combine_parquet_files(files: list[str], out_path: str) None[source]

Combines multiple Parquet files into a single Parquet file.

This function reads the schema from the first file and uses it to initialize a ParquetWriter. It then iterates through all files in the list, reading each one as a table and writing it to the new combined file. This is more memory-efficient than reading all files into one large table first.

Parameters:
  • files – A list of paths to the Parquet files to combine.

  • out_path – The path for the combined output Parquet file.

sequifier.preprocess.create_file_paths_for_multiple_files1(project_path: str, target_dir: str, n_splits: int, n_batches: int, process_id: int, file_index: int, dataset_name: str, write_format: str) dict[int, list[str]][source]

Creates a dictionary of temporary file paths for a specific data file.

This is used in the multi-file, combine_into_single_file=True workflow. It generates file path names for intermediate batches before they are combined.

The naming pattern is: {dataset_name}-{process_id}-{file_index}-split{split}-{batch_id}.{write_format}

Parameters:
  • project_path – The path to the sequifier project directory.

  • target_dir – The temporary directory to place files in.

  • n_splits – The number of data splits.

  • n_batches – The number of batches created by the process.

  • process_id – The ID of the multiprocessing worker.

  • file_index – The index of the file being processed by this worker.

  • dataset_name – The root name of the dataset.

  • write_format – The file extension.

Returns:

A dictionary mapping a split index (int) to a list of file paths (str) for all batches in that split.

sequifier.preprocess.create_file_paths_for_multiple_files2(project_path: str, target_dir: str, n_splits: int, n_processes: int, n_files: dict[int, int], dataset_name: str, write_format: str) dict[int, list[str]][source]

Creates a dictionary of intermediate file paths for a multi-file run.

This is used in the multi-file, combine_into_single_file=True workflow. It generates the file paths for the combined files from each process, which are the inputs to the final combination step.

The naming pattern is: {dataset_name}-{process_id}-{file_index}-split{split}.{write_format}

Parameters:
  • project_path – The path to the sequifier project directory.

  • target_dir – The temporary directory where files are located.

  • n_splits – The number of data splits.

  • n_processes – The total number of multiprocessing workers.

  • n_files – A dictionary mapping process_id to the number of files that process handled.

  • dataset_name – The root name of the dataset.

  • write_format – The file extension.

Returns:

A dictionary mapping a split index (int) to a list of all intermediate combined file paths (str) for that split.

sequifier.preprocess.create_file_paths_for_single_file(project_path: str, target_dir: str, n_splits: int, n_batches: int, dataset_name: str, write_format: str) dict[int, list[str]][source]

Creates a dictionary of temporary file paths for a single-file run.

This is used in the single-file, combine_into_single_file=True workflow. It generates file path names for intermediate batches created by different processes before they are combined.

The naming pattern is: {dataset_name}-split{split}-{core_id}.{write_format}

Parameters:
  • project_path – The path to the sequifier project directory.

  • target_dir – The temporary directory to place files in.

  • n_splits – The number of data splits.

  • n_batches – The number of processes (batches) running in parallel.

  • dataset_name – The root name of the dataset.

  • write_format – The file extension.

Returns:

A dictionary mapping a split index (int) to a list of file paths (str) for all batches in that split.

sequifier.preprocess.create_id_map(data: DataFrame, column: str) dict[str | int, int][source]

Creates a map from unique values in a column to integer indices.

Finds all unique values in the specified column of the data DataFrame, sorts them, and creates a dictionary mapping each unique value to a 1-based integer index.

Parameters:
  • data – The Polars DataFrame containing the column.

  • column – The name of the column to map.

Returns:

A dictionary mapping unique values (str or int) to an integer index (starting from 1).

sequifier.preprocess.delete_files(files: list[str] | dict[int, list[str]]) None[source]

Deletes a list of files from the filesystem.

Parameters:

files – A list of file paths to delete, or a dictionary whose values are lists of file paths to delete.

sequifier.preprocess.extract_sequences(data: DataFrame, schema: Any, seq_length: int, seq_step_size: int, columns: list[str]) DataFrame[source]

Extracts subsequences from a DataFrame of full sequences.

This function takes a DataFrame where each row contains all items for a single sequenceId. It iterates through each sequenceId, extracts all possible subsequences of seq_length using the specified seq_step_size, calculates the starting position of each subsequence within the original sequence, and formats them into a new, long-format DataFrame that conforms to the provided schema.

Parameters:
  • data – The input Polars DataFrame, grouped by “sequenceId”.

  • schema – The schema for the output long-format DataFrame.

  • seq_length – The length of the subsequences to extract.

  • seq_step_size – The step size to use when sliding the window to create subsequences.

  • columns – A list of the data column names (features) to extract.

Returns:

A new, long-format Polars DataFrame containing the extracted subsequences, matching the provided schema. Includes columns for sequenceId, subsequenceId, startItemPosition, inputCol, and the sequence items (‘0’, ‘1’, …).

sequifier.preprocess.extract_subsequences(in_seq: dict[str, list], seq_length: int, seq_step_size: int, columns: list[str]) dict[str, list[list[float | int]]][source]

Extracts subsequences from a dictionary of sequence lists.

This function takes a dictionary in_seq where keys are column names and values are lists of items for a single full sequence. It first pads the sequences with 0s at the beginning if they are shorter than seq_length. Then, it calculates the subsequence start indices using get_subsequence_starts and extracts all subsequences.

Parameters:
  • in_seq – A dictionary mapping column names to lists of items (e.g., {‘col_A’: [1, 2, 3, 4, 5], ‘col_B’: [6, 7, 8, 9, 10]}).

  • seq_length – The length of the subsequences to extract.

  • seq_step_size – The desired step size between subsequences.

  • columns – A list of the column names (keys in in_seq) to process.

Returns:

A dictionary mapping column names to a list of lists, where each inner list is a subsequence.

sequifier.preprocess.get_batch_limits(data: DataFrame, n_batches: int) list[tuple[int, int]][source]

Calculates row indices to split a DataFrame into batches.

This function divides the DataFrame into n_batches roughly equal chunks. Crucially, it ensures that no sequenceId is split across two different batches. It does this by finding the ideal split points and then adjusting them to the nearest sequenceId boundary.

Parameters:
  • data – The DataFrame to split. Must be sorted by “sequenceId”.

  • n_batches – The desired number of batches.

Returns:

A list of (start_index, end_index) tuples, where each tuple defines the row indices for a batch.

sequifier.preprocess.get_combined_statistics(n1: int, mean1: float, std1: float, n2: int, mean2: float, std2: float) tuple[float, float][source]

Calculates the combined mean and standard deviation of two data subsets.

Uses a stable parallel algorithm (related to Welford’s algorithm) to combine statistics from two subsets without needing the original data.

Parameters:
  • n1 – Number of samples in subset 1.

  • mean1 – Mean of subset 1.

  • std1 – Standard deviation of subset 1.

  • n2 – Number of samples in subset 2.

  • mean2 – Mean of subset 2.

  • std2 – Standard deviation of subset 2.

Returns:

A tuple (combined_mean, combined_std) containing the combined mean and standard deviation of the two subsets.

sequifier.preprocess.get_group_bounds(data_subset: DataFrame, group_proportions: list[float])[source]

Calculates row indices for splitting a sequence into groups.

This function takes a DataFrame data_subset (which typically contains all items for a single sequenceId) and calculates the row indices to split it into multiple groups (e.g., train, val, test) based on the provided group_proportions.

Parameters:
  • data_subset – The DataFrame (for a single sequence) to split.

  • group_proportions – A list of floats (e.g., [0.8, 0.1, 0.1]) that sum to 1.0, defining the relative sizes of the splits.

Returns:

A list of (start_index, end_index) tuples, one for each proportion, defining the row slices for each group.

sequifier.preprocess.get_subsequence_starts(in_seq_length: int, seq_length: int, seq_step_size: int) ndarray[source]

Calculates the start indices for extracting subsequences.

This function determines the starting indices for sliding a window of seq_length over an input sequence of in_seq_length. It aims to use seq_step_size, but adjusts the step size slightly to ensure that the windows are distributed as evenly as possible and cover the full sequence from the beginning to the end.

Parameters:
  • in_seq_length – The length of the original input sequence.

  • seq_length – The length of the subsequences to extract.

  • seq_step_size – The desired step size between subsequences.

Returns:

A numpy array of integer start indices for each subsequence.

sequifier.preprocess.insert_top_folder(path: str, folder_name: str) str[source]

Inserts a directory name into a file path, just before the filename.

Example

insert_top_folder(“a/b/c.txt”, “temp”) returns “a/b/temp/c.txt”

Parameters:
  • path – The original file path.

  • folder_name – The name of the folder to insert.

Returns:

The new path string with the folder inserted.

sequifier.preprocess.preprocess(args: Any, args_config: dict[str, Any]) None[source]

Runs the main data preprocessing pipeline.

This function loads the preprocessing configuration, initializes the Preprocessor class, and executes the preprocessing steps based on the loaded configuration.

Parameters:
  • args – An object containing command-line arguments. Expected to have a config_path attribute specifying the path to the YAML configuration file.

  • args_config – A dictionary containing additional configuration parameters that may override or supplement the settings loaded from the config file.

sequifier.preprocess.preprocess_batch(project_path: str, data_name_root: str, process_id: int, batch: DataFrame, schema: Any, split_paths: list[str], seq_length: int, seq_step_sizes: list[int], data_columns: list[str], col_types: dict[str, str], group_proportions: list[float], target_dir: str, write_format: str, batches_per_file: int) None[source]

Processes a batch of data.

Parameters:
  • project_path – The path to the sequifier project directory.

  • data_name_root – The root name of the data file.

  • process_id – The id of the process.

  • batch – The batch of data to process.

  • schema – The schema for the preprocessed data.

  • split_paths – The paths to the output split files.

  • seq_length – The sequence length for the model inputs.

  • seq_step_sizes – A list of step sizes for creating subsequences.

  • data_columns – A list of data columns.

  • col_types – A dictionary containing the column types.

  • group_proportions – A list of floats that define the relative sizes of data splits.

  • target_dir – The target directory for temporary files.

  • write_format – The file format for the output files.

  • batches_per_file – The number of batches to process per file.

sequifier.preprocess.process_and_write_data_pt(data: DataFrame, seq_length: int, path: str, column_types: dict[str, str])[source]

Processes the sequence DataFrame and writes it to a .pt file.

This function takes the long-format sequence DataFrame (data), aggregates it by sequenceId and subsequenceId, and pivots it so that each inputCol becomes its own column containing a list of sequence items. It also extracts the startItemPosition.

It then converts these lists into NumPy arrays, splits them into sequences (all but last item) and targets (all but first item), and converts them to PyTorch tensors along with sequence/subsequence IDs and start positions. The final data tuple (sequences_dict, targets_dict, sequence_ids_tensor, subsequence_ids_tensor, start_item_positions_tensor) is saved to a .pt file using torch.save.

Parameters:
  • data – The long-format Polars DataFrame of extracted sequences.

  • seq_length – The total sequence length (N). The resulting tensors will have sequence length N-1.

  • path – The output file path (e.g., “data/batch_0.pt”).

  • column_types – A dictionary mapping column names to their string data types, used to determine the correct torch dtype.

class sequifier.train.TransformerEmbeddingModel(transformer_model: TransformerModel)[source]

A wrapper around the TransformerModel to expose the embedding functionality.

__init__(transformer_model: TransformerModel)[source]

Initializes the TransformerEmbeddingModel.

Parameters:

transformer_model – The TransformerModel to wrap.

forward(src: dict[str, Tensor])[source]

Forward pass for the embedding model.

Parameters:

src – The input data.

Returns:

The embedded output.

class sequifier.train.TransformerModel(hparams: Any, rank: int | None = None)[source]

The main Transformer model for the sequifier.

This class implements the Transformer model, including the training and evaluation loops, as well as the export functionality.

__init__(hparams: Any, rank: int | None = None)[source]

Initializes the TransformerModel.

Based on the hyperparameters, this initializes: - Embeddings for categorical and real features (self.encoder) - Positional encoders (self.pos_encoder) - The main TransformerEncoder (self.transformer_encoder) - Output decoders for each target column (self.decoder) - Loss functions (self.criterion) - Optimizer (self.optimizer) and scheduler (self.scheduler)

Parameters:
  • hparams – The hyperparameters for the model (e.g., from TrainModel config).

  • rank – The rank of the current process (for distributed training).

apply_softmax(target_column: str, output: Tensor) Tensor[source]

Applies softmax to the output of the decoder.

If the target is real, it returns the output unchanged. If the target is categorical, it applies LogSoftmax.

Parameters:
  • target_column – The name of the target column.

  • output – The decoded output tensor (logits or real value).

Returns:

The output tensor, with LogSoftmax applied if categorical.

decode(target_column: str, output: Tensor) Tensor[source]

Decodes the output of the transformer encoder.

Applies the appropriate final linear layer for a given target column.

Parameters:
  • target_column – The name of the target column to decode.

  • output – The raw output tensor from the TransformerEncoder (seq_length, batch_size, d_model).

Returns:

The decoded output (logits or real value) for the target column (seq_length, batch_size, n_classes/1).

forward(src: dict[str, Tensor]) dict[str, Tensor][source]

The main forward pass of the model.

This is typically used for inference/evaluation, returning the probabilities/values for the last token in the sequence.

Parameters:

src – A dictionary mapping column names to input tensors (batch_size, seq_length).

Returns:

A dictionary mapping target column names to their final output (LogSoftmax probabilities or real values) for the last token (batch_size, n_classes/1).

forward_embed(src: dict[str, Tensor]) Tensor[source]

Forward pass for the embedding model.

This returns only the embedding from the last token in the sequence.

Parameters:

src – A dictionary mapping column names to input tensors (batch_size, seq_length).

Returns:

The embedding tensor for the last token (batch_size, d_model).

forward_inner(src: dict[str, Tensor]) Tensor[source]

The inner forward pass of the model.

This handles embedding lookup, positional encoding, and passing the combined tensor through the transformer encoder.

Parameters:

src – A dictionary mapping column names to input tensors (batch_size, seq_length).

Returns:

The raw output tensor from the TransformerEncoder (seq_length, batch_size, d_model).

forward_train(src: dict[str, Tensor]) dict[str, Tensor][source]

Forward pass for training.

This runs the inner forward pass and then applies the appropriate decoder for each target column.

Parameters:

src – A dictionary mapping column names to input tensors (batch_size, seq_length).

Returns:

A dictionary mapping target column names to their raw output (logit) tensors (seq_length, batch_size, n_classes/1).

train_model(train_loader: DataLoader, valid_loader: DataLoader, train_sampler: RandomSampler | DistributedSampler | DistributedGroupedRandomSampler | None, valid_sampler: RandomSampler | DistributedSampler | DistributedGroupedRandomSampler | None) None[source]

Trains the model.

This method contains the main training loop, including epoch iteration, validation, early stopping logic, and model saving/exporting.

Parameters:
  • train_loader – DataLoader for the training dataset.

  • valid_loader – DataLoader for the validation dataset.

  • train_sampler – Sampler for the training DataLoader, used to set the epoch in distributed training.

  • valid_sampler – Sampler for the validation DataLoader, used to set the epoch in distributed training.

sequifier.train.cleanup()[source]

Cleans up the distributed training environment.

sequifier.train.format_number(number: int | float | float32) str[source]

Format a number for display.

Parameters:

number – The number to format.

Returns:

A formatted string representation of the number.

sequifier.train.infer_with_embedding_model(model: Module, x: list[dict[str, ndarray]], device: str, size: int, target_columns: list[str]) ndarray[source]

Performs inference with an embedding model.

Parameters:
  • model – The loaded TransformerEmbeddingModel.

  • x – A list of input data dictionaries (batched).

  • device – The device to run inference on.

  • size – The total number of samples (unused in this function).

  • target_columns – List of target column names (unused in this function).

Returns:

A NumPy array containing the concatenated embeddings from all batches.

sequifier.train.infer_with_generative_model(model: Module, x: list[dict[str, ndarray]], device: str, size: int, target_columns: list[str]) dict[str, ndarray][source]

Performs inference with a generative model.

Parameters:
  • model – The loaded TransformerModel.

  • x – A list of input data dictionaries (batched).

  • device – The device to run inference on.

  • size – The total number of samples to trim the final output to.

  • target_columns – List of target column names to extract from the output.

Returns:

A dictionary mapping target column names to their concatenated output NumPy arrays, trimmed to size.

sequifier.train.load_inference_model(model_type: str, model_path: str, training_config_path: str, args_config: dict[str, Any], device: str, infer_with_dropout: bool) Module[source]

Loads a trained model for inference.

Parameters:
  • model_type – “generative” or “embedding”.

  • model_path – Path to the saved .pt model file.

  • training_config_path – Path to the .yaml config file used for training.

  • args_config – A dictionary of override configurations.

  • device – The device to load the model onto (e.g., “cuda”, “cpu”).

  • infer_with_dropout – Whether to force dropout layers to be active during inference.

Returns:

The loaded and compiled torch.nn.Module (TransformerModel or TransformerEmbeddingModel) in evaluation mode.

sequifier.train.setup(rank: int, world_size: int, backend: str = 'nccl')[source]

Sets up the distributed training environment.

Parameters:
  • rank – The rank of the current process.

  • world_size – The total number of processes.

  • backend – The distributed backend to use.

sequifier.train.train(args: Any, args_config: dict[str, Any]) None[source]

The main training function.

Parameters:
  • args – The command-line arguments.

  • args_config – The configuration dictionary.

sequifier.train.train_worker(rank: int, world_size: int, config: TrainModel, from_folder: bool)[source]

The worker function for distributed training.

Parameters:
  • rank – The rank of the current process.

  • world_size – The total number of processes.

  • config – The training configuration.

  • from_folder – Whether to load data from a folder (e.g., preprocessed .pt files) or a single file (e.g., .parquet).

class sequifier.infer.Inferer(model_type: str, model_path: str, project_path: str, id_maps: dict[str, dict[str | int, int]] | None, selected_columns_statistics: dict[str, dict[str, float]], map_to_id: bool, categorical_columns: list[str], real_columns: list[str], selected_columns: list[str] | None, target_columns: list[str], target_column_types: dict[str, str], sample_from_distribution_columns: list[str] | None, infer_with_dropout: bool, inference_batch_size: int, device: str, args_config: dict[str, Any], training_config_path: str)[source]

A class for performing inference with a trained sequifier model.

This class encapsulates the model (either ONNX session or PyTorch model), normalization statistics, ID mappings, and all configuration needed to run inference. It provides methods to handle batching, model-specific inference calls (PyTorch vs. ONNX), and post-processing (like inverting normalization).

model_type

‘generative’ or ‘embedding’.

map_to_id

Whether to map integer predictions back to original IDs.

selected_columns_statistics

Dict of ‘mean’ and ‘std’ for real columns.

index_map

The inverse of id_maps, for mapping indices back to values.

device

The device (‘cuda’ or ‘cpu’) for inference.

target_columns

List of columns the model predicts.

target_column_types

Dict mapping target columns to ‘categorical’ or ‘real’.

inference_model_type

‘onnx’ or ‘pt’.

ort_session

onnxruntime.InferenceSession if using ONNX.

inference_model

The loaded PyTorch model if using ‘pt’.

__init__(model_type: str, model_path: str, project_path: str, id_maps: dict[str, dict[str | int, int]] | None, selected_columns_statistics: dict[str, dict[str, float]], map_to_id: bool, categorical_columns: list[str], real_columns: list[str], selected_columns: list[str] | None, target_columns: list[str], target_column_types: dict[str, str], sample_from_distribution_columns: list[str] | None, infer_with_dropout: bool, inference_batch_size: int, device: str, args_config: dict[str, Any], training_config_path: str)[source]

Initializes the Inferer.

Parameters:
  • model_type – The type of model to use for inference.

  • model_path – The path to the trained model.

  • project_path – The path to the sequifier project directory.

  • id_maps – A dictionary of id maps for categorical columns.

  • selected_columns_statistics – A dictionary of statistics for numerical columns.

  • map_to_id – Whether to map the output to the original ids.

  • categorical_columns – A list of categorical columns.

  • real_columns – A list of real columns.

  • selected_columns – A list of selected columns.

  • target_columns – A list of target columns.

  • target_column_types – A dictionary of target column types.

  • sample_from_distribution_columns – A list of columns to sample from the distribution.

  • infer_with_dropout – Whether to use dropout during inference.

  • inference_batch_size – The batch size for inference.

  • device – The device to use for inference.

  • args_config – The command-line arguments.

  • training_config_path – The path to the training configuration file.

adjust_and_infer_embedding(x: dict[str, ndarray], size: int)[source]

Handles batching and backend-specific calls for embedding inference.

This function prepares the input data x into batches using prepare_inference_batches and then calls the correct inference backend based on self.inference_model_type (.pt or .onnx).

Parameters:
  • x – The complete dictionary of input features (NumPy arrays).

  • size – The total number of samples in x, used to truncate any padding added for batching.

Returns:

A NumPy array of embeddings, concatenated from all batches.

adjust_and_infer_generative(x: dict[str, ndarray], size: int)[source]

Handles batching and backend-specific calls for generative inference.

This function prepares the input data x into batches using prepare_inference_batches and then calls the correct inference backend based on self.inference_model_type (.pt or .onnx). It aggregates the results from all batches.

Parameters:
  • x – The complete dictionary of input features (NumPy arrays).

  • size – The total number of samples in x, used to truncate any padding added for batching.

Returns:

A dictionary mapping target column names to NumPy arrays of raw model outputs (logits or real values).

expand_to_batch_size(x: ndarray) ndarray[source]

Pads a NumPy array to match self.inference_batch_size.

Repeats samples from x until the array’s first dimension is equal to self.inference_batch_size.

Parameters:

x – The input NumPy array to pad.

Returns:

A new NumPy array of size self.inference_batch_size in the first dimension.

infer_embedding(x: dict[str, ndarray]) ndarray[source]

Performs inference with an embedding model.

This is a high-level wrapper that calls adjust_and_infer_embedding to handle batching and model-specific logic.

Parameters:

x – A dictionary mapping feature names to NumPy arrays. All arrays must have the same first dimension (batch size).

Returns:

A 2D NumPy array of the resulting embeddings.

infer_generative(x: dict[str, ndarray] | None, probs: dict[str, ndarray] | None = None, return_probs: bool = False) dict[str, ndarray][source]

Performs generative inference, returning probabilities or predictions.

This function orchestrates the generative inference process. 1. If probs are not provided, it calls adjust_and_infer_generative

to get the raw model output (logits or real values) using x.

  1. If return_probs is True: - It normalizes the logits for categorical columns to get

    probabilities (using softmax, implemented in normalize).

    • It returns a dictionary of probabilities (for categorical) and raw predicted values (for real).

  2. If return_probs is False (default): - It converts the model outputs (either from x or probs) into

    final predictions.

    • For categorical columns, it either takes the argmax or samples from the distribution (sample_with_cumsum).

    • For real columns, it returns the value as-is.

Parameters:
  • x – A dictionary mapping feature names to NumPy arrays. Required if probs is not provided.

  • probs – An optional dictionary of probabilities/logits. If provided, this skips the model inference step.

  • return_probs – If True, returns normalized probabilities for categorical targets. If False, returns final class predictions (via argmax or sampling).

Returns:

A dictionary mapping target column names to NumPy arrays. The content of the arrays depends on return_probs.

infer_pure(x: dict[str, ndarray]) list[ndarray][source]

Performs a single inference pass using the ONNX session.

This function assumes x is already a single, correctly-sized batch. It formats the input dictionary to match the ONNX model’s input names and executes self.ort_session.run().

Parameters:

x – A dictionary of feature arrays for a single batch. This batch must be of size self.inference_batch_size.

Returns:

A list of NumPy arrays, representing the raw outputs from the ONNX model.

invert_normalization(values: ndarray, target_column: str) ndarray[source]

Inverts Z-score normalization for a given target column.

Uses the ‘mean’ and ‘std’ stored in self.selected_columns_statistics to transform normalized values back to their original scale.

Parameters:
  • values – A NumPy array of normalized values.

  • target_column – The name of the column whose statistics should be used for the inverse transformation.

Returns:

A NumPy array of values in their original scale.

prepare_inference_batches(x: dict[str, ndarray], pad_to_batch_size: bool) list[dict[str, ndarray]][source]

Splits input data into batches for inference.

This function takes a large dictionary of feature arrays and splits them into a list of smaller dictionaries (batches) of size self.inference_batch_size.

Parameters:
  • x – A dictionary of feature arrays.

  • pad_to_batch_size – If True (for ONNX), the last batch will be padded up to self.inference_batch_size by repeating samples. If False (for PyTorch), the last batch may be smaller.

Returns:

A list of dictionaries, where each dictionary is a single batch ready for inference.

sequifier.infer.expand_data_by_autoregression(data: DataFrame, autoregression_extra_steps: int, seq_length: int) DataFrame[source]

Expands a Polars DataFrame for autoregressive inference.

This function takes a DataFrame of sequences and adds autoregression_extra_steps new rows for each sequence. These new rows represent future time steps to be predicted.

For each new step, it: 1. Copies the last known observation for a sequence. 2. Increments the subsequenceId. 3. Shifts the historical data columns (e.g., ‘1’, ‘2’, …, ‘50’) one

position “older” (e.g., old ‘1’ becomes new ‘2’, old ‘49’ becomes new ‘50’).

  1. Fills the “newest” columns (e.g., new ‘1’ for the first extra step) with np.inf as a placeholder for the prediction.

Parameters:
  • data – The input Polars DataFrame, sorted by sequenceId and subsequenceId.

  • autoregression_extra_steps – The number of future time steps to add to each sequence.

  • seq_length – The sequence length, used to identify the historical data columns (named ‘1’ through seq_length).

Returns:

A new Polars DataFrame containing all original rows plus the newly generated future rows with placeholders.

sequifier.infer.fill_in_predictions_pl(data: DataFrame, preds: dict[str, ndarray], current_subsequence_id: int, sequence_ids_present: Series, seq_length: int) DataFrame[source]

Fills in predictions into the main Polars DataFrame using a robust, join-based approach that preserves the original DataFrame’s structure.

This function broadcasts predictions to all relevant future rows via a join, then uses conditional expressions to update only the specific placeholder cells (np.inf) that correspond to the correct future time step.

Parameters:
  • data – The main DataFrame containing all sequences.

  • preds – A dictionary of new predictions, mapping target column names to NumPy arrays.

  • current_subsequence_id – The adjusted subsequence ID at which predictions were made.

  • sequence_ids_present – A Polars Series of the sequence IDs in the current batch.

  • seq_length – The length of the sequence.

Returns:

An updated Polars DataFrame with the same dimensions as the input, with future placeholder values filled in.

sequifier.infer.fill_number(number: int | float, max_length: int) str[source]

Pads a number with leading zeros to a specified string length.

Used for creating sortable string keys (e.g., “001-001”, “001-002”).

Parameters:
  • number – The integer or float to format.

  • max_length – The total desired length of the output string.

Returns:

A string representation of the number, padded with leading zeros.

sequifier.infer.format_delta(time_delta: timedelta) str[source]

Formats a timedelta object into a human-readable string (seconds).

Parameters:

time_delta – The timedelta object to format.

Returns:

A string representing the total seconds with 3 decimal places.

sequifier.infer.get_embeddings(config: Any, inferer: Inferer, data: DataFrame, column_types: dict[str, dtype]) ndarray[source]

Generates embeddings from a Polars DataFrame.

This function converts a Polars DataFrame into the NumPy array dictionary format expected by the Inferer. It uses numpy_to_pytorch for the main conversion, then transforms the tensors to NumPy arrays before passing them to inferer.infer_embedding.

Parameters:
  • config – The InfererModel configuration object.

  • inferer – The initialized Inferer instance.

  • data – The input Polars DataFrame chunk.

  • column_types – A dictionary mapping column names to torch.dtype.

Returns:

A NumPy array containing the computed embeddings for the batch.

sequifier.infer.get_embeddings_pt(config: Any, inferer: Inferer, data: dict[str, Tensor]) ndarray[source]

Generates embeddings from a batch of PyTorch tensor data.

This function serves as a wrapper for Inferer.infer_embedding when the input data is already in PyTorch tensor format (from loading .pt files which contain sequences, targets, sequence_ids, subsequence_ids, and start_positions). It converts the tensor dictionary to a NumPy array dictionary before passing it to the inferer.

Parameters:
  • config – The InfererModel configuration object (unused, but kept for consistent function signature).

  • inferer – The initialized Inferer instance.

  • data – A dictionary mapping column/feature names to `torch.Tensor`s (the sequences part loaded from the .pt file).

Returns:

A NumPy array containing the computed embeddings for the batch.

sequifier.infer.get_probs_preds(config: Any, inferer: Inferer, data: DataFrame, column_types: dict[str, dtype]) tuple[dict[str, ndarray] | None, dict[str, ndarray]][source]

Generates predictions from a Polars DataFrame (non-autoregressive).

This function converts a Polars DataFrame into the NumPy array dictionary format expected by the Inferer. It’s used for standard, non-autoregressive generative inference. It calls inferer.infer_generative once and returns the probabilities (if requested) and predictions.

Parameters:
  • config – The InfererModel configuration object.

  • inferer – The initialized Inferer instance.

  • data – The input Polars DataFrame chunk.

  • column_types – A dictionary mapping column names to torch.dtype.

Returns:

  • probs: A dictionary mapping target columns to NumPy arrays of probabilities, or None if config.output_probabilities is False.

  • preds: A dictionary mapping target columns to NumPy arrays of final predictions.

Return type:

A tuple (probs, preds)

sequifier.infer.get_probs_preds_autoregression(config: Any, inferer: Inferer, data: DataFrame, column_types: dict[str, dtype], seq_length: int) tuple[dict[str, ndarray] | None, dict[str, ndarray], ndarray][source]

Performs autoregressive inference using a time-step-based Polars loop.

This function orchestrates the autoregressive process by iterating through each unique, adjusted time step (subsequenceIdAdjusted).

For each time step: 1. Filters the main DataFrame data to get the current slice of data

for all sequences at that time step.

  1. Calls get_probs_preds to generate predictions for this slice.

  2. Uses fill_in_predictions_pl to update the main data DataFrame, filling in the np.inf placeholders for the next time steps using the predictions just made.

  3. Collects the predictions and a corresponding sort key.

After iterating through all time steps, it sorts all collected predictions based on the keys (sequenceId, subsequenceId) and returns the complete, ordered results.

Parameters:
  • config – The InfererModel configuration object.

  • inferer – The initialized Inferer instance.

  • data – The input Polars DataFrame, expanded with future rows (see expand_data_by_autoregression).

  • column_types – A dictionary mapping column names to torch.dtype.

  • seq_length – The sequence length, passed to fill_in_predictions_pl.

Returns:

  • probs: A dictionary mapping target columns to sorted NumPy arrays of probabilities, or None.

  • preds: A dictionary mapping target columns to sorted NumPy arrays of final predictions.

  • sequence_ids: A NumPy array of sequenceId`s corresponding to each row in the `preds arrays.

Return type:

A tuple (probs, preds, sequence_ids)

sequifier.infer.get_probs_preds_pt(config: Any, inferer: Inferer, data: dict[str, Tensor], extra_steps: int = 0) tuple[dict[str, ndarray] | None, dict[str, ndarray]][source]

Generates predictions from PyTorch tensor data, supporting autoregression.

This function performs generative inference on a batch of PyTorch tensor data loaded from .pt files (which contain sequences, targets, sequence_ids, subsequence_ids, and start_positions). It implements an autoregressive loop: 1. Runs inference on the initial data X (sequences). 2. For each subsequent step (i in extra_steps):

  1. Creates the next input X_next by shifting the previous input X and appending the prediction from the last step.

  2. Runs inference on X_next.

  1. Collects and reshapes all predictions and probabilities from all steps into a single flat batch, ordered by original sample index, then by step.

Parameters:
  • config – The InfererModel configuration object, used to check output_probabilities and selected_columns.

  • inferer – The initialized Inferer instance.

  • data – A dictionary mapping column/feature names to `torch.Tensor`s (the sequences part loaded from the .pt file).

  • extra_steps – The number of additional autoregressive steps to perform. A value of 0 means simple, non-autoregressive inference.

Returns:

  • probs: A dictionary mapping target columns to NumPy arrays of probabilities, ordered by sample index then step, or None if config.output_probabilities is False.

  • preds: A dictionary mapping target columns to NumPy arrays of final predictions, ordered by sample index then step.

Return type:

A tuple (probs, preds)

sequifier.infer.infer(args: Any, args_config: dict[str, Any]) None[source]

Runs the main inference pipeline.

This function orchestrates the inference process. It loads the main inference configuration, retrieves necessary metadata like ID maps and column statistics from a ddconfig file (if required for mapping or normalization), and then delegates the core work to the infer_worker function.

Parameters:
  • args – Command-line arguments, typically from argparse. Expected to have attributes like config_path and on_unprocessed.

  • args_config – A dictionary of configuration overrides, often passed from the command line, that will be merged into the loaded configuration file.

sequifier.infer.infer_embedding(config: InfererModel, inferer: Inferer, model_id: str, dataset: list[Any] | Iterator[Any], column_types: dict[str, dtype]) None[source]

Performs inference with an embedding model and saves the results.

This function iterates through the provided dataset (which can be a list of DataFrames or an iterator of tensors). For each data chunk, it calls the appropriate function (get_embeddings or get_embeddings_pt) to generate embeddings. It then formats these embeddings into a Polars DataFrame, associating them with their sequenceId and subsequenceId, and writes the resulting DataFrame to the configured output path.

Parameters:
  • config – The InfererModel configuration object.

  • inferer – The initialized Inferer instance.

  • model_id – A string identifier for the model, used for naming output files.

  • dataset – A list containing a Polars DataFrame (for parquet/csv) or an iterator of loaded PyTorch data (for .pt files).

  • column_types – A dictionary mapping column names to their torch.dtype.

sequifier.infer.infer_generative(config: InfererModel, inferer: Inferer, model_id: str, dataset: list[Any] | Iterator[Any], column_types: dict[str, dtype])[source]

Performs inference with a generative model and saves the results.

This function manages the generative inference workflow: 1. Iterates through the dataset (chunks). 2. Handles data preparation, including expanding data for autoregression

if configured (expand_data_by_autoregression). It also calculates the corresponding itemPosition for each prediction.

  1. Calls the correct function to get probabilities and predictions based on data format and autoregression settings (e.g., get_probs_preds_autoregression, get_probs_preds_pt).

  2. Post-processes predictions: - Maps integer predictions back to original IDs if map_to_id is True. - Inverts normalization for real-valued target columns.

  3. Saves probabilities to disk (if config.output_probabilities is True).

  4. Saves the final predictions to disk, formatted as a Polars DataFrame with sequenceId, itemPosition, and target columns.

Parameters:
  • config – The InfererModel configuration object.

  • inferer – The initialized Inferer instance.

  • model_id – A string identifier for the model, used for naming output files.

  • dataset – A list containing a Polars DataFrame (for parquet/csv) or an iterator of loaded PyTorch data (for .pt files).

  • column_types – A dictionary mapping column names to their torch.dtype.

sequifier.infer.infer_worker(config: Any, args_config: dict[str, Any], id_maps: dict[str, dict[str | int, int]] | None, selected_columns_statistics: dict[str, dict[str, float]], percentage_limits: tuple[float, float] | None)[source]

Core worker function that performs inference.

This function handles the main workflow: 1. Loads the dataset based on config.read_format (parquet, csv, or pt). 2. Iterates over one or more model paths specified in the config. 3. For each model, initializes an Inferer object with all necessary

configurations, mappings, and statistics.

  1. Calls the appropriate inference function (infer_generative or infer_embedding) based on the config.model_type.

  2. Manages the data iterators and passes data chunks to the inference functions.

Parameters:
  • config – The fully resolved InfererModel configuration object.

  • args_config – A dictionary of command-line arguments, passed to the Inferer for potential model loading overrides.

  • id_maps – A nested dictionary mapping categorical column names to their value-to-index maps. None if map_to_id is False.

  • selected_columns_statistics – A nested dictionary containing ‘mean’ and ‘std’ for real-valued columns used for normalization.

  • percentage_limits – A tuple (start_pct, end_pct) used only when config.read_format == “pt” to slice the dataset.

sequifier.infer.load_pt_dataset(data_path: str, start_pct: float, end_pct: float) Iterator[source]

Lazily loads and yields data from .pt files in a directory.

This function scans a directory for .pt files, sorts them, and then yields the contents of a specific slice of those files defined by a start and end percentage. This allows for processing large datasets in chunks without loading everything into memory.

Parameters:
  • data_path – The path to the folder containing the .pt files.

  • start_pct – The starting percentage (0.0 to 100.0) of the file list to begin loading from.

  • end_pct – The ending percentage (0.0 to 100.0) of the file list to stop loading at.

Yields:

Iterator – An iterator where each item is the data loaded from a single .pt file (e.g., using torch.load).

sequifier.infer.normalize(outs: dict[str, ndarray]) dict[str, ndarray][source]

Applies the softmax function to a dictionary of logits.

Converts raw model logits for categorical columns into probabilities that sum to 1.

Parameters:

outs – A dictionary mapping target column names to NumPy arrays of logits.

Returns:

A dictionary mapping the same target column names to NumPy arrays of probabilities.

sequifier.infer.sample_with_cumsum(probs: ndarray) ndarray[source]

Samples from a probability distribution using the inverse CDF method.

Takes an array of logits, computes the cumulative probability distribution, draws a random number r from [0, 1), and returns the index of the first class i where cumsum[i] > r.

Parameters:

probs – A 2D NumPy array of logits (not normalized probabilities). Shape is (batch_size, num_classes).

Returns:

A 1D NumPy array of shape (batch_size,) containing the sampled class indices.

sequifier.infer.verify_variable_order(data: DataFrame) None[source]

Verifies that the DataFrame is correctly sorted for autoregression.

Checks two conditions: 1. sequenceId is globally sorted in ascending order. 2. subsequenceId is sorted in ascending order within each

sequenceId group.

Parameters:

data – The Polars DataFrame to check.

Raises:

AssertionError – If sequenceId is not globally sorted or if subsequenceId is not sorted within sequenceId groups.

sequifier.make.make(args)[source]

Creates a new sequifier project.

Parameters:

args – The command-line arguments.

Main function for initiating a hyperparameter search process.

This function loads the hyperparameter search configuration, initializes the searcher, and starts the search.

Parameters:
  • config_path (str) – Path to the hyperparameter search YAML configuration file.

  • on_unprocessed (bool) – Flag indicating whether to run the search on unprocessed data.

Returns:

None

class sequifier.helpers.LogFile(path: str, open_mode: str, rank: int | None = None)[source]

Manages logging to multiple files based on verbosity levels.

This class opens multiple log files based on a path template and a hardcoded list of levels (2 and 3). Messages are written to files based on their assigned level, and high-level messages are also printed to the console on the main process (rank 0).

rank

The rank of the current process, used to control console output.

Type:

Optional[int]

levels

The hardcoded list of log levels [2, 3] for which files are created.

Type:

list[int]

_files

A dictionary mapping log levels to their open file handlers.

Type:

dict[int, io.TextIOWrapper]

_path

The original path template provided.

Type:

str

__init__(path: str, open_mode: str, rank: int | None = None)[source]

Initializes the LogFile and opens log files.

The path argument should be a template containing “[NUMBER]”, which will be replaced by the log levels (2 and 3) to create separate log files.

Parameters:
  • path – The path template for the log files (e.g., “run_log_[NUMBER].txt”).

  • open_mode – The mode for opening the log files (e.g., “a”, “w”).

  • rank – The rank of the current process (e.g., in distributed training). If None or 0, high-level messages will be printed to stdout.

close() None[source]

Closes all open log file handlers.

write(string: str, level: int = 3) None[source]

Writes a string to log files and potentially the console.

The string is written to all log files whose level is less than or equal to the specified level.

  • A message with level=2 goes to file 2.

  • A message with level=3 goes to file 2 and file 3.

If level is 3 or greater, the message is also printed to stdout if self.rank is None or 0.

Parameters:
  • string – The message to log.

  • level – The verbosity level of the message. Defaults to 3.

sequifier.helpers.construct_index_maps(id_maps: dict[str, dict[str | int, int]] | None, target_columns_index_map: list[str], map_to_id: bool | None) dict[str, dict[int, str | int]][source]

Constructs reverse index maps (int index to original ID).

This function creates reverse mappings from the integer indices back to the original string or integer identifiers. It only performs this operation if map_to_id is True and id_maps is provided.

A special mapping for index 0 is added: - If original IDs are strings, 0 maps to “unknown”. - If original IDs are integers, 0 maps to (minimum original ID) - 1.

Parameters:
  • id_maps – A nested dictionary mapping column names to their respective ID-to-index maps (e.g., {‘col_name’: {‘original_id’: 1, …}}). Expected to be provided if map_to_id is True.

  • target_columns_index_map – A list of column names for which to construct the reverse maps.

  • map_to_id – A boolean flag. If True, the reverse maps are constructed. If False or None, an empty dictionary is returned.

Returns:

A dictionary where keys are column names from target_columns_index_map and values are the reverse maps (index-to-original-ID). Returns an empty dict if map_to_id is not True.

Raises:
  • AssertionError – If map_to_id is True but id_maps is None.

  • AssertionError – If the values of a map are not consistently string or integer (excluding the added ‘0’ key).

sequifier.helpers.normalize_path(path: str, project_path: str) str[source]

Normalizes a path to be relative to a project path, then joins them.

This function ensures that a given path is correctly expressed as an absolute path rooted at project_path. It does this by first removing the project_path prefix from path (if it exists) and then joining the result back to project_path.

This is useful for handling paths that might be provided as either relative (e.g., “data/file.txt”) or absolute (e.g., “/abs/path/to/project/data/file.txt”).

Parameters:
  • path – The path to normalize.

  • project_path – The absolute path to the project’s root directory.

Returns:

A normalized, absolute path.

sequifier.helpers.numpy_to_pytorch(data: DataFrame, column_types: dict[str, dtype], all_columns: list[str], seq_length: int) dict[str, Tensor][source]

Converts a long-format Polars DataFrame to a dict of sequence tensors.

This function assumes the input DataFrame data is in a long format where each row represents a sequence for a specific feature. It expects a column named “inputCol” that contains the feature name (e.g., ‘price’, ‘volume’) and other columns representing time steps (e.g., “0”, “1”, …, “L”).

It generates two tensors for each column in all_columns: 1. An “input” tensor (from time steps L down to 1). 2. A “target” tensor (from time steps L-1 down to 0).

Example

For seq_length = 3 and all_columns = [‘price’], it will create: - ‘price’: Tensor from columns [“3”, “2”, “1”] - ‘price_target’: Tensor from columns [“2”, “1”, “0”]

Parameters:
  • data – The long-format Polars DataFrame. Must contain “inputCol” and columns named as strings of integers for time steps.

  • column_types – A dictionary mapping feature names (from “inputCol”) to their desired torch.dtype.

  • all_columns – A list of all feature names (from “inputCol”) to be processed and converted into tensors.

  • seq_length – The total sequence length (L). This determines the column names for time steps (e.g., “0” to “L”).

Returns:

A dictionary mapping feature names to their corresponding PyTorch tensors. Target tensors are stored with a _target suffix (e.g., {‘price’: <tensor>, ‘price_target’: <tensor>}).

sequifier.helpers.read_data(path: str, read_format: str, columns: list[str] | None = None) DataFrame[source]

Reads data from a CSV or Parquet file into a Polars DataFrame.

Parameters:
  • path – The file path to read from.

  • read_format – The format of the file. Supported formats are “csv” and “parquet”.

  • columns – An optional list of column names to read. This argument is only used when read_format is “parquet”.

Returns:

A Polars DataFrame containing the data from the file.

Raises:

ValueError – If read_format is not “csv” or “parquet”.

sequifier.helpers.subset_to_selected_columns(data: DataFrame | LazyFrame, selected_columns: list[str]) DataFrame | LazyFrame[source]

Filters a DataFrame to rows where ‘inputCol’ is in a selected list.

This function supports both Polars (DataFrame, LazyFrame) and Pandas DataFrames, dispatching to the appropriate filtering method.

  • For Polars objects, it uses data.filter(pl.col(“inputCol”).is_in(…)).

  • For other objects (presumably Pandas), it builds a numpy boolean mask and filters using data.loc[…].

Note: The type hint only specifies Polars objects, but the implementation includes a fallback path for Pandas-like objects.

Parameters:
  • data – The Polars (DataFrame, LazyFrame) or Pandas DataFrame to filter. It must contain a column named “inputCol”.

  • selected_columns – A list of values. Rows will be kept if their value in “inputCol” is present in this list.

Returns:

A filtered DataFrame or LazyFrame of the same type as the input.

sequifier.helpers.write_data(data: DataFrame, path: str, write_format: str, **kwargs) None[source]

Writes a Polars (or Pandas) DataFrame to a CSV or Parquet file.

This function detects the type of the input DataFrame. - For Polars DataFrames, it uses .write_csv() or .write_parquet(). - For other DataFrame types (presumably Pandas), it uses .to_csv()

or .to_parquet().

Note: The type hint specifies pl.DataFrame, but the implementation includes a fallback path that suggests compatibility with Pandas DataFrames.

Parameters:
  • data – The Polars (or Pandas) DataFrame to write.

  • path – The destination file path.

  • write_format – The format to write. Supported formats are “csv” and “parquet”.

  • **kwargs – Additional keyword arguments passed to the underlying write function (e.g., write_csv for Polars, to_csv for Pandas).

Returns:

None.

Raises:

ValueError – If write_format is not “csv” or “parquet”.

class sequifier.io.yaml.TrainModelDumper(stream, default_style=None, default_flow_style=False, canonical=None, indent=None, width=None, allow_unicode=None, line_break=None, encoding=None, explicit_start=None, explicit_end=None, version=None, tags=None, sort_keys=True)[source]

A custom YAML dumper for TrainModel objects.

This dumper extends the base yaml.Dumper to provide custom serialization for TrainModel and related objects, ensuring a clean and readable YAML output. It also modifies the indentation behavior for better formatting.

increase_indent(flow=False, indentless=False)[source]

Increase the indentation level for the YAML output.

This method overrides the default behavior to force indentation for all block-style collections, improving the readability of the output YAML.

Parameters:
  • flow – Whether the context is a flow-style collection.

  • indentless – Whether the context is an indentless sequence.

Returns:

The result of the parent class’s increase_indent method, with flow forced to False.

sequifier.io.yaml.represent_dot_dict(dumper, data)[source]

Represents DotDict objects as a simple YAML mapping. The original output showed a ‘dictitems’ attribute. If your DotDict is essentially a dictionary, this will work.

sequifier.io.yaml.represent_numpy_float(dumper, data)[source]

Represents numpy.float64 (and similar numpy floats) as standard YAML floats.

sequifier.io.yaml.represent_numpy_int(dumper, data)[source]

Represents numpy.int64 (and similar numpy integers) as standard YAML integers.

sequifier.io.yaml.represent_sequifier_object(dumper, data)[source]

Represents objects from ‘sequifier.config.train_config’ (like TrainModel, ModelSpecModel, TrainingSpecModel) as a simple YAML mapping, using the object’s __dict__. This effectively removes the !!python/object tag and the explicit ‘__dict__:’, ‘__fields_set__:’ keys.

class sequifier.io.sequifier_dataset_from_folder.SequifierDatasetFromFolder(data_path: str, config: TrainModel)[source]

An efficient PyTorch Dataset that pre-loads all data into RAM.

This is the ideal strategy when the entire dataset split can fit into the system’s memory. It pays a one-time I/O cost at initialization, after which all data access during training is extremely fast (RAM access).

__getitem__(idx: int) Tuple[Dict[str, Tensor], Dict[str, Tensor], int, int, int][source]

Retrieves a single sample from the pre-loaded data.

Parameters:

idx – The index of the sample to retrieve.

Returns:

  • sequence (dict): Dictionary of feature tensors for the sample.

  • targets (dict): Dictionary of target tensors for the sample.

  • sequence_id (int): The sequence ID of the sample.

  • subsequence_id (int): The subsequence ID within the sequence.

  • start_position (int): The starting item position of the subsequence

    within the original full sequence.

Return type:

A tuple containing

__init__(data_path: str, config: TrainModel)[source]

Initializes the dataset by loading all .pt files from the data directory into memory. Each .pt file is expected to contain a tuple: (sequences_dict, targets_dict, sequence_ids_tensor, subsequence_ids_tensor, start_item_positions_tensor).

class sequifier.io.sequifier_dataset_from_folder_lazy.SequifierDatasetFromFolderLazy(data_path: str, config: TrainModel, ram_threshold: float = 70.0)[source]

An efficient PyTorch Dataset for datasets that do not fit into RAM.

This class loads data from individual .pt files on-demand (lazily) when an item is requested via __getitem__. It maintains an in-memory cache of recently used files to speed up access. To prevent memory exhaustion, the cache is managed by a Least Recently Used (LRU) policy, which evicts the oldest data chunks when the total system RAM usage exceeds a configurable threshold.

This strategy balances I/O overhead and memory usage, making it suitable for training on datasets larger than the available system memory.

__getitem__(idx: int) Tuple[Dict[str, Tensor], Dict[str, Tensor], int, int, int][source]

Retrieves a single data sample, loading from disk if not in the cache.

This method is the core of the lazy-loading strategy. It is thread-safe and manages the cache automatically.

Parameters:

idx – The index of the sample to retrieve.

Returns:

  • sequence (dict): Dictionary of feature tensors for the sample.

  • targets (dict): Dictionary of target tensors for the sample.

  • sequence_id (int): The sequence ID of the sample.

  • subsequence_id (int): The subsequence ID within the sequence.

  • start_position (int): The starting item position of the subsequence

    within the original full sequence.

Return type:

A tuple containing

__init__(data_path: str, config: TrainModel, ram_threshold: float = 70.0)[source]

Initializes the dataset by reading metadata and setting up the cache. Each .pt file is expected to contain a tuple: (sequences_dict, targets_dict, sequence_ids_tensor, subsequence_ids_tensor, start_item_positions_tensor).

Parameters:
  • data_path (str) – The path to the directory containing the pre-processed .pt files and a metadata.json file.

  • config (TrainModel) – The training configuration object.

  • ram_threshold (float) – The system RAM usage percentage (0-100) at which to trigger cache eviction.

__len__() int[source]

Returns the total number of samples in the dataset.

class sequifier.io.sequifier_dataset_from_file.SequifierDatasetFromFile(data_path: str, config: TrainModel, shuffle: bool = True)[source]

An iterable-style dataset that pre-loads all data into CPU RAM and yields pre-collated batches.

This is the idiomatic PyTorch solution for implementing custom ‘en block’ batching. The __iter__ method handles shuffling and batch slicing, ensuring maximum performance.

__iter__() Iterator[Tuple[Dict[str, Tensor], Dict[str, Tensor], None, None, None]][source]

Yields batches of data.

Handles shuffling (if enabled) and slicing data based on distributed rank and worker ID.

Yields:

Iterator[Tuple[Dict[str, torch.Tensor], Dict[str, torch.Tensor], None, None, None]]

An iterator where each item is a tuple containing:
  • data_batch (dict): Dictionary of feature tensors for the batch.

  • targets_batch (dict): Dictionary of target tensors for the batch.

  • None: Placeholder for sequence_id (not used in this dataset type).

  • None: Placeholder for subsequence_id (not used in this dataset type).

  • None: Placeholder for start_position (not used in this dataset type).

__len__() int[source]

Returns the total number of samples in the dataset.

set_epoch(epoch: int)[source]

Allows the training loop to set the epoch for deterministic shuffling.

sequifier.optimizers.optimizers.get_optimizer_class(optimizer_name: str) Optimizer[source]

Gets the optimizer class from a string. easteregg

Parameters:

optimizer_name – The name of the optimizer.

Returns:

The optimizer class.

class sequifier.samplers.distributed_grouped_random_sampler.DistributedGroupedRandomSampler(data_source: SequifierDatasetFromFolder | SequifierDatasetFromFolderLazy, num_replicas: int, rank: int, shuffle: bool = True, seed: int = 0)[source]

A distributed sampler that groups samples by file to improve cache efficiency.

This sampler partitions the set of data FILES across the distributed processes, not the individual samples. Each process then iterates through its assigned files in a random order. Within each file, the samples are also shuffled.

This ensures that each process sees a unique subset of the data per epoch while maximizing sequential reads from the same file, which is ideal for lazy-loading datasets.

__init__(data_source: SequifierDatasetFromFolder | SequifierDatasetFromFolderLazy, num_replicas: int, rank: int, shuffle: bool = True, seed: int = 0)[source]
Parameters:
  • data_source – The dataset to sample from. Must have a batch_files_info attribute.

  • num_replicas – Number of processes participating in distributed training.

  • rank – Rank of the current process.

  • shuffle – If True, shuffles the order of files and samples within files.

  • seed – Random seed used to create the permutation.

__iter__() Iterator[int][source]

Returns an iterator over indices for the current rank.

__len__() int[source]

Returns the number of samples for the current rank, not the total.

set_epoch(epoch: int) None[source]

Sets the epoch for this sampler. This is used to create a different shuffling order for each epoch.

```